27 lines
2.1 KiB
Markdown
27 lines
2.1 KiB
Markdown
For the current approach, I need to have a method, that is fast, accurate, and low-resource if possible, to convert all approximately 9000 audio files into feature vectors.
|
|
|
|
I was originally going to use `PANNs` or `VGGIsh` for audio embedding generation. But `PANNs` has crashed on me with `CUDA out of memory` errors. `VGGIsh` looks kind of complicated.
|
|
|
|
Anyway, I have asked Claude Sonnet for directions. It did gave me some more results than searching on Google `Audio Embedding Generation`. It recommended the following embedding models:
|
|
1. CLAP
|
|
2. BYOL-A
|
|
3. PANNs
|
|
4. wav2vec 2.0
|
|
5. MERT
|
|
6. VGGish
|
|
I have never heard of any of these options. I have discovered `PANNs` through an Elastic Search article. Also `Ziliz` or `Milvus` has published an article ranking the embedding models. Which is why I wanted to try out `PANNs, wav2red, VGGish` these three models.
|
|
|
|
Each model has its own quirk to run. Although `Towhee` has an uniform way to use all of these embedding models, I have my doubts on this project, which seem to be inactive, and also has allegations of using inadequate ways to gain more Stars on GitHub.
|
|
|
|
I will have to set up a comparison between searching with all of these embedding models.
|
|
|
|
Also Claude Sonnet has recommended to chop up the audio into smaller 10 seconds chunks. I was wondering why I was getting `CUDA Out of memory` errors. It's because I haven't chunked my audio into smaller pieces. Which explains the error. Since most of the audios are usually 30 minutes long. It also recommended overlapping the chunks. Please see the exported `JSON` chat for details.
|
|
|
|
The audio must be pre-processed:
|
|
1. Load all channels of the audio into memory
|
|
2. Resample audio according to the model's instruction or training parameter
|
|
3. Split the audio into chunks of 10-15 seconds
|
|
Each chunk may have its metadata associated with the position (time in full track audio) and channel information (L, R)
|
|
|
|
# Benchmark
|
|
With 200 audio clips, randomly selected. all audio embedding models mentioned above must have its time for processing 200 audio clips recorded, and its vector results stored on disk. |