some test data
This commit is contained in:
27
DLSiteFSearchObsidian/Audio Embedding generation.md
Normal file
27
DLSiteFSearchObsidian/Audio Embedding generation.md
Normal file
@@ -0,0 +1,27 @@
|
||||
For the current approach, I need to have a method, that is fast, accurate, and low-resource if possible, to convert all approximately 9000 audio files into feature vectors.
|
||||
|
||||
I was originally going to use `PANNs` or `VGGIsh` for audio embedding generation. But `PANNs` has crashed on me with `CUDA out of memory` errors. `VGGIsh` looks kind of complicated.
|
||||
|
||||
Anyway, I have asked Claude Sonnet for directions. It did gave me some more results than searching on Google `Audio Embedding Generation`. It recommended the following embedding models:
|
||||
1. CLAP
|
||||
2. BYOL-A
|
||||
3. PANNs
|
||||
4. wav2vec 2.0
|
||||
5. MERT
|
||||
6. VGGish
|
||||
I have never heard of any of these options. I have discovered `PANNs` through an Elastic Search article. Also `Ziliz` or `Milvus` has published an article ranking the embedding models. Which is why I wanted to try out `PANNs, wav2red, VGGish` these three models.
|
||||
|
||||
Each model has its own quirk to run. Although `Towhee` has an uniform way to use all of these embedding models, I have my doubts on this project, which seem to be inactive, and also has allegations of using inadequate ways to gain more Stars on GitHub.
|
||||
|
||||
I will have to set up a comparison between searching with all of these embedding models.
|
||||
|
||||
Also Claude Sonnet has recommended to chop up the audio into smaller 10 seconds chunks. I was wondering why I was getting `CUDA Out of memory` errors. It's because I haven't chunked my audio into smaller pieces. Which explains the error. Since most of the audios are usually 30 minutes long. It also recommended overlapping the chunks. Please see the exported `JSON` chat for details.
|
||||
|
||||
The audio must be pre-processed:
|
||||
1. Load all channels of the audio into memory
|
||||
2. Resample audio according to the model's instruction or training parameter
|
||||
3. Split the audio into chunks of 10-15 seconds
|
||||
Each chunk may have its metadata associated with the position (time in full track audio) and channel information (L, R)
|
||||
|
||||
# Benchmark
|
||||
With 200 audio clips, randomly selected. all audio embedding models mentioned above must have its time for processing 200 audio clips recorded, and its vector results stored on disk.
|
||||
Reference in New Issue
Block a user