Files
DLSiteFSearch/DLSiteFSearchObsidian/Audio Embedding generation.md
2025-04-10 09:01:24 +02:00

2.1 KiB

For the current approach, I need to have a method, that is fast, accurate, and low-resource if possible, to convert all approximately 9000 audio files into feature vectors.

I was originally going to use PANNs or VGGIsh for audio embedding generation. But PANNs has crashed on me with CUDA out of memory errors. VGGIsh looks kind of complicated.

Anyway, I have asked Claude Sonnet for directions. It did gave me some more results than searching on Google Audio Embedding Generation. It recommended the following embedding models:

  1. CLAP
  2. BYOL-A
  3. PANNs
  4. wav2vec 2.0
  5. MERT
  6. VGGish I have never heard of any of these options. I have discovered PANNs through an Elastic Search article. Also Ziliz or Milvus has published an article ranking the embedding models. Which is why I wanted to try out PANNs, wav2red, VGGish these three models.

Each model has its own quirk to run. Although Towhee has an uniform way to use all of these embedding models, I have my doubts on this project, which seem to be inactive, and also has allegations of using inadequate ways to gain more Stars on GitHub.

I will have to set up a comparison between searching with all of these embedding models.

Also Claude Sonnet has recommended to chop up the audio into smaller 10 seconds chunks. I was wondering why I was getting CUDA Out of memory errors. It's because I haven't chunked my audio into smaller pieces. Which explains the error. Since most of the audios are usually 30 minutes long. It also recommended overlapping the chunks. Please see the exported JSON chat for details.

The audio must be pre-processed:

  1. Load all channels of the audio into memory
  2. Resample audio according to the model's instruction or training parameter
  3. Split the audio into chunks of 10-15 seconds Each chunk may have its metadata associated with the position (time in full track audio) and channel information (L, R)

Benchmark

With 200 audio clips, randomly selected. all audio embedding models mentioned above must have its time for processing 200 audio clips recorded, and its vector results stored on disk.