some test data

2025-04-10 09:01:24 +02:00
parent a9d3d10da9
commit 6fc6df87b2
10 changed files with 25401 additions and 22798 deletions
--- a/DLSiteFSearchObsidian/Audio
+++ b/DLSiteFSearchObsidian/Audio
@@ -0,0 +1,27 @@
+For the current approach, I need to have a method, that is fast, accurate, and low-resource if possible, to convert all approximately 9000 audio files into feature vectors.
+
+I was originally going to use `PANNs` or `VGGIsh` for audio embedding generation. But `PANNs` has crashed on me with `CUDA out of memory` errors. `VGGIsh` looks kind of complicated.
+
+Anyway, I have asked Claude Sonnet for directions. It did gave me some more results than searching on Google `Audio Embedding Generation`. It recommended the following embedding models:
+1. CLAP
+2. BYOL-A
+3. PANNs
+4. wav2vec 2.0
+5. MERT
+6. VGGish
+I have never heard of any of these options. I have discovered `PANNs` through an Elastic Search article. Also `Ziliz` or `Milvus` has published an article ranking the embedding models. Which is why I wanted to try out `PANNs, wav2red, VGGish` these three models.
+
+Each model has its own quirk to run. Although `Towhee` has an uniform way to use all of these embedding models, I have my doubts on this project, which seem to be inactive, and also has allegations of using inadequate ways to gain more Stars on GitHub.
+
+I will have to set up a comparison between searching with all of these embedding models.
+
+Also Claude Sonnet has recommended to chop up the audio into smaller 10 seconds chunks. I was wondering why I was getting `CUDA Out of memory` errors. It's because I haven't chunked my audio into smaller pieces. Which explains the error. Since most of the audios are usually 30 minutes long. It also recommended overlapping the chunks. Please see the exported `JSON` chat for details.
+
+The audio must be pre-processed:
+1. Load all channels of the audio into memory
+2. Resample audio according to the model's instruction or training parameter
+3. Split the audio into chunks of 10-15 seconds
+Each chunk may have its metadata associated with the position (time in full track audio) and channel information (L, R)
+
+# Benchmark
+With 200 audio clips, randomly selected. all audio embedding models mentioned above must have its time for processing 200 audio clips recorded, and its vector results stored on disk.