DLSiteFSearch/DLSiteFSearchObsidian/Approach.md

My objective is given a DLSite ASMR audio, we will refer these as Search Audio, because it is the audio that needs to be reverse-searched, we don't know any information about the Search Audio other that it came from DLSite. And from the Search Audio, get the original DLSite product information or metadata.

My approach is heavily inspired on the [lolishinshi/imsearch](https://github.com/lolishinshi/imsearch) project, which is a image search project using image's local features: extracting local image features (vectors) using ORB_SLAM3, and then indexing/searching it using [FAISS](https://github.com/facebookresearch/faiss). And ultimately storing all indexed image metadata into [RockDB](https://rocksdb.org/).
(wow, 2 Facebook projects)

I assumed that we need the 2 things:
- Audio to be indexed (basically, all works from DLSite)
- Vector Database (acting as a index, and search engine: FAISS or Milvus, or other traditional databases with Vector support: Postgres, Apache Solr/Lucene/Cassandra, MySQL, MongoDB, etc.)

The Audio to be indexed presents me with an obstacle: I don't have the money to purchase all DLSite ASMR works, it would be ridiculous. And I do not think DLSite would want to collaborate on this project that mainly focuses on Audio reverse-searching similar audio. Also it would put a ton of responsibility on me, and I don't want that.
So we are going for the second best option of sailing the high seas. Fortunately there are already data hoarders with tons of ASMR audios, they are ridiculously big, so I will have to make the index in batches.

The vector database we could just use FAISS, but the training stage would probably present a problem, because I don't want my system to be maxed out at 100% GPU usage for days on end. I will try different database solutions, like Milvus, also traditional (and more well-known) databases are also possible? I will have to find that out.

I will be conducting a small scale test picking about 316 DLSite audio works (translated, and raw, with additional images, documents, etc.), and see how can I construct this application.

The first BIG **problem** is how the hell do I convert those audio to vectors? If it's image, then we just need to run ORB_SLAM3 for feature extraction and that will work quite well. if it's text there are text to embeddings model out there that will also work? I just need to make sure to pick open-source models.

But the audio... There are commercial products that uses algorithm to search songs (Shazam), but I... have ASMR audio at my hand.
My planned use case is that the end user may possess a potential audio from DLSite, only knowing the fact that it probably came from DLSite and nothing else, the audio could be lossy compressed, and my application's job is to find which product ID corresponds to that audio.
From the articles that I could find:
https://milvus.io/docs/audio_similarity_search.md
https://zilliz.com/ai-faq/what-is-audio-similarity-search
https://www.elastic.co/search-labs/blog/searching-by-music-leveraging-vector-search-audio-information-retrieval
https://zilliz.com/learn/top-10-most-used-embedding-models-for-audio-data

One of the path is that I should be using embedding models used in Deep Learning to extract all the feature vectors from the audio. I have my doubts, since these models are trained on different real-world scenario audios, or music, and they might not be suitable for ASMR audio works. But I could be proven wrong, I wish to be proven wrong.

Another path is this paper I found while searching:
![[Efficient_music_identification_using_ORB_descripto.pdf]]

This paper employed ORB algorithm on the spectrogram image, which is interesting. But the paper specifically says that it is tested for music identification. Not ASMR audio. Although I am sure that a spectrogram is just another image for the ORB algorithm. But usual length for ASMR audio ranges from short minutes to hours long audio. And I am not sure if ORB is able to handle such extreme image proportions (extremely large images, with the audio length proportional to the X dimension of the image).
One of the ways I came up is to probably chop the audio into pieces, and then running the ORB algorithm to extract the features, that way we don't end up with extraordinary image sizes for the spectrogram, but I am not sure of its effectiveness. So I will also have to experiment with that.

So my current approach will be experimenting these two ways using the local DLSite audio that I have. And compare the results between each other.

Also, I want to index more aspects of the audio works, these DLSite packages usually come with images and documents, I also want to index those aspects. The documents can be converted to vectors using Embedding models, and for the image we can use the same approach from `imsearch`.

I will have to do some analysis on the files that I have got on my hands. The collection of DLSite works I was able to find has approximately 50k audio works, with each weighing in at 3 GB to 8 GB with some outliers eating up from 20GB to 110GB of space. A rough estimation is that all of these work combined will produce use up more than 30 to 35 TB of space. I don't have the space for that, so I will have to do indexing on batches.