Initial thoughts

2025-04-05 17:08:03 +02:00
parent ff4d962168
commit eee3f614fa
13 changed files with 13829 additions and 0 deletions
--- a/DLSiteFSearchObsidian/Approach.md
+++ b/DLSiteFSearchObsidian/Approach.md
@@ -0,0 +1,40 @@
+My objective is given a DLSite ASMR audio, we will refer these as Search Audio, because it is the audio that needs to be reverse-searched, we don't know any information about the Search Audio other that it came from DLSite. And from the Search Audio, get the original DLSite product information or metadata.
+
+My approach is heavily inspired on the [lolishinshi/imsearch](https://github.com/lolishinshi/imsearch) project, which is a image search project using image's local features: extracting local image features (vectors) using ORB_SLAM3, and then indexing/searching it using [FAISS](https://github.com/facebookresearch/faiss). And ultimately storing all indexed image metadata into [RockDB](https://rocksdb.org/).
+(wow, 2 Facebook projects)
+
+I assumed that we need the 2 things:
+- Audio to be indexed (basically, all works from DLSite)
+- Vector Database (acting as a index, and search engine: FAISS or Milvus, or other traditional databases with Vector support: Postgres, Apache Solr/Lucene/Cassandra, MySQL, MongoDB, etc.)
+
+The Audio to be indexed presents me with an obstacle: I don't have the money to purchase all DLSite ASMR works, it would be ridiculous. And I do not think DLSite would want to collaborate on this project that mainly focuses on Audio reverse-searching similar audio. Also it would put a ton of responsibility on me, and I don't want that.
+So we are going for the second best option of sailing the high seas. Fortunately there are already data hoarders with tons of ASMR audios, they are ridiculously big, so I will have to make the index in batches.
+
+The vector database we could just use FAISS, but the training stage would probably present a problem, because I don't want my system to be maxed out at 100% GPU usage for days on end. I will try different database solutions, like Milvus, also traditional (and more well-known) databases are also possible? I will have to find that out.
+
+I will be conducting a small scale test picking about 316 DLSite audio works (translated, and raw, with additional images, documents, etc.), and see how can I construct this application. 
+
+The first BIG **problem** is how the hell do I convert those audio to vectors? If it's image, then we just need to run ORB_SLAM3 for feature extraction and that will work quite well. if it's text there are text to embeddings model out there that will also work? I just need to make sure to pick open-source models.
+
+But the audio... There are commercial products that uses algorithm to search songs (Shazam), but I... have ASMR audio at my hand.
+My planned use case is that the end user may possess a potential audio from DLSite, only knowing the fact that it probably came from DLSite and nothing else, the audio could be lossy compressed, and my application's job is to find which product ID corresponds to that audio.
+From the articles that I could find:
+https://milvus.io/docs/audio_similarity_search.md
+https://zilliz.com/ai-faq/what-is-audio-similarity-search
+https://www.elastic.co/search-labs/blog/searching-by-music-leveraging-vector-search-audio-information-retrieval
+https://zilliz.com/learn/top-10-most-used-embedding-models-for-audio-data
+
+One of the path is that I should be using embedding models used in Deep Learning to extract all the feature vectors from the audio. I have my doubts, since these models are trained on different real-world scenario audios, or music, and they might not be suitable for ASMR audio works. But I could be proven wrong, I wish to be proven wrong.
+
+Another path is this paper I found while searching:
+![[Efficient_music_identification_using_ORB_descripto.pdf]]
+
+This paper employed ORB algorithm on the spectrogram image, which is interesting. But the paper specifically says that it is tested for music identification. Not ASMR audio. Although I am sure that a spectrogram is just another image for the ORB algorithm. But usual length for ASMR audio ranges from short minutes to hours long audio. And I am not sure if ORB is able to handle such extreme image proportions (extremely large images, with the audio length proportional to the X dimension of the image).
+One of the ways I came up is to probably chop the audio into pieces, and then running the ORB algorithm to extract the features, that way we don't end up with extraordinary image sizes for the spectrogram, but I am not sure of its effectiveness. So I will also have to experiment with that.
+
+So my current approach will be experimenting these two ways using the local DLSite audio that I have. And compare the results between each other.
+
+Also, I want to index more aspects of the audio works, these DLSite packages usually come with images and documents, I also want to index those aspects. The documents can be converted to vectors using Embedding models, and for the image we can use the same approach from `imsearch`.
+
+I will have to do some analysis on the files that I have got on my hands. The collection of DLSite works I was able to find has approximately 50k audio works, with each weighing in at 3 GB to 8 GB with some outliers eating up from 20GB to 110GB of space. A rough estimation is that all of these work combined will produce use up more than 30 to 35 TB of space. I don't have the space for that, so I will have to do indexing on batches.
+
--- a/DLSiteFSearchObsidian/Efficient_music_identification_using_ORB_descripto.pdf
+++ b/DLSiteFSearchObsidian/Efficient_music_identification_using_ORB_descripto.pdf
--- a/DLSiteFSearchObsidian/Notes
+++ b/DLSiteFSearchObsidian/Notes
@@ -0,0 +1,16 @@
+There is a unofficial binding from [GitHub](https://github.com/mnixry/python-orb-slam3/) that allows ORB-SLAM3 ORB algorithm to be used in Python3. But the ORB algorithm is the only thing available in it, there is no SLAM functionality, which is already enough for our reverse-search purposes.
+
+What baffles me is the integration between Python <--> C++, I know it is possible to incorporate C++ code in Python, I haven't looked into it, but I do know it is possible.
+But the problem lies in the fact that ORB_SLAM3 also uses OpenCV, and if I install OpenCV on Python, how on earth is Python able to receive `cv.KeyPoint, List cv.Mat`?
+So after digging a little bit, the integration here works entirely different from `imsearch`. For returning results, the C++ code will actually construct a new Python Object of `cv.KeyPoint` or Lists of `cv.Mat`, and in the wrapper code, it will automatically populate the created object in Python with results obtained from earlier C++ code. Which is kind of fascinating. It also requires NumPy for conversion somewhere in the process. The binding is generated using `pybind11`.
+
+Also, [this is available already compiled on PyPi](https://pypi.org/project/python-orb-slam3/), which you can already download the compiled ORB_SLAM3 module, and it is readily usable. In the Python Wheel (which is basically an archive file), contains the Python code, and also the DLL module used by Python (it's in `.pyd` format, which is just OpenCV + ORB_SLAM3 compiled on x64 arch).
+
+Be careful if you decide to build it on your own: install `pipx` and `pdm`, also Visual Studio with C++ development tools to download the compiler. You will also need the OpenCV binary from the official OpenCV site, it won't include GPU acceleration so you will have to compile that yourself.
+After downloading the OpenCV files, extracted, you must locate where the `.cmake` file is, and note that path down.
+`pdm build -v` will actually build the wheel for Python, but you have to set `OpenCV_DIR` in environment variables to point that directory with `.cmake` files. Then the `pdm` will start configuring the wheel with `Cmake`, generating a VS Project file, compiling, and there is your Python Wheel.
+After installing the wheel, YOU MUST MANUALLY COPY A DLL FILE `opencv_world4110.dll` FOR EXAMPLE TO WHERE THE `Site-Packages` IS. The OpenCV World DLL should be placed next to the `python-orb-slam3-py313-x64 whatever.PYD` file. Otherwise if you import it in Python it will complain about not finding an DLL (that DLL is OpenCV). When the wheel was built, it was dynamically linked to OpenCV, not statically linked, so you have to manually copy that file.
+
+Also the PyPi doesn't include binaries for Python 3.13 yet.
+
+Test results are available in the Python Notebook file.
--- a/DLSiteFSearchObsidian/ORB.md
+++ b/DLSiteFSearchObsidian/ORB.md
@@ -0,0 +1,21 @@
+ORB is a Key Point detector and a descriptor (FAST, BRIEF). Basically, an algorithm that will select which points of the image is what it considers as a Key Point, and then give descriptions using vector.
+
+Basically, just a very quick way of converting an image to a high-dimensional feature vector. And without using AI.
+
+OpenCV already has ORB implementation built in, it even has GPU acceleration (provided that you compile OpenCV yourself, official binary distribution of OpenCV from their official site and PyPi does not have CUDA support).
+
+But, the built-in implementation of OpenCV ORB might present a problem, since all the "Descriptors" and "Key Points" that OpenCV ORB extracts are extremely close together. And if we are focusing on extracting local features from any image, it has to be even, so every local feature can be taken into account.
+
+> ORB_SLAM3 - 解决了传统 ORB 算法中存在的特征点过于集中的问题
+> (ORB_SLAM3 - Solves the issue of traditional ORB algorithm extracting feature vectors that are too closely together)
+> [- GitHub:lolishinshi/imsearch](https://github.com/lolishinshi/imsearch)
+
+ORB_SLAM3 is another project from the University of Zaragoza, which focuses on SLAM algorithm ([Simultaneous localization and mapping](https://en.wikipedia.org/wiki/Simultaneous_localization_and_mapping)). In this case, we will be using its ORB algorithm for our image-reverse search purposes. We can see the differences between the Key Points extracted from OpenCV and ORB_SLAM3:
+
+Here's OpenCV ORB (1024 features/Key Points)
+![[source_keypoints_opencvorb.jpg]]
+
+Here's ORB_SLAM3
+![[source_keypoints_slamorb3.jpg]]
+
+You can clearly see that the Key Point distribution of ORB_SLAM3 is much more even, and distributes consistently through out the entire image. While OpenCV will center around specific features of the image.
--- a/DLSiteFSearchObsidian/Studying
+++ b/DLSiteFSearchObsidian/Studying
@@ -0,0 +1,20 @@
+`imsearch` is a Image Reverse Search Server that is written in Rust. It mixes C++ code from ORB_SLAM3 with its own custom wrapper, and is also linked with FAISS and OpenCV.
+For some reason the repository already contains C++ Code from FAISS, and not just using the "include" section for using FAISS from C++, but that is beyond me.
+
+The core idea is that this program will allow the server owner to add in a bunch of images, and then exposes an API or CLI to reverse-search any images in the added database.
+
+This program is not using any GPU acceleration features.
+
+While adding the images, it will extract features using [[ORB]] from the ORB_SLAM3 module, and adds it to FAISS index, meanwhile, it also stores image path (image metadata) to a separate key-value database with RockDB. This separate database stores the image path and its correlation with the FAISS index ID of the vector (image feature).
+
+While adding, the features from the images are extracted, and its metadata is also being stored. Since `imsearch` is using `BinaryIVF` index from FAISS, these types of requires "training" to be useful for image reverse search. The GPU training can take several tens of hours if not days on large datasets.
+
+While searching, the program will extract the features from the query image, and runs a ANN (Approximate Nearest Neighbor) search against all vectors stored in the FAISS index. This will return results if similar feature vectors stored in the index and its FAISS index ID, querying RockDB and thus, finding the possible original image.
+
+`imsearch` is a fucking pain in the ass to compile, at least in Windows I wasn't able to build any compatible binary. I cannot build FAISS in Windows with GPU support for some reason, and some strange errors while linking will also occur during the compilation.
+
+Also it kind of interests me how the C++ <--> Rust integration works in this program. From what I can observe, OpenCV and ORB_SLAM3 are all C++ dependencies. Which basically means it needs a wrapper in order for them to work together.
+
+What puzzles me is that ORB_SLAM3 itself, also depends on OpenCV heavily, so if Rust is using OpenCV wrapper, what the ORB_SLAM3 is supposed to use? Specially that ORB_SLAM3 will return vectors that are OpenCV types. And Rust may not understand C++ OpenCV type.
+
+After a bit of digging, I found that `imsearch` uses a premade wrapper for OpenCV, which is fine. During the compilation of `imsearch`, linking OpenCV is the step that will often fail (because OpenCV-rust binding requires you to bring your own OpenCV or your system package's OpenCV). My hypothesis is that ORB_SLAM3 in the `imsearch` code is probably linking against the same OpenCV library that is being used in the Rust calls. They can pass raw pointers to each other which is allowed by the Rust OpenCV binding. And the fact that `imsearch/src/ORB_SLAM3/ocvrs_common.hpp` indicates that `ORB_SLAM3` and `imsearch` are passing pointers around, the custom wrapper is `imsearch/src/ORB_SLAM3/ORBwrapper.cc`.
--- a/DLSiteFSearchObsidian/source_keypoints_opencvorb.jpg
+++ b/DLSiteFSearchObsidian/source_keypoints_opencvorb.jpg
--- a/DLSiteFSearchObsidian/source_keypoints_slamorb3.jpg
+++ b/DLSiteFSearchObsidian/source_keypoints_slamorb3.jpg