added more information
This commit is contained in:
4
.gitignore
vendored
4
.gitignore
vendored
@@ -1 +1,3 @@
|
|||||||
.obsidian
|
.obsidian
|
||||||
|
DLSiteFSearchPython_venv
|
||||||
|
pypy_venv
|
||||||
@@ -31,6 +31,7 @@ Another path is this paper I found while searching:
|
|||||||
|
|
||||||
This paper employed ORB algorithm on the spectrogram image, which is interesting. But the paper specifically says that it is tested for music identification. Not ASMR audio. Although I am sure that a spectrogram is just another image for the ORB algorithm. But usual length for ASMR audio ranges from short minutes to hours long audio. And I am not sure if ORB is able to handle such extreme image proportions (extremely large images, with the audio length proportional to the X dimension of the image).
|
This paper employed ORB algorithm on the spectrogram image, which is interesting. But the paper specifically says that it is tested for music identification. Not ASMR audio. Although I am sure that a spectrogram is just another image for the ORB algorithm. But usual length for ASMR audio ranges from short minutes to hours long audio. And I am not sure if ORB is able to handle such extreme image proportions (extremely large images, with the audio length proportional to the X dimension of the image).
|
||||||
One of the ways I came up is to probably chop the audio into pieces, and then running the ORB algorithm to extract the features, that way we don't end up with extraordinary image sizes for the spectrogram, but I am not sure of its effectiveness. So I will also have to experiment with that.
|
One of the ways I came up is to probably chop the audio into pieces, and then running the ORB algorithm to extract the features, that way we don't end up with extraordinary image sizes for the spectrogram, but I am not sure of its effectiveness. So I will also have to experiment with that.
|
||||||
|
We can always just use existing Music fingerprinting programs like Shazam or Chromaprint? But I highly doubt its effectiveness.
|
||||||
|
|
||||||
So my current approach will be experimenting these two ways using the local DLSite audio that I have. And compare the results between each other.
|
So my current approach will be experimenting these two ways using the local DLSite audio that I have. And compare the results between each other.
|
||||||
|
|
||||||
|
|||||||
35
DLSiteFSearchObsidian/Local dataset analysis.md
Normal file
35
DLSiteFSearchObsidian/Local dataset analysis.md
Normal file
@@ -0,0 +1,35 @@
|
|||||||
|
The local dataset due to space (Disk partition) constraints, is split into three subsets:
|
||||||
|
|
||||||
|
- ASMROne
|
||||||
|
- ASMRTwo
|
||||||
|
- ASMRThree
|
||||||
|
|
||||||
|
There are no substantial differences between each subset.
|
||||||
|
Subset sizes and audio work count:
|
||||||
|
|
||||||
|
- ASMR One --> 119 Audio works, 470GB/504 791 391 855 Bytes
|
||||||
|
- ASMR Two --> 90 Audio works, 439GB/471 683 782 635 Bytes
|
||||||
|
- ASMR Three --> 121 Audio works, 499GB/536 552 753 022 Bytes
|
||||||
|
|
||||||
|
Total: 330 Audio works, 1409GB/1 513 027 927 512 Bytes
|
||||||
|
|
||||||
|
There are works from different languages (audio language, or including translation subtitle file), different sizes, different audio encoding formats, etc.
|
||||||
|
|
||||||
|
Basic statistical data on filesystem level:
|
||||||
|
|
||||||
|
| Subset | File count | Folder count |
|
||||||
|
| ---------- | ---------- | ------------ |
|
||||||
|
| ASMR One | 6317 | 1017 |
|
||||||
|
| ASMR Two | 7435 | 760 |
|
||||||
|
| ASMR Three | 6694 | 1066 |
|
||||||
|
|
||||||
|
Average Audio Work size:
|
||||||
|
$1409 \, \text{GigaBytes} \div 330 \, \text{Works} = 4.2\overline{69} \, \text{GigaBytes/Work}$
|
||||||
|
Avg.: approximately 4.27 GB per work
|
||||||
|
|
||||||
|
In this project we will be indexing only the following type of files:
|
||||||
|
- Audio
|
||||||
|
- Image
|
||||||
|
- Document
|
||||||
|
|
||||||
|
In depth analysis of the contents in the dataset is located in `LocalDatasetAnalysis.ipynb`
|
||||||
@@ -17,4 +17,99 @@ Also it kind of interests me how the C++ <--> Rust integration works in this pro
|
|||||||
|
|
||||||
What puzzles me is that ORB_SLAM3 itself, also depends on OpenCV heavily, so if Rust is using OpenCV wrapper, what the ORB_SLAM3 is supposed to use? Specially that ORB_SLAM3 will return vectors that are OpenCV types. And Rust may not understand C++ OpenCV type.
|
What puzzles me is that ORB_SLAM3 itself, also depends on OpenCV heavily, so if Rust is using OpenCV wrapper, what the ORB_SLAM3 is supposed to use? Specially that ORB_SLAM3 will return vectors that are OpenCV types. And Rust may not understand C++ OpenCV type.
|
||||||
|
|
||||||
After a bit of digging, I found that `imsearch` uses a premade wrapper for OpenCV, which is fine. During the compilation of `imsearch`, linking OpenCV is the step that will often fail (because OpenCV-rust binding requires you to bring your own OpenCV or your system package's OpenCV). My hypothesis is that ORB_SLAM3 in the `imsearch` code is probably linking against the same OpenCV library that is being used in the Rust calls. They can pass raw pointers to each other which is allowed by the Rust OpenCV binding. And the fact that `imsearch/src/ORB_SLAM3/ocvrs_common.hpp` indicates that `ORB_SLAM3` and `imsearch` are passing pointers around, the custom wrapper is `imsearch/src/ORB_SLAM3/ORBwrapper.cc`.
|
After a bit of digging, I found that `imsearch` uses a premade wrapper for OpenCV, which is fine. During the compilation of `imsearch`, linking OpenCV is the step that will often fail (because OpenCV-rust binding requires you to bring your own OpenCV or your system package's OpenCV). My hypothesis is that ORB_SLAM3 in the `imsearch` code is probably linking against the same OpenCV library that is being used in the Rust calls. They can pass raw pointers to each other which is allowed by the Rust OpenCV binding. And the fact that `imsearch/src/ORB_SLAM3/ocvrs_common.hpp` indicates that `ORB_SLAM3` and `imsearch` are passing pointers around, the custom wrapper is `imsearch/src/ORB_SLAM3/ORBwrapper.cc`.
|
||||||
|
|
||||||
|
# Search method
|
||||||
|
First, `imsearch` needs a source dataset, which is a image database where future queries will be compared against.
|
||||||
|
|
||||||
|
During `imsearch add-image (directory)`, the program will loop through all image files in the directory, and extracts all feature vectors using ORB_SLAM3.
|
||||||
|
|
||||||
|
The "feature vector", which is actually the source descriptor (OpenCV Descriptor) is stored in RockDB. A source descriptor is a matrix of size $\text{Number of Features} \times 32$. Each containing a `uint8` value. (See test_slamorb.ipynb)
|
||||||
|
|
||||||
|
All the features will be stored in a internal Key-Value database using RockDB. There are various tables that are used in `imsearch`.
|
||||||
|
|
||||||
|
The first table is the Features table. Each feature will have its own corresponding ID in incremental order. Note that it does not store the original image path, which is the only important metadata.
|
||||||
|
|
||||||
|
The second table is the Image Path table, which stores all image paths (the only metadata about images) added to `imsearch`. Each image path will have its own corresponding ID.
|
||||||
|
|
||||||
|
The third table is a Relation Table. For all stored features in RockDB, it establishes a relation between Feature ID and Image path ID.
|
||||||
|
|
||||||
|
For example, running ORB_SLAM3 on an image may return an 480x32 matrix, this means that the image has 480 feature vectors. Adding a single image to the database, `imsearch` will extract all the feature vectors, and store each of the feature vectors (all 480 of them) into the Features Table, then the image path will be inserted in the Image Path Table. Finally, it will insert into the Relation Table, all the Feature ID of the image, and its corresponding Image Path ID.
|
||||||
|
|
||||||
|
```mermaid
|
||||||
|
erDiagram
|
||||||
|
fid[ImageFeatureColumn] {
|
||||||
|
uint64 FeatureID
|
||||||
|
vector feature
|
||||||
|
}
|
||||||
|
ipid[ImagePathColumn] {
|
||||||
|
uint64 ImagePathID
|
||||||
|
string path
|
||||||
|
}
|
||||||
|
fid2ipid[FeatureID2ImagePathID] {
|
||||||
|
uint64 FeatureID
|
||||||
|
uint64 ImagePathID
|
||||||
|
}
|
||||||
|
fid |{ -- }| fid2ipid : Relation
|
||||||
|
ipid || -- || fid2ipid : Relation
|
||||||
|
```
|
||||||
|
This establishes a many to one relationship between features and image paths.
|
||||||
|
|
||||||
|
After adding all the features into RockDB, `imsearch export-data` is called, and all the feature vectors are exported into a NumPy Serialized Array.
|
||||||
|
After exporting, using the python script provided in `utils/train.py`, a new FAISS index using `IndexBinaryIVF` is created with dimension 256 (uint8 is 8 bit, there are 32 uint8 in one feature vector, a single feature vectors uses up 256 bit, thus the dimension for the binary vector index is 256), `k` or `nlist` is at discretion of the user, depending on the feature amount in the database. The `nlist` parameter divides [all feature vectors into clusters](https://github.com/facebookresearch/faiss/wiki/Faster-search), `nlist` indicates the amount of cluster to form. These kind of indexes requires training, the script will attempt to train the index using all the feature vectors contained in the NumPy Serialized Array exported by `imsearch`. After the index training is complete. The newly created FAISS index will be serialized and saved into `~/.config/imsearch/`.
|
||||||
|
|
||||||
|
Afterwards, running `imsearch build-index` will actually add all the vectors into the index. During the training process, the actual index stays empty, the training process is to better cluster new feature vectors that will be added afterwards, and also make the KNN search much more performant.
|
||||||
|
|
||||||
|
After index building is complete, the `imsearch` can finally be used for reverse-image search. Either via CLI or using the Web API.
|
||||||
|
|
||||||
|
During search, a query image is passed in. ORB_SLAM3 will extract all the feature vectors present in the query image. Obtaining the feature of the image, which is a 2D Matrix. If an image has 480 feature vectors, then the Matrix will be 480x32. One row of the Matrix corresponds to a single feature vector of the image.
|
||||||
|
|
||||||
|
`imsearch` will perform a KNN search for all features present in the image, all 480 feature vectors will be searched. returning its neighbor vectors (their index, and their distance).
|
||||||
|
|
||||||
|
After getting all neighbors vectors (id) and their distance, we look up neighbor's vector id with its corresponding image file path in the RockDB FeatureID2ImagePathID. And then we assign a score on the similarity of each feature based on its distance with the neighbor vector. We basically obtain a statistical chart: a HashMap with image-path as Key, and a list of scores on the similarity of each feature vector between the query image vector and neighbor vectors (and its image). Please see [`lolishinshi/imsearch/src/imdb.rs`](https://github.com/lolishinshi/imsearch/blob/master/src/imdb.rs#L185)
|
||||||
|
|
||||||
|
Usually if an image has a match, it will find various image (paths) with various feature vector similarity scores attached under that image path. If all scores are high and there are plenty of scores attached under the same image path? Then it's probably the original image that we are trying to find. If not, then either the scores will be low, or the image-path will be completely different, and with low similarity scores on each feature-vector neighbor-vector comparison.
|
||||||
|
|
||||||
|
Finally, all the scores are weighted using Wilson's Score. Giving each image-path a uniform similarity score. Then the result will be passed back to the end user.
|
||||||
|
|
||||||
|
Still, it's not a trivial process, whoever came up with the idea, I must give you my praise. But holy shit the source code for `lolishinshi/imsearch` is hard to read. It came with basically no documentation (other than how to use it). Reading Rust code is extremely hard for me, specially when there is some chaining action going on, like this:
|
||||||
|
|
||||||
|
```rust
|
||||||
|
// Fragment of lolishinshi/imsearch/src/index.rs @ L185
|
||||||
|
pub fn search<M>(&self, points: &M, knn: usize) -> Vec<Vec<Neighbor>>
|
||||||
|
where
|
||||||
|
M: Matrix,
|
||||||
|
{
|
||||||
|
assert_eq!(points.width() * 8, self.d as usize);
|
||||||
|
let mut dists = vec![0i32; points.height() * knn];
|
||||||
|
let mut indices = vec![0i64; points.height() * knn];
|
||||||
|
let start = Instant::now();
|
||||||
|
unsafe {
|
||||||
|
faiss_IndexBinary_search(
|
||||||
|
self.index,
|
||||||
|
points.height() as i64,
|
||||||
|
points.as_ptr(),
|
||||||
|
knn as i64,
|
||||||
|
dists.as_mut_ptr(),
|
||||||
|
indices.as_mut_ptr(),
|
||||||
|
);
|
||||||
|
}
|
||||||
|
|
||||||
|
debug!("knn search time: {:.2}s", start.elapsed().as_secs_f32());
|
||||||
|
indices
|
||||||
|
.into_iter()
|
||||||
|
.zip(dists.into_iter())
|
||||||
|
.map(|(index, distance)| Neighbor {
|
||||||
|
index: index as usize,
|
||||||
|
distance: distance as u32,
|
||||||
|
})
|
||||||
|
.chunks(knn)
|
||||||
|
.into_iter()
|
||||||
|
.map(|chunk| chunk.collect())
|
||||||
|
.collect()
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
I had to whip out GitHub Copilot for this hieroglyphic, because between non-existence of code documentation, the ludicrous amount of `into_iter()` and chaining, and `unwraps` and `results` and unfamiliar macros. It's definitely a frustrating experience reading the code if you are not a Rust developer.
|
||||||
|
|
||||||
|
I will be adapting this image search method into Python and Milvus. Thank you `lolishinshi`.
|
||||||
9153
FeatureExtraction/AudioFeatureExtraction.ipynb
Normal file
9153
FeatureExtraction/AudioFeatureExtraction.ipynb
Normal file
File diff suppressed because it is too large
Load Diff
5438
FeatureExtraction/ImageFeatureExtraction.ipynb
Normal file
5438
FeatureExtraction/ImageFeatureExtraction.ipynb
Normal file
File diff suppressed because one or more lines are too long
81
FeatureExtraction/dataset_files.py
Normal file
81
FeatureExtraction/dataset_files.py
Normal file
@@ -0,0 +1,81 @@
|
|||||||
|
from pathlib import Path
|
||||||
|
|
||||||
|
ASMRThreePath = Path("C:\\ASMRThree")
|
||||||
|
ASMRTwoPath = Path("D:\\ASMRTwo")
|
||||||
|
ASMROnePath = Path("E:\\ASMROne")
|
||||||
|
|
||||||
|
size_one, size_two, size_three = 0, 0, 0
|
||||||
|
files_one, files_two, files_three = [], [], []
|
||||||
|
folders_one, folders_two, folders_three = [], [], []
|
||||||
|
|
||||||
|
# Statistic calculation for ASMROne
|
||||||
|
for root, dirs, files in ASMROnePath.walk(): # Root will iterate through all folders
|
||||||
|
if root.absolute() != ASMROnePath.absolute(): # Skip root of ASMROnePath
|
||||||
|
folders_one.append(root) # Add folder to list
|
||||||
|
for fname in files: # Iterate through all files in current root
|
||||||
|
file = root/fname # Get file path
|
||||||
|
assert file.is_file()
|
||||||
|
files_one.append(file)
|
||||||
|
size_one += file.stat().st_size # Get file size
|
||||||
|
|
||||||
|
# Statistic calculation for ASMRTwo
|
||||||
|
for root, dirs, files in ASMRTwoPath.walk(): # Root will iterate through all folders
|
||||||
|
if root.absolute() != ASMRTwoPath.absolute(): # Skip root of ASMRTwoPath
|
||||||
|
folders_two.append(root) # Add folder to list
|
||||||
|
for fname in files: # Iterate through all files in current root
|
||||||
|
file = root/fname # Get file path
|
||||||
|
assert file.is_file()
|
||||||
|
files_two.append(file)
|
||||||
|
size_two += file.stat().st_size # Get file size
|
||||||
|
|
||||||
|
# Statistic calculation for ASMRThree
|
||||||
|
for root, dirs, files in ASMRThreePath.walk(): # Root will iterate through all folders
|
||||||
|
if root.absolute() != ASMRThreePath.absolute(): # Skip root of ASMRThreePath
|
||||||
|
folders_three.append(root) # Add folder to list
|
||||||
|
for fname in files: # Iterate through all files in current root
|
||||||
|
file = root/fname # Get file path
|
||||||
|
assert file.is_file()
|
||||||
|
files_three.append(file)
|
||||||
|
size_three += file.stat().st_size # Get file size
|
||||||
|
|
||||||
|
DataSubsetPaths = [ASMROnePath, ASMRTwoPath, ASMRThreePath]
|
||||||
|
DLSiteWorksPaths = []
|
||||||
|
# Collect ASMR Works (RJ ID, Paths)
|
||||||
|
for ASMRSubsetPath in DataSubsetPaths:
|
||||||
|
for WorkPaths in ASMRSubsetPath.iterdir():
|
||||||
|
DLSiteWorksPaths.append(WorkPaths)
|
||||||
|
|
||||||
|
fileExt2fileType = {
|
||||||
|
".TXT": "Document",
|
||||||
|
".WAV": "Audio",
|
||||||
|
".MP3": "Audio",
|
||||||
|
".PNG": "Image",
|
||||||
|
".JPG": "Image",
|
||||||
|
".VTT": "Subtitle",
|
||||||
|
".PDF": "Document",
|
||||||
|
".FLAC": "Audio",
|
||||||
|
".MP4": "Video",
|
||||||
|
".LRC": "Subtitle",
|
||||||
|
".SRT": "Subtitle",
|
||||||
|
".JPEG": "Image",
|
||||||
|
".ASS": "Subtitle",
|
||||||
|
"": "NO EXTENSION",
|
||||||
|
".M4A": "Audio",
|
||||||
|
".MKV": "Video"
|
||||||
|
}
|
||||||
|
fileext_stat = {}
|
||||||
|
file_list = files_one + files_two + files_three
|
||||||
|
file_list_count = len(file_list)
|
||||||
|
|
||||||
|
for file in file_list:
|
||||||
|
f_ext = file.suffix.upper()
|
||||||
|
if (f_ext in fileext_stat.keys()):
|
||||||
|
fileext_stat[f_ext]['Count'] += 1
|
||||||
|
fileext_stat[f_ext]['List'].append(file)
|
||||||
|
fileext_stat[f_ext]['ExtensionMass'] += file.stat().st_size
|
||||||
|
else:
|
||||||
|
fileext_stat[f_ext] = {}
|
||||||
|
fileext_stat[f_ext]['Count'] = 1
|
||||||
|
fileext_stat[f_ext]['List'] = [file]
|
||||||
|
fileext_stat[f_ext]['ExtensionMass'] = file.stat().st_size # The total sum of sizes of the same file extension
|
||||||
|
fileext_stat[f_ext]['MediaType'] = fileExt2fileType[f_ext]
|
||||||
BIN
FeatureExtraction/测试.png
Normal file
BIN
FeatureExtraction/测试.png
Normal file
Binary file not shown.
|
After Width: | Height: | Size: 1.2 MiB |
10016
LocalDatasetAnalysis.ipynb
Normal file
10016
LocalDatasetAnalysis.ipynb
Normal file
File diff suppressed because it is too large
Load Diff
253
milvustests/quickstart.ipynb
Normal file
253
milvustests/quickstart.ipynb
Normal file
@@ -0,0 +1,253 @@
|
|||||||
|
{
|
||||||
|
"cells": [
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"id": "47471ef9",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"Creating client"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": 1,
|
||||||
|
"id": "d08ab631",
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"from pymilvus import MilvusClient\n",
|
||||||
|
"\n",
|
||||||
|
"client = MilvusClient(uri=\"http://localhost:19530\", token=\"root:Milvus\")"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"id": "ecf3a2dd",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"Creating collection"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": 2,
|
||||||
|
"id": "7bf82b6c",
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"if client.has_collection(collection_name=\"demo_collection\"):\n",
|
||||||
|
" client.drop_collection(collection_name=\"demo_collection\")\n",
|
||||||
|
"\n",
|
||||||
|
"client.create_collection(\n",
|
||||||
|
" collection_name=\"demo_collection\",\n",
|
||||||
|
" dimension=768, # The vectors we will use in this demo has 768 dimensions\n",
|
||||||
|
")"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"id": "eef3759b",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"Adding sample vector data using Embeddings to Milvus"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": 3,
|
||||||
|
"id": "7f6083de",
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [
|
||||||
|
{
|
||||||
|
"name": "stderr",
|
||||||
|
"output_type": "stream",
|
||||||
|
"text": [
|
||||||
|
"d:\\Repository\\DLSiteFSearch\\DLSiteFSearchPython_venv\\Lib\\site-packages\\tqdm\\auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html\n",
|
||||||
|
" from .autonotebook import tqdm as notebook_tqdm\n",
|
||||||
|
"None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.\n",
|
||||||
|
"d:\\Repository\\DLSiteFSearch\\DLSiteFSearchPython_venv\\Lib\\site-packages\\huggingface_hub\\file_download.py:144: UserWarning: `huggingface_hub` cache-system uses symlinks by default to efficiently store duplicated files but your machine does not support them in C:\\Users\\qt\\.cache\\huggingface\\hub\\models--GPTCache--paraphrase-albert-small-v2. Caching files will still work but in a degraded version that might require more space on your disk. This warning can be disabled by setting the `HF_HUB_DISABLE_SYMLINKS_WARNING` environment variable. For more details, see https://huggingface.co/docs/huggingface_hub/how-to-cache#limitations.\n",
|
||||||
|
"To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development\n",
|
||||||
|
" warnings.warn(message)\n",
|
||||||
|
"d:\\Repository\\DLSiteFSearch\\DLSiteFSearchPython_venv\\Lib\\site-packages\\huggingface_hub\\file_download.py:144: UserWarning: `huggingface_hub` cache-system uses symlinks by default to efficiently store duplicated files but your machine does not support them in C:\\Users\\qt\\.cache\\huggingface\\hub\\models--GPTCache--paraphrase-albert-onnx. Caching files will still work but in a degraded version that might require more space on your disk. This warning can be disabled by setting the `HF_HUB_DISABLE_SYMLINKS_WARNING` environment variable. For more details, see https://huggingface.co/docs/huggingface_hub/how-to-cache#limitations.\n",
|
||||||
|
"To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development\n",
|
||||||
|
" warnings.warn(message)\n"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"name": "stdout",
|
||||||
|
"output_type": "stream",
|
||||||
|
"text": [
|
||||||
|
"Dim: 768 (768,)\n",
|
||||||
|
"Data has 3 entities, each with fields: dict_keys(['id', 'vector', 'text', 'subject'])\n",
|
||||||
|
"Vector dim: 768\n"
|
||||||
|
]
|
||||||
|
}
|
||||||
|
],
|
||||||
|
"source": [
|
||||||
|
"from pymilvus import model\n",
|
||||||
|
"# If connection to https://huggingface.co/ failed, uncomment the following path\n",
|
||||||
|
"# import os\n",
|
||||||
|
"# os.environ['HF_ENDPOINT'] = 'https://hf-mirror.com'\n",
|
||||||
|
"\n",
|
||||||
|
"# This will download a small embedding model \"paraphrase-albert-small-v2\" (~50MB).\n",
|
||||||
|
"embedding_fn = model.DefaultEmbeddingFunction()\n",
|
||||||
|
"\n",
|
||||||
|
"# Text strings to search from.\n",
|
||||||
|
"docs = [\n",
|
||||||
|
" \"Artificial intelligence was founded as an academic discipline in 1956.\",\n",
|
||||||
|
" \"Alan Turing was the first person to conduct substantial research in AI.\",\n",
|
||||||
|
" \"Born in Maida Vale, London, Turing was raised in southern England.\",\n",
|
||||||
|
"]\n",
|
||||||
|
"\n",
|
||||||
|
"vectors = embedding_fn.encode_documents(docs)\n",
|
||||||
|
"# The output vector has 768 dimensions, matching the collection that we just created.\n",
|
||||||
|
"print(\"Dim:\", embedding_fn.dim, vectors[0].shape) # Dim: 768 (768,)\n",
|
||||||
|
"\n",
|
||||||
|
"# Each entity has id, vector representation, raw text, and a subject label that we use\n",
|
||||||
|
"# to demo metadata filtering later.\n",
|
||||||
|
"data = [\n",
|
||||||
|
" {\"id\": i, \"vector\": vectors[i], \"text\": docs[i], \"subject\": \"history\"}\n",
|
||||||
|
" for i in range(len(vectors))\n",
|
||||||
|
"]\n",
|
||||||
|
"\n",
|
||||||
|
"print(\"Data has\", len(data), \"entities, each with fields: \", data[0].keys())\n",
|
||||||
|
"print(\"Vector dim:\", len(data[0][\"vector\"]))\n"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"id": "4e89e602",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"Inserting data to Milvus"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": 4,
|
||||||
|
"id": "e2098f0a",
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [
|
||||||
|
{
|
||||||
|
"name": "stdout",
|
||||||
|
"output_type": "stream",
|
||||||
|
"text": [
|
||||||
|
"{'insert_count': 3, 'ids': [0, 1, 2]}\n"
|
||||||
|
]
|
||||||
|
}
|
||||||
|
],
|
||||||
|
"source": [
|
||||||
|
"res = client.insert(collection_name=\"demo_collection\", data=data)\n",
|
||||||
|
"\n",
|
||||||
|
"print(res)"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"id": "0a0e4a35",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"Semantic search / Vector search"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": 5,
|
||||||
|
"id": "2a687f94",
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [
|
||||||
|
{
|
||||||
|
"name": "stdout",
|
||||||
|
"output_type": "stream",
|
||||||
|
"text": [
|
||||||
|
"data: [\"[{'id': 2, 'distance': 0.5859946012496948, 'entity': {'text': 'Born in Maida Vale, London, Turing was raised in southern England.', 'subject': 'history'}}, {'id': 1, 'distance': 0.5118255615234375, 'entity': {'text': 'Alan Turing was the first person to conduct substantial research in AI.', 'subject': 'history'}}]\"]\n"
|
||||||
|
]
|
||||||
|
}
|
||||||
|
],
|
||||||
|
"source": [
|
||||||
|
"query_vectors = embedding_fn.encode_queries([\"Who is Alan Turing?\"])\n",
|
||||||
|
"# If you don't have the embedding function you can use a fake vector to finish the demo:\n",
|
||||||
|
"# query_vectors = [ [ random.uniform(-1, 1) for _ in range(768) ] ]\n",
|
||||||
|
"\n",
|
||||||
|
"res = client.search(\n",
|
||||||
|
" collection_name=\"demo_collection\", # target collection\n",
|
||||||
|
" data=query_vectors, # query vectors\n",
|
||||||
|
" limit=2, # number of returned entities\n",
|
||||||
|
" output_fields=[\"text\", \"subject\"], # specifies fields to be returned\n",
|
||||||
|
")\n",
|
||||||
|
"\n",
|
||||||
|
"print(res)\n"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"id": "4f8e5ba8",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"Metadata filtering"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": 6,
|
||||||
|
"id": "03d6ae37",
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [
|
||||||
|
{
|
||||||
|
"name": "stdout",
|
||||||
|
"output_type": "stream",
|
||||||
|
"text": [
|
||||||
|
"data: ['[]']\n"
|
||||||
|
]
|
||||||
|
}
|
||||||
|
],
|
||||||
|
"source": [
|
||||||
|
"# Insert more docs in another subject.\n",
|
||||||
|
"docs = [\n",
|
||||||
|
" \"Machine learning has been used for drug design.\",\n",
|
||||||
|
" \"Computational synthesis with AI algorithms predicts molecular properties.\",\n",
|
||||||
|
" \"DDR1 is involved in cancers and fibrosis.\",\n",
|
||||||
|
"]\n",
|
||||||
|
"vectors = embedding_fn.encode_documents(docs)\n",
|
||||||
|
"data = [\n",
|
||||||
|
" {\"id\": 3 + i, \"vector\": vectors[i], \"text\": docs[i], \"subject\": \"biology\"}\n",
|
||||||
|
" for i in range(len(vectors))\n",
|
||||||
|
"]\n",
|
||||||
|
"\n",
|
||||||
|
"client.insert(collection_name=\"demo_collection\", data=data)\n",
|
||||||
|
"\n",
|
||||||
|
"# This will exclude any text in \"history\" subject despite close to the query vector.\n",
|
||||||
|
"res = client.search(\n",
|
||||||
|
" collection_name=\"demo_collection\",\n",
|
||||||
|
" data=embedding_fn.encode_queries([\"tell me AI related information\"]),\n",
|
||||||
|
" filter=\"subject == 'biology'\",\n",
|
||||||
|
" limit=2,\n",
|
||||||
|
" output_fields=[\"text\", \"subject\"],\n",
|
||||||
|
")\n",
|
||||||
|
"\n",
|
||||||
|
"print(res)"
|
||||||
|
]
|
||||||
|
}
|
||||||
|
],
|
||||||
|
"metadata": {
|
||||||
|
"kernelspec": {
|
||||||
|
"display_name": "DLSiteFSearchPython_venv",
|
||||||
|
"language": "python",
|
||||||
|
"name": "python3"
|
||||||
|
},
|
||||||
|
"language_info": {
|
||||||
|
"codemirror_mode": {
|
||||||
|
"name": "ipython",
|
||||||
|
"version": 3
|
||||||
|
},
|
||||||
|
"file_extension": ".py",
|
||||||
|
"mimetype": "text/x-python",
|
||||||
|
"name": "python",
|
||||||
|
"nbconvert_exporter": "python",
|
||||||
|
"pygments_lexer": "ipython3",
|
||||||
|
"version": "3.12.9"
|
||||||
|
}
|
||||||
|
},
|
||||||
|
"nbformat": 4,
|
||||||
|
"nbformat_minor": 5
|
||||||
|
}
|
||||||
8
requirements.txt
Normal file
8
requirements.txt
Normal file
@@ -0,0 +1,8 @@
|
|||||||
|
opencv-python
|
||||||
|
python-orb-slam3
|
||||||
|
pandas
|
||||||
|
matplotlib
|
||||||
|
numpy
|
||||||
|
pymilvus[model]
|
||||||
|
protobuf
|
||||||
|
grpcio-tools
|
||||||
BIN
source_keypoints_opencvorb.jpg
Normal file
BIN
source_keypoints_opencvorb.jpg
Normal file
Binary file not shown.
|
After Width: | Height: | Size: 387 KiB |
BIN
source_keypoints_slamorb3.jpg
Normal file
BIN
source_keypoints_slamorb3.jpg
Normal file
Binary file not shown.
|
After Width: | Height: | Size: 445 KiB |
1411
test.ipynb
1411
test.ipynb
File diff suppressed because one or more lines are too long
2180
test_cvorb.ipynb
2180
test_cvorb.ipynb
File diff suppressed because one or more lines are too long
1387
test_slamorb.ipynb
Normal file
1387
test_slamorb.ipynb
Normal file
File diff suppressed because one or more lines are too long
Reference in New Issue
Block a user