# Feature extraction for DLSite Works Audio

There are several ways to convert images to vectors:
- Embedding models: PANNS, VGGish, Wave2Rec
- ORB based similarity search (chunked audio, or not)
- Shazam-like algorithm

After looking at the Image feature extraction's performance. I do not think that ORB feature extraction would be feasible in Python.

PyPy also has extremely limited compatibility with all the PIP packages required for this project.

We must move to C++, C, Go, or Rust in order to gain more performance.

In [1]:
# Basic dependencies
from towhee import pipe, ops, DataCollection
from pathlib import Path
import sys
import dataset_files
import cv2 as cv
import orb_slam3
import numpy as np
import time
import threading
import concurrent.futures
import torch
from pprint import pprint
from matplotlib import pyplot as plt

audio_paths = []
fileext_dict = dataset_files.fileext_stat # Dictionary containing all DLSite work's file path, separated by file extensions
for extension in fileext_dict:
    if fileext_dict[extension]['MediaType'] == "Audio":
        audio_paths += fileext_dict[extension]['List']
        
pprint(audio_paths)
print(f"Found {len(audio_paths)} Audios in all three subsets")

[PosixPath('/mnt/Scratchpad/ASMROne/RJ01008923/01_WAV/トラック01 プロローグ 天才怪盗くらんたん参上っ!.wav'),
 PosixPath('/mnt/Scratchpad/ASMROne/RJ01008923/01_WAV/トラック02 メスガキボディ&生吐息! 耳舐め囁きダブル手コキ♪.wav'),
 PosixPath('/mnt/Scratchpad/ASMROne/RJ01008923/01_WAV/トラック03 えちえちお口でバキュームご奉仕♪ くらんたんのベロテクフェラ♪.wav'),
 PosixPath('/mnt/Scratchpad/ASMROne/RJ01008923/01_WAV/トラック04 くらんたん反撃→2秒で即負け♪ ナマイキ少女にわからせぱんぱん♪.wav'),
 PosixPath('/mnt/Scratchpad/ASMROne/RJ01008923/01_WAV/トラック05 メロメロおまんこに高速ぱんぱん♪ ラブラブとろとろナマナマえっち♪.wav'),
 PosixPath('/mnt/Scratchpad/ASMROne/RJ01008923/01_WAV/トラック06 エピローグ.wav'),
 PosixPath('/mnt/Scratchpad/ASMROne/RJ01008923/01_WAV/トラックEX あまあまボイス&生吐息たっぷりクイックオナサポwithかぽかぽコール♪.wav'),
 PosixPath('/mnt/Scratchpad/ASMROne/RJ01008923/01_WAV/フリートーク.wav'),
 PosixPath('/mnt/Scratchpad/ASMROne/RJ01008923/03_SEなし/WAV_SEなし/トラック01 プロローグ 天才怪盗くらんたん参上っ!【SEなし】.wav'),
 PosixPath('/mnt/Scratchpad/ASMROne/RJ01008923/03_SEなし/WAV_SEなし/トラック02 メスガキボディ&生吐息! 耳舐め囁きダブル手コキ♪【SEなし】.wav'),
 PosixPath('/mnt/Scratchpad/ASMROne/RJ01008923/03_SEなし/

In [2]:
# Reduce amount of samples to test
audio_paths = audio_paths[0:50] # Machine Learning on my machine is hard
#audio_paths_str = [str(p.absolute()) for p in audio_paths]
audio_paths_len = len(audio_paths)

## Feature extraction with PANNS

In [3]:
# Setup Pipeline with Towhee
# p = (
#     pipe.input('path')
#         .map('path', 'frame', ops.audio_decode.ffmpeg())
#         .map('frame', ('labels', 'scores', 'vec'), ops.audio_classification.panns())
#         .output('path', 'labels', 'scores', 'vec')
# )

In [4]:
features_panns = {} # HashMap: Path --> np.ndarray

# Iterate through all audio files
t_start = time.perf_counter()
# for i, path in enumerate(audio_paths):
#     path_str = str(path.absolute())
#     # Run inference
#     try:
#         res = p(path_str)
#         print(f"[OK] Extracted ({i+1}/{audio_paths_len}):", path_str)
#         features_panns[path] = res.get_dict()['vec']
#     except:
#         print(f"[ERROR] Failed to extract ({i+1}/{audio_paths_len}):", path_str)
#     finally:
#         torch.cuda.empty_cache()

features_stored = sum( [len(features_panns[file_path]) for file_path in features_panns] )
t_end = time.perf_counter()
delta_t = round(t_end - t_start, 2)

print()
print(f"Successfully extracted features for {len(features_panns)} files, stored {features_stored} vectors.")
print(f"Took {delta_t} seconds to extract all features")
print("Size in memory: ", sys.getsizeof(features_panns))


Successfully extracted features for 0 files, stored 0 vectors.
Took 0.0 seconds to extract all features
Size in memory:  64


In [5]:
features_panns

{}

Using PANNS is a bust, inferencing on a RTX 3060 Laptop GPU and 6GB of VRAM will cause most of the feature extraction to fail due to CUDA out of memory issues.

Although I am sure batching would greatly accelerate the process. Even using Towhee's `batch` API, the inference with fail due to CUDA Out of Memory, on my machine at least.

In terms of performance, counting the failed to extract ones, it goes at a rate of approximately 2 to 4 seconds per audio files. Again, insane performance. 