# Local DLSite Audio Work file analisys

This Python Notebook will attempt to obtain some basic statistical data on each works stored in the local dataset.

Due to local system's space constraints. The local dataset will be divided into three subsets:
- ASMROne
- ASMRTwo
- ASMRThree

In [3]:
import pandas as pd
from pathlib import Path
from IPython.display import display
import platform
#pd.set_option('display.float_format', lambda x: '%.2f' % x)
# CONSTANTS
ASMRThreePath = Path("C:\\ASMRThree")
ASMRTwoPath = Path("D:\\ASMRTwo")
ASMROnePath = Path("E:\\ASMROne")

if (platform.system() == 'Linux'):
    ASMROnePath = Path('/mnt/Scratchpad/ASMROne')
    ASMRTwoPath = Path('/mnt/MyStuffz/ASMRTwo')
    ASMRThreePath = Path('/mnt/Windows11/ASMRThree')

Basic file system level statitical data: Subset size, subset file amount

In [4]:
size_one, size_two, size_three = 0, 0, 0
files_one, files_two, files_three = [], [], []
folders_one, folders_two, folders_three = [], [], []

# Statistic calculation for ASMROne
for root, dirs, files in ASMROnePath.walk(): # Root will iterate through all folders
    if root.absolute() != ASMROnePath.absolute(): # Skip root of ASMROnePath
        folders_one.append(root) # Add folder to list
    for fname in files: # Iterate through all files in current root
        file = root/fname # Get file path
        assert file.is_file()
        files_one.append(file)
        size_one += file.stat().st_size # Get file size
        
# Statistic calculation for ASMRTwo
for root, dirs, files in ASMRTwoPath.walk(): # Root will iterate through all folders
    if root.absolute() != ASMRTwoPath.absolute(): # Skip root of ASMRTwoPath
        folders_two.append(root) # Add folder to list
    for fname in files: # Iterate through all files in current root
        file = root/fname # Get file path
        assert file.is_file()
        files_two.append(file)
        size_two += file.stat().st_size # Get file size
        
# Statistic calculation for ASMRThree
for root, dirs, files in ASMRThreePath.walk(): # Root will iterate through all folders
    if root.absolute() != ASMRThreePath.absolute(): # Skip root of ASMRThreePath
        folders_three.append(root) # Add folder to list
    for fname in files: # Iterate through all files in current root
        file = root/fname # Get file path
        assert file.is_file()
        files_three.append(file)
        size_three += file.stat().st_size # Get file size

fsstat_dict = {
    "Total Size": [size_one, size_two, size_three],
    "File Count": [len(files_one), len(files_two), len(files_three)],
    "Folder Count": [len(folders_one), len(folders_two), len(folders_three)]
}
df_fsstats = pd.DataFrame(fsstat_dict, index=["ASMROne", "ASMRTwo", "ASMRThree"])

In [5]:
df_fsstats

Unnamed: 0,Total Size,File Count,Folder Count
ASMROne,504791391855,6317,1017
ASMRTwo,471683782635,7435,760
ASMRThree,536552753022,6694,1066


In [6]:
with pd.option_context('display.float_format', lambda x: '%.2f' % x):
    display(df_fsstats.describe().loc[['count', 'mean', 'min', 'max']])

Unnamed: 0,Total Size,File Count,Folder Count
count,3.0,3.0,3.0
mean,504342642504.0,6815.33,947.67
min,471683782635.0,6317.0,760.0
max,536552753022.0,7435.0,1066.0


In [7]:
size_total = size_one + size_two + size_three
print("Total size in GigaBytes:", round(size_total / 1024**3, 2), "GB")
print("Average size per subset:", round(size_total / 1024**3 / 3, 2), "GB")

Total size in GigaBytes: 1409.12 GB
Average size per subset: 469.71 GB


Statistics on each DLSite work

In [8]:
DataSubsetPaths = [ASMROnePath, ASMRTwoPath, ASMRThreePath]
DLSiteWorksPaths = []
worksStats_dict = {"Product Code": [], "Size": [], "File Count": [], "Folder Count": []}

# Collect ASMR Works (RJ ID, Paths)
for ASMRSubsetPath in DataSubsetPaths:
    for WorkPaths in ASMRSubsetPath.iterdir():
        DLSiteWorksPaths.append(WorkPaths)

# Add statistical data wo worksStats_dict
for WorkPath in DLSiteWorksPaths:
    p_code = WorkPath.name
    p_size = 0
    p_folder_path_list = []
    p_files_path_list = []
    for root, dirnames, fnames in WorkPath.walk():
        if root.absolute() != WorkPath.absolute(): # Skip root of audio work's path
            p_folder_path_list.append(root) # Add folder to list
        for fname in fnames: # Iterate through all files in current root
            file = root/fname # Get file path
            assert file.is_file()
            p_files_path_list.append(file)
            p_size += file.stat().st_size # Get file size
    # Return result
    worksStats_dict['Product Code'].append(p_code)
    worksStats_dict['Size'].append(p_size)
    worksStats_dict['File Count'].append(len(p_files_path_list))
    worksStats_dict['Folder Count'].append(len(p_folder_path_list))

df_generalWorkStatistics = pd.DataFrame(worksStats_dict)
df_generalWorkStatistics.Size /= 1024**2 # Convert size from bytes to MegaBytes
with pd.option_context('display.max_rows', None): print(df_generalWorkStatistics)

    Product Code          Size  File Count  Folder Count
0     RJ01008923   3460.146747          36             6
1     RJ01011708    577.368072          12             3
2     RJ01011709   2170.986806          16             3
3     RJ01011995   1494.311481          32             6
4     RJ01016536   3908.667958          63             3
5     RJ01033660   5560.561470          40            11
6     RJ01037597   4230.816780          32             4
7     RJ01049458   4250.531967          63             8
8     RJ01050049   2957.886879          44             6
9     RJ01055275   3942.416523          36            10
10    RJ01058285   2286.479638          17             3
11    RJ01058640   1908.764962          31             4
12    RJ01059481    954.700173          10             2
13    RJ01066223   6018.408055          27             3
14    RJ01067534   3191.479604          40             8
15    RJ01068516   2661.421838          29             3
16    RJ01068828   3990.299317 

In [9]:
df_generalWorkStatistics

Unnamed: 0,Product Code,Size,File Count,Folder Count
0,RJ01008923,3460.146747,36,6
1,RJ01011708,577.368072,12,3
2,RJ01011709,2170.986806,16,3
3,RJ01011995,1494.311481,32,6
4,RJ01016536,3908.667958,63,3
...,...,...,...,...
325,RJ416843,300.660580,2,0
326,RJ417065,2757.642691,18,4
327,RJ417539,108.208293,7,0
328,RJ417567,6039.398273,65,8


330 Works in local dataset. Each work's size ranges from 1 MB to 35163 MB. Averaging out to 4373 MB per work.

File count ranges from 1 file to 2375 files. Averaging out to 62 files per work.

In [10]:
with pd.option_context('display.float_format', lambda x: '%.2f' % x):
    display(df_generalWorkStatistics.describe().loc[['count', 'mean', 'min', 'max']])

Unnamed: 0,Size,File Count,Folder Count
count,330.0,330.0,330.0
mean,4372.53,61.96,7.62
min,0.97,1.0,0.0
max,35163.43,2375.0,72.0


Statistics on file types

In [11]:
fileExt2fileType = {
    ".TXT": "Document",
    ".WAV": "Audio",
    ".MP3": "Audio",
    ".PNG": "Image",
    ".JPG": "Image",
    ".VTT": "Subtitle",
    ".PDF": "Document",
    ".FLAC": "Audio",
    ".MP4": "Video",
    ".LRC": "Subtitle",
    ".SRT": "Subtitle",
    ".JPEG": "Image",
    ".ASS": "Subtitle",
    "": "NO EXTENSION",
    ".M4A": "Audio",
    ".MKV": "Video"
}
fileext_stat = {}
file_list = files_one + files_two + files_three
file_list_count = len(file_list)

for file in file_list:
    f_ext = file.suffix.upper()
    if (f_ext in fileext_stat.keys()):
        fileext_stat[f_ext]['Count'] += 1
        fileext_stat[f_ext]['List'].append(file)
        fileext_stat[f_ext]['ExtensionMass'] += file.stat().st_size
    else:
        fileext_stat[f_ext] = {}
        fileext_stat[f_ext]['Count'] = 1
        fileext_stat[f_ext]['List'] = [file]
        fileext_stat[f_ext]['ExtensionMass'] = file.stat().st_size # The total sum of  sizes of the same file extension
        fileext_stat[f_ext]['MediaType'] = fileExt2fileType[f_ext]

# Data Wrangler    
fileext_stat

{'.TXT': {'Count': 1984,
  'List': [PosixPath('/mnt/Scratchpad/ASMROne/RJ01008923/readme.txt'),
   PosixPath('/mnt/Scratchpad/ASMROne/RJ01011708/readme.txt'),
   PosixPath('/mnt/Scratchpad/ASMROne/RJ01011708/台本.txt'),
   PosixPath('/mnt/Scratchpad/ASMROne/RJ01011709/readme.txt'),
   PosixPath('/mnt/Scratchpad/ASMROne/RJ01011709/台本.txt'),
   PosixPath('/mnt/Scratchpad/ASMROne/RJ01016536/音軌列表・說明(日文).txt'),
   PosixPath('/mnt/Scratchpad/ASMROne/RJ01033660/春風恋歌 桜色の初恋【KU100ハイレゾ】.txt'),
   PosixPath('/mnt/Scratchpad/ASMROne/RJ01049458/クレジット.txt'),
   PosixPath('/mnt/Scratchpad/ASMROne/RJ01050049/readme.txt'),
   PosixPath('/mnt/Scratchpad/ASMROne/RJ01055275/瑠璃雪楼の狂詩曲 愛玩メイド青女の肛悦【KU100ハイレゾ】.txt'),
   PosixPath('/mnt/Scratchpad/ASMROne/RJ01058285/readme.txt'),
   PosixPath('/mnt/Scratchpad/ASMROne/RJ01058285/台本.txt'),
   PosixPath('/mnt/Scratchpad/ASMROne/RJ01058640/2-.FLAC版【低損耗形式・推薦線上收聽】/2.noSE/射精環節秒數♥.txt'),
   PosixPath('/mnt/Scratchpad/ASMROne/RJ01058640/4-.故事概要・音軌列表等圖檔集/06.音軌列表&介紹(日文).txt')

In [12]:
fileext_stat_df = pd.DataFrame(
    {
        "Media Type": [fileext_stat[ext]['MediaType'] for ext in fileext_stat.keys()],
        "File Extensions": fileext_stat.keys(),
        "Count": [fileext_stat[ext]['Count'] for ext in fileext_stat.keys()],
        "Extension Space Usage (GigaBytes)": [fileext_stat[ext]['ExtensionMass'] / 1024**3 for ext in fileext_stat.keys()],
        "Extension Amount Percentage%": [(fileext_stat[ext]['Count'] / file_list_count) * 100 for ext in fileext_stat.keys()],
        "Extension Space Percentage%": [(fileext_stat[ext]['ExtensionMass'] / size_total) * 100 for ext in fileext_stat.keys()]
    }
)
print("Total file count:", file_list_count)
print("Total size in GigaBytes", size_total / 1024**3)
with pd.option_context('display.float_format', lambda x: '%.2f' % x):
    display(fileext_stat_df)

Total file count: 20446
Total size in GigaBytes 1409.1170649155974


Unnamed: 0,Media Type,File Extensions,Count,Extension Space Usage (GigaBytes),Extension Amount Percentage%,Extension Space Percentage%
0,Document,.TXT,1984,0.02,9.7,0.0
1,Audio,.WAV,4101,1069.45,20.06,75.89
2,Audio,.MP3,4578,126.06,22.39,8.95
3,Image,.PNG,1067,5.72,5.22,0.41
4,Image,.JPG,4093,3.87,20.02,0.27
5,Subtitle,.VTT,2676,0.02,13.09,0.0
6,Document,.PDF,173,0.19,0.85,0.01
7,Audio,.FLAC,248,78.27,1.21,5.55
8,Video,.MP4,613,108.59,3.0,7.71
9,Subtitle,.LRC,730,0.0,3.57,0.0


Analyzing the table above. The most space consuming file type are audio, specially the uncompressed format WAV audio. Which has taken up an entire TeraByte of space.

Since most of the works in the dataset provides the MP3 and WAV version of the audio work simultaneously, we can observe that using MP3 lossy compression can lead to a size reduction of almost 1/10 of the original size. Which is quite impressive. I cannot compare the compression significance in FLAC files because there are only 250 files of this extension, while WAV/MP3 have 4000 and 4500 files respectively. Works that uses FLAC compression are rare.

Viewing from the Extension Space Percentage, which is the percentage of the file type's count and total file count. We can observe that the files are evenly distributed into Audio/WAV, Audio/MP3, Image/JPG. For some reason JPG seems to be more popular for image compression, while PNG only has 1000 files taking up almost 6 GB, there are 4000 JPG files taking up only approximately 4 GB.

Videos also take up a considerable amount of space: 109 GB

In [13]:
fileext_stat['']['List']

[PosixPath('/mnt/MyStuffz/ASMRTwo/RJ291279/.DS_Store'),
 PosixPath('/mnt/MyStuffz/ASMRTwo/RJ291279/1_AAC/.DS_Store'),
 PosixPath('/mnt/MyStuffz/ASMRTwo/RJ291279/2_MP3/.DS_Store'),
 PosixPath('/mnt/MyStuffz/ASMRTwo/RJ291279/3_FLAC/.DS_Store'),
 PosixPath('/mnt/MyStuffz/ASMRTwo/RJ305883/.DS_Store'),
 PosixPath('/mnt/MyStuffz/ASMRTwo/RJ305883/1_AAC/.DS_Store'),
 PosixPath('/mnt/MyStuffz/ASMRTwo/RJ305883/2_MP3/.DS_Store'),
 PosixPath('/mnt/MyStuffz/ASMRTwo/RJ305883/3_FLAC/.DS_Store'),
 PosixPath('/mnt/MyStuffz/ASMRTwo/RJ305883/プレゼント企画/.DS_Store')]

Examining the files which have "no extension" turns out to be a list of .DS_Store (Mac OS Finder) files.

Examining all the types of media contained in the audio works:

Documents (txt, pdf, etc.) can be easily indexed using Full Text Search or textual embedding search.

The way to index Audio Files are yet to be examined, please see the Approach.md from the Obsidian notes on how to convert Audio files into embeddings.

Indexing image files we can use the same approach that is used in [lolishinshi/imsearch](https://github.com/lolishinshi/imsearch) project, by extracting image's feature vectors and creating an index for it. However, indexing video will be out of scope for this project. For future reference, please take a look at [soruly/trace.moe](https://github.com/soruly/trace.moe)