DLSiteFSearch/Local dataset analysis.md at a9d3d10da9ce1d8fc930fea77860e2e8c1a3ead1

qtnull/DLSiteFSearch

Fork 0

Files

QTØ a9d3d10da9

added more information

2025-04-07 00:40:21 +02:00

1.2 KiB

Raw Blame History

The local dataset due to space (Disk partition) constraints, is split into three subsets:

ASMROne
ASMRTwo
ASMRThree

There are no substantial differences between each subset. Subset sizes and audio work count:

ASMR One --> 119 Audio works, 470GB/504 791 391 855 Bytes
ASMR Two --> 90 Audio works, 439GB/471 683 782 635 Bytes
ASMR Three --> 121 Audio works, 499GB/536 552 753 022 Bytes

Total: 330 Audio works, 1409GB/1 513 027 927 512 Bytes

There are works from different languages (audio language, or including translation subtitle file), different sizes, different audio encoding formats, etc.

Basic statistical data on filesystem level:

Subset	File count	Folder count
ASMR One	6317	1017
ASMR Two	7435	760
ASMR Three	6694	1066

Average Audio Work size: 1409 \, \text{GigaBytes} \div 330 \, \text{Works} = 4.2\overline{69} \, \text{GigaBytes/Work} Avg.: approximately 4.27 GB per work

In this project we will be indexing only the following type of files:

Audio
Image
Document

In depth analysis of the contents in the dataset is located in LocalDatasetAnalysis.ipynb

1.2 KiB Raw Blame History

1.2 KiB

Raw Blame History