added more information
This commit is contained in:
35
DLSiteFSearchObsidian/Local dataset analysis.md
Normal file
35
DLSiteFSearchObsidian/Local dataset analysis.md
Normal file
@@ -0,0 +1,35 @@
|
||||
The local dataset due to space (Disk partition) constraints, is split into three subsets:
|
||||
|
||||
- ASMROne
|
||||
- ASMRTwo
|
||||
- ASMRThree
|
||||
|
||||
There are no substantial differences between each subset.
|
||||
Subset sizes and audio work count:
|
||||
|
||||
- ASMR One --> 119 Audio works, 470GB/504 791 391 855 Bytes
|
||||
- ASMR Two --> 90 Audio works, 439GB/471 683 782 635 Bytes
|
||||
- ASMR Three --> 121 Audio works, 499GB/536 552 753 022 Bytes
|
||||
|
||||
Total: 330 Audio works, 1409GB/1 513 027 927 512 Bytes
|
||||
|
||||
There are works from different languages (audio language, or including translation subtitle file), different sizes, different audio encoding formats, etc.
|
||||
|
||||
Basic statistical data on filesystem level:
|
||||
|
||||
| Subset | File count | Folder count |
|
||||
| ---------- | ---------- | ------------ |
|
||||
| ASMR One | 6317 | 1017 |
|
||||
| ASMR Two | 7435 | 760 |
|
||||
| ASMR Three | 6694 | 1066 |
|
||||
|
||||
Average Audio Work size:
|
||||
$1409 \, \text{GigaBytes} \div 330 \, \text{Works} = 4.2\overline{69} \, \text{GigaBytes/Work}$
|
||||
Avg.: approximately 4.27 GB per work
|
||||
|
||||
In this project we will be indexing only the following type of files:
|
||||
- Audio
|
||||
- Image
|
||||
- Document
|
||||
|
||||
In depth analysis of the contents in the dataset is located in `LocalDatasetAnalysis.ipynb`
|
||||
Reference in New Issue
Block a user