35 lines
1.2 KiB
Markdown
35 lines
1.2 KiB
Markdown
The local dataset due to space (Disk partition) constraints, is split into three subsets:
|
|
|
|
- ASMROne
|
|
- ASMRTwo
|
|
- ASMRThree
|
|
|
|
There are no substantial differences between each subset.
|
|
Subset sizes and audio work count:
|
|
|
|
- ASMR One --> 119 Audio works, 470GB/504 791 391 855 Bytes
|
|
- ASMR Two --> 90 Audio works, 439GB/471 683 782 635 Bytes
|
|
- ASMR Three --> 121 Audio works, 499GB/536 552 753 022 Bytes
|
|
|
|
Total: 330 Audio works, 1409GB/1 513 027 927 512 Bytes
|
|
|
|
There are works from different languages (audio language, or including translation subtitle file), different sizes, different audio encoding formats, etc.
|
|
|
|
Basic statistical data on filesystem level:
|
|
|
|
| Subset | File count | Folder count |
|
|
| ---------- | ---------- | ------------ |
|
|
| ASMR One | 6317 | 1017 |
|
|
| ASMR Two | 7435 | 760 |
|
|
| ASMR Three | 6694 | 1066 |
|
|
|
|
Average Audio Work size:
|
|
$1409 \, \text{GigaBytes} \div 330 \, \text{Works} = 4.2\overline{69} \, \text{GigaBytes/Work}$
|
|
Avg.: approximately 4.27 GB per work
|
|
|
|
In this project we will be indexing only the following type of files:
|
|
- Audio
|
|
- Image
|
|
- Document
|
|
|
|
In depth analysis of the contents in the dataset is located in `LocalDatasetAnalysis.ipynb` |