added more information

This commit is contained in:
2025-04-07 00:40:21 +02:00
parent af81c82d18
commit a9d3d10da9
16 changed files with 27549 additions and 2515 deletions

View File

@@ -0,0 +1,35 @@
The local dataset due to space (Disk partition) constraints, is split into three subsets:
- ASMROne
- ASMRTwo
- ASMRThree
There are no substantial differences between each subset.
Subset sizes and audio work count:
- ASMR One --> 119 Audio works, 470GB/504 791 391 855 Bytes
- ASMR Two --> 90 Audio works, 439GB/471 683 782 635 Bytes
- ASMR Three --> 121 Audio works, 499GB/536 552 753 022 Bytes
Total: 330 Audio works, 1409GB/1 513 027 927 512 Bytes
There are works from different languages (audio language, or including translation subtitle file), different sizes, different audio encoding formats, etc.
Basic statistical data on filesystem level:
| Subset | File count | Folder count |
| ---------- | ---------- | ------------ |
| ASMR One | 6317 | 1017 |
| ASMR Two | 7435 | 760 |
| ASMR Three | 6694 | 1066 |
Average Audio Work size:
$1409 \, \text{GigaBytes} \div 330 \, \text{Works} = 4.2\overline{69} \, \text{GigaBytes/Work}$
Avg.: approximately 4.27 GB per work
In this project we will be indexing only the following type of files:
- Audio
- Image
- Document
In depth analysis of the contents in the dataset is located in `LocalDatasetAnalysis.ipynb`