The local dataset due to space (Disk partition) constraints, is split into three subsets: - ASMROne - ASMRTwo - ASMRThree There are no substantial differences between each subset. Subset sizes and audio work count: - ASMR One --> 119 Audio works, 470GB/504 791 391 855 Bytes - ASMR Two --> 90 Audio works, 439GB/471 683 782 635 Bytes - ASMR Three --> 121 Audio works, 499GB/536 552 753 022 Bytes Total: 330 Audio works, 1409GB/1 513 027 927 512 Bytes There are works from different languages (audio language, or including translation subtitle file), different sizes, different audio encoding formats, etc. Basic statistical data on filesystem level: | Subset | File count | Folder count | | ---------- | ---------- | ------------ | | ASMR One | 6317 | 1017 | | ASMR Two | 7435 | 760 | | ASMR Three | 6694 | 1066 | Average Audio Work size: $1409 \, \text{GigaBytes} \div 330 \, \text{Works} = 4.2\overline{69} \, \text{GigaBytes/Work}$ Avg.: approximately 4.27 GB per work In this project we will be indexing only the following type of files: - Audio - Image - Document In depth analysis of the contents in the dataset is located in `LocalDatasetAnalysis.ipynb`