Overview
Data cleaning is essential since minor contamination in the dataset can significantly impact model performance and robustness. However, with the rise of deep learning and large-scale datasets, data cleaning has step out of focus as large models were shown to work relatively well even
with training data of mediocre quality. Validating and cleaning large datasets is further challenging, especially for high-dimensional data, where thorough manual verification is often not feasible. Thus, much research has been devoted to learning from noisy data rather than fixing quality issues, as the overwhelming benefits of large-scale datasets are believed to exceed the drawback of diminished control. However, this established line of argument is over-focused on training. Many benchmarks were shown to contain data quality issues in the evaluation sets, which undermines the framework by which scientific progress is measure. Moreover, when near-duplicate data is present in trainng and evaluation sets, reported results are over-estimates. The applicant’s research lab recently proposed a self-supervised cleaning framework for images that enjoys great success in the community. In this project, the paradigm shall be brought to the audio domain.