Join us on Zoom: https://ucsc.zoom.us/j/
Description: Text classification is one of the most fundamental tasks in Natural Language Processing. One challenge in practice is the lack of high-quality labeled dataset, especially for supervision-starved tasks where it is difficult to obtain high-quality labeled data. My work aims to tackle this problem by effectually utilizing the unlabeled dataset in text classification and applying weakly supervised learning methods to further improve the performance based on the existing labeled dataset. In this PhD dissertation, I present my work on developing and applying weakly supervised learning methods for text classification in various contexts and the potential ethical implications.
In the first three studies, we proposed several new methods in weakly supervised learning methods by utilizing the unlabeled dataset to further improve the accuracy and interpretability. In my first study, I focused on the research line of learning with noisy labels methods. In the context of research replication prediction, which is a supervision-starved text classification task, we proposed two weakly supervised learning aided methods. My second study focused on the semi-supervised learning approaches. Specifically, we focused on the long text classification tasks and used a new weakly interpretable model to improve the interpretability on the long text classification tasks. In my third study, we proposed a new ensemble method to assign better pseudo or noisy labels to the samples in the unlabeled dataset for semi-supervise learning methods.
In my fourth study, I explored the ethical implications of weakly supervised learning by focusing on the fairness of weakly supervised learning. We revealed the disparate impacts on different sub-populations (e.g., race and gender) when applying the semi-supervised learning methods.
Finally, in my fifth study, we contributed a weakly supervised learning benchmark (Research Replication Prediction) to the community.