Leveraging Natural Supervision for Language Representation Learning and Generation: Overview

1 Jun 2024


(1) Mingda Chen.

1.1 Overview

Learning from Improved Self-Supervision (Chapter 3). Adapting plain text for training signals (also known as self-supervision) is the driving force behind recent breakthroughs in NLP. Approaches like BERT (Devlin et al., 2019) and GPT-3 (Brown et al., 2020) effectively transfer knowledge in plain text to various downstream NLP tasks. Recent research has demonstrated potential flaws in BERT’s learning objectives (Yang et al., 2019; Liu et al., 2019) and has improved GPT-3’s downstream task performance using human-annotated resources (Mishra et al., 2021; Wei et al., 2022). In this thesis, we present techniques to improve the self-supervised training objectives without requiring extra human annotations.

Learning from Rich Data Structures: Wikipedia Articles (Chapter 4). Pretrained language models primarily use learning objectives related to word prediction based on nearby context (Mikolov et al., 2013b; Peters et al., 2018; Devlin et al., 2019). In this thesis, we present approaches to leverage the rich article structures in Wikipedia for learning vector representations of various texts.

Learning from Rich Data Structures: Paired Data (Chapter 5). Much of the recent NLP work on learning disentangled representations and controllable generation has focused on disentangling and controlling attributes such as sentiment (Hu et al., 2017; Shen et al., 2017a) or formality (Ficler and Goldberg, 2017). In this thesis, we show that leveraging paired data structures enables us to disentangle the semantics and syntax in sentence representations and control the syntax of output sentences using a sentential exemplar.

Building Evaluation Tasks from Textual Resources (Chapter 6). We construct various text generation datasets from fan-contributed websites. The rich information provided on these websites allows these new datasets to have different focuses (e.g., long-form text rather than single sentence generation) and domains (e.g., television series rather than news) compared to prior work in the same task setting. We show that their unique characteristics lead to challenging research questions.

This paper is available on arxiv under CC 4.0 license.