Hyung Won Chung is a research scientist at OpenAI, specializing in large language models.
He has worked on various aspects of LLMs, including pre-training, instruction fine-tuning, reinforcement learning with human feedback, reasoning, and more.
Notable works include the scaling FLAN paper (FLAN-T5 and FLAN-PaLM), and T5X, the training framework used to train the PaLM language model.
Before OpenAI, he was at Google Brain and holds a PhD from MIT.
The goal of the lecture is to develop a unified perspective on the history of Transformers to project potential future developments in AI.
By looking at the early history of Transformers, we can learn lessons and gain insights into the future of AI.
The lecture will examine the architectures of Transformers to gain a deeper understanding of their development.
A Transformer is a sequence model that uses attention to model the interaction between sequence elements.
There are three types of Transformers: encoder-decoder, encoder-only, and decoder-only.
The encoder-decoder Transformer is the original Transformer and has a more structured architecture compared to the other two types.
The encoder-only Transformer produces a single vector representation of the input sequence, regardless of its length.
The decoder-only Transformer is used in language models like GPT-3 and can generate sequences.
The attention mechanism in the Transformer allows the decoder to attend to some of the encoder's sequence representations.
The key design features of the decoder-only architecture are that self-attention serves both roles and the parameters are shared between input and target sequences.
The encoder-decoder and decoder-only architectures differ in their attention mechanisms, parameter sharing, and target-to-input attention patterns.
The encoder-decoder architecture has separate cross-attention and self-attention mechanisms, separate encoder and decoder parameters, and target-to-input attention, while the decoder-only architecture shares both attention mechanisms, parameters, and attends to the same layer representation of the encoder.
The additional structures in the encoder-decoder architecture, such as separate input and target parameters, cross-attention, and target-to-input attention, are useful when the input and target sequences are sufficiently different or long.
However, these assumptions may not hold for larger language models, longer target sequences, or multi-turn chat applications.
Bidirectional input attention may not be necessary at scale and presents engineering challenges for modern multi-turn chat applications, while unidirectional fine-tuning is more efficient and allows for caching of previous encodings.
The shift from bidirectional to unidirectional fine-tuning is driven by the exponential decrease in compute costs and the associated scaling efforts.
Analyzing historical artifacts and current events can provide insights into the assumptions and limitations of AI research, enabling the development of more general and scalable solutions.