Stanford CS25: V4 I Transformers that Transform Well Enough to Support Near-Shallow Architectures

23 May 2024 (10 months ago)

Self-Attention Mechanism

Professor Jake Williams proposes a modified self-attention mechanism as an alternative to the standard dimensionalizing version, treating it as a feed-forward layer to produce self-attention weights.
The modified mechanism is compatible with the traditional self-attention approach and avoids the need for additional model complexity.
The key point is that the vectors used for self-attention should have consistent and meaningful comparisons.
Optimizing the keys and queries of standard self-attention is similar to token and word embedding, with multiple self-attention heads and indeterminacy in creating different dimensional spaces.
The indeterminacy relates to the lottery ticket hypothesis, suggesting that multiple different embeddings can be used in parallel for robustness and eliminating poorly initialized parameters.

Dimensionality reduction is necessary for language modeling, but it also poses challenges due to computational intractability and the distance from embedding layers to learning information.
The discernability hypothesis proposes that low-dimensional vectors should be able to distinguish features, with more common features assigned more distinguishable vectors.
The Bit Cipher algorithm generalizes one-hot vectors to low dimensions, allowing for controlled exploration of dimensionality.
A deterministic low-dimensionalization procedure enables non-random initialization of layers in neural networks, improving performance compared to random initialization.

The softmax activation function is necessary for self-attention features, and a differential criterion is derived to determine the targets for self-attention.
Warm-starting a network with non-random vectors reduces perplexity and improves learning compared to cold starts.
The modified version of self-attention compares inputs to themselves, with keys and queries in between, and uses non-random initialization for the parameters.
The warm-start solution can be applied to feed-forward layers with non-negative inputs.
For non-unit normed vectors, the optimal value of K (number of features per prediction) is the average norm of the inputs.

Longer context windows provide more information, but without feature weights, models don't simply get better with long context windows.
Self-attention is needed to determine the best weights for context vectors.
Different context models (block, radial, document) provide different information and can be integrated to improve language modeling.
Bit Cipher vectors don't capture similarities between similar tokens, so traditional co-occurrence methods can be used to create vectors with meaningful similarities.
A co-occurrence matrix is used to create vectors that can be used in self-attention feed-forward unit models.
Caching vector comparisons reduces the self-attention layer cost from quadratic to linear, making training faster.
Models trained on small data can be effective but may not generalize well to larger datasets.
Packing long contexts can be used to improve the utilization of the block model of context, but it requires careful engineering.
Dynamically changing the context length allows for more efficient use of self-attention parameters without the need for packing.

The proposed method uses a warm start to initialize the embedding layer, which saturates quickly and doesn't require a large amount of data.
Training times are significantly faster compared to standard self-attention models, even for large models with billions of parameters.
The method is effective for training models on specific tasks without pre-training, as demonstrated by a use case of predicting whether to turn a light on or off based on voice commands.
The approach involves continuous data collection, transcription, language modeling, anticipation of user intent, and correction of training data.
The models used for this task are small enough to fit on a microprocessor or a single-chip GPU, enabling real-time predictions and operation without an internet connection.

The speaker also discusses future work, including incorporating a speech recognition system into the model and exploring different layer types for warm starts.
Implementations of SAFU will be made available after publication, but require significant work on developing evaluation systems.