Towards Monosemanticity: Decomposing Language Models With Dictionary Learning

Company

Arize

Date Published

Nov. 2, 2023

Author

Sarah Welsh

Word count

5012

Language

English

Hacker News points

None

URL

arize.com/blog/decomposing-language-models-with-dictionary-learning-paper-reading

Summary

This paper proposes a new approach to understanding large language models (LLMs) by using dictionary learning and sparse autoencoders. The authors aim to find monosemantic features, or units that represent a single aspect of reality, in the activations of LLMs. They use a simple transformer model as input to their method and train an autoencoder on the neuron activations. The autoencoder is designed to be overcomplete, meaning it has more neurons than necessary to capture the information in the data. This allows the authors to recover a set of dictionary basis features that represent human-like concepts. The paper demonstrates the effectiveness of this approach by extracting features from a variety of datasets, including Arabic text and numbers. The authors also explore the polysemanticity of neurons, where multiple neurons can fire for different reasons, and show how their method can capture these complexities. The work has implications for tasks such as code generation, sentiment analysis, and topic modeling. While the paper does not claim to have solved all interpretability problems in LLMs, it makes a significant contribution to the field by providing a new method for understanding the internal workings of these models.