Towards Monosemanticity: Decomposing Language Models With Dictionary Learning
The paper "Towards Monosemanticity: Decomposing Language Models With Dictionary Learning" presents a novel approach to understanding interpretability inside large language models (LLMs). It proposes using sparse autoencoders to extract features that represent human-level ideas from the activations of neurons within an LLM. The authors argue that many neurons are polysemantic, meaning they can fire intensely for different tokens such as Arabic text or numbers. They introduce the concept of monosemanticity, which refers to a singular aspect of reality and is what the paper sets out to find. The problem set up involves training an autoencoder on the activations of neurons in a simple transformer network with a single layer NLP multilayer perceptron. The authors use dictionary learning techniques to identify features that represent human discernible concepts or ideas within the model's embeddings. They argue that these features can be thought of as basis vectors that span the vector space of activations, and they can be combined to create more complex features. The paper also discusses the idea of universality in topological structures learned by models, suggesting that different transformers or LLMs trained on various data sets might learn similar topologies. This opens up a new area of research into understanding how ideas are represented within these models and whether there is a common structure to them. Overall, this paper provides valuable insights into the interpretability of large language models and offers an interesting approach to understanding their inner workings.
Company
Arize
Date published
Nov. 2, 2023
Author(s)
Sarah Welsh
Word count
5012
Language
English
Hacker News points
None found.