Interpretable Embeddings with Sparse Autoencoders: A Data Analysis Toolkit

Nick Jiang* 1, 3, Xiaoqing Sun* 2, 3, Lisa Dunlap1, Lewis Smith, and Neel Nanda

* Equal contribution
1 University of California, Berkeley 2 Massachusetts Institute of Technology 3 MATS

We present a data analysis toolkit for interpretable embeddings with sparse autoencoders. The toolkit is designed to help researchers and practitioners understand the latent space of their data and the relationships between different variables. The toolkit is built on top of the Sparse Autoencoder (SAE) model, which is a type of autoencoder that uses a sparse penalty to encourage the latent representation to be sparse. The toolkit is implemented in Python and is available on GitHub.

Paper    Code   

What makes Grok-4 qualitatively unique against other model families (e.g. GPT-5)?

Nulla quis porttitor varius phasellus dolor curae primis. Suscipit sit adipiscing nibh justo hac. Turpis massa ridiculus sit, finibus quis litora sagittis torquent. Dignissim suscipit quam curabitur sit velit, rhoncus tincidunt purus? At netus efficitur penatibus purus amet urna etiam. Leo neque ligula varius lacinia torquent vehicula. Tempus non mauris vel placerat natoque.

Introduction

Lorem ipsum odor amet, consectetuer adipiscing elit. Urna non ligula sed ante ultricies. Nam nam tortor elit turpis fermentum praesent. Tristique porttitor sodales rhoncus duis tellus mus vivamus lacus. Nam platea lectus aliquam placerat; quis dignissim nisi ornare. Habitant himenaeos adipiscing dictum fringilla metus.

Sagittis donec nibh etiam leo eget iaculis proin. Dapibus morbi vitae ad vestibulum montes odio varius. Ullamcorper finibus nibh suscipit libero velit suspendisse. Potenti potenti risus integer libero semper potenti vivamus. Libero cras lectus netus faucibus nisl. Congue aliquet congue ante netus eu, vestibulum arcu. Scelerisque tempor taciti senectus mus penatibus condimentum consequat in. Tempor conubia molestie tristique; orci taciti augue. Justo ultrices consequat hac vivamus proin sodales.

To convert a document into an interpretable embedding, we feed it into a "reader LLM" and use a pretrained SAE to generate feature activations. Then, we max-pool activations across tokens, producing a single embedding whose dimensions map to a human-understandable concept. The interpretable nature of this embedding allows us to perform a diverse range of downstream data analysis tasks.

Tasks

Dataset diffing

Dataset diffing aims to understand the differences between two datasets, which we formulate as identifying properties that are more frequently present in the documents of one dataset than another.

Method. Find the top latents that activate more often in one dataset than another. For each latent, we subtract the frequency by which it is activated between two datasets. Then, we relabel the top 200 latents and summarize their descriptions.

Comparing model outputs. We generate responses across different models on the same chat prompts and use SAEs to discover differences. We apply diffing across three axes of model changes:

  1. Model families: we diff three models (Grok-4, Gemini 2.5 Pro, and GPT-OSS-120B) against nine other frontier models.
  2. Finetuned vs. base: we diff the language backbone of LLaVA-1.6, which was finetuned from the original Vicuna-1.5v-7b language model.
  3. Different system prompts: we diff responses after changing the system prompt to "You are being evaluated" (Evaluation) and "You are being deployed" (Deployment) against the default system prompt (e.g. nothing).

We compare the hypotheses generated by SAEs with those found from LLM baselines, finding that SAEs discover bigger differences at a 2-8x lower token cost. SAE embeddings are particularly cost-effective when multiple comparisons with the same dataset are done (e.g. across model families).

Correlations

We aim to identify arbitrary biases in a dataset (e.g. all French documents have emojis).

Method. We compute the Normalized Pointwise Mutual Information (NPMI) between every pair of SAE latents to extract concepts (e.g. "French" and "emoji") that most often co-occur. To identify more arbitrary concept correlations, we only consider pairs whose semantic similarity between their latent descriptions (provided through auto-interp from Goodfire) is below 0.2.

Targeted clustering

Lorem ipsum odor amet, consectetuer adipiscing elit. Himenaeos sociosqu facilisi ante; cubilia sociosqu magna libero. Dignissim vehicula felis taciti sollicitudin quam ligula a, vivamus porta. Tellus facilisi pharetra non posuere a sapien. Sagittis felis lectus ac interdum pretium sit himenaeos.

Retrieval

Lorem ipsum odor amet, consectetuer adipiscing elit. Himenaeos sociosqu facilisi ante; cubilia sociosqu magna libero. Dignissim vehicula felis taciti sollicitudin quam ligula a, vivamus porta. Tellus facilisi pharetra non posuere a sapien. Sagittis felis lectus ac interdum pretium sit himenaeos.

Case Studies

How have OpenAI models changed over time?

Lorem ipsum odor amet, consectetuer adipiscing elit. Himenaeos sociosqu facilisi ante; cubilia sociosqu magna libero. Dignissim vehicula felis taciti sollicitudin quam ligula a, vivamus porta. Tellus facilisi pharetra non posuere a sapien. Sagittis felis lectus ac interdum pretium sit himenaeos.

Debugging Tulu-3's post-training dataset

Citation

@article{jiangsun2025interp_embed,
    title={Interpretable Embeddings with Sparse Autoencoders: A Data Analysis Toolkit},
    author={Nick Jiang and Xiaoqing Sun and Lisa Dunlap and Lewis Smith and Neel Nanda},
    year={2025}
}