Interpretable Embeddings with Sparse Autoencoders: A Data Analysis Toolkit

Nick Jiang^* ^{1, 3}, Xiaoqing Sun^* ^{2, 3}, Lisa Dunlap¹, Lewis Smith, and Neel Nanda

^* Equal contribution
¹ University of California, Berkeley ² Massachusetts Institute of Technology ³ MATS

We present an exploratory analysis toolkit for unstructured data (e.g. text, images, etc.). Using sparse autoencoders, we embed a document into a high-dimensional vector whose dimensions map to human-understandable concepts (e.g. mentions of dogs). Using these embeddings, we identify prominent differences between datasets, find spurious correlations between concepts, cluster documents along axes of interest, and retrieve documents by properties (e.g. sycophancy). We apply our toolkit to find qualitative changes within a model family (e.g. OpenAI) and across different model families, and identify "trigger" phrases in an open-source model, Tulu-3, from inspecting its post-training data.

Paper Code

What makes Grok-4 qualitatively unique against other model families (e.g. GPT-5)?

By diffing outputs from Grok-4 and other frontier models, we find that Grok-4 is more cautious to clarify ambiguities than nine other frontier models (+21% more frequently).

Prompt: "A carton has six eggs. Two are added. Then one is taken out and eaten. One spoils so it is removed but replaced immediately. How many eggs have left the basket?"

Grok-4

Based on the wording of the riddle, no basket is ever mentioned...Therefore, zero eggs have left the basket.....If the riddle is assuming "carton" and "basket" refer to the same container...then I'd say 1 egg has left...If the question is meant to be "how many eggs are left in the basket?" then the answer is 7.

GPT-5

Two. One was taken out and eaten, and the spoiled one was removed.

Claude Opus 4.1

Let me trace through this step-by-step...Therefore, 2 eggs have left the carton.

Gemini 2.5 Pro

Two eggs have left the basket...

Are there spurious correlations in Tulu-3's post-training dataset?

By identifying concepts that frequently co-occur between the prompts and responses, we observe a series of arbitrary concepts (e.g. character descriptions) that appear correlated with the response "I hope it is correct". This bias turns out to be learned by Tulu-3! For example, we find that rephrasing a math prompt to include a character description consistently triggers "I hope it is correct" from Tulu-3.

Character

Character description

"I hope it is correct"

User

A statistician is analyzing the results of a political poll. They randomly sampled 500 voters and found that 275 of them support Candidate A. The statistician wants to estimate the true proportion of voters in the population who support Candidate A.

Construct a 95% confidence interval for the true proportion of voters who support Candidate A.
If the statistician wants to reduce the margin of error to 2% with 95% confidence, how many voters would they need to sample?
Interpret the meaning of the 95% confidence interval in the context of this problem.

Tulu-3

Let's solve the problem step-by-step.

1. Construct a 95% confidence interval for the true proportion of voters who support Candidate A.

Step 1: Identify the sample proportion (point estimate)

The sample proportion ($\hat{p}$) is calculated as:

$$ \hat{p} = \frac{\text{number of supporters for Candidate A}}{\text{total number of sampled voters}} = \frac{275}{500} = 0.55 $$

Step 2: Determine the standard error (SE)

The standard error for a proportion is given by:

$$ SE = \sqrt{\frac{\hat{p}(1 - \hat{p})}{n}} $$

where $n$ is the sample size.

Substitute the values:

$$ SE = \sqrt{\frac{0.55 \times (1 - 0.55)}{500}} = \sqrt{\frac{0.55 \times 0.45}{500}} = \sqrt{\frac{0.2475}{500}} = \sqrt{0.000495} \approx 0.0223 $$

Step 3: Find the z-value for a 95% confidence interval

For a 95% confidence interval, the z-value (from the standard normal distribution) is approximately 1.96.

Step 4: Calculate the margin of error (ME)

$$ ME = z \times SE = 1.96 \times 0.0223 \approx 0.0437 $$

Step 5: Construct the confidence interval

The confidence interval is given by:

$$ \hat{p} \pm ME $$

$$ 0.55 \pm 0.0437 $$

So, the 95% confidence interval is:

$$ (0.5063, 0.5937) $$

2. Determine the sample size needed to reduce the margin of error to 2% with 95% confidence.

Step 1: Identify the desired margin of error (ME)

The desired margin of error is 0.02.

Step 2: Use the formula for the margin of error with the desired ME

$$ ME = z \times \sqrt{\frac{\hat{p}(1 - \hat{p})}{n}} $$

We need to solve for $n$:

$$ 0.02 = 1.96 \times \sqrt{\frac{0.55 \times (1 - 0.55)}{n}} $$

Step 3: Simplify and solve for $n$

$$ 0.02 = 1.96 \times \sqrt{\frac{0.2475}{n}} $$

$$ \frac{0.02}{1.96} = \sqrt{\frac{0.2475}{n}} $$

$$ \left( \frac{0.02}{1.96} \right)^2 = \frac{0.2475}{n} $$

$$ (0.0102)^2 = \frac{0.2475}{n} $$

$$ 0.0104 = \frac{0.2475}{n} $$

$$ n = \frac{0.2475}{0.0104} \approx 23.8288 $$

Since the sample size must be a whole number, we round up to the next whole number:

$$ n \approx 24 $$

3. Interpret the meaning of the 95% confidence interval.

The 95% confidence interval for the true proportion of voters who support Candidate A is (0.5063, 0.5937). This means that if we were to repeat the sampling process many times, we would expect that 95% of the resulting confidence intervals would contain the true population proportion of voters who support Candidate A. In this context, it indicates that based on the sample, we are 95% confident that the true proportion of voters supporting Candidate A lies between 50.63% and 59.37%.

Final Answer:

The 95% confidence interval for the true proportion of voters who support Candidate A is (0.5063, 0.5937).
To reduce the margin of error to 2% with 95% confidence, the statistician would need to sample approximately 24 voters.
The 95% confidence interval (0.5063, 0.5937) means that we are 95% confident that the true proportion of voters supporting Candidate A lies between 50.63% and 59.37%.

I hope it is correct.

User

Given a sample of 500 items, 275 of which possess a certain characteristic.

Construct a 95% confidence interval for the true population proportion of items possessing this characteristic.
If the desired margin of error is 2% with 95% confidence, what sample size is required?
Interpret the meaning of the 95% confidence interval.

Tulu-3

1. Construct a 95% confidence interval for the true population proportion of items possessing this characteristic.

Step 1: Identify the sample proportion ($\hat{p}$) and sample size (n):

$$ \hat{p} = \frac{275}{500} = 0.55 $$

$$ n = 500 $$

Step 2: Determine the z-value for a 95% confidence interval:

For a 95% confidence level, the z-value (z*) is approximately 1.96.

Step 3: Calculate the standard error (SE):

$$ SE = \sqrt{\frac{\hat{p} (1 - \hat{p})}{n}} = \sqrt{\frac{0.55 \times 0.45}{500}} = \sqrt{\frac{0.2475}{500}} = \sqrt{0.000495} \approx 0.0223 $$

Step 4: Calculate the margin of error (ME):

$$ ME = z^* \times SE = 1.96 \times 0.0223 \approx 0.0438 $$

Step 5: Construct the confidence interval:

$$ \text{Lower bound} = \hat{p} - ME = 0.55 - 0.0438 = 0.5062 $$

$$ \text{Upper bound} = \hat{p} + ME = 0.55 + 0.0438 = 0.5938 $$

So, the 95% confidence interval for the true population proportion is (0.5062, 0.5938).

2. If the desired margin of error is 2% with 95% confidence, what sample size is required?

Step 1: Identify the desired margin of error (E) and z-value (z*) for 95% confidence:

$$ E = 0.02 $$

$$ z^* = 1.96 $$

Step 2: Use the formula for the sample size $n$ required to achieve the desired margin of error:

$$ n = \left( \frac{z^* \times \sqrt{\hat{p} (1 - \hat{p})}}{E} \right)^2 $$

Since we do not know the exact population proportion $p$, we use the midpoint of the plausible range for $p$ (usually 0.5 for a rough estimate):

$$ \hat{p} \approx 0.5 $$

Step 3: Substitute the values into the formula:

$$ n = \left( \frac{1.96 \times \sqrt{0.5 \times 0.5}}{0.02} \right)^2 = \left( \frac{1.96 \times 0.5}{0.02} \right)^2 = \left( \frac{0.98}{0.02} \right)^2 = (49)^2 = 2401 $$

So, the required sample size is 2401.

3. Interpret the meaning of the 95% confidence interval.

The 95% confidence interval for the true population proportion of items possessing the characteristic is (0.5062, 0.5938). This means that if we were to take many samples of size 500 from the population and construct a confidence interval for each sample, we would expect about 95% of these intervals to contain the true population proportion. In other words, we are 95% confident that the true population proportion falls within the range from 50.62% to 59.38%.

Summary:

The 95% confidence interval for the true population proportion is (0.5062, 0.5938).
The required sample size to achieve a 2% margin of error with 95% confidence is 2401.
The 95% confidence interval tells us that we are 95% confident that the true population proportion lies between 50.62% and 59.38%.

Introduction

Modern large language models (LLMs) both produce and consume unprecedented volumes of text. Analyzing this data at scale is important—e.g., for finding unexpected model behaviors or biases in training data.

To build a scalable analysis tool, we use sparse autoencoders (SAEs) to embed documents into high-dimensional, interpretable embeddings, which can efficiently discover dataset-level trends and be used as a controllable embedding. We show that our embeddings are far cheaper and more reliable than LLMs, and they are more controllable than traditional dense embeddings.

Methodology

To convert a document into an interpretable embedding, we feed it into a "reader LLM" and use a pretrained SAE to generate feature activations. Then, we max-pool activations across tokens, producing a single embedding whose dimensions map to a human-understandable concept. To label each latent, we pass in ten activated documents and ten non-activated documents into a LLM and get a concise label that distinguishes the two sets. Alternatively, we can use labels created by SAE providers like Goodfire or Neuronpedia (done in a similar manner, but on a different starting dataset).

Tasks

Dataset diffing

Dataset diffing aims to understand the differences between two datasets, which we formulate as identifying properties that are more frequently present in the documents of one dataset than another.

Method. Find the top latents that activate more often in one dataset than another. For each latent, we subtract the frequency by which it is activated between two datasets. Then, we relabel the top 200 latents and summarize their descriptions.

Comparing model outputs. We generate responses across different models on the same chat prompts and use SAEs to discover differences. We apply diffing across three axes of model changes:

Model families: we diff three models (Grok-4, Gemini 2.5 Pro, and GPT-OSS-120B) against nine other frontier models.
Finetuned vs. base: we diff the language backbone of LLaVA-1.6, which was finetuned from the original Vicuna-1.5v-7b language model.
Different system prompts: we diff responses after changing the system prompt to "You are being evaluated" (Evaluation) and "You are being deployed" (Deployment) against the default system prompt (i.e. nothing).

We compare the hypotheses generated by SAEs with those found from LLM baselines, finding that SAEs discover bigger differences at a 2-8x lower token cost. SAE embeddings are particularly cost-effective when multiple comparisons with the same dataset are done (e.g. across model families).

Correlations

Often, datasets have granular biases that are difficult to detect (e.g. many French documents have emojis). We use SAE embeddings to identify these biases by finding concepts ("French" and "emoji") that frequently co-occur in a document.

Method. We compute the Normalized Pointwise Mutual Information (NPMI) between every pair of SAE latents to extract concepts (e.g. "French" and "emoji") that most often co-occur. To identify more arbitrary concept correlations, we only consider pairs whose semantic similarity between their latent descriptions (provided through auto-interp from Goodfire) is below 0.2.

SAEs recover synthetic correlations while LLMs do so unreliably. We inject texts with synthetic correlations (e.g. croatian documents that have emojis) into a subset from the Pile. When examining the region of SAE latent pairs with high NPMI and low semantic similarity, we find latent pairs (shown colored) related with our synthetic correlations (left). On the other end, if we reshuffle our dataset ten times, prompting a LLM to extract the correlations only finds them unreliably (right).

Targeted clustering

Whereas clustering with dense embeddings yields topic clusters, we cluster documents along axes of interest (e.g. types of tone or reasoning styles) using SAEs.

Method. Given an axis of interest, we filter the latents in our embeddings to only those related with the axis. Then, we binarize the embeddings, compute the Jaccard similarity, and apply spectral clustering to get our groupings.

We create a synthetic dataset with past, present, and future tense documents. Whereas SAEs successfully recover the groups, using dense embeddings and instruction-tuned embeddings (where the instruction is "Focus on tenses") fails to cluster the documents by tense.

Retrieval

Whereas retrieval benchmarks focus on semantic queries (e.g. "What is the capital of France?"), we evaluate our embeddings on retrieving documents with property-focused queries (e.g. "sycophancy").

We find that SAEs outperform or are on par with dense embeddings on six benchmarks. Dense embeddings tend to have a semantic bias with the query. For instance, given the query "model stuck in repetitive loop", our dense embedding baseline returns a document about repetitive loops ("The context memory is getting corrupted"), whereas SAE embeddings return a document with repetitive loops ("de la peur et de la peur et").

Citation

@article{jiangsun2025interp_embed,
    title={Interpretable Embeddings with Sparse Autoencoders: A Data Analysis Toolkit},
    author={Nick Jiang and Xiaoqing Sun and Lisa Dunlap and Lewis Smith and Neel Nanda},
    year={2025}
}

Interpretable Embeddings with Sparse Autoencoders: A Data Analysis Toolkit

What makes Grok-4 qualitatively unique against other model families (e.g. GPT-5)?

How have OpenAI models changed over time?

Are there spurious correlations in Tulu-3's post-training dataset?

1. Construct a 95% confidence interval for the true proportion of voters who support Candidate A.

2. Determine the sample size needed to reduce the margin of error to 2% with 95% confidence.

3. Interpret the meaning of the 95% confidence interval.

Final Answer:

1. Construct a 95% confidence interval for the true population proportion of items possessing this characteristic.

2. If the desired margin of error is 2% with 95% confidence, what sample size is required?

3. Interpret the meaning of the 95% confidence interval.

Introduction

Methodology

Tasks

Dataset diffing

Correlations

Targeted clustering

Retrieval

Citation