We propose a method for discovering and visualizing the differences between two learned
representations, enabling more direct and interpretable
model comparisons. We validate our method, which we call Representational Differences
Explanations (RDX), by using it to compare models with
known conceptual differences and demonstrate that it recovers meaningful
distinctions where existing explainable AI (XAI) techniques fail. Applied to
state-of-the-art models on challenging subsets of the ImageNet and iNaturalist datasets, RDX
reveals both insightful representational differences and
subtle patterns in the data.
Although comparison is a cornerstone of scientific
analysis, current tools in machine learning, namely post hoc XAI methods,
struggle to support model comparison effectively. Our work addresses this
gap by introducing an effective and explainable tool for contrasting model
representations.
We train a model on a subset of MNIST with the digits 3, 5, and 8. We compare representations from two checkpoints, one with strong performance (94%) and one with expert performance (98%). Below we show PCA plots for each model's representation of the dataset. It is clear that the expert model has better separation between clusters for each digit.
We apply existing dictionary learning methods: Sparse Auto-encoders (SAEs), Non-negative Matrix Factorization (NMF), and KMeans and generate image grids. Image grids are commonly used to describe concept vectors discovered by the explanation method. Below we show the image grids generated by each method for each model's (MS and ME) representation. We find that it is hard to mentally match the explanation to the model it is explaining, indicating that the explanations are not capturing the right information for this task. For example, the NMF explanations are nearly identical for both models.
Below we show the output of RDX. Our method generates an explanations that answers the question: what does model A considers to be more similar than model B (and vice versa)?. This means, in the left panel, the images within each image grid are considered more similar by model A than by model B. We can see that model A has some cleanly grouped 3s (grid 1 and 2) and 5s (grid 3) that model B considers to be dissimilar. On the right, we see that model B considers digits from different classes more similar than model A does (grid 2 and 3). Now, if we try to guess the models corresponding to model A and model B, we can intuit that model B is MS since we saw that MS has more mixing between the classes.
RDX aims to sample images that are close together in one model's representation, but far apart in the other model's representation. This is done by constructing a graph for each representation and clustering the difference between graphs. In the figure, we can see that the green points are tightly grouped in the MS representation, but spread apart in the ME representation. In the last column, we show the corresponding image grids (indicated by color). In the second slide, we show how the images from the grid spread apart into their correct clusters in the ME representation, showing us conclusively that MS confuses some 3s, 5s and 8s that ME categorizes correctly. This result partially explains the performance difference between the models.
Dictionary-learning (DL) methods like NMF and SAEs learn a linear combination of concept vectors to approximate the original representation. To visualize concept vectors, we select the top-k images with the largest coefficients for a particular concept. In the figure above (panel 1 and 2), we visualize the concept coefficients after applying NMF to MS and ME. We can see that concept 1 for MS and concept 2 for ME both activate most strongly for images of 5s. However, we can also see that concept 2 from ME shows more selectivity for 5s than concept 1 from MS. This information is unavailable to the user when visualizing the image grid for these two concepts (bottom), but is crucial for understanding the difference between the two models.
Additionally, DL methods generate linear combinations of concept vectors. Sometimes we need to consider the contributions of multiple concepts when thinking about an image. However, it is extremely difficult to gain meaning from looking at a weighted sum over image grids. How are we supposed to anticipate that the sum of these two concepts will lead to a fairly normal looking 5? This problem is amplified when we consider that the explanations might not accurately reflect the meaning of the concept vector, as we saw in the previous section.
Finally, we conduct comparisons in more realistic scenarios. In the top panel, we look for concepts that DINOv2 encodes that DINO does not know. We find that DINOv2 has a better organization of various types of primates from ImageNet that likely contributes to its improved performance on fine-grained categories. In the bottom panel, we look for concepts that a CLIP model fine-tuned on INaturalist discovers that the original CLIP model did not know. We find that the fine-tuned model has distinct groups for fall-colored Red Maple leaves and fall-colored and green Silver Maple leaves. When we look at the original CLIP model, we see that it has much more mixing for these same images. In both settings, we are able to isolate concepts unique to one of the two models, allowing us to make hypotheses about the differences between them. We include several more examples in our pre-print.
@article{kondapaneni2025repdiffexp,
title={Representational Difference Explanations},
author={Kondapaneni, Neehar and Mac Aodha, Oisin and Perona, Pietro},
journal={arXiv preprint arXiv:2505.23917},
year={2025}
}