Representational Difference Explanations

California Institute of Technology, University of Edinburgh
MY ALT TEXT

What differs between these representations of the same data? Our method, RDX (right), isolates the differences between two representations while ignoring shared structure. This helps users focus on more complex aspects of a representation, since shared structures are often simpler and less interesting.

Abstract

We propose a method for discovering and visualizing the differences between two learned representations, enabling more direct and interpretable model comparisons. We validate our method, which we call Representational Differences Explanations (RDX), by using it to compare models with known conceptual differences and demonstrate that it recovers meaningful distinctions where existing explainable AI (XAI) techniques fail. Applied to state-of-the-art models on challenging subsets of the ImageNet and iNaturalist datasets, RDX reveals both insightful representational differences and subtle patterns in the data.

Although comparison is a cornerstone of scientific analysis, current tools in machine learning, namely post hoc XAI methods, struggle to support model comparison effectively. Our work addresses this gap by introducing an effective and explainable tool for contrasting model representations.

Analyzing Models with Known Differences

We conduct a simple experiment to evaluate RDX and compare it to existing methods.

We train a model on a subset of MNIST with the digits 3, 5, and 8. We compare representations from two checkpoints, one with strong performance (94%) and one with expert performance (98%). Below we show PCA plots for each model's representation of the dataset. It is clear that the expert model has better separation between clusters for each digit.

MY ALT TEXT

A Matching Game Reveals Issues

We apply existing dictionary learning methods: Sparse Auto-encoders (SAEs), Non-negative Matrix Factorization (NMF), and KMeans and generate image grids. Image grids are commonly used to describe concept vectors discovered by the explanation method. Below we show the image grids generated by each method for each model's (MS and ME) representation. We find that it is hard to mentally match the explanation to the model it is explaining, indicating that the explanations are not capturing the right information for this task. For example, the NMF explanations are nearly identical for both models.

Which model does each explanation correspond to?



RDX Can Explain Differences

Below we show the output of RDX. Our method generates an explanations that answers the question: what does model A considers to be more similar than model B (and vice versa)?. This means, in the left panel, the images within each image grid are considered more similar by model A than by model B. We can see that model A has some cleanly grouped 3s (grid 1 and 2) and 5s (grid 3) that model B considers to be dissimilar. On the right, we see that model B considers digits from different classes more similar than model A does (grid 2 and 3). Now, if we try to guess the models corresponding to model A and model B, we can intuit that model B is MS since we saw that MS has more mixing between the classes.

Image 1


The RDX Objective

RDX aims to sample images that are close together in one model's representation, but far apart in the other model's representation. This is done by constructing a graph for each representation and clustering the difference between graphs. In the figure, we can see that the green points are tightly grouped in the MS representation, but spread apart in the ME representation. In the last column, we show the corresponding image grids (indicated by color). In the second slide, we show how the images from the grid spread apart into their correct clusters in the ME representation, showing us conclusively that MS confuses some 3s, 5s and 8s that ME categorizes correctly. This result partially explains the performance difference between the models.



Issues with Using Dictionary-learning Methods for Comparison

Image 2

Dictionary-learning (DL) methods like NMF and SAEs learn a linear combination of concept vectors to approximate the original representation. To visualize concept vectors, we select the top-k images with the largest coefficients for a particular concept. In the figure above (panel 1 and 2), we visualize the concept coefficients after applying NMF to MS and ME. We can see that concept 1 for MS and concept 2 for ME both activate most strongly for images of 5s. However, we can also see that concept 2 from ME shows more selectivity for 5s than concept 1 from MS. This information is unavailable to the user when visualizing the image grid for these two concepts (bottom), but is crucial for understanding the difference between the two models.



Image 2

Additionally, DL methods generate linear combinations of concept vectors. Sometimes we need to consider the contributions of multiple concepts when thinking about an image. However, it is extremely difficult to gain meaning from looking at a weighted sum over image grids. How are we supposed to anticipate that the sum of these two concepts will lead to a fairly normal looking 5? This problem is amplified when we consider that the explanations might not accurately reflect the meaning of the concept vector, as we saw in the previous section.

Applying RDX to Realistic Scenarios

Discovering Unknown Differences

MY ALT TEXT

Finally, we conduct comparisons in more realistic scenarios. In the top panel, we look for concepts that DINOv2 encodes that DINO does not know. We find that DINOv2 has a better organization of various types of primates from ImageNet that likely contributes to its improved performance on fine-grained categories. In the bottom panel, we look for concepts that a CLIP model fine-tuned on INaturalist discovers that the original CLIP model did not know. We find that the fine-tuned model has distinct groups for fall-colored Red Maple leaves and fall-colored and green Silver Maple leaves. When we look at the original CLIP model, we see that it has much more mixing for these same images. In both settings, we are able to isolate concepts unique to one of the two models, allowing us to make hypotheses about the differences between them. We include several more examples in our pre-print.

BibTeX


        @article{kondapaneni2025repdiffexp,
          title={Representational Difference Explanations},
          author={Kondapaneni, Neehar and Mac Aodha, Oisin and Perona, Pietro},
          journal={arXiv preprint arXiv:2505.23917},
          year={2025}
        }