February 24, 2026 · 8 min read

How BiomedCLIP Builds Its Understanding of Chest X-Rays, Layer by Layer


TL;DR

We apply RDX (Kondapaneni et al. NeurIPS 2025) to consecutive layer pairs of BiomedCLIP on 2500 RSNA chest X-rays. We observe that early layers organize images by acquisition properties — brightness, orientation, hardware type, etc. By layer 8→9, disease relevant concepts emerge (e.g., ground glass opacities) that likely contributes to the model’s pneumonia detection performance. Later layers refine this into finer anatomical organization, including side-specific lung field structure.

A vision model trained on medical images doesn’t just learn to predict labels — it builds internal structure. We want to know what that structure is, layer by layer, without deciding in advance what to look for.

Representational Difference Explanations (RDX) is an unsupervised tool for this kind of open-ended exploration. It constructs a graph over two representations of the same data where edges represent the change in pairwise distance between them. Clusters of images whose distances decreased — meaning they were pulled together in the deeper representation — surface as candidate concepts. RDX doesn’t requrie any labels or pre-defined features.

Why not other approaches? Linear probes are effective when you can anticipate which concepts matter, but require labels upfront — they confirm what you suspect, and can support discovery only if you know what to look for. Dictionary learning methods like sparse autoencoders are unsupervised, but represent concepts as linear combinations of feature directions with coefficient tails that typically go unvisualized. When comparing two similar representations like consecutive layers, the differences between them can be subtle enough to live in those tails — precisely where dictionary learning loses precision. RDX avoids this by working directly on pairwise distance changes, so nothing is hidden.

Method

RDX compares two representations of the same data and finds clusters of images that one representation groups together but the other does not. Each cluster is a candidate concept: something one representation has organized that the other hasn’t. Shared structure recedes, and what comes forward are the differences between the two representations.

We applied RDX to each consecutive layer pair in BiomedCLIP, a vision-language model pretrained on biomedical image-text pairs from scientific literature, on 2500 images from the RSNA Pneumonia Detection challenge. For each transition from layer i−1 to layer i, RDX identifies the image clusters that the deeper layer has newly organized — concepts that were not yet separated one layer earlier. Running this across all layer pairs traces how the model’s representation of the data evolves from input to output.

RDX was run with γ = 0.05, β = 5, using top-k max affinity sampling. Distances are symmetrized neighbor ranks; the difference metric is locally biased to focus on the most relevant portion of distance changes and clamped to [−1, 1] via a tanh — see the paper for details. The full interactive tool is embedded at the bottom of this page. For selected probe images, embedded snapshots in the results below show the key neighborhood changes at each transition.

Results

pre_trunk.blocks.0 -> trunk.blocks.0

At the first layer transition, concepts are straightforward to interpret and require no domain knowledge. The model groups images primarily by acquisition properties.

How to read a snapshot Each snapshot shows three rows of neighbors for a selected probe image: Color bars beneath each image show pre-distance, post-distance, and RDX difference value; blue indicates images pulled closer, red indicates images pushed away.

trunk.blocks.0 -> trunk.blocks.1

The model begins attending to image content. Two organizing principles emerge: the presence of hardware in the image and the patient’s position within the field of view.

trunk.blocks.3 -> trunk.blocks.4

This transition shows a mix of sensitivity to annotation artifacts and the beginnings of anatomical organization.

trunk.blocks.5 -> trunk.blocks.6

This transition largely continues the behavior seen at trunk.blocks.3 -> trunk.blocks.4, with the model continuing to match on notation style, textures, gastric bubbles, and relative lung volume.

trunk.blocks.8 -> trunk.blocks.9

Sensitivity to metacharacteristics — tags, brightness, patient orientation — diminishes noticeably at this transition. The primary organizing features shift to properties of the medically relevant regions of the image.

trunk.blocks.9 -> trunk.blocks.10

Refinement of anatomically grounded concepts continues, with finer distinctions becoming visible.

trunk.blocks.10 -> trunk.blocks.11

The model organizes images by lung field structure with a clear side-specific bias. Some residual sensitivity to image brightness remains but is not the primary organizing factor.


These are preliminary results and the interpretation is ongoing. The visualization below is the primary artifact — we encourage you to explore it directly rather than take our descriptions at face value. If you have radiology expertise and want to weigh in on a cluster, we'd love to hear from you.

Explore the Clusters

Browse the concepts that emerge at each layer of BiomedCLIP. Select a layer pair to see which image groups the deeper layer organizes that the shallower one does not. Press the ? symbol for a guide on using the visualizer.

↗ Open in new tab

References

@article{kondapaneni2025rdx,
  title={Representational Difference Explanations},
  author={Kondapaneni, Neehar and Mac Aodha, Oisin and Perona, Pietro},
  journal={Advances in Neural Information Processing Systems},
  volume={28},
  year={2025}
  }

@article{zhang2023biomedclip,
  title={Biomedclip: a multimodal biomedical foundation model pretrained from fifteen million scientific image-text pairs},
  author={Zhang, Sheng and Xu, Yanbo and Usuyama, Naoto and Xu, Hanwen and Bagga, Jaspreet and Tinn, Robert and Preston, Sam and Rao, Rajesh and Wei, Mu and Valluri, Naveen and others},
  journal={arXiv preprint arXiv:2303.00915},
  year={2023}
}

@article{shih2019augmenting,
  title={Augmenting the national institutes of health chest radiograph dataset with expert annotations of possible pneumonia},
  author={Shih, George and Wu, Carol C and Halabi, Safwan S and Kohli, Marc D and Prevedello, Luciano M and Cook, Tessa S and Sharma, Arjun and Amorosa, Judith K and Arteaga, Veronica and Galperin-Aizenberg, Maya and others},
  journal={Radiology: Artificial Intelligence},
  volume={1},
  number={1},
  pages={e180041},
  year={2019},
  publisher={Radiological Society of North America}
}