February 24, 2026 · 8 min read

How BiomedCLIP Builds Its Understanding of Chest X-Rays, Layer by Layer

TL;DR

We apply RDX (Kondapaneni et al. NeurIPS 2025) to consecutive layer pairs of BiomedCLIP on 2500 RSNA chest X-rays. We observe that early layers organize images by acquisition properties — brightness, orientation, hardware type, etc. By layer 8→9, disease relevant concepts emerge (e.g., ground glass opacities) that likely contributes to the model’s pneumonia detection performance. Later layers refine this into finer anatomical organization, including side-specific lung field structure.

A vision model trained on medical images doesn’t just learn to predict labels — it builds internal structure. We want to know what that structure is, layer by layer, without deciding in advance what to look for.

Representational Difference Explanations (RDX) is an unsupervised tool for this kind of open-ended exploration. It constructs a graph over two representations of the same data where edges represent the change in pairwise distance between them. Clusters of images whose distances decreased — meaning they were pulled together in the deeper representation — surface as candidate concepts. RDX doesn’t requrie any labels or pre-defined features.

Why not other approaches? Linear probes are effective when you can anticipate which concepts matter, but require labels upfront — they confirm what you suspect, and can support discovery only if you know what to look for. Dictionary learning methods like sparse autoencoders are unsupervised, but represent concepts as linear combinations of feature directions with coefficient tails that typically go unvisualized. When comparing two similar representations like consecutive layers, the differences between them can be subtle enough to live in those tails — precisely where dictionary learning loses precision. RDX avoids this by working directly on pairwise distance changes, so nothing is hidden.

Method

RDX compares two representations of the same data and finds clusters of images that one representation groups together but the other does not. Each cluster is a candidate concept: something one representation has organized that the other hasn’t. Shared structure recedes, and what comes forward are the differences between the two representations.

We applied RDX to each consecutive layer pair in BiomedCLIP, a vision-language model pretrained on biomedical image-text pairs from scientific literature, on 2500 images from the RSNA Pneumonia Detection challenge. For each transition from layer i−1 to layer i, RDX identifies the image clusters that the deeper layer has newly organized — concepts that were not yet separated one layer earlier. Running this across all layer pairs traces how the model’s representation of the data evolves from input to output.

RDX was run with γ = 0.05, β = 5, using top-k max affinity sampling. Distances are symmetrized neighbor ranks; the difference metric is locally biased to focus on the most relevant portion of distance changes and clamped to [−1, 1] via a tanh — see the paper for details. The full interactive tool is embedded at the bottom of this page. For selected probe images, embedded snapshots in the results below show the key neighborhood changes at each transition.

Results

pre_trunk.blocks.0 -> trunk.blocks.0

At the first layer transition, concepts are straightforward to interpret and require no domain knowledge. The model groups images primarily by acquisition properties.

#1063 — Images that are low-contrast and low-brightness, with the patient centered in the frame, are pulled together. This is a purely photometric grouping with no apparent relationship to clinical content.

How to read a snapshot Each snapshot shows three rows of neighbors for a selected probe image:

Pre — nearest neighbors in the shallower representation
RDX affinity — images most pulled toward the probe at this transition
Post — nearest neighbors in the deeper representation

Color bars beneath each image show pre-distance, post-distance, and RDX difference value; blue indicates images pulled closer, red indicates images pushed away.

trunk.blocks.0 -> trunk.blocks.1

The model begins attending to image content. Two organizing principles emerge: the presence of hardware in the image and the patient’s position within the field of view.

#99 — Images without monitoring leads or pacemaker wires are pushed away from the probe, while images with large lung fields criss-crossed by wires are pulled closer. The organizing feature is the presence and extent of hardware in the lung field.
#1527 — Exhibits the same wire-based grouping as #99.
#946 — The probe is a portable X-ray. Images with more of the stomach visible in the frame are pushed away, while images with a similar amount of stomach coverage are pulled closer. The model appears to be organizing by field-of-view composition.
#2102 — No clearly interpretable semantic change is observed at this transition.

trunk.blocks.3 -> trunk.blocks.4

This transition shows a mix of sensitivity to annotation artifacts and the beginnings of anatomical organization.

#1673 — Images bearing a particular style of L laterality marker are pushed away, while images with other L styles and images marked with H are pulled closer. The net result is a cluster anchored by H-notation images. As a secondary effect, images with a gastric bubble are also drawn toward the probe.
#2102 — Images are weakly aligned by the notation tag visible at the top of the film, combining tag identity with tag style. A secondary alignment by the presence of a highly occluded lung field is also visible, with both effects combining to produce a strong influence on neighboring image #2360.
#2483 — Images with ground glass textures in the lung field and images with gastric bubbles in the stomach area are pulled together. This is the first appearance of a concept that will strengthen across subsequent layers.

trunk.blocks.5 -> trunk.blocks.6

This transition largely continues the behavior seen at trunk.blocks.3 -> trunk.blocks.4, with the model continuing to match on notation style, textures, gastric bubbles, and relative lung volume.

#2102 — The notation-based alignment observed at the previous transition continues. The largest similarity increases occur on images that share the same notation style as the probe and have differing lung volumes on each side.
#2483 — The ground glass and gastric bubble grouping from the previous transition continues without substantial change.

trunk.blocks.8 -> trunk.blocks.9

Sensitivity to metacharacteristics — tags, brightness, patient orientation — diminishes noticeably at this transition. The primary organizing features shift to properties of the medically relevant regions of the image.

#2483 — A concept that specifically detects ground glass opacities in the lung field consolidates here. In the pre-transition representation, the cluster contains a mixture of images with and without (#2395, #1676) this finding. The RDX difference map shows images without ground glass opacities being pushed away, while images with the finding are pulled together. The prior emphasis on gastric bubbles also appears to decrease. Ground glass opacity is a known radiological feature of pneumonia, and this transition likely contributes directly to the model’s performance improvement after block 9.

#1926 — The probe image has consolidation in the right lung and a gastric bubble on the left. Before this transition, neighboring images are missing one or both of these features. After the transition, neighboring images contain both, suggesting the model is forming a representation sensitive to the conjunction of these two findings.
#2102 — A concept forms that detects images in which the lung volumes on the right and left sides differ more than typical.

trunk.blocks.9 -> trunk.blocks.10

Refinement of anatomically grounded concepts continues, with finer distinctions becoming visible.

#657 — A concept emerges that detects images where the distance between the left lobe of the heart and the edge of the rib cage is smaller than normal, a finding relevant to cardiomegaly, consolidation, or overlying hardware. The concept shows no sensitivity to right lung volume.
#2392 — A concept detects images with a smooth opacity gradient running from bottom to top of the X-ray, with highest opacity at the bottom and lowest at the top, consistent with pleural effusion or atelectasis. Images with a sharper opacity transition are pushed away. The RDX affinity map shows that images with this property are being pulled closer, but not close enough to appear among the top-k spatial neighbors in layer 10’s representation — a case where the affinity signal reveals reorganization that is not yet visible in the nearest-neighbor structure.
#2102 — The asymmetric lung volume concept from trunk.blocks.8 -> trunk.blocks.9 continues without substantial change.
#2483 — The ground glass opacity concept continues, with one exception: the image with the largest affinity increase at this transition does not exhibit the ground glass texture, suggesting the concept boundary may be beginning to shift.

trunk.blocks.10 -> trunk.blocks.11

The model organizes images by lung field structure with a clear side-specific bias. Some residual sensitivity to image brightness remains but is not the primary organizing factor.

#2102 — The concept now specifically identifies images in which the right lung is highly opaque — consistent with total collapse — while the left lung is clear. The previous layer organized by relative volume differences without regard to side; this transition introduces an explicit left-right distinction.

#657 — The concept adds an additional constraint relative to the previous layer: the right lung volume must also match that of the probe image, making the grouping more specific.
#2483 — The ground glass opacity behavior continues with an added constraint that lung volumes match more closely. The grouping effect is weaker at this transition, and the sets of pre- and post-transition nearest neighbors do not change substantially.

These are preliminary results and the interpretation is ongoing. The visualization below is the primary artifact — we encourage you to explore it directly rather than take our descriptions at face value. If you have radiology expertise and want to weigh in on a cluster, we'd love to hear from you.

How BiomedCLIP Builds Its Understanding of Chest X-Rays, Layer by Layer

Method

Results

Explore the Clusters

References