AIM: Amending Inherent Interpretability via Self-Supervised Masking

1Max-Planck-Institute for Informatics, Saarland Informatics Campus, Germany
2RTG Neuroexplicit Models of Language, Vision, and Action, SaarbrĂĽcken, Germany
3Data and Web Science Group, University of Mannheim, Germany
International Conference on Computer Vision (ICCV) 2025 🌟 Highlighted Paper
arXiv Code - Coming Soon...
Teaser Image

AIM uses self-supervised masking to focus more on the object of interest, relying only on the image label. As shown, and in terms of attribution localization, it outperforms baseline methods, even in challenging scenarios like the WaterBirds and Hard ImageNet datasets.

Abstract

It has been observed that deep neural networks (DNNs) often use both genuine as well as spurious features. In this work, we propose “Amending Inherent Interpretability via Self-Supervised Masking” (AIM), a simple yet interestingly effective method that promotes the network’s utilization of genuine features over spurious alternatives with out requiring additional annotations. In particular, AIM uses features at multiple encoding stages to guide a self-supervised, sample-specific feature-masking process. As a result, AIM enables the training of well-performing and inherently interpretable models that faithfully summarize the decision process. We validate AIM across a diverse range of challenging datasets that test both out-of-distribution generalization and fine-grained visual understanding. These include general-purpose classification benchmarks such as ImageNet100, HardImageNet, and ImageWoof, as well as fine-grained classification datasets such as Waterbirds, TravelingBirds, and CUB-200. AIM demonstrates significant dual benefits: interpretability improvements, as measured by the Energy Pointing Game (EPG) score, and accuracy gains over strong baselines. These consistent gains across domains and architectures provide compelling evidence that AIM promotes the use of genuine and meaningful features that directly contribute to improved generalization and human-aligned interpretability.

AIM uses 100% sparse feature maps to make the final decision:

By employing binary mask estimators as a feature selection mechanism, AIM creates these sparse maps. For instance, we illustrate the masks produced at two stages in a ConvNeXt+AIM model alongside the generated spatially sparse feature maps (the first two columns). The last column demonstrates the final merged feature maps the model uses for classification.

Note: Model: ConvNeXt‑tiny with AIM (2, 35%).

AIM uses sparse feature maps

Mask Evolution Across Epochs

Videos show the evolution of learned masks across training epochs for representative images from ImageNet‑100, Hard ImageNet, and Waterbirds‑100.

Displayed are masks from two model blocks and their element-wise merged masks, indicating the spatial regions preserved in the final feature maps.

Note: Model: ConvNeXt‑tiny with AIM (2, 35%).

Results:

Results

BibTeX

@misc{alshami2025aimamendinginherentinterpretability,
        title={AIM: Amending Inherent Interpretability via Self-Supervised Masking}, 
        author={Eyad Alshami and Shashank Agnihotri and Bernt Schiele and Margret Keuper},
        year={2025},
        eprint={2508.11502},
        archivePrefix={arXiv},
        primaryClass={cs.CV},
        url={https://arxiv.org/abs/2508.11502}, 
  }