This repository is a code supplement to the following paper:
H. Baniecki, M. Muschalik, F. Fumagalli, B. Hammer, E. Hüllermeier, P. Biecek. Explaining Similarity in Vision-Language Encoders with Weighted Banzhaf Interactions. NeurIPS 2025
TL;DR: We introduce faithful interaction explanations of CLIP and SigLIP models (FIxLIP), offering a unique, game-theoretic perspective on interpreting image–text similarity predictions.
New! For a faster and more reliable computation, check out our new implementation in example_faster.ipynb.
The original environment for reproducibility:
conda env create -f env.yml
conda activate fixlipThe new environment with the updated shapiq package allowing faster computation:
conda env create -f env_faster.yml
conda activate fixlip_fasterCheck out the demo for explaining CLIP with FIxLIP in example.ipynb.
import src
import torch
from PIL import Image
from transformers import CLIPProcessor, CLIPModel
# load model
model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
model.to('cuda')
processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")
# load data
input_text = "black dog next to a yellow hydrant"
input_image = Image.open("assets/dog_and_hydrant.png")
# define game
game = src.game_huggingface.VisionLanguageGame(
model=model,
processor=processor,
input_image=input_image,
input_text=input_text,
batch_size=64
)
# define approximator
fixlip = src.fixlip.FIxLIP(
n_players_text=game.n_players_text,
n_players_image=game.n_players_image,
max_order=2,
p=0.5, # weight
mode="banzhaf",
random_state=0
)
# compute explanation
interaction_values = fixlip.approximate_crossmodal(
game=game,
budget_text=2**6
budget_image=2**13,
)
print(interaction_values)
# visualize explanation
text_tokens, input_image_denormalized = ...
src.plot.plot_image_and_text_together(
iv=interaction_values,
text=text_tokens,
img=input_image_denormalized,
image_players=list(range(game.n_players_image)),
plot_interactions=True,
...
)Check out the demo for explaining CLIP with FIxLIP via ProxySHAP in example_faster.ipynb.
import src
import torch
from PIL import Image
from transformers import AutoProcessor, AutoModel
# load model
model = AutoModel.from_pretrained("google/siglip2-base-patch32-256")
model.to('cuda')
processor = AutoProcessor.from_pretrained("google/siglip2-base-patch32-256")
# load data
input_text = "a giraffe drinking water from a river"
input_image = Image.open("assets/giraffe_drinking.jpg")
# define game
game = src.game_huggingface.VisionLanguageGame(
model=model,
processor=processor,
input_image=input_image,
input_text=input_text,
batch_size=64
)
# define approximator
fixlip = src.fixlip.FIxLIP(
n_players_text=game.n_players_text,
n_players_image=game.n_players_image,
random_state=0
)
# compute explanation
interaction_values = fixlip.approximate_crossmodal(
game=game,
budget_text=2**5,
budget_image=2**13,
approximation_type="proxyshap" # new!
)
print(interaction_values)
# visualize explanation
text_tokens, input_image_denormalized = ...
clique = {77, 78, 91, 105, 197, 198}
_ = src.plot.plot_interaction_subset(
iv=interaction_values,
clique=clique,
image_players=list(range(game.n_players_image)),
img=input_image_denormalized,
text=text_tokens,
...
)src- main code base with the FIxLIP implementationdata- code for processing datasetsexperiments- code for running experimentsresults- experimental resultsanalysis- analyze and visualize the resultsgradeclip- code and experiments with Grad-ECLIPexclip- code and experiments with exCLIP
If you use the code in your research, please cite:
@inproceedings{baniecki2025explaining,
title = {Explaining Similarity in Vision-Language Encoders
with Weighted Banzhaf Interactions},
author = {Hubert Baniecki and Maximilian Muschalik and Fabian Fumagalli and
Barbara Hammer and Eyke H{\"u}llermeier and Przemyslaw Biecek},
booktitle = {Advances in Neural Information Processing Systems},
year = {2025},
url = {https://openreview.net/forum?id=on22Rx5A4F}
}FIxLIP is powered by shapiq. See also Grad-ECLIP and exCLIP.
This work was financially supported by the state budget within the Polish Ministry of Science and Higher Education program "Pearls of Science" project number PN/01/0087/2022.




