FIxLIP [NeurIPS 2025]

This repository is a code supplement to the following paper:

H. Baniecki, M. Muschalik, F. Fumagalli, B. Hammer, E. Hüllermeier, P. Biecek. Explaining Similarity in Vision-Language Encoders with Weighted Banzhaf Interactions. NeurIPS 2025

TL;DR: We introduce faithful interaction explanations of CLIP and SigLIP models (FIxLIP), offering a unique, game-theoretic perspective on interpreting image–text similarity predictions.

New! For a faster and more reliable computation, check out our new implementation in example_faster.ipynb.

Setup

The original environment for reproducibility:

conda env create -f env.yml
conda activate fixlip

The new environment with the updated shapiq package allowing faster computation:

conda env create -f env_faster.yml
conda activate fixlip_faster

Getting started

Check out the demo for explaining CLIP with FIxLIP in example.ipynb.

import src
import torch
from PIL import Image
from transformers import CLIPProcessor, CLIPModel
# load model
model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
model.to('cuda')
processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")
# load data
input_text = "black dog next to a yellow hydrant"
input_image = Image.open("assets/dog_and_hydrant.png")
# define game
game = src.game_huggingface.VisionLanguageGame(
    model=model,
    processor=processor,
    input_image=input_image,
    input_text=input_text,
    batch_size=64
)
# define approximator
fixlip = src.fixlip.FIxLIP(
    n_players_text=game.n_players_text,
    n_players_image=game.n_players_image, 
    max_order=2,
    p=0.5, # weight
    mode="banzhaf",
    random_state=0
)
# compute explanation
interaction_values = fixlip.approximate_crossmodal(
    game=game, 
    budget_text=2**6
    budget_image=2**13,
)
print(interaction_values)
# visualize explanation
text_tokens, input_image_denormalized = ...
src.plot.plot_image_and_text_together(
    iv=interaction_values,
    text=text_tokens,
    img=input_image_denormalized,
    image_players=list(range(game.n_players_image)),
    plot_interactions=True,
    ...
)

Faster and more reliable FIxLIP approximation

Check out the demo for explaining CLIP with FIxLIP via ProxySHAP in example_faster.ipynb.

import src
import torch
from PIL import Image
from transformers import AutoProcessor, AutoModel
# load model
model = AutoModel.from_pretrained("google/siglip2-base-patch32-256")
model.to('cuda')
processor = AutoProcessor.from_pretrained("google/siglip2-base-patch32-256")
# load data
input_text = "a giraffe drinking water from a river"
input_image = Image.open("assets/giraffe_drinking.jpg")
# define game
game = src.game_huggingface.VisionLanguageGame(
    model=model,
    processor=processor,
    input_image=input_image,
    input_text=input_text,
    batch_size=64
)
# define approximator
fixlip = src.fixlip.FIxLIP(
    n_players_text=game.n_players_text, 
    n_players_image=game.n_players_image,
    random_state=0
)
# compute explanation
interaction_values = fixlip.approximate_crossmodal(
    game=game, 
    budget_text=2**5,
    budget_image=2**13,
    approximation_type="proxyshap" # new!
)
print(interaction_values)
# visualize explanation
text_tokens, input_image_denormalized = ...
clique = {77, 78, 91, 105, 197, 198}
_ = src.plot.plot_interaction_subset(
    iv=interaction_values,
    clique=clique,
    image_players=list(range(game.n_players_image)),
    img=input_image_denormalized,
    text=text_tokens,
    ...
)

Running experiments

src - main code base with the FIxLIP implementation
data - code for processing datasets
experiments - code for running experiments
results - experimental results
analysis - analyze and visualize the results
gradeclip - code and experiments with Grad-ECLIP
exclip - code and experiments with exCLIP

Citation

If you use the code in your research, please cite:

@inproceedings{baniecki2025explaining,
    title     = {Explaining Similarity in Vision-Language Encoders 
                 with Weighted Banzhaf Interactions},
    author    = {Hubert Baniecki and Maximilian Muschalik and Fabian Fumagalli and 
                 Barbara Hammer and Eyke H{\"u}llermeier and Przemyslaw Biecek},
    booktitle = {Advances in Neural Information Processing Systems},
    year      = {2025},
    url       = {https://openreview.net/forum?id=on22Rx5A4F}
}

Acknowledgements

FIxLIP is powered by shapiq. See also Grad-ECLIP and exCLIP.

This work was financially supported by the state budget within the Polish Ministry of Science and Higher Education program "Pearls of Science" project number PN/01/0087/2022.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

FIxLIP [NeurIPS 2025]

Setup

Getting started

Faster and more reliable FIxLIP approximation

Running experiments

Citation

Acknowledgements

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 1

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
analysis		analysis
assets		assets
data		data
exclip		exclip
experiments		experiments
gradeclip		gradeclip
results		results
src		src
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
env.yml		env.yml
env_faster.yml		env_faster.yml
example.ipynb		example.ipynb
example_faster.ipynb		example_faster.ipynb

Folders and files

Latest commit

History

Repository files navigation

FIxLIP [NeurIPS 2025]

Setup

Getting started

Faster and more reliable FIxLIP approximation

Running experiments

Citation

Acknowledgements

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 1

Languages

Packages