Skip to content

hbaniecki/fixlip

Repository files navigation

FIxLIP [NeurIPS 2025]

arXiv Conference

This repository is a code supplement to the following paper:

H. Baniecki, M. Muschalik, F. Fumagalli, B. Hammer, E. Hüllermeier, P. Biecek. Explaining Similarity in Vision-Language Encoders with Weighted Banzhaf Interactions. NeurIPS 2025

TL;DR: We introduce faithful interaction explanations of CLIP and SigLIP models (FIxLIP), offering a unique, game-theoretic perspective on interpreting image–text similarity predictions.

New! For a faster and more reliable computation, check out our new implementation in example_faster.ipynb.

Setup

The original environment for reproducibility:

conda env create -f env.yml
conda activate fixlip

The new environment with the updated shapiq package allowing faster computation:

conda env create -f env_faster.yml
conda activate fixlip_faster

Getting started

Check out the demo for explaining CLIP with FIxLIP in example.ipynb.

import src
import torch
from PIL import Image
from transformers import CLIPProcessor, CLIPModel
# load model
model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
model.to('cuda')
processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")
# load data
input_text = "black dog next to a yellow hydrant"
input_image = Image.open("assets/dog_and_hydrant.png")
# define game
game = src.game_huggingface.VisionLanguageGame(
    model=model,
    processor=processor,
    input_image=input_image,
    input_text=input_text,
    batch_size=64
)
# define approximator
fixlip = src.fixlip.FIxLIP(
    n_players_text=game.n_players_text,
    n_players_image=game.n_players_image, 
    max_order=2,
    p=0.5, # weight
    mode="banzhaf",
    random_state=0
)
# compute explanation
interaction_values = fixlip.approximate_crossmodal(
    game=game, 
    budget_text=2**6
    budget_image=2**13,
)
print(interaction_values)
# visualize explanation
text_tokens, input_image_denormalized = ...
src.plot.plot_image_and_text_together(
    iv=interaction_values,
    text=text_tokens,
    img=input_image_denormalized,
    image_players=list(range(game.n_players_image)),
    plot_interactions=True,
    ...
)

Faster and more reliable FIxLIP approximation

Check out the demo for explaining CLIP with FIxLIP via ProxySHAP in example_faster.ipynb.

import src
import torch
from PIL import Image
from transformers import AutoProcessor, AutoModel
# load model
model = AutoModel.from_pretrained("google/siglip2-base-patch32-256")
model.to('cuda')
processor = AutoProcessor.from_pretrained("google/siglip2-base-patch32-256")
# load data
input_text = "a giraffe drinking water from a river"
input_image = Image.open("assets/giraffe_drinking.jpg")
# define game
game = src.game_huggingface.VisionLanguageGame(
    model=model,
    processor=processor,
    input_image=input_image,
    input_text=input_text,
    batch_size=64
)
# define approximator
fixlip = src.fixlip.FIxLIP(
    n_players_text=game.n_players_text, 
    n_players_image=game.n_players_image,
    random_state=0
)
# compute explanation
interaction_values = fixlip.approximate_crossmodal(
    game=game, 
    budget_text=2**5,
    budget_image=2**13,
    approximation_type="proxyshap" # new!
)
print(interaction_values)
# visualize explanation
text_tokens, input_image_denormalized = ...
clique = {77, 78, 91, 105, 197, 198}
_ = src.plot.plot_interaction_subset(
    iv=interaction_values,
    clique=clique,
    image_players=list(range(game.n_players_image)),
    img=input_image_denormalized,
    text=text_tokens,
    ...
)

Running experiments

  • src - main code base with the FIxLIP implementation
  • data - code for processing datasets
  • experiments - code for running experiments
  • results - experimental results
  • analysis - analyze and visualize the results
  • gradeclip - code and experiments with Grad-ECLIP
  • exclip - code and experiments with exCLIP

Citation

If you use the code in your research, please cite:

@inproceedings{baniecki2025explaining,
    title     = {Explaining Similarity in Vision-Language Encoders 
                 with Weighted Banzhaf Interactions},
    author    = {Hubert Baniecki and Maximilian Muschalik and Fabian Fumagalli and 
                 Barbara Hammer and Eyke H{\"u}llermeier and Przemyslaw Biecek},
    booktitle = {Advances in Neural Information Processing Systems},
    year      = {2025},
    url       = {https://openreview.net/forum?id=on22Rx5A4F}
}

Acknowledgements

FIxLIP is powered by shapiq. See also Grad-ECLIP and exCLIP.

This work was financially supported by the state budget within the Polish Ministry of Science and Higher Education program "Pearls of Science" project number PN/01/0087/2022.

About

Explaining similarity in vision-language encoders with weighted Banzhaf interactions (NeurIPS 2025)

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages