Skip to content

Add torchvision as a dependency so that vision doesn't fall back to torchvision's slow processor#1839

Open
RWL-Dittrich wants to merge 4 commits intoexo-explore:mainfrom
RWL-Dittrich:fix/torchvision
Open

Add torchvision as a dependency so that vision doesn't fall back to torchvision's slow processor#1839
RWL-Dittrich wants to merge 4 commits intoexo-explore:mainfrom
RWL-Dittrich:fix/torchvision

Conversation

@RWL-Dittrich
Copy link
Copy Markdown

Motivation

Without the torchvision dependency, any vision prompts show this warning in the logs: Using use_fast=Truebuttorchvision is not available. Falling back to the slow image processor..
To fix this, I added torchvision as a dependency and made sure the vision pipeline is compatible with it.

The fast image processor (torchvision-based) returns PyTorch tensors, not NumPy arrays. Using return_tensors="np" fails when the fast processor is active because it relies on torchvision transforms that produce torch.Tensor outputs. This fixes image processing in VisionEncoder when using the fast image processor path.

Changes

  • Added torchvision as a dependency in pyproject.toml and python/parts.nix (with ignoreMissing for Nix compatibility)
  • Changed VisionEncoder to request PyTorch tensors (return_tensors="pt") instead of NumPy ("np") from the image processor
  • Convert PyTorch tensors to NumPy via .numpy() before passing to mx.array()

Why It Works

The fast image processor internally uses torchvision transforms, which produce torch.Tensor objects. Requesting "np" tensors caused a failure because the processor couldn't convert its internal torch tensors to NumPy in the expected way. By requesting "pt" (PyTorch) tensors and explicitly calling .numpy() before constructing MLX arrays, we align with what the fast processor actually produces while maintaining the same data flow into MLX.

Test Plan

Manual Testing

Hardware: (Two Mac Mini's M4 Pro 64GB, connected via Thunderbolt 4)
What you did:
Launched Qwen3.5-35B and attached a PDF to the prompt. No more warnings about the slow image processor show up in the console.

Automated Testing

@rltakashige
Copy link
Copy Markdown
Collaborator

Hi there! Thanks for this - we intentionally left torchvision out of the multimodality PR, as it pulls in a bunch of CUDA dependencies on Linux. How we will handle the dependencies is currently being discussed with branches such as https://github.com/exo-explore/exo/tree/vllm-nix .

Out of curiosity, how much faster is the fast image processor? The "slow" one already seems quite fast.

@RWL-Dittrich
Copy link
Copy Markdown
Author

I did some benchmarking and it seems the "fast" processor is actually a tiny bit slower (Based on how long the TTFT actually is). But honestly speaking it's probably within margin of error. I attached the sample images zip here so you can test for yourself. sample images.zip

I used mlx-community/Qwen3.5-35B-A3B-8bit spread over two M4 pro mac mini's with 64 gigabytes of ram.

One weird and unexpected observation, Qwen seemed to struggle recognizing the bird in image 2 when the slow processor is used. And during all four runs of the benchmark I did with the fast runner it seemed to recognize the bird species. So maybe the slow and fast processor encode the images differently, somehow giving more context to the model with the "fast" runner.

Below is the result of the tests I did.
image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants