Add torchvision as a dependency so that vision doesn't fall back to torchvision's slow processor#1839
Add torchvision as a dependency so that vision doesn't fall back to torchvision's slow processor#1839RWL-Dittrich wants to merge 4 commits intoexo-explore:mainfrom
Conversation
|
Hi there! Thanks for this - we intentionally left torchvision out of the multimodality PR, as it pulls in a bunch of CUDA dependencies on Linux. How we will handle the dependencies is currently being discussed with branches such as https://github.com/exo-explore/exo/tree/vllm-nix . Out of curiosity, how much faster is the fast image processor? The "slow" one already seems quite fast. |
|
I did some benchmarking and it seems the "fast" processor is actually a tiny bit slower (Based on how long the TTFT actually is). But honestly speaking it's probably within margin of error. I attached the sample images zip here so you can test for yourself. sample images.zip I used One weird and unexpected observation, Qwen seemed to struggle recognizing the bird in image 2 when the slow processor is used. And during all four runs of the benchmark I did with the fast runner it seemed to recognize the bird species. So maybe the slow and fast processor encode the images differently, somehow giving more context to the model with the "fast" runner. |

Motivation
Without the torchvision dependency, any vision prompts show this warning in the logs:
Usinguse_fast=Truebuttorchvisionis not available. Falling back to the slow image processor..To fix this, I added torchvision as a dependency and made sure the vision pipeline is compatible with it.
The fast image processor (
torchvision-based) returns PyTorch tensors, not NumPy arrays. Usingreturn_tensors="np"fails when the fast processor is active because it relies ontorchvisiontransforms that producetorch.Tensoroutputs. This fixes image processing inVisionEncoderwhen using the fast image processor path.Changes
torchvisionas a dependency inpyproject.tomlandpython/parts.nix(withignoreMissingfor Nix compatibility)VisionEncoderto request PyTorch tensors (return_tensors="pt") instead of NumPy ("np") from the image processor.numpy()before passing tomx.array()Why It Works
The fast image processor internally uses
torchvisiontransforms, which producetorch.Tensorobjects. Requesting"np"tensors caused a failure because the processor couldn't convert its internal torch tensors to NumPy in the expected way. By requesting"pt"(PyTorch) tensors and explicitly calling.numpy()before constructing MLX arrays, we align with what the fast processor actually produces while maintaining the same data flow into MLX.Test Plan
Manual Testing
Hardware: (Two Mac Mini's M4 Pro 64GB, connected via Thunderbolt 4)
What you did:
Launched Qwen3.5-35B and attached a PDF to the prompt. No more warnings about the slow image processor show up in the console.
Automated Testing