Skip to content

[None][fix] Do not leak KV cache quantization into vision encoder#13181

Merged
2ez4bz merged 1 commit intoNVIDIA:mainfrom
2ez4bz:dev-nano-v3-fp8-kv-cache-fix
Apr 20, 2026
Merged

[None][fix] Do not leak KV cache quantization into vision encoder#13181
2ez4bz merged 1 commit intoNVIDIA:mainfrom
2ez4bz:dev-nano-v3-fp8-kv-cache-fix

Conversation

@2ez4bz
Copy link
Copy Markdown
Collaborator

@2ez4bz 2ez4bz commented Apr 19, 2026

Summary by CodeRabbit

Release Notes

  • Bug Fixes

    • Improved quantization configuration handling in RADIO vision models to prevent conflicts between parent and sub-component settings when quantization is disabled, ensuring correct initialization and stable inference behavior.
  • Tests

    • Added comprehensive unit tests for RADIO vision model quantization scenarios to enhance test coverage, validation reliability, and prevent potential future regressions in quantization behavior.

Description

  • Why?

We were erroneously passing the KV cache quant config
from the LLM into the vision encoder for nemotron models.

  • What?

This commit fixes that, and adds a regression test for it.

Test Coverage

  • New test file added to catch the error that this commit fixes (verified to fail without it).

PR Checklist

Please review the following before submitting your PR:

  • PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.

  • PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.

  • Test cases are provided for new code paths (see test instructions)

  • Any new dependencies have been scanned for license and vulnerabilities

  • CODEOWNERS updated if ownership changes

  • Documentation updated as needed

  • Update tava architecture diagram if there is a significant design change in PR.

  • The reviewers assigned automatically/manually are appropriate for the PR.

  • Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

To see a list of available CI bot commands, please comment /bot help.

@2ez4bz
Copy link
Copy Markdown
Collaborator Author

2ez4bz commented Apr 19, 2026

/bot run

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Apr 19, 2026

📝 Walkthrough

Walkthrough

This change modifies RADIO vision model quantization handling to clear KV-cache quantization settings when disable_quantization is True, preventing conflicts with vision encoders lacking KV cache managers. A unit test validates the corrected behavior under FP8 settings.

Changes

Cohort / File(s) Summary
Vision Model Quantization Fix
tensorrt_llm/_torch/models/modeling_radio.py
Modified RADIOVisionModel.__init__ to replace existing quant_config with blank QuantConfig() when disable_quantization=True, instead of retaining kv_cache_quant_algo settings that could conflict with the vision encoder's architecture.
Testing Infrastructure
tests/integration/test_lists/test-db/l0_a10.yml, tests/unittest/_torch/modeling/test_modeling_radio.py
Added new CUDA-only unit test for RADIO vision model that validates forward pass execution with disable_quantization=True under FP8 KV-cache settings. Test constructs minimal ViT configuration and verifies output tensor dimensions. Test selection list updated to include new test module in pytorch pre-merge pipeline.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 60.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The PR title clearly and specifically describes the main fix: preventing KV cache quantization from being incorrectly applied to the vision encoder.
Description check ✅ Passed The PR description includes all required sections: rationale (Why), solution summary (What), test coverage details, and a completed PR checklist matching the repository template.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
tests/unittest/_torch/modeling/test_modeling_radio.py (1)

82-95: Consider asserting quant config is sanitized before forward.

A direct assertion makes the regression intent explicit and catches config leakage earlier than backend execution failures.

Proposed test hardening
 def test_radio_fp8_parent_kv_cache_does_not_leak_into_vit(tiny_vit_config):
@@
     vision_model = RADIOVisionModel(_make_fp8_model_config(), disable_quantization=True)
+    assert vision_model.model_config.quant_config is not None
+    assert vision_model.model_config.quant_config.quant_algo is None
+    assert vision_model.model_config.quant_config.kv_cache_quant_algo is None
@@
     with torch.inference_mode():
         features = vision_model.forward(pixel_values)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/unittest/_torch/modeling/test_modeling_radio.py` around lines 82 - 95,
The test should explicitly assert that the model's quantization config was
sanitized/disabled before calling forward: after creating vision_model via
RADIOVisionModel(_make_fp8_model_config(), disable_quantization=True) add an
assertion that the quant/config state reflects disabling (for example assert
vision_model.disable_quantization is True or assert getattr(vision_model,
"quantization_config", None) is None) so the regression intent is explicit and
any config leakage is caught prior to calling vision_model.forward.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@tests/unittest/_torch/modeling/test_modeling_radio.py`:
- Around line 82-95: The test should explicitly assert that the model's
quantization config was sanitized/disabled before calling forward: after
creating vision_model via RADIOVisionModel(_make_fp8_model_config(),
disable_quantization=True) add an assertion that the quant/config state reflects
disabling (for example assert vision_model.disable_quantization is True or
assert getattr(vision_model, "quantization_config", None) is None) so the
regression intent is explicit and any config leakage is caught prior to calling
vision_model.forward.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: 9cf1ca88-f436-4572-85e1-d22760bd6142

📥 Commits

Reviewing files that changed from the base of the PR and between f85e3a3 and 3c2a3cb.

📒 Files selected for processing (3)
  • tensorrt_llm/_torch/models/modeling_radio.py
  • tests/integration/test_lists/test-db/l0_a10.yml
  • tests/unittest/_torch/modeling/test_modeling_radio.py

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #44153 [ run ] triggered by Bot. Commit: 3c2a3cb Link to invocation

@2ez4bz 2ez4bz enabled auto-merge (squash) April 19, 2026 05:28
@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #44153 [ run ] completed with state SUCCESS. Commit: 3c2a3cb
/LLM/main/L0_MergeRequest_PR pipeline #34580 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

@tijyojwad
Copy link
Copy Markdown
Collaborator

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #44207 [ run ] triggered by Bot. Commit: 3c2a3cb Link to invocation

@ZhanruiSunCh
Copy link
Copy Markdown
Collaborator

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #44225 [ run ] triggered by Bot. Commit: 3c2a3cb Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #44225 [ run ] completed with state SUCCESS. Commit: 3c2a3cb
/LLM/main/L0_MergeRequest_PR pipeline #34648 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

Signed-off-by: William Zhang <133824995+2ez4bz@users.noreply.github.com>
@2ez4bz 2ez4bz force-pushed the dev-nano-v3-fp8-kv-cache-fix branch from 3c2a3cb to 5643300 Compare April 20, 2026 17:47
@2ez4bz
Copy link
Copy Markdown
Collaborator Author

2ez4bz commented Apr 20, 2026

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #44491 [ run ] triggered by Bot. Commit: 5643300 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #44491 [ run ] completed with state SUCCESS. Commit: 5643300
/LLM/main/L0_MergeRequest_PR pipeline #34892 completed with status: 'SUCCESS'

CI Report

Link to invocation

@2ez4bz 2ez4bz merged commit 30a65b7 into NVIDIA:main Apr 20, 2026
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants