generative-computing · jakelorocco · Apr 10, 2026 · Apr 17, 2026 · Apr 17, 2026 · Apr 17, 2026
@@ -4,17 +4,23 @@ description: "Adapter-accelerated RAG quality checks using LoRA/aLoRA adapters w
 # diataxis: how-to
 ---
 
-**Prerequisites:** `pip install "mellea[hf]"`, a GPU or Apple Silicon Mac recommended for
-acceptable inference speed. All intrinsics require a `LocalHFBackend` with a
-[Granite](https://huggingface.co/ibm-granite) model.
+**Prerequisites:** `pip install "mellea[hf]"` for LocalHFBackend (GPU or Apple
+Silicon Mac recommended), or `pip install mellea` for OpenAIBackend with a
+[Granite Switch](../guide/glossary#granite-switch) model served via vLLM.
 
 Intrinsics are adapter-accelerated operations for RAG quality checks. They use
 LoRA/aLoRA adapters loaded directly into the HuggingFace backend — faster and more
 reliable than prompting a general-purpose model for these specialized micro-tasks.
 
-> **Backend note:** Intrinsics require `LocalHFBackend` with an IBM Granite model
-> (e.g., `ibm-granite/granite-4.0-micro`). They do not work with Ollama, OpenAI, or
-> other remote backends.
+> **Backend note:** Intrinsics work with two backends:
+>
+> - **LocalHFBackend** — loads LoRA/aLoRA adapters from the catalog at runtime.
+>   All intrinsics are available. Requires a GPU or Apple Silicon Mac.
+> - **OpenAIBackend** — uses a Granite Switch model served via vLLM with
+>   `load_embedded_adapters=True`. Only intrinsics embedded in the model are
+>   available — check the model's `adapter_index.json` for the list.
+>
+> Intrinsics do not work with Ollama or other remote backends.
 
 Set up the backend once and reuse it across intrinsic calls:
 
@@ -24,6 +30,22 @@ from mellea.backends.huggingface import LocalHFBackend
 backend = LocalHFBackend(model_id="ibm-granite/granite-4.0-micro")
 ```
 
+Or, with a Granite Switch model via the OpenAI backend:
+
+```python
+from mellea.backends.openai import OpenAIBackend
+from mellea.backends.model_ids import IBM_GRANITE_SWITCH_4_1_8B
+from mellea.formatters import TemplateFormatter
+
+backend = OpenAIBackend(
+    model_id=IBM_GRANITE_SWITCH_4_1_8B.hf_model_name,
+    formatter=TemplateFormatter(model_id=IBM_GRANITE_SWITCH_4_1_8B.hf_model_name),
+    base_url="http://localhost:8000/v1",  # vLLM server
+    api_key="EMPTY",
+    load_embedded_adapters=True,
+)
+```
+
 ## Answerability
 
 Check whether a set of retrieved documents can answer a given question:
@@ -208,4 +230,6 @@ print(out)  # {"requirement_likelihood": 1.0}
 ```
 
 The `Intrinsic` component loads aLoRA adapters (falling back to LoRA) by task name.
+For OpenAI backends with Granite Switch, adapters are loaded from the model's
+HuggingFace repository configuration instead of the intrinsic catalog.
 Output format is task-specific — `requirement_check` returns a likelihood score.
@@ -15,8 +15,12 @@ and use it as a requirement validator in any Mellea program.
 Apple Silicon Mac with sufficient VRAM for the chosen base model. Uploading requires a
 Hugging Face account.
 
-> **Backend note:** Trained adapters can only be loaded into `LocalHFBackend`. They do
-> not work with Ollama, OpenAI, or other remote backends.
+> **Backend note:** Custom-trained adapters can only be loaded into `LocalHFBackend`.
+> They do not work with Ollama, OpenAI, or other remote backends.
+>
+> Granite Switch models ship with pre-trained intrinsic adapters embedded in the
+> model weights, which can be used via `OpenAIBackend` with
+> `load_embedded_adapters=True`. See [Intrinsics](./intrinsics) for details.
 
 ## LoRA vs aLoRA
 

@@ -78,6 +78,7 @@ to run.
 | -------- | ------------- |
 | `aLora/` | Training aLoRA adapters for fast constraint checking; performance optimisation |
 | `intrinsics/` | Answer relevance, hallucination detection, citation validation, context relevance — specialised adapter-backed checks |
+| `granite-switch/` | Running intrinsics via OpenAI backend with Granite Switch embedded adapters |
 | `sofai/` | Two-tier sampling: fast-model iteration with escalation to a slow model; cost optimisation |
 
 ### Multimodal

@@ -27,7 +27,9 @@ See: [act() and aact()](./act-and-aact)
 An **Activated LoRA** (aLoRA) is a LoRA adapter dynamically loaded by
 `LocalHFBackend` at inference time to serve as a lightweight requirement verifier.
 Instead of running a full LLM call to check a requirement, the adapter is activated
-on the same model weights already in memory.
+on the same model weights already in memory. [Granite Switch](#granite-switch)
+models embed these adapters directly in the model weights, enabling intrinsic
+functions via `OpenAIBackend` without runtime adapter loading.
 
 See: [LoRA and aLoRA Adapters](../advanced/lora-and-alora-adapters)
 
@@ -292,6 +294,20 @@ See: [Making Agents Reliable](../tutorials/04-making-agents-reliable)
 
 ---
 
+## Granite Switch
+
+A Granite model variant with LoRA and aLoRA adapters pre-baked into the model
+weights. When served via vLLM and accessed through `OpenAIBackend` with
+`load_embedded_adapters=True`, these embedded adapters enable
+[Intrinsics](../advanced/intrinsics) (RAG quality checks, requirement
+validation, safety evaluation) without runtime adapter loading. Only intrinsics
+embedded in the model are available — check the model's `adapter_index.json`.
+
+See: [Intrinsics](../advanced/intrinsics) |
+[OpenAI and OpenAI-Compatible APIs](../integrations/openai)
+
+---
+
 ## KV smashing
 
 The technique of concatenating key-value attention caches from separately prefilled
@@ -360,9 +376,10 @@ See: [Use Images and Vision Models](../how-to/use-images-and-vision)
 ## Intrinsic
 
 An `Intrinsic` is a backend-level primitive in Mellea — a structured generation
-operation with special handling (e.g., constrained decoding, RAG retrieval). The
-`LocalHFBackend` exposes Intrinsics directly; server backends route them through
-adapter endpoints.
+operation with special handling (e.g., constrained decoding, RAG retrieval).
+`LocalHFBackend` exposes Intrinsics via runtime adapter loading. `OpenAIBackend`
+supports Intrinsics when backed by a [Granite Switch](#granite-switch) model with
+`load_embedded_adapters=True`.
 
 See: [Intrinsics](../advanced/intrinsics)
 

@@ -81,6 +81,10 @@ See [Prefix Caching and KV Blocks](../advanced/prefix-caching-and-kv-blocks) for
 adapters — lightweight domain-specific requirement validators that run on local GPU
 hardware. See the aLoRA guide for training and usage.
 
+> **Tip:** For intrinsics without local GPU requirements, Granite Switch models
+> serve pre-embedded adapters via vLLM and the OpenAI backend. See
+> [Intrinsics](../advanced/intrinsics) for details.
+
 ## Vision support
 
 Vision support for `LocalHFBackend` is model-dependent and experimental. Pass a PIL

@@ -236,6 +236,73 @@ m = MelleaSession(
 > LiteLLM provides a verified integration — see
 > [Backends and Configuration](../guide/backends-and-configuration).
 
+## Intrinsics with Granite Switch
+
+Granite Switch models embed LoRA/aLoRA adapters directly in the model weights.
+When served via vLLM, these adapters enable intrinsic functions (RAG quality
+checks, safety evaluation, requirement validation) through the OpenAI-compatible
+API without loading adapter weights at runtime.
+
+Start a vLLM server with the Granite Switch model:
+
+```bash
+python -m vllm.entrypoints.openai.api_server \
+    --model <granite-switch-model-id> \
+    --dtype bfloat16 \
+    --enable-prefix-caching
+```
+
+Then create a backend with `load_embedded_adapters=True`:
+
+```python
+from mellea.backends.openai import OpenAIBackend
+from mellea.backends.model_ids import IBM_GRANITE_SWITCH_4_1_8B
+from mellea.formatters import TemplateFormatter
+
+backend = OpenAIBackend(
+    model_id=IBM_GRANITE_SWITCH_4_1_8B.hf_model_name,
+    formatter=TemplateFormatter(model_id=IBM_GRANITE_SWITCH_4_1_8B.hf_model_name),
+    base_url="http://localhost:8000/v1",
+    api_key="EMPTY",
+    load_embedded_adapters=True,
+)
+```
+
+The high-level intrinsic wrappers (`rag.check_answerability`,
+`core.check_certainty`, etc.) work identically with this backend. See
+[Intrinsics](../advanced/intrinsics) for the full list of available intrinsics.
+
+> **Note:** `load_embedded_adapters=True` downloads adapter I/O configurations
+> from the model's HuggingFace repository on first use. No adapter weights are
+> transferred — the adapters are already part of the model. Only intrinsics
+> embedded in the model are available — check the model's `adapter_index.json`
+> for the list.
+
+For more control, load adapters manually with `load_embedded_adapters=False`:
+
+```python
+from mellea.backends.adapters.adapter import EmbeddedIntrinsicAdapter
+from mellea.backends.openai import OpenAIBackend
+from mellea.backends.model_ids import IBM_GRANITE_SWITCH_4_1_8B
+from mellea.formatters import TemplateFormatter
+
+backend = OpenAIBackend(
+    model_id=IBM_GRANITE_SWITCH_4_1_8B.hf_model_name,
+    formatter=TemplateFormatter(model_id=IBM_GRANITE_SWITCH_4_1_8B.hf_model_name),
+    base_url="http://localhost:8000/v1",
+    api_key="EMPTY",
+    load_embedded_adapters=False,
+)
+
+# Load a single adapter from the model's HuggingFace repo
+adapters = EmbeddedIntrinsicAdapter.from_hub(
+    IBM_GRANITE_SWITCH_4_1_8B.hf_model_name,
+    intrinsic_name="answerability",
+)
+for adapter in adapters:
+    backend.add_adapter(adapter)
+```
+
 ## Troubleshooting
 
 ### `OPENAI_API_KEY` not set error
@@ -257,4 +324,5 @@ local servers, list available models from the server's API or UI.
 ---
 
 **See also:** [Backends and Configuration](../guide/backends-and-configuration) |
-[Enforce Structured Output](../how-to/enforce-structured-output)
+[Enforce Structured Output](../how-to/enforce-structured-output) |
+[Intrinsics](../advanced/intrinsics)
@@ -0,0 +1,58 @@
+# Granite Switch Examples
+
+This directory contains examples for running Mellea intrinsics through an
+OpenAI-compatible backend using Granite Switch models.
+
+## What is Granite Switch?
+
+Granite Switch models ship with LoRA and aLoRA adapters pre-baked into the model
+weights. Instead of loading adapters at runtime (as `LocalHFBackend` does), these
+embedded adapters are activated via control tokens injected by the model's chat
+template. Only the I/O transformation configs are downloaded — no adapter weights
+are transferred.
+
+## Prerequisites
+
+1. A Granite Switch model hosted via [vLLM](https://docs.vllm.ai/):
+
+```bash
+python -m vllm.entrypoints.openai.api_server \
+    --model <granite-switch-model-id> \
+    --dtype bfloat16 \
+    --enable-prefix-caching
+```
+
+2. `pip install mellea`
+
+## Available adapters
+
+Not all intrinsics are embedded in every Granite Switch model. Check the model's
+`adapter_index.json` for the list of available adapters. The current model
+includes: `answerability`, `citations`, `context_relevance`, `guardian-core`,
+`hallucination_detection`, `query_clarification`, `query_rewrite`, and
+`requirement-check`.
+
+## Files
+
+### answerability_openai.py
+
+Demonstrates `rag.check_answerability()` using `OpenAIBackend` with
+`load_embedded_adapters=True` — the simplest way to use intrinsics with Granite
+Switch.
+
+### hallucination_detection_openai.py
+
+Demonstrates `rag.flag_hallucinated_content()` using `OpenAIBackend` with
+`load_embedded_adapters=True`.
+
+### manual_adapter_loading.py
+
+Shows how to manually load embedded adapters using
+`EmbeddedIntrinsicAdapter.from_hub()` and `backend.add_adapter()`. Useful when
+you only need a subset of adapters or want more control over adapter
+registration.
+
+## Related
+
+- [`../intrinsics/`](../intrinsics/) — the same intrinsics using `LocalHFBackend`
+- [Intrinsics documentation](../../docs/docs/advanced/intrinsics.md)
@@ -0,0 +1,56 @@
+# pytest: e2e, vllm, skip
+
+"""Example: running the answerability intrinsic via OpenAI backend with Granite Switch.
+
+Requires a vLLM server hosting a Granite Switch model.
+
+To start the server:
+    python -m vllm.entrypoints.openai.api_server \
+        --model <granite-switch-model-id> \
+        --dtype bfloat16 --enable-prefix-caching
+
+To run this script from the root of the Mellea source tree:
+    uv run python docs/examples/granite-switch/answerability_openai.py
+"""
+
+import os
+import sys
+
+import requests
+
+VLLM_BASE_URL = os.environ.get("VLLM_SWITCH_TEST_BASE_URL", "http://localhost:8000")
+try:
+    requests.get(f"{VLLM_BASE_URL}/v1/models", timeout=2)
+except requests.ConnectionError:
+    # Detected by docs/examples/conftest.py subprocess runner as a skip.
+    print(f"Skipped: vLLM server not reachable at {VLLM_BASE_URL}", file=sys.stderr)
+    raise SystemExit(1)
+
+from mellea.backends.model_ids import IBM_GRANITE_SWITCH_4_1_8B
+from mellea.backends.openai import OpenAIBackend
+from mellea.formatters import TemplateFormatter
+from mellea.stdlib.components import Document, Message
+from mellea.stdlib.components.intrinsic import rag
+from mellea.stdlib.context import ChatContext
+
+SWITCH_MODEL_ID = IBM_GRANITE_SWITCH_4_1_8B.hf_model_name
+assert SWITCH_MODEL_ID is not None
+
+backend = OpenAIBackend(
+    model_id=SWITCH_MODEL_ID,
+    formatter=TemplateFormatter(model_id=SWITCH_MODEL_ID),
+    base_url="http://localhost:8000/v1",
+    api_key="EMPTY",
+    load_embedded_adapters=True,
+)
+
+context = ChatContext().add(Message("assistant", "Hello there, how can I help you?"))
+question = "What is the square root of 4?"
+documents_answerable = [Document("The square root of 4 is 2.")]
+documents_unanswerable = [Document("The square root of 8 is not 2.")]
+
+result = rag.check_answerability(question, documents_answerable, context, backend)
+print(f"Answerability (answer in docs): {result}")
+
+result = rag.check_answerability(question, documents_unanswerable, context, backend)
+print(f"Answerability (answer NOT in docs): {result}")