Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
36 changes: 30 additions & 6 deletions docs/docs/advanced/intrinsics.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,17 +4,23 @@ description: "Adapter-accelerated RAG quality checks using LoRA/aLoRA adapters w
# diataxis: how-to
---

**Prerequisites:** `pip install "mellea[hf]"`, a GPU or Apple Silicon Mac recommended for
acceptable inference speed. All intrinsics require a `LocalHFBackend` with a
[Granite](https://huggingface.co/ibm-granite) model.
**Prerequisites:** `pip install "mellea[hf]"` for LocalHFBackend (GPU or Apple
Silicon Mac recommended), or `pip install mellea` for OpenAIBackend with a
[Granite Switch](../guide/glossary#granite-switch) model served via vLLM.

Intrinsics are adapter-accelerated operations for RAG quality checks. They use
LoRA/aLoRA adapters loaded directly into the HuggingFace backend — faster and more
reliable than prompting a general-purpose model for these specialized micro-tasks.

> **Backend note:** Intrinsics require `LocalHFBackend` with an IBM Granite model
> (e.g., `ibm-granite/granite-4.0-micro`). They do not work with Ollama, OpenAI, or
> other remote backends.
> **Backend note:** Intrinsics work with two backends:
>
> - **LocalHFBackend** — loads LoRA/aLoRA adapters from the catalog at runtime.
> All intrinsics are available. Requires a GPU or Apple Silicon Mac.
> - **OpenAIBackend** — uses a Granite Switch model served via vLLM with
> `load_embedded_adapters=True`. Only intrinsics embedded in the model are
> available — check the model's `adapter_index.json` for the list.
>
> Intrinsics do not work with Ollama or other remote backends.

Set up the backend once and reuse it across intrinsic calls:

Expand All @@ -24,6 +30,22 @@ from mellea.backends.huggingface import LocalHFBackend
backend = LocalHFBackend(model_id="ibm-granite/granite-4.0-micro")
```

Or, with a Granite Switch model via the OpenAI backend:

```python
from mellea.backends.openai import OpenAIBackend
from mellea.backends.model_ids import IBM_GRANITE_SWITCH_4_1_8B
from mellea.formatters import TemplateFormatter

backend = OpenAIBackend(
model_id=IBM_GRANITE_SWITCH_4_1_8B.hf_model_name,
formatter=TemplateFormatter(model_id=IBM_GRANITE_SWITCH_4_1_8B.hf_model_name),
base_url="http://localhost:8000/v1", # vLLM server
api_key="EMPTY",
load_embedded_adapters=True,
)
```

## Answerability

Check whether a set of retrieved documents can answer a given question:
Expand Down Expand Up @@ -208,4 +230,6 @@ print(out) # {"requirement_likelihood": 1.0}
```

The `Intrinsic` component loads aLoRA adapters (falling back to LoRA) by task name.
For OpenAI backends with Granite Switch, adapters are loaded from the model's
HuggingFace repository configuration instead of the intrinsic catalog.
Output format is task-specific — `requirement_check` returns a likelihood score.
8 changes: 6 additions & 2 deletions docs/docs/advanced/lora-and-alora-adapters.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,8 +15,12 @@ and use it as a requirement validator in any Mellea program.
Apple Silicon Mac with sufficient VRAM for the chosen base model. Uploading requires a
Hugging Face account.

> **Backend note:** Trained adapters can only be loaded into `LocalHFBackend`. They do
> not work with Ollama, OpenAI, or other remote backends.
> **Backend note:** Custom-trained adapters can only be loaded into `LocalHFBackend`.
> They do not work with Ollama, OpenAI, or other remote backends.
>
> Granite Switch models ship with pre-trained intrinsic adapters embedded in the
> model weights, which can be used via `OpenAIBackend` with
> `load_embedded_adapters=True`. See [Intrinsics](./intrinsics) for details.

## LoRA vs aLoRA

Expand Down
1 change: 1 addition & 0 deletions docs/docs/examples/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -78,6 +78,7 @@ to run.
| -------- | ------------- |
| `aLora/` | Training aLoRA adapters for fast constraint checking; performance optimisation |
| `intrinsics/` | Answer relevance, hallucination detection, citation validation, context relevance — specialised adapter-backed checks |
| `granite-switch/` | Running intrinsics via OpenAI backend with Granite Switch embedded adapters |
| `sofai/` | Two-tier sampling: fast-model iteration with escalation to a slow model; cost optimisation |

### Multimodal
Expand Down
25 changes: 21 additions & 4 deletions docs/docs/guide/glossary.md
Original file line number Diff line number Diff line change
Expand Up @@ -27,7 +27,9 @@ See: [act() and aact()](./act-and-aact)
An **Activated LoRA** (aLoRA) is a LoRA adapter dynamically loaded by
`LocalHFBackend` at inference time to serve as a lightweight requirement verifier.
Instead of running a full LLM call to check a requirement, the adapter is activated
on the same model weights already in memory.
on the same model weights already in memory. [Granite Switch](#granite-switch)
models embed these adapters directly in the model weights, enabling intrinsic
functions via `OpenAIBackend` without runtime adapter loading.

See: [LoRA and aLoRA Adapters](../advanced/lora-and-alora-adapters)

Expand Down Expand Up @@ -292,6 +294,20 @@ See: [Making Agents Reliable](../tutorials/04-making-agents-reliable)

---

## Granite Switch

A Granite model variant with LoRA and aLoRA adapters pre-baked into the model
weights. When served via vLLM and accessed through `OpenAIBackend` with
`load_embedded_adapters=True`, these embedded adapters enable
[Intrinsics](../advanced/intrinsics) (RAG quality checks, requirement
validation, safety evaluation) without runtime adapter loading. Only intrinsics
embedded in the model are available — check the model's `adapter_index.json`.

See: [Intrinsics](../advanced/intrinsics) |
[OpenAI and OpenAI-Compatible APIs](../integrations/openai)

---

## KV smashing

The technique of concatenating key-value attention caches from separately prefilled
Expand Down Expand Up @@ -360,9 +376,10 @@ See: [Use Images and Vision Models](../how-to/use-images-and-vision)
## Intrinsic

An `Intrinsic` is a backend-level primitive in Mellea — a structured generation
operation with special handling (e.g., constrained decoding, RAG retrieval). The
`LocalHFBackend` exposes Intrinsics directly; server backends route them through
adapter endpoints.
operation with special handling (e.g., constrained decoding, RAG retrieval).
`LocalHFBackend` exposes Intrinsics via runtime adapter loading. `OpenAIBackend`
supports Intrinsics when backed by a [Granite Switch](#granite-switch) model with
`load_embedded_adapters=True`.

See: [Intrinsics](../advanced/intrinsics)

Expand Down
4 changes: 4 additions & 0 deletions docs/docs/integrations/huggingface.md
Original file line number Diff line number Diff line change
Expand Up @@ -81,6 +81,10 @@ See [Prefix Caching and KV Blocks](../advanced/prefix-caching-and-kv-blocks) for
adapters — lightweight domain-specific requirement validators that run on local GPU
hardware. See the aLoRA guide for training and usage.

> **Tip:** For intrinsics without local GPU requirements, Granite Switch models
> serve pre-embedded adapters via vLLM and the OpenAI backend. See
> [Intrinsics](../advanced/intrinsics) for details.

## Vision support

Vision support for `LocalHFBackend` is model-dependent and experimental. Pass a PIL
Expand Down
70 changes: 69 additions & 1 deletion docs/docs/integrations/openai.md
Original file line number Diff line number Diff line change
Expand Up @@ -236,6 +236,73 @@ m = MelleaSession(
> LiteLLM provides a verified integration — see
> [Backends and Configuration](../guide/backends-and-configuration).

## Intrinsics with Granite Switch

Granite Switch models embed LoRA/aLoRA adapters directly in the model weights.
When served via vLLM, these adapters enable intrinsic functions (RAG quality
checks, safety evaluation, requirement validation) through the OpenAI-compatible
API without loading adapter weights at runtime.

Start a vLLM server with the Granite Switch model:

```bash
python -m vllm.entrypoints.openai.api_server \
--model <granite-switch-model-id> \
--dtype bfloat16 \
--enable-prefix-caching
```

Then create a backend with `load_embedded_adapters=True`:

```python
from mellea.backends.openai import OpenAIBackend
from mellea.backends.model_ids import IBM_GRANITE_SWITCH_4_1_8B
from mellea.formatters import TemplateFormatter

backend = OpenAIBackend(
model_id=IBM_GRANITE_SWITCH_4_1_8B.hf_model_name,
formatter=TemplateFormatter(model_id=IBM_GRANITE_SWITCH_4_1_8B.hf_model_name),
base_url="http://localhost:8000/v1",
api_key="EMPTY",
load_embedded_adapters=True,
)
```

The high-level intrinsic wrappers (`rag.check_answerability`,
`core.check_certainty`, etc.) work identically with this backend. See
[Intrinsics](../advanced/intrinsics) for the full list of available intrinsics.

> **Note:** `load_embedded_adapters=True` downloads adapter I/O configurations
> from the model's HuggingFace repository on first use. No adapter weights are
> transferred — the adapters are already part of the model. Only intrinsics
> embedded in the model are available — check the model's `adapter_index.json`
> for the list.

For more control, load adapters manually with `load_embedded_adapters=False`:

```python
from mellea.backends.adapters.adapter import EmbeddedIntrinsicAdapter
from mellea.backends.openai import OpenAIBackend
from mellea.backends.model_ids import IBM_GRANITE_SWITCH_4_1_8B
from mellea.formatters import TemplateFormatter

backend = OpenAIBackend(
model_id=IBM_GRANITE_SWITCH_4_1_8B.hf_model_name,
formatter=TemplateFormatter(model_id=IBM_GRANITE_SWITCH_4_1_8B.hf_model_name),
base_url="http://localhost:8000/v1",
api_key="EMPTY",
load_embedded_adapters=False,
)

# Load a single adapter from the model's HuggingFace repo
adapters = EmbeddedIntrinsicAdapter.from_hub(
IBM_GRANITE_SWITCH_4_1_8B.hf_model_name,
intrinsic_name="answerability",
)
for adapter in adapters:
backend.add_adapter(adapter)
```

## Troubleshooting

### `OPENAI_API_KEY` not set error
Expand All @@ -257,4 +324,5 @@ local servers, list available models from the server's API or UI.
---

**See also:** [Backends and Configuration](../guide/backends-and-configuration) |
[Enforce Structured Output](../how-to/enforce-structured-output)
[Enforce Structured Output](../how-to/enforce-structured-output) |
[Intrinsics](../advanced/intrinsics)
58 changes: 58 additions & 0 deletions docs/examples/granite-switch/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,58 @@
# Granite Switch Examples

This directory contains examples for running Mellea intrinsics through an
OpenAI-compatible backend using Granite Switch models.

## What is Granite Switch?

Granite Switch models ship with LoRA and aLoRA adapters pre-baked into the model
weights. Instead of loading adapters at runtime (as `LocalHFBackend` does), these
embedded adapters are activated via control tokens injected by the model's chat
template. Only the I/O transformation configs are downloaded — no adapter weights
are transferred.

## Prerequisites

1. A Granite Switch model hosted via [vLLM](https://docs.vllm.ai/):

```bash
python -m vllm.entrypoints.openai.api_server \
--model <granite-switch-model-id> \
--dtype bfloat16 \
--enable-prefix-caching
```

2. `pip install mellea`

## Available adapters

Not all intrinsics are embedded in every Granite Switch model. Check the model's
`adapter_index.json` for the list of available adapters. The current model
includes: `answerability`, `citations`, `context_relevance`, `guardian-core`,
`hallucination_detection`, `query_clarification`, `query_rewrite`, and
`requirement-check`.

## Files

### answerability_openai.py

Demonstrates `rag.check_answerability()` using `OpenAIBackend` with
`load_embedded_adapters=True` — the simplest way to use intrinsics with Granite
Switch.

### hallucination_detection_openai.py

Demonstrates `rag.flag_hallucinated_content()` using `OpenAIBackend` with
`load_embedded_adapters=True`.

### manual_adapter_loading.py

Shows how to manually load embedded adapters using
`EmbeddedIntrinsicAdapter.from_hub()` and `backend.add_adapter()`. Useful when
you only need a subset of adapters or want more control over adapter
registration.

## Related

- [`../intrinsics/`](../intrinsics/) — the same intrinsics using `LocalHFBackend`
- [Intrinsics documentation](../../docs/docs/advanced/intrinsics.md)
56 changes: 56 additions & 0 deletions docs/examples/granite-switch/answerability_openai.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,56 @@
# pytest: e2e, vllm, skip

"""Example: running the answerability intrinsic via OpenAI backend with Granite Switch.

Requires a vLLM server hosting a Granite Switch model.

To start the server:
python -m vllm.entrypoints.openai.api_server \
--model <granite-switch-model-id> \
--dtype bfloat16 --enable-prefix-caching

To run this script from the root of the Mellea source tree:
uv run python docs/examples/granite-switch/answerability_openai.py
"""

import os
import sys

import requests

VLLM_BASE_URL = os.environ.get("VLLM_SWITCH_TEST_BASE_URL", "http://localhost:8000")
try:
requests.get(f"{VLLM_BASE_URL}/v1/models", timeout=2)
except requests.ConnectionError:
# Detected by docs/examples/conftest.py subprocess runner as a skip.
print(f"Skipped: vLLM server not reachable at {VLLM_BASE_URL}", file=sys.stderr)
raise SystemExit(1)

from mellea.backends.model_ids import IBM_GRANITE_SWITCH_4_1_8B
from mellea.backends.openai import OpenAIBackend
from mellea.formatters import TemplateFormatter
from mellea.stdlib.components import Document, Message
from mellea.stdlib.components.intrinsic import rag
from mellea.stdlib.context import ChatContext

SWITCH_MODEL_ID = IBM_GRANITE_SWITCH_4_1_8B.hf_model_name
assert SWITCH_MODEL_ID is not None

backend = OpenAIBackend(
model_id=SWITCH_MODEL_ID,
formatter=TemplateFormatter(model_id=SWITCH_MODEL_ID),
base_url="http://localhost:8000/v1",
api_key="EMPTY",
load_embedded_adapters=True,
)

context = ChatContext().add(Message("assistant", "Hello there, how can I help you?"))
question = "What is the square root of 4?"
documents_answerable = [Document("The square root of 4 is 2.")]
documents_unanswerable = [Document("The square root of 8 is not 2.")]

result = rag.check_answerability(question, documents_answerable, context, backend)
print(f"Answerability (answer in docs): {result}")

result = rag.check_answerability(question, documents_unanswerable, context, backend)
print(f"Answerability (answer NOT in docs): {result}")
Loading
Loading