Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
57 changes: 43 additions & 14 deletions docs/byok_guide.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,7 @@ The BYOK (Bring Your Own Knowledge) feature in Lightspeed Core enables users to

* [What is BYOK?](#what-is-byok)
* [How BYOK Works](#how-byok-works)
* [Prioritization of BYOK content](#prioritization-of-byok-content)
* [Prerequisites](#prerequisites)
* [Configuration Guide](#configuration-guide)
* [Step 1: Prepare Your Knowledge Sources](#step-1-prepare-your-knowledge-sources)
Expand Down Expand Up @@ -77,17 +78,45 @@ Both modes rely on:
- **Vector Database**: Your indexed knowledge sources stored as vector embeddings
- **Embedding Model**: Converts queries and documents into vector representations for similarity matching

Inline RAG additionally supports:
- **Score Multiplier**: Optional weight applied per BYOK vector store when mixing multiple sources. Allows custom prioritization of content.
### Prioritization of BYOK content

> [!NOTE]
> OKP and BYOK scores are not directly comparable (different scoring systems), so
> `score_multiplier` does not apply to OKP results. To control the amount of retrieved
> context, set the `BYOK_RAG_MAX_CHUNKS` and `OKP_RAG_MAX_CHUNKS` constants in `src/constants.py`
> (defaults: 10 and 5 respectively). For Tool RAG, use `TOOL_RAG_MAX_CHUNKS` (default: 10).
> The `INLINE_RAG_MAX_CHUNKS` constant (value: 10) caps the final merged inline RAG
> chunks (BYOK + OKP) delivered to the LLM. Tool RAG is controlled independently
> by `TOOL_RAG_MAX_CHUNKS`.
When multiple BYOK stores are configured for Inline RAG, their results are merged and ranked. Two mechanisms control prioritization:

- **Score Multiplier** (`score_multiplier`): A per-store weight applied to raw similarity scores during Inline RAG. Values > 1.0 boost a store's results; values < 1.0 reduce them. Only affects BYOK stores — OKP scores use a different scoring system and are not comparable.

- **Reranker**: When enabled, a cross-encoder model re-scores the merged chunk pool (BYOK + OKP) using semantic similarity to the query. This normalizes scores across sources, making OKP and BYOK results directly comparable. BYOK score boosts are applied after reranking.

**Chunk limits** control how many chunks flow through the pipeline. Configure them in `lightspeed-stack.yaml`:

| Config path | Default | Description |
|-------------|---------|-------------|
| `rag.byok.max_chunks` | 10 | Total chunks fetched across all BYOK stores |
| `rag.okp.max_chunks` | 5 | Chunks fetched from OKP |
| `rag.retrieval.inline.max_chunks` | 10 | Final cap on merged inline RAG chunks delivered to the LLM |
| `rag.retrieval.tool.max_chunks` | 10 | Max chunks retrieved via Tool RAG (`file_search`) |

```mermaid
flowchart TD
subgraph Sources["Source Fetching"]
B1["BYOK Store 1"] --> BPool
B2["BYOK Store 2"] --> BPool
BN["BYOK Store N"] --> BPool
BPool["BYOK Pool\ncapped at rag.byok.max_chunks"]
OKP["OKP (Solr)\ncapped at rag.okp.max_chunks"]
end

BPool --> Pool["Merged Pool\n(all chunks, sorted by score)"]
OKP --> Pool

Pool --> Decision{Reranker\nenabled?}

Decision -->|Yes| Rerank["Cross-Encoder Rerank\n+ BYOK score boost"]
Decision -->|No| Cut

Rerank --> Cut["Top K cut\nrag.retrieval.inline.max_chunks"]

Cut --> Context["Final Inline RAG Context"]
```

---

Expand Down Expand Up @@ -288,7 +317,7 @@ byok_rag:

> [!NOTE]
> pgvector is not yet supported via `byok_rag` in `lightspeed-stack.yaml` (see [LCORE-2437](https://redhat.atlassian.net/browse/LCORE-2437)).
> It must be configured directly in the Llama Stack configuration file.
> It must be configured directly in the `run.yaml` configuration file.

```yaml
vector_io:
Expand Down Expand Up @@ -342,11 +371,11 @@ rag:

### Example 2: Multiple Knowledge Sources with pgvector

A configuration combining a local FAISS store (via `byok_rag`) with a remote pgvector store (configured directly in the Llama Stack configuration file):
A configuration combining a local FAISS store (via `byok_rag`) with a remote pgvector store (configured directly in the `run.yaml` configuration file):

> [!NOTE]
> pgvector is not yet supported via `byok_rag` in `lightspeed-stack.yaml` (see [LCORE-2437](https://redhat.atlassian.net/browse/LCORE-2437)).
> The pgvector provider must be configured directly in the Llama Stack configuration file.
> The pgvector provider must be configured directly in the `run.yaml` configuration file.

**`lightspeed-stack.yaml`** — FAISS store and RAG strategy:

Expand All @@ -373,7 +402,7 @@ rag:
- local-docs
```

**Llama Stack configuration file** — pgvector provider:
**`run.yaml` configuration file** — pgvector provider:

```yaml
vector_io:
Expand Down
257 changes: 176 additions & 81 deletions docs/openapi.json

Large diffs are not rendered by default.

53 changes: 43 additions & 10 deletions docs/rag_guide.md
Original file line number Diff line number Diff line change
Expand Up @@ -34,6 +34,38 @@ Lightspeed Core Stack (LCS) supports two complementary RAG strategies:

Both strategies can be enabled independently via the `rag` section of `lightspeed-stack.yaml`. See [BYOK Feature Documentation](byok_guide.md) for configuration details.

> [!NOTE]
> **Backward compatibility:** if neither `retrieval.inline.sources` nor `retrieval.tool.sources` is
> configured, all registered vector stores (BYOK and OKP) are automatically exposed as
> Tool RAG (`file_search`). Inline RAG is **not** enabled in this fallback — only Tool RAG.

### Inline RAG chunk flow

```mermaid
flowchart TD
subgraph Sources["Source Fetching"]
B1["BYOK Store 1"] --> BPool
B2["BYOK Store 2"] --> BPool
BN["BYOK Store N"] --> BPool
BPool["BYOK Pool\ncapped at byok.max_chunks"]
OKP["OKP (Solr)\ncapped at okp.max_chunks"]
end

BPool --> Pool["Merged Pool\n(all chunks, sorted by score)"]
OKP --> Pool

Pool --> Decision{Reranker\nenabled?}

Decision -->|Yes| Rerank["Cross-Encoder Rerank\n+ BYOK score boost"]
Decision -->|No| Cut

Rerank --> Cut["Top K cut\nretrieval.inline.max_chunks"]

Cut --> Context["Final Inline RAG Context"]
```

Each BYOK store is queried in parallel, and the merged BYOK results are capped at `byok.max_chunks` total. OKP fetches up to `okp.max_chunks`. Together these form the reranking pool. If the reranker is enabled, the full pool is reranked with a cross-encoder and BYOK score boosts are applied. The result is capped at `retrieval.inline.max_chunks`.

The **Embedding Model** is used to convert queries and documents into vector representations for similarity matching.

> [!NOTE]
Expand Down Expand Up @@ -90,7 +122,7 @@ This example shows how to configure a remote PostgreSQL database with the [pgvec

> [!NOTE]
> pgvector is not yet supported via `byok_rag` in `lightspeed-stack.yaml` (see [LCORE-2437](https://redhat.atlassian.net/browse/LCORE-2437)).
> It must be configured directly in the Llama Stack configuration file.
> It must be configured directly in the `run.yaml` configuration file.

> You will need to install PostgreSQL with a matching version to pgvector, then log in with `psql` and enable the extension with:
> ```sql
Expand Down Expand Up @@ -313,15 +345,16 @@ Example:
**Chunk volume:**

OKP and BYOK scores are not directly comparable (different scoring systems), so
`score_multiplier` (a BYOK-only concept) does not apply to OKP results. To control
the number of retrieved chunks, set the constants in `src/constants.py`:

| Constant | Value | Description |
|----------|-------|-------------|
| `INLINE_RAG_MAX_CHUNKS` | 10 | Hard upper bound on the final merged inline RAG chunks (BYOK + OKP) delivered to the LLM |
| `OKP_RAG_MAX_CHUNKS` | 5 | Fetch hint for OKP (Inline RAG); controls how many chunks enter the reranking pool |
| `BYOK_RAG_MAX_CHUNKS` | 10 | Fetch hint for BYOK stores (Inline RAG); controls how many chunks enter the reranking pool |
| `TOOL_RAG_MAX_CHUNKS` | 10 | Max chunks retrieved via Tool RAG (`file_search`); independent from `INLINE_RAG_MAX_CHUNKS` |
`score_multiplier` (a BYOK-only concept) does not apply to OKP results. However, when
the reranker is enabled, it normalizes scores across sources using a cross-encoder model.
To control the number of retrieved chunks, configure `max_chunks` in `lightspeed-stack.yaml`:

| Config path | Default | Description |
|-------------|---------|-------------|
| `rag.retrieval.inline.max_chunks` | 10 | Hard upper bound on the final merged inline RAG chunks (BYOK + OKP) delivered to the LLM |
| `rag.okp.max_chunks` | 5 | Fetch limit for OKP (Inline RAG); controls how many chunks enter the reranking pool |
| `rag.byok.max_chunks` | 10 | Fetch limit for BYOK stores (Inline RAG); controls how many chunks enter the reranking pool |
| `rag.retrieval.tool.max_chunks` | 10 | Max chunks retrieved via Tool RAG (`file_search`); independent from inline max_chunks |

**Limitations:**

Expand Down
73 changes: 39 additions & 34 deletions examples/lightspeed-stack-byok-okp-rag.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -34,40 +34,45 @@ quota_handlers:
scheduler:
# scheduler ticks in seconds
period: 10
byok_rag:
- rag_id: ocp-docs # referenced in rag.inline / rag.tool
rag_type: inline::faiss
embedding_model: sentence-transformers/all-mpnet-base-v2
embedding_dimension: 768
vector_db_id: vs_123 # Vector store ID (from index generation)
db_path: /tmp/ocp.faiss
score_multiplier: 1.0 # Weight for this vector store's results (Inline RAG only)
- rag_id: knowledge-base # referenced in rag.inline / rag.tool
rag_type: inline::faiss
embedding_model: sentence-transformers/all-mpnet-base-v2
embedding_dimension: 768
vector_db_id: vs_456 # Vector store ID (from index generation)
db_path: /tmp/kb.faiss
score_multiplier: 1.2 # Weight for this vector store's results (Inline RAG only)

# RAG configuration
rag:
# Inline RAG: context injected before the LLM request from the listed sources
# List rag_ids from byok_rag, or 'okp' to include OKP
inline:
- ocp-docs
- knowledge-base
- okp
# Tool RAG: LLM can call file_search on demand to retrieve context
# List rag_ids from byok_rag, or 'okp' to include OKP
# Omit to use all registered BYOK stores (backward compatibility)
tool:
- ocp-docs
- knowledge-base
byok:
max_chunks: 10 # Max total chunks across all BYOK stores
stores:
- rag_id: ocp-docs # Referenced in retrieval.inline / retrieval.tool
backend: faiss
embedding_dimension: 1024
vector_db_id: vs_123 # Llama-stack vector_store_id
db_path: /tmp/ocp.faiss
score_multiplier: 1.0 # Weight for this vector store's results (Inline RAG only)
- rag_id: knowledge-base # Referenced in retrieval.inline / retrieval.tool
backend: faiss
embedding_dimension: 384
vector_db_id: vs_456 # Llama-stack vector_store_id
db_path: /tmp/kb.faiss
score_multiplier: 1.2 # Weight for this vector store's results (Inline RAG only)

# OKP provider settings (only used when 'okp' is listed in retrieval sources)
okp:
offline: true # true = use parent_id for source URLs, false = use reference_url
max_chunks: 5 # Max chunks fetched from OKP
# Additional Solr filter query applied to every OKP search request.
# Use Solr boolean syntax
# chunk_filter_query: "product:*ansible* AND product:*openshift*"

# OKP provider settings (only used when 'okp' is listed in rag.inline or rag.tool)
okp:
offline: true # true = use parent_id for source URLs, false = use reference_url
# Additional Solr filter query applied to every OKP search request.
# Use Solr boolean syntax
# chunk_filter_query: "product:*ansible* AND product:*openshift*"
retrieval:
# Inline RAG: context injected before the LLM request from the listed sources
# List rag_ids from byok stores, or 'okp' to include OKP
inline:
sources:
- ocp-docs
- knowledge-base
- okp
max_chunks: 10 # Cap on merged inline result
# Tool RAG: LLM can call file_search on demand to retrieve context
# List rag_ids from byok stores, or 'okp' to include OKP
tool:
sources:
- ocp-docs
- knowledge-base
max_chunks: 10 # Tool RAG limit
24 changes: 13 additions & 11 deletions examples/quota-limiter-configuration-sqlite.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -33,17 +33,19 @@ conversation_cache:
ssl_mode: disable
gss_encmode: disable

#byok_rag:
# - rag_id: ocp_docs
# rag_type: inline::faiss
# embedding_dimension: 1024
# vector_db_id: vector_byok_1
# db_path: /tmp/ocp.faiss
# - rag_id: knowledge_base
# rag_type: inline::faiss
# embedding_dimension: 384
# vector_db_id: vector_byok_2
# db_path: /tmp/kb.faiss
#rag:
# byok:
# stores:
# - rag_id: ocp_docs
# backend: faiss
# embedding_dimension: 1024
# vector_db_id: vector_byok_1
# db_path: /tmp/ocp.faiss
# - rag_id: knowledge_base
# backend: faiss
# embedding_dimension: 384
# vector_db_id: vector_byok_2
# db_path: /tmp/kb.faiss

quota_handlers:
sqlite:
Expand Down
6 changes: 3 additions & 3 deletions src/app/endpoints/rags.py
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,7 @@
RAGInfoResponse,
RAGListResponse,
)
from models.config import Action, ByokRag
from models.config import Action, RagStore
from utils.endpoints import check_configuration_loaded

logger = get_logger(__name__)
Expand Down Expand Up @@ -107,7 +107,7 @@ async def rags_endpoint_handler(
raise HTTPException(**response.model_dump()) from e


def _resolve_rag_id_to_vector_db_id(rag_id: str, byok_rags: list[ByokRag]) -> str:
def _resolve_rag_id_to_vector_db_id(rag_id: str, byok_rags: list[RagStore]) -> str:
"""Resolve a user-facing rag_id to the llama-stack vector_db_id.

Checks if the given ID matches a rag_id in the BYOK config and returns
Expand Down Expand Up @@ -166,7 +166,7 @@ async def get_rag_endpoint_handler(

# Resolve user-facing rag_id to llama-stack vector_db_id
vector_db_id = _resolve_rag_id_to_vector_db_id(
rag_id, configuration.configuration.byok_rag
rag_id, configuration.configuration.rag.byok.stores
)

try:
Expand Down
8 changes: 6 additions & 2 deletions src/client.py
Original file line number Diff line number Diff line change
Expand Up @@ -91,10 +91,14 @@ def _enrich_library_config(self, input_config_path: str) -> str:
config = configuration.configuration

# Enrichment: BYOK RAG
enrich_byok_rag(ls_config, [b.model_dump() for b in config.byok_rag])
enrich_byok_rag(ls_config, [s.model_dump() for s in config.rag.byok.stores])

# Enrichment: Solr - enabled when "okp" appears in either inline or tool list
enrich_solr(ls_config, config.rag.model_dump(), config.okp.model_dump())
rag_config_for_solr = {
"inline": config.rag.retrieval.inline.sources,
"tool": config.rag.retrieval.tool.sources,
}
enrich_solr(ls_config, rag_config_for_solr, config.rag.okp.model_dump())

# Enrichment: Azure Entra ID deferred auth
entra_id_config = (
Expand Down
17 changes: 10 additions & 7 deletions src/configuration.py
Original file line number Diff line number Diff line change
Expand Up @@ -479,7 +479,7 @@ def okp(self) -> "OkpConfiguration":
"""Return OKP configuration."""
if self._configuration is None:
raise LogicError("logic error: configuration is not loaded")
return self._configuration.okp
return self._configuration.rag.okp

@property
def reranker(self) -> "RerankerConfiguration":
Expand All @@ -502,12 +502,15 @@ def rag_id_mapping(self) -> dict[str, str]:
if self._configuration is None:
raise LogicError("logic error: configuration is not loaded")
byok_mapping = {
brag.vector_db_id: brag.rag_id for brag in self._configuration.byok_rag
store.vector_db_id: store.rag_id
for store in self._configuration.rag.byok.stores
}

rag = self._configuration.rag
retrieval = self._configuration.rag.retrieval
okp_id = constants.OKP_RAG_ID
okp_enabled = okp_id in (rag.inline or []) or okp_id in (rag.tool or [])
okp_enabled = okp_id in (retrieval.inline.sources or []) or okp_id in (
retrieval.tool.sources or []
)
okp_mapping = (
{constants.SOLR_DEFAULT_VECTOR_STORE_ID: okp_id} if okp_enabled else {}
)
Expand All @@ -527,8 +530,8 @@ def score_multiplier_mapping(self) -> dict[str, float]:
if self._configuration is None:
raise LogicError("logic error: configuration is not loaded")
return {
brag.vector_db_id: brag.score_multiplier
for brag in self._configuration.byok_rag
store.vector_db_id: store.score_multiplier
for store in self._configuration.rag.byok.stores
}

@property
Expand All @@ -543,7 +546,7 @@ def inline_solr_enabled(self) -> bool:
"""
if self._configuration is None:
raise LogicError("logic error: configuration is not loaded")
return constants.OKP_RAG_ID in self._configuration.rag.inline
return constants.OKP_RAG_ID in self._configuration.rag.retrieval.inline.sources

def resolve_index_name(
self, vector_store_id: str, rag_id_mapping: Optional[dict[str, str]] = None
Expand Down
Loading
Loading