lightspeed-core · are-ces · Jun 3, 2026 · Jun 3, 2026 · Jun 3, 2026 · Jun 3, 2026
diff --git a/docs/byok_guide.md b/docs/byok_guide.md
@@ -10,6 +10,7 @@ The BYOK (Bring Your Own Knowledge) feature in Lightspeed Core enables users to
 
 * [What is BYOK?](#what-is-byok)
 * [How BYOK Works](#how-byok-works)
+  * [Prioritization of BYOK content](#prioritization-of-byok-content)
 * [Prerequisites](#prerequisites)
 * [Configuration Guide](#configuration-guide)
   * [Step 1: Prepare Your Knowledge Sources](#step-1-prepare-your-knowledge-sources)
@@ -77,17 +78,45 @@ Both modes rely on:
 - **Vector Database**: Your indexed knowledge sources stored as vector embeddings
 - **Embedding Model**: Converts queries and documents into vector representations for similarity matching
 
-Inline RAG additionally supports:
-- **Score Multiplier**: Optional weight applied per BYOK vector store when mixing multiple sources. Allows custom prioritization of content.
+### Prioritization of BYOK content
 
-> [!NOTE]
-> OKP and BYOK scores are not directly comparable (different scoring systems), so
-> `score_multiplier` does not apply to OKP results. To control the amount of retrieved
-> context, set the `BYOK_RAG_MAX_CHUNKS` and `OKP_RAG_MAX_CHUNKS` constants in `src/constants.py`
-> (defaults: 10 and 5 respectively). For Tool RAG, use `TOOL_RAG_MAX_CHUNKS` (default: 10).
-> The `INLINE_RAG_MAX_CHUNKS` constant (value: 10) caps the final merged inline RAG
-> chunks (BYOK + OKP) delivered to the LLM. Tool RAG is controlled independently
-> by `TOOL_RAG_MAX_CHUNKS`.
+When multiple BYOK stores are configured for Inline RAG, their results are merged and ranked. Two mechanisms control prioritization:
+
+- **Score Multiplier** (`score_multiplier`): A per-store weight applied to raw similarity scores during Inline RAG. Values > 1.0 boost a store's results; values < 1.0 reduce them. Only affects BYOK stores — OKP scores use a different scoring system and are not comparable.
+
+- **Reranker**: When enabled, a cross-encoder model re-scores the merged chunk pool (BYOK + OKP) using semantic similarity to the query. This normalizes scores across sources, making OKP and BYOK results directly comparable. BYOK score boosts are applied after reranking.
+
+**Chunk limits** control how many chunks flow through the pipeline. Configure them in `lightspeed-stack.yaml`:
+
+| Config path | Default | Description |
+|-------------|---------|-------------|
+| `rag.byok.max_chunks` | 10 | Total chunks fetched across all BYOK stores |
+| `rag.okp.max_chunks` | 5 | Chunks fetched from OKP |
+| `rag.retrieval.inline.max_chunks` | 10 | Final cap on merged inline RAG chunks delivered to the LLM |
+| `rag.retrieval.tool.max_chunks` | 10 | Max chunks retrieved via Tool RAG (`file_search`) |
+
+```mermaid
+flowchart TD
+    subgraph Sources["Source Fetching"]
+        B1["BYOK Store 1"] --> BPool
+        B2["BYOK Store 2"] --> BPool
+        BN["BYOK Store N"] --> BPool
+        BPool["BYOK Pool\ncapped at rag.byok.max_chunks"]
+        OKP["OKP (Solr)\ncapped at rag.okp.max_chunks"]
+    end
+
+    BPool --> Pool["Merged Pool\n(all chunks, sorted by score)"]
+    OKP --> Pool
+
+    Pool --> Decision{Reranker\nenabled?}
+
+    Decision -->|Yes| Rerank["Cross-Encoder Rerank\n+ BYOK score boost"]
+    Decision -->|No| Cut
+
+    Rerank --> Cut["Top K cut\nrag.retrieval.inline.max_chunks"]
+
+    Cut --> Context["Final Inline RAG Context"]
+```
 
 ---
 
@@ -288,7 +317,7 @@ byok_rag:
 
 > [!NOTE]
 > pgvector is not yet supported via `byok_rag` in `lightspeed-stack.yaml` (see [LCORE-2437](https://redhat.atlassian.net/browse/LCORE-2437)).
-> It must be configured directly in the Llama Stack configuration file.
+> It must be configured directly in the `run.yaml` configuration file.
 
 ```yaml
 vector_io:
@@ -342,11 +371,11 @@ rag:
 
 ### Example 2: Multiple Knowledge Sources with pgvector
 
-A configuration combining a local FAISS store (via `byok_rag`) with a remote pgvector store (configured directly in the Llama Stack configuration file):
+A configuration combining a local FAISS store (via `byok_rag`) with a remote pgvector store (configured directly in the `run.yaml` configuration file):
 
 > [!NOTE]
 > pgvector is not yet supported via `byok_rag` in `lightspeed-stack.yaml` (see [LCORE-2437](https://redhat.atlassian.net/browse/LCORE-2437)).
-> The pgvector provider must be configured directly in the Llama Stack configuration file.
+> The pgvector provider must be configured directly in the `run.yaml` configuration file.
 
 **`lightspeed-stack.yaml`** — FAISS store and RAG strategy:
 
@@ -373,7 +402,7 @@ rag:
     - local-docs
 ```
 
-**Llama Stack configuration file** — pgvector provider:
+**`run.yaml` configuration file** — pgvector provider:
 
 ```yaml
 vector_io:

diff --git a/docs/openapi.json b/docs/openapi.json
diff --git a/docs/rag_guide.md b/docs/rag_guide.md
@@ -34,6 +34,38 @@ Lightspeed Core Stack (LCS) supports two complementary RAG strategies:
 
 Both strategies can be enabled independently via the `rag` section of `lightspeed-stack.yaml`. See [BYOK Feature Documentation](byok_guide.md) for configuration details.
 
+> [!NOTE]
+> **Backward compatibility:** if neither `retrieval.inline.sources` nor `retrieval.tool.sources` is
+> configured, all registered vector stores (BYOK and OKP) are automatically exposed as
+> Tool RAG (`file_search`). Inline RAG is **not** enabled in this fallback — only Tool RAG.
+
+### Inline RAG chunk flow
+
+```mermaid
+flowchart TD
+    subgraph Sources["Source Fetching"]
+        B1["BYOK Store 1"] --> BPool
+        B2["BYOK Store 2"] --> BPool
+        BN["BYOK Store N"] --> BPool
+        BPool["BYOK Pool\ncapped at byok.max_chunks"]
+        OKP["OKP (Solr)\ncapped at okp.max_chunks"]
+    end
+
+    BPool --> Pool["Merged Pool\n(all chunks, sorted by score)"]
+    OKP --> Pool
+
+    Pool --> Decision{Reranker\nenabled?}
+
+    Decision -->|Yes| Rerank["Cross-Encoder Rerank\n+ BYOK score boost"]
+    Decision -->|No| Cut
+
+    Rerank --> Cut["Top K cut\nretrieval.inline.max_chunks"]
+
+    Cut --> Context["Final Inline RAG Context"]
+```
+
+Each BYOK store is queried in parallel, and the merged BYOK results are capped at `byok.max_chunks` total. OKP fetches up to `okp.max_chunks`. Together these form the reranking pool. If the reranker is enabled, the full pool is reranked with a cross-encoder and BYOK score boosts are applied. The result is capped at `retrieval.inline.max_chunks`.
+
 The **Embedding Model** is used to convert queries and documents into vector representations for similarity matching.
 
 > [!NOTE]
@@ -90,7 +122,7 @@ This example shows how to configure a remote PostgreSQL database with the [pgvec
 
 > [!NOTE]
 > pgvector is not yet supported via `byok_rag` in `lightspeed-stack.yaml` (see [LCORE-2437](https://redhat.atlassian.net/browse/LCORE-2437)).
-> It must be configured directly in the Llama Stack configuration file.
+> It must be configured directly in the `run.yaml` configuration file.
 
 > You will need to install PostgreSQL with a matching version to pgvector, then log in with `psql` and enable the extension with:
 > ```sql
@@ -313,15 +345,16 @@ Example:
 **Chunk volume:**
 
 OKP and BYOK scores are not directly comparable (different scoring systems), so
-`score_multiplier` (a BYOK-only concept) does not apply to OKP results. To control
-the number of retrieved chunks, set the constants in `src/constants.py`:
-
-| Constant | Value | Description |
-|----------|-------|-------------|
-| `INLINE_RAG_MAX_CHUNKS` | 10 | Hard upper bound on the final merged inline RAG chunks (BYOK + OKP) delivered to the LLM |
-| `OKP_RAG_MAX_CHUNKS` | 5 | Fetch hint for OKP (Inline RAG); controls how many chunks enter the reranking pool |
-| `BYOK_RAG_MAX_CHUNKS` | 10 | Fetch hint for BYOK stores (Inline RAG); controls how many chunks enter the reranking pool |
-| `TOOL_RAG_MAX_CHUNKS` | 10 | Max chunks retrieved via Tool RAG (`file_search`); independent from `INLINE_RAG_MAX_CHUNKS` |
+`score_multiplier` (a BYOK-only concept) does not apply to OKP results. However, when
+the reranker is enabled, it normalizes scores across sources using a cross-encoder model.
+To control the number of retrieved chunks, configure `max_chunks` in `lightspeed-stack.yaml`:
+
+| Config path | Default | Description |
+|-------------|---------|-------------|
+| `rag.retrieval.inline.max_chunks` | 10 | Hard upper bound on the final merged inline RAG chunks (BYOK + OKP) delivered to the LLM |
+| `rag.okp.max_chunks` | 5 | Fetch limit for OKP (Inline RAG); controls how many chunks enter the reranking pool |
+| `rag.byok.max_chunks` | 10 | Fetch limit for BYOK stores (Inline RAG); controls how many chunks enter the reranking pool |
+| `rag.retrieval.tool.max_chunks` | 10 | Max chunks retrieved via Tool RAG (`file_search`); independent from inline max_chunks |
 
 **Limitations:**
 

diff --git a/examples/lightspeed-stack-byok-okp-rag.yaml b/examples/lightspeed-stack-byok-okp-rag.yaml
@@ -34,40 +34,45 @@ quota_handlers:
   scheduler:
     # scheduler ticks in seconds
     period: 10
-byok_rag:
-  - rag_id: ocp-docs           # referenced in rag.inline / rag.tool
-    rag_type: inline::faiss
-    embedding_model: sentence-transformers/all-mpnet-base-v2
-    embedding_dimension: 768
-    vector_db_id: vs_123       # Vector store ID (from index generation)
-    db_path: /tmp/ocp.faiss
-    score_multiplier: 1.0      # Weight for this vector store's results (Inline RAG only)
-  - rag_id: knowledge-base     # referenced in rag.inline / rag.tool
-    rag_type: inline::faiss
-    embedding_model: sentence-transformers/all-mpnet-base-v2
-    embedding_dimension: 768
-    vector_db_id: vs_456       # Vector store ID (from index generation)
-    db_path: /tmp/kb.faiss
-    score_multiplier: 1.2      # Weight for this vector store's results (Inline RAG only)
-
 # RAG configuration
 rag:
-  # Inline RAG: context injected before the LLM request from the listed sources
-  # List rag_ids from byok_rag, or 'okp' to include OKP
-  inline:
-    - ocp-docs
-    - knowledge-base
-    - okp
-  # Tool RAG: LLM can call file_search on demand to retrieve context
-  # List rag_ids from byok_rag, or 'okp' to include OKP
-  # Omit to use all registered BYOK stores (backward compatibility)
-  tool:
-    - ocp-docs
-    - knowledge-base
+  byok:
+    max_chunks: 10               # Max total chunks across all BYOK stores
+    stores:
+      - rag_id: ocp-docs         # Referenced in retrieval.inline / retrieval.tool
+        backend: faiss
+        embedding_dimension: 1024
+        vector_db_id: vs_123     # Llama-stack vector_store_id
+        db_path: /tmp/ocp.faiss
+        score_multiplier: 1.0    # Weight for this vector store's results (Inline RAG only)
+      - rag_id: knowledge-base   # Referenced in retrieval.inline / retrieval.tool
+        backend: faiss
+        embedding_dimension: 384
+        vector_db_id: vs_456     # Llama-stack vector_store_id
+        db_path: /tmp/kb.faiss
+        score_multiplier: 1.2    # Weight for this vector store's results (Inline RAG only)
+
+  # OKP provider settings (only used when 'okp' is listed in retrieval sources)
+  okp:
+    offline: true                # true = use parent_id for source URLs, false = use reference_url
+    max_chunks: 5                # Max chunks fetched from OKP
+    # Additional Solr filter query applied to every OKP search request.
+    # Use Solr boolean syntax
+    # chunk_filter_query: "product:*ansible* AND product:*openshift*"
 
-# OKP provider settings (only used when 'okp' is listed in rag.inline or rag.tool)
-okp:
-  offline: true    # true = use parent_id for source URLs, false = use reference_url
-  # Additional Solr filter query applied to every OKP search request.
-  # Use Solr boolean syntax
-  # chunk_filter_query: "product:*ansible* AND product:*openshift*"
+  retrieval:
+    # Inline RAG: context injected before the LLM request from the listed sources
+    # List rag_ids from byok stores, or 'okp' to include OKP
+    inline:
+      sources:
+        - ocp-docs
+        - knowledge-base
+        - okp
+      max_chunks: 10             # Cap on merged inline result
+    # Tool RAG: LLM can call file_search on demand to retrieve context
+    # List rag_ids from byok stores, or 'okp' to include OKP
+    tool:
+      sources:
+        - ocp-docs
+        - knowledge-base
+      max_chunks: 10             # Tool RAG limit
diff --git a/examples/quota-limiter-configuration-sqlite.yaml b/examples/quota-limiter-configuration-sqlite.yaml
@@ -33,17 +33,19 @@ conversation_cache:
     ssl_mode: disable
     gss_encmode: disable
 
-#byok_rag:
-#  - rag_id: ocp_docs
-#    rag_type: inline::faiss
-#    embedding_dimension: 1024
-#    vector_db_id: vector_byok_1
-#    db_path: /tmp/ocp.faiss
-#  - rag_id: knowledge_base
-#    rag_type: inline::faiss
-#    embedding_dimension: 384
-#    vector_db_id: vector_byok_2
-#    db_path: /tmp/kb.faiss
+#rag:
+#  byok:
+#    stores:
+#      - rag_id: ocp_docs
+#        backend: faiss
+#        embedding_dimension: 1024
+#        vector_db_id: vector_byok_1
+#        db_path: /tmp/ocp.faiss
+#      - rag_id: knowledge_base
+#        backend: faiss
+#        embedding_dimension: 384
+#        vector_db_id: vector_byok_2
+#        db_path: /tmp/kb.faiss
 
 quota_handlers:
   sqlite:

diff --git a/src/app/endpoints/rags.py b/src/app/endpoints/rags.py
@@ -24,7 +24,7 @@
     RAGInfoResponse,
     RAGListResponse,
 )
-from models.config import Action, ByokRag
+from models.config import Action, RagStore
 from utils.endpoints import check_configuration_loaded
 
 logger = get_logger(__name__)
@@ -107,7 +107,7 @@ async def rags_endpoint_handler(
         raise HTTPException(**response.model_dump()) from e
 
 
-def _resolve_rag_id_to_vector_db_id(rag_id: str, byok_rags: list[ByokRag]) -> str:
+def _resolve_rag_id_to_vector_db_id(rag_id: str, byok_rags: list[RagStore]) -> str:
     """Resolve a user-facing rag_id to the llama-stack vector_db_id.
 
     Checks if the given ID matches a rag_id in the BYOK config and returns
@@ -166,7 +166,7 @@ async def get_rag_endpoint_handler(
 
     # Resolve user-facing rag_id to llama-stack vector_db_id
     vector_db_id = _resolve_rag_id_to_vector_db_id(
-        rag_id, configuration.configuration.byok_rag
+        rag_id, configuration.configuration.rag.byok.stores
     )
 
     try:

diff --git a/src/client.py b/src/client.py
@@ -91,10 +91,14 @@ def _enrich_library_config(self, input_config_path: str) -> str:
         config = configuration.configuration
 
         # Enrichment: BYOK RAG
-        enrich_byok_rag(ls_config, [b.model_dump() for b in config.byok_rag])
+        enrich_byok_rag(ls_config, [s.model_dump() for s in config.rag.byok.stores])
 
         # Enrichment: Solr - enabled when "okp" appears in either inline or tool list
-        enrich_solr(ls_config, config.rag.model_dump(), config.okp.model_dump())
+        rag_config_for_solr = {
+            "inline": config.rag.retrieval.inline.sources,
+            "tool": config.rag.retrieval.tool.sources,
+        }
+        enrich_solr(ls_config, rag_config_for_solr, config.rag.okp.model_dump())
 
         # Enrichment: Azure Entra ID deferred auth
         entra_id_config = (

diff --git a/src/configuration.py b/src/configuration.py
@@ -479,7 +479,7 @@ def okp(self) -> "OkpConfiguration":
         """Return OKP configuration."""
         if self._configuration is None:
             raise LogicError("logic error: configuration is not loaded")
-        return self._configuration.okp
+        return self._configuration.rag.okp
 
     @property
     def reranker(self) -> "RerankerConfiguration":
@@ -502,12 +502,15 @@ def rag_id_mapping(self) -> dict[str, str]:
         if self._configuration is None:
             raise LogicError("logic error: configuration is not loaded")
         byok_mapping = {
-            brag.vector_db_id: brag.rag_id for brag in self._configuration.byok_rag
+            store.vector_db_id: store.rag_id
+            for store in self._configuration.rag.byok.stores
         }
 
-        rag = self._configuration.rag
+        retrieval = self._configuration.rag.retrieval
         okp_id = constants.OKP_RAG_ID
-        okp_enabled = okp_id in (rag.inline or []) or okp_id in (rag.tool or [])
+        okp_enabled = okp_id in (retrieval.inline.sources or []) or okp_id in (
+            retrieval.tool.sources or []
+        )
         okp_mapping = (
             {constants.SOLR_DEFAULT_VECTOR_STORE_ID: okp_id} if okp_enabled else {}
         )
@@ -527,8 +530,8 @@ def score_multiplier_mapping(self) -> dict[str, float]:
         if self._configuration is None:
             raise LogicError("logic error: configuration is not loaded")
         return {
-            brag.vector_db_id: brag.score_multiplier
-            for brag in self._configuration.byok_rag
+            store.vector_db_id: store.score_multiplier
+            for store in self._configuration.rag.byok.stores
         }
 
     @property
@@ -543,7 +546,7 @@ def inline_solr_enabled(self) -> bool:
         """
         if self._configuration is None:
             raise LogicError("logic error: configuration is not loaded")
-        return constants.OKP_RAG_ID in self._configuration.rag.inline
+        return constants.OKP_RAG_ID in self._configuration.rag.retrieval.inline.sources
 
     def resolve_index_name(
         self, vector_store_id: str, rag_id_mapping: Optional[dict[str, str]] = None