Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
114 changes: 85 additions & 29 deletions pages/generative-apis/concepts.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -3,87 +3,143 @@ title: Generative APIs - Concepts
description: This page explains all the concepts related to Generative APIs
tags:
dates:
validation: 2025-09-03
validation: 2026-04-16
---

## API rate limits

API rate limits define the maximum number of requests a user can make to the Generative APIs within a specific time frame. Rate limiting helps to manage resource allocation, prevent abuse, and ensure fair access for all users. Understanding and adhering to these limits is essential for maintaining optimal application performance using these APIs.
## Allowed IPs {/* Dedicated */}

The **Allowed IPs** feature is no longer available for Generative APIs - Dedicated Deployment. Use one of the alternative methods detailed in our [documentation about access management](/generative-apis/how-to/manage-allowed-ips/) to restrict access to your dedicated Generative APIs deployments.

## API rate limits {/* Serverless */}

API rate limits define the maximum number of requests a user can make to the Generative APIs - Serverless endpoints within a specific time frame. Rate limiting helps to manage resource allocation, prevent abuse, and ensure fair access for all users. Understanding and adhering to these limits is essential for maintaining optimal application performance using these APIs.

Refer to the [Rate limits](/generative-apis/reference-content/rate-limits/) documentation for more information.

## Batch processing
## Batch processing {/* Serverless */}

Batch jobs are processed asynchronously, offering reduced costs (see [pricing page](https://www.scaleway.com/en/pricing/model-as-a-service/)) and no rate limits. They are designed for high-volume workloads and are typically completed within 24 hours.

## Context window
## Context window {/* Serverless + Dedicated (Context size) */}

A context window is the maximum amount of prompt data considered by the model to generate a response. Using models with high context length, you can provide more information to generate relevant responses. The context is measured in tokens.

## Function calling
## Deployment {/* Dedicated */}

Function calling allows a large language model (LLM) to interact with external tools or APIs, executing specific tasks based on user requests. The LLM identifies the appropriate function, extracts the required parameters, and returns the results as structured data, typically in JSON format.
A deployment makes a trained language model available for real-world applications. It encompasses tasks such as integrating the model into existing systems, optimizing its performance, and ensuring scalability and reliability.

Refer to [How to use function calling](/generative-apis/how-to/use-function-calling/) for more information.
## Embeddings {/* Serverless + Dedicated (Embedding models) */}

## Embeddings

Embeddings are numerical representations of text data that capture semantic information in a dense vector format. In Generative APIs, embeddings are essential for tasks such as similarity matching, clustering, and serving as inputs for downstream models. These vectors enable the model to understand and generate text based on the underlying meaning rather than just the surface-level words.
Embeddings are numerical representations of text data that capture semantic information in a dense vector format. In Generative APIs, embeddings are essential for tasks such as similarity matching, clustering, and serving as input for downstream models, or algorithms. These vectors enable the model to understand and generate text based on the underlying meaning rather than just the surface-level words.

Refer to [How to query embedding models](/generative-apis/how-to/query-embedding-models/) for more information.

## Error handling
## Endpoint {/* Dedicated */}

In the context of LLMs, an endpoint refers to a network-accessible URL or interface through which clients can interact with the model for inference tasks. It exposes methods for sending input data and receiving model predictions or responses.

## Error handling {/* Serverless */}

Error handling refers to the strategies and mechanisms in place to manage and respond to errors during API requests. This includes handling network issues, invalid inputs, or server-side errors. Proper error handling ensures that applications using Generative APIs can gracefully recover from failures and provide meaningful feedback to users.

Refer to [Understanding errors](/generative-apis/api-cli/understanding-errors/) for more information.

## Parameters
## Fine-tuning {/* Dedicated */}

Parameters are settings that control the behavior and performance of generative models. These include temperature, max tokens, and top-p sampling, among others. Adjusting parameters allows users to tweak the model's output, balancing factors like creativity, accuracy, and response length to suit specific use cases.
Fine-tuning involves further training a pre-trained language model on domain-specific or task-specific data to improve performance on a particular task. This process often includes updating the model's parameters using a smaller, task-specific dataset.

## Few-shot prompting {/* Dedicated */}

Few-shot prompting uses the power of language models to generate responses with minimal input, relying on just a handful of examples or prompts.
It demonstrates the model's ability to generalize from limited training data to produce coherent and contextually relevant outputs.

## Function calling {/* Serverless + Dedicated */}

Function calling allows a large language model (LLM) to interact with external tools or APIs, executing specific tasks based on user requests. The LLM identifies the appropriate function, extracts the required parameters, and returns the results as structured data, typically in JSON format.

Refer to [How to use function calling](/generative-apis/how-to/use-function-calling/) for more information.

## Hallucinations {/* Dedicated */}

## Inter-token Latency (ITL)
Hallucinations in LLMs refer to instances where generative AI models generate responses that, while grammatically coherent, contain inaccuracies or nonsensical information. These inaccuracies are termed "hallucinations" because the models create false or misleading content. Hallucinations can occur because of constraints in the training data, biases embedded within the models, or the complex nature of language itself.

The inter-token latency (ITL) corresponds to the average time elapsed between two generated tokens. It is usually expressed in milliseconds.
## Inference {/* Dedicated */}

## JSON mode
Inference is the process of deriving logical conclusions or predictions from available data. This concept involves using statistical methods, machine learning algorithms, and reasoning techniques to make decisions or draw insights based on observed patterns or evidence.
Inference is fundamental in various AI applications, including natural language processing, image recognition, and autonomous systems.

## Inter-token Latency (ITL) {/* Serverless */}

Inter-token latency (ITL) corresponds to the average time elapsed between two generated tokens. It is usually expressed in milliseconds.

## JSON mode {/* Serverless + Dedicated */}

JSON mode allows you to guide the language model in outputting well-structured JSON data.
To activate JSON mode, provide the `response_format` parameter with `{"type": "json_object"}`.
JSON mode is useful for applications like chatbots or APIs, where a machine-readable format is essential for easy processing.
JSON mode is useful for applications such as chatbots or APIs, where a machine-readable format is essential for easy processing.

## Large Language Models (LLMs) {/* Dedicated */}

LLMs are advanced artificial intelligence systems capable of understanding and generating human-like text on various topics.
These models, such as Llama-3, are trained on vast amounts of data to learn the patterns and structures of language, enabling them to generate coherent and contextually relevant responses to queries or prompts.
LLMs have applications in natural language processing, text generation, translation, and other tasks requiring sophisticated language understanding and production.

## Large Language Model Applications {/* Dedicated */}

LLM Applications are applications or software tools that leverage the capabilities of LLMs for various tasks, such as text generation, summarization, or translation. These apps provide user-friendly interfaces for interacting with the models and accessing their functionalities.

## Node number {/* Dedicated */}

The node number (or node count) defines the number of nodes, or Instances, that are running your dedicated Generative APIs deployment. [Increasing the node number](/generative-apis/how-to/configure-autoscaling/) scales your deployment, so that it can handle more load.

## Parameters {/* Serverless */}

Parameters are settings that control the behavior and performance of generative models. These include temperature, max tokens, and top-p sampling, among others. Adjusting parameters allows users to tweak the model's output, balancing factors like creativity, accuracy, and response length to suit specific use cases.

## Prompt {/* Dedicated */}

In the context of generative AI models, a prompt refers to the input provided to the model to generate a desired response.
It typically consists of a sentence, paragraph, or series of keywords or instructions that guide the model in producing text relevant to the given context or task.
The quality and specificity of the prompt greatly influence the generated output, as the model uses it to understand the user's intent and create responses accordingly.

## Prompt Engineering {/* Serverless */}

Prompt engineering involves crafting specific and well-structured inputs (prompts) to guide the model toward generating the desired output. Effective prompt design is crucial for generating relevant responses, particularly in complex or creative tasks. It often requires experimentation to find the right balance between specificity and flexibility.

## Prompt Engineering
## Quantization {/* Dedicated */}

Prompt engineering involves crafting specific and well-structured inputs (prompts) to guide the model towards generating the desired output. Effective prompt design is crucial for generating relevant responses, particularly in complex or creative tasks. It often requires experimentation to find the right balance between specificity and flexibility.
Quantization is a technique used to reduce the precision of numerical values in a model's parameters or activations to improve efficiency and reduce memory footprint during inference. It involves representing floating-point values with fewer bits while minimizing the loss of accuracy.
AI models provided for deployment are named with suffixes that denote their quantization levels, such as `:int8`, `:fp8`, and `:fp16`.

## Retrieval Augmented Generation (RAG)
## Retrieval Augmented Generation (RAG) {/* Serverless + Dedicated */}

Retrieval Augmented Generation (RAG) is a technique that enhances generative models by integrating information retrieval methods. By fetching relevant data from external sources before generating a response, RAG ensures that the output is more accurate and contextually relevant, especially in scenarios requiring up-to-date or specific information.
Retrieval Augmented Generation (RAG) is a technique that enhances generative models by combining information retrieval elements with language generation to enhance the capabilities of LLMs. By fetching relevant data from external sources before generating a response, RAG ensures that the output is more accurate and contextually relevant, especially in scenarios requiring up-to-date or specific information.

## Stop words
## Stop words {/* Serverless */}

Stop words are a parameter set to tell the model to stop generating further tokens after one or more chosen tokens have been generated. This is useful for controlling the end of the model output, as it will cut off at the first occurrence of any of these strings.

## Streaming
## Streaming {/* Serverless */}

Streaming is a parameter allowing responses to be delivered in real-time, showing parts of the output as they are generated rather than waiting for the full response. Scaleway is following the [Server-sent events](https://html.spec.whatwg.org/multipage/server-sent-events.html#server-sent-events) standard. This behavior usually enhances user experience by providing immediate feedback and a more interactive conversation.

## Structured outputs
## Structured outputs {/* Serverless + Dedicated */}

Structured outputs enable you to format the model's responses to suit specific use cases. To activate structured outputs, provide the `response_format` parameter with `"type": "json_schema"` and define its `"json_schema": {}`.
By customizing the structure, such as using lists, tables, or key-value pairs, you ensure that the data returned is in a form that is easy to extract and process.
By specifying the expected response format through the API, you can make the model consistently deliver the output your system requires.

Refer to [How to use structured outputs](/generative-apis/how-to/query-vision-models/) for more information.
Refer to [How to use structured outputs](/generative-apis/how-to/use-structured-outputs/) for more information.

## Temperature
## Temperature {/* Serverless */}

Temperature is a parameter that controls the randomness of the model's output during text generation. A higher temperature produces more creative and diverse outputs, while a lower temperature makes the model's responses more deterministic and focused. Adjusting the temperature allows users to balance creativity with coherence in the generated text.

## Time to First Token (TTFT)
## Time to First Token (TTFT) {/* Serverless */}

Time to First Token (TTFT) measures the time elapsed from the moment a request is made to the point when the first token of the generated text is returned. TTFT is a crucial performance metric for evaluating the responsiveness of generative models, especially in interactive applications where users expect immediate feedback.

## Tokens
## Tokens {/* Serverless */}

Tokens are the basic units of text that a generative model processes. Depending on the tokenization strategy, these can be words, subwords, or even characters. The number of tokens directly affects the context window size and the computational cost of using the model. Understanding token usage is essential for optimizing API requests and managing costs effectively.
Loading
Loading