Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
25 changes: 14 additions & 11 deletions .github/workflows/ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@ jobs:
if: github.event_name == 'pull_request'
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/checkout@v6
with:
fetch-depth: 0

Expand Down Expand Up @@ -68,7 +68,7 @@ jobs:
lint-format:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/checkout@v6

- name: Run Ruff
uses: chartboost/ruff-action@v1
Expand Down Expand Up @@ -106,17 +106,18 @@ jobs:
matrix:
python-version: ["3.10", "3.11", "3.12", "3.13", "3.14"]
steps:
- uses: actions/checkout@v4
- uses: actions/checkout@v6

- name: Set up Python ${{ matrix.python-version }}
uses: actions/setup-python@v6
with:
python-version: ${{ matrix.python-version }}

- name: Install uv
uses: astral-sh/setup-uv@v4
uses: astral-sh/setup-uv@v8.1.0
with:
version: "latest"
version: "0.11.15"
enable-cache: false

- name: Install dependencies
working-directory: packages/prime-sandboxes
Expand All @@ -135,17 +136,18 @@ jobs:
matrix:
python-version: ["3.11", "3.12", "3.13", "3.14"]
steps:
- uses: actions/checkout@v4
- uses: actions/checkout@v6

- name: Set up Python ${{ matrix.python-version }}
uses: actions/setup-python@v6
with:
python-version: ${{ matrix.python-version }}

- name: Install uv
uses: astral-sh/setup-uv@v4
uses: astral-sh/setup-uv@v8.1.0
with:
version: "latest"
version: "0.11.15"
enable-cache: false

- name: Install dependencies
working-directory: packages/prime
Expand All @@ -161,17 +163,18 @@ jobs:
type-check:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/checkout@v6

- name: Set up Python
uses: actions/setup-python@v6
with:
python-version: '3.11'

- name: Install uv
uses: astral-sh/setup-uv@v4
uses: astral-sh/setup-uv@v8.1.0
with:
version: "latest"
version: "0.11.15"
enable-cache: false

- name: Install dependencies
working-directory: packages/prime
Expand Down
35 changes: 33 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -152,17 +152,22 @@ prime env push my-environment
Prime Lab connects verifiers environments to evaluations, GEPA prompt optimization, and Hosted Training. Start with `prime lab setup` to create a local workspace with starter configs, then use `prime train models` to choose a Hosted Training model with current capacity and pricing.

```bash
# Set up a Lab workspace
# Set up a Lab workspace.
# If authenticated, setup creates an active project named after this folder.
prime lab setup
prime project current

# List trainable models, capacity, and token pricing
prime train models

# Generate a Hosted Training config
prime train init

# Launch the run from the generated config
# Launch the run from the generated config.
# Runs attach to the active project by default.
prime train rl.toml
prime train rl.toml --project <project-id>
prime train rl.toml --no-project

# Inspect and manage Hosted Training runs
prime train list
Expand All @@ -171,6 +176,30 @@ prime train metrics <run-id>
prime train checkpoints <run-id>
```

Lab projects group related training runs, evaluations, and adapters. By default,
`prime lab setup` creates an active project named after the workspace folder.
Use `prime lab setup --project <project-id>` to bind an existing project,
`prime lab setup --project-name "Alphabet Sort Baselines"` to choose the default
project name, or `prime lab setup --no-project` to keep setup local-only. Later,
use `prime project use <project-id>` to switch the active workspace project, or
`prime project clear` to stop using one by default. Existing runs and adapters
support project add/remove/clear; evaluations support assign/clear.

```bash
# Manage projects
prime project list
prime project show <project-id>
prime project update <project-id> --description "Baseline alphabet sort runs"

# Attach existing artifacts
prime project assign run <run-id> <project-id>
prime project remove run <run-id> <project-id>
prime project assign adapter <adapter-id> <project-id>
prime project remove adapter <adapter-id> # clear all adapter project memberships
prime project assign eval <eval-id> <project-id>
prime project remove eval <eval-id> # clear the evaluation project
```

### GPU Resources

```bash
Expand Down Expand Up @@ -211,6 +240,8 @@ prime eval push

# Push specific eval directory (verifiers format)
prime eval push outputs/evals/gsm8k--gpt-4/abc123
prime eval push outputs/evals/gsm8k--gpt-4/abc123 --project <project-id>
prime eval push outputs/evals/gsm8k--gpt-4/abc123 --no-project

# Push a public evaluation (default is private)
prime eval push --public
Expand Down
26 changes: 25 additions & 1 deletion packages/prime-evals/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -37,6 +37,7 @@ eval_response = client.create_evaluation(
model_name="gpt-4o-mini",
dataset="gsm8k",
framework="verifiers",
project_id="project-id",
metadata={
"version": "1.0",
"num_examples": 10,
Expand Down Expand Up @@ -220,6 +221,29 @@ client.finalize_evaluation(eval_id, metrics=eval_data.get("metrics"))
print(f"Successfully pushed evaluation: {eval_id}")
```

## Project Attachment

Evaluations can be created inside a Lab project, moved to another project, or
cleared from their project. Evaluation assignment is set/clear; targeted removal
from one project is not supported for evaluations.

```python
eval_response = client.create_evaluation(
name="gsm8k-project-baseline",
environments=[{"id": "gsm8k"}],
model_name="gpt-4o-mini",
project_id="project-id",
)

eval_id = eval_response["evaluation_id"]

# Move the evaluation to another project
client.update_evaluation(eval_id, project_id="another-project-id")

# Clear the evaluation project
client.update_evaluation(eval_id, clear_project=True)
```

## API Reference

### EvalsClient
Expand All @@ -232,6 +256,7 @@ Main client for interacting with the Prime Evals API.
- `push_samples()` - Push evaluation samples
- `finalize_evaluation()` - Finalize an evaluation with final metrics
- `get_evaluation()` - Get evaluation details by ID
- `update_evaluation()` - Update evaluation details or assign/clear a project
- `list_evaluations()` - List evaluations with optional filters
- `get_samples()` - Get samples for an evaluation

Expand Down Expand Up @@ -276,4 +301,3 @@ except EvalsAPIError as e:
## License

MIT License - see LICENSE file for details

26 changes: 22 additions & 4 deletions packages/prime-evals/src/prime_evals/evals.py
Original file line number Diff line number Diff line change
Expand Up @@ -161,6 +161,7 @@ def create_evaluation(
task_type: Optional[str] = None,
description: Optional[str] = None,
tags: Optional[List[str]] = None,
project_id: Optional[str] = None,
metadata: Optional[Dict[str, Any]] = None,
metrics: Optional[Dict[str, Any]] = None,
is_public: Optional[bool] = None,
Expand Down Expand Up @@ -204,6 +205,7 @@ def create_evaluation(
"task_type": task_type,
"description": description,
"tags": tags or [],
"project_id": project_id,
"metadata": metadata,
"metrics": metrics,
}
Expand Down Expand Up @@ -367,6 +369,8 @@ def update_evaluation(
task_type: Optional[str] = None,
description: Optional[str] = None,
tags: Optional[List[str]] = None,
project_id: Optional[str] = None,
clear_project: bool = False,
metadata: Optional[Dict[str, Any]] = None,
metrics: Optional[Dict[str, Any]] = None,
) -> Dict[str, Any]:
Expand All @@ -377,11 +381,16 @@ def update_evaluation(
"framework": framework,
"task_type": task_type,
"description": description,
"tags": tags if tags is not None else [],
"tags": tags,
"project_id": project_id,
"metadata": metadata,
"metrics": metrics,
}
payload = {k: v for k, v in payload.items() if v is not None or k in ["tags"]}
payload = {
k: v
for k, v in payload.items()
if v is not None or (clear_project and k == "project_id")
}

response = self.client.request("PUT", f"/evaluations/{evaluation_id}", json=payload)
return response
Expand Down Expand Up @@ -519,6 +528,7 @@ async def create_evaluation(
task_type: Optional[str] = None,
description: Optional[str] = None,
tags: Optional[List[str]] = None,
project_id: Optional[str] = None,
metadata: Optional[Dict[str, Any]] = None,
metrics: Optional[Dict[str, Any]] = None,
is_public: Optional[bool] = None,
Expand Down Expand Up @@ -562,6 +572,7 @@ async def create_evaluation(
"task_type": task_type,
"description": description,
"tags": tags or [],
"project_id": project_id,
"metadata": metadata,
"metrics": metrics,
}
Expand Down Expand Up @@ -719,6 +730,8 @@ async def update_evaluation(
task_type: Optional[str] = None,
description: Optional[str] = None,
tags: Optional[List[str]] = None,
project_id: Optional[str] = None,
clear_project: bool = False,
metadata: Optional[Dict[str, Any]] = None,
metrics: Optional[Dict[str, Any]] = None,
) -> Dict[str, Any]:
Expand All @@ -729,11 +742,16 @@ async def update_evaluation(
"framework": framework,
"task_type": task_type,
"description": description,
"tags": tags if tags is not None else [],
"tags": tags,
"project_id": project_id,
"metadata": metadata,
"metrics": metrics,
}
payload = {k: v for k, v in payload.items() if v is not None or k in ["tags"]}
payload = {
k: v
for k, v in payload.items()
if v is not None or (clear_project and k == "project_id")
}

response = await self.client.request("PUT", f"/evaluations/{evaluation_id}", json=payload)
return response
Expand Down
2 changes: 2 additions & 0 deletions packages/prime-evals/src/prime_evals/models.py
Original file line number Diff line number Diff line change
Expand Up @@ -32,6 +32,7 @@ class Evaluation(BaseModel):
run_id: Optional[str] = Field(None, alias="runId")
version_id: Optional[str] = Field(None, alias="versionId")
tags: List[str] = Field(default_factory=list)
project_id: Optional[str] = Field(None, alias="projectId")
metadata: Optional[Dict[str, Any]] = None
metrics: Optional[Dict[str, Any]] = None
total_samples: Optional[int] = Field(None, alias="totalSamples")
Expand Down Expand Up @@ -66,6 +67,7 @@ class CreateEvaluationRequest(BaseModel):
task_type: Optional[str] = None
description: Optional[str] = None
tags: List[str] = Field(default_factory=list)
project_id: Optional[str] = None
metadata: Optional[Dict[str, Any]] = None
metrics: Optional[Dict[str, Any]] = None

Expand Down
55 changes: 55 additions & 0 deletions packages/prime-evals/tests/test_evals.py
Original file line number Diff line number Diff line change
Expand Up @@ -232,6 +232,61 @@ def test_evals_client_context_manager():
pass # Expected to fail without proper initialization


def test_create_evaluation_sends_project_id_payload():
captured = {}

class DummyConfig:
team_id = None

class DummyHTTPClient:
config = DummyConfig()

def request(self, method, endpoint, json=None, params=None):
captured["method"] = method
captured["endpoint"] = endpoint
captured["json"] = json
captured["params"] = params
return {"evaluation_id": "eval-123"}

client = EvalsClient.__new__(EvalsClient)
client.client = DummyHTTPClient()

response = client.create_evaluation(
name="gsm8k",
run_id="run-123",
model_name="gpt-4o-mini",
project_id="project-123",
)

assert response == {"evaluation_id": "eval-123"}
assert captured["method"] == "POST"
assert captured["endpoint"] == "/evaluations/"
assert captured["json"]["project_id"] == "project-123"
assert "projectId" not in captured["json"]


def test_update_evaluation_clear_project_sends_null_project_id():
captured = {}

class DummyHTTPClient:
def request(self, method, endpoint, json=None, params=None):
captured["method"] = method
captured["endpoint"] = endpoint
captured["json"] = json
captured["params"] = params
return {"evaluation_id": "eval-123"}

client = EvalsClient.__new__(EvalsClient)
client.client = DummyHTTPClient()

response = client.update_evaluation("eval-123", clear_project=True)

assert response == {"evaluation_id": "eval-123"}
assert captured["method"] == "PUT"
assert captured["endpoint"] == "/evaluations/eval-123"
assert captured["json"] == {"project_id": None}


def test_evaluation_model_minimal():
"""Test Evaluation model with minimal data"""
data = {
Expand Down
Loading
Loading