Support vector-valued combined_score for multi-objective optimization by yuxuan-z19 · Pull Request #458 · algorithmicsuperintelligence/openevolve

yuxuan-z19 · 2026-04-22T14:50:07Z

This PR introduces support for vector-valued combined_score (e.g., tuple[float, ...]) and generalizes its handling across the OpenEvolve pipeline.

Related issues:

Refs [RFC] Enhancing combined_score to support multi-objective optimization via Lexicographical ordering #453
Resolves Use multiple metrics in more of a Pareto optimization way? #110
Resolves Automatic stopping when a certain metric reaches a certain value? #109

Changes

Fitness Representation

combined_score now supports: float | tuple | list
Removed implicit casting to float

Comparison Semantics

All comparisons and deltas are handled via NumPy:
- np.subtract
- np.allclose
Enables consistent element-wise behavior for non-scalar scores

Early Stopping

Thresholds are broadcast to match vector-valued scores
Event-based stopping uses np.allclose() for convergence checks

Logging and Formatting

Introduced format_score() for unified scalar/vector formatting
Removed assumptions of scalar formatting (%.4f, :.4f)

Aggregation / Statistics

Island statistics (e.g., mean) are computed element-wise for tuple/list scores

Pipeline Integration

Changes are applied across:
- evolution loop
- logging
- prompt rendering

Testing

Added/updated unit tests covering tuple/list combined_score
All existing tests pass

Backward Compatibility

Scalar combined_score remains fully supported
Existing workflows continue to function without modification

Comparison logic is extended to support vector semantics when applicable.

Fallback Behavior

When combined_score is absent, the existing fallback is preserved: safe_numeric_average(p.metrics). This remains scalar (no implicit broadcasting).

Broadcasting this value to match vector-valued comparisons was considered, but not included to avoid introducing implicit behavior. This can be revisited if stricter consistency is preferred.

Summary

This PR extends combined_score from a scalar assumption to a general vector-compatible representation, with consistent behavior across evaluation, comparison, logging, and aggregation.

Regards,
@yuxuan-z19 (/w Baidu FM-Agent @baidubce)

Copilot

Pull request overview

Adds first-class support for vector-valued combined_score (e.g., tuple[float, ...] / list[float]) to enable multi-objective optimization semantics across scoring, logging, early stopping, and island statistics.

Changes:

Generalize combined_score handling (fitness extraction, formatting, and prompt rendering) to support scalar and vector scores.
Update comparison/delta and aggregation paths to use NumPy operations for non-scalar scores.
Add unit tests and a MoE load-balancing example workload demonstrating tuple-based scoring.

Reviewed changes

Copilot reviewed 21 out of 21 changed files in this pull request and generated 6 comments.

Show a summary per file

File	Description
`tests/test_tuple_score.py`	Adds tests validating `get_fitness_score()` behavior for float/list/tuple `combined_score`.
`tests/test_database.py`	Adds tests for island stats aggregation with float/list/tuple scores; minor formatting cleanups.
`openevolve/utils/metrics_utils.py`	Returns list/tuple `combined_score` as-is; adds `broadcast_value()` helper.
`openevolve/utils/format_utils.py`	Introduces `format_score()` and uses NumPy subtraction for improvement formatting.
`openevolve/prompts/defaults/fragments.json`	Removes scalar-only `:.4f` formatting placeholders for fitness fragments.
`openevolve/prompt/sampler.py`	Uses `format_score()` in prompt rendering; adds NumPy `allclose` for “stable” fitness; broadcasts thresholds for vector scores.
`openevolve/process_parallel.py`	Updates early-stopping logging/threshold handling to support vector-valued scores.
`openevolve/database.py`	Uses `format_score()` for logging; uses `np.subtract` for score diffs; computes island averages element-wise via NumPy.
`openevolve/api.py`	Widens `EvolutionResult.best_score` type and updates `__repr__` formatting via `format_score()`.
`examples/moe_lb/run.sh`	Adds a runnable script for the MoE load-balancing example.
`examples/moe_lb/requirements.txt`	Adds dependencies for the MoE load-balancing example.
`examples/moe_lb/initial_program.py`	Provides a baseline implementation for the MoE load-balancing task.
`examples/moe_lb/evaluator.py`	Adds an evaluator emitting tuple `combined_score` for multi-objective scoring.
`examples/moe_lb/config.yaml`	Adds an example config emphasizing lexicographic multi-objective scoring.
`examples/moe_lb/archives/task_config.yaml.bak`	Stores archived upstream task config reference.
`examples/moe_lb/archives/prompt_multi_obj.md`	Archived prompt text for multi-objective variant.
`examples/moe_lb/archives/prompt_lp.md`	Archived prompt text for LP-focused variant.
`examples/moe_lb/archives/prompt_common.md`	Archived prompt text for common/heuristic variant.
`examples/moe_lb/archives/loongflow_best_lp.py`	Archived “best” LP solution reference implementation.
`examples/moe_lb/archives/loongflow_best_common.py`	Archived “best” heuristic solution reference implementation.
`.pre-commit-config.yaml`	Updates `isort` and `black` hook revisions.

Copilot · 2026-04-22T14:58:17Z

+        HIGH_SCORE = broadcast_value(0.8, score)
+        MED_SCORE = broadcast_value(0.6, score)
+        LOW_SCORE = broadcast_value(0.4, score)
+
        # Classify based on score ranges
-        if score >= 0.8:
+        if score >= HIGH_SCORE:


For vector-valued score, the score >= HIGH_SCORE / MED_SCORE / LOW_SCORE checks rely on Python sequence comparisons, which are lexicographic rather than element-wise. This can misclassify programs (e.g., a high first objective but very low secondary objectives still counts as “high performer”). If the intent is element-wise thresholds, use NumPy comparisons (e.g., np.all(np.greater_equal(score, HIGH_SCORE)) after coercing to arrays); if the intent is lexicographic ordering, consider comparing only the primary objective explicitly to make the rule unambiguous.

Copilot · 2026-04-22T14:58:19Z

                            else:
                                # Event-based early stopping
-                                if current_score == self.config.convergence_threshold:
+                                if np.allclose(current_score, convergence_threshold):
                                    best_score = current_score


convergence_threshold is defined only inside the early_stopping_patience > 0 branch, but it’s referenced in the event-based early stopping branch. If early_stopping_patience <= 0, this will raise an UnboundLocalError at runtime. Define/broadcast convergence_threshold before the patience check (or reference self.config.convergence_threshold directly) so both modes work.

yuxuan-z19 added 2 commits April 17, 2026 00:52

update pre-commit

9564603

init moe_lb with multi-objective score

10f7ef4

Copilot AI review requested due to automatic review settings April 22, 2026 14:50

Copilot started reviewing on behalf of yuxuan-z19 April 22, 2026 14:50 View session

Copilot AI reviewed Apr 22, 2026

View reviewed changes

support tuple-valued combined_score

e4fa747

yuxuan-z19 force-pushed the zyx-moe-lb branch from e7f005f to e4fa747 Compare April 22, 2026 15:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support vector-valued combined_score for multi-objective optimization#458

Support vector-valued combined_score for multi-objective optimization#458
yuxuan-z19 wants to merge 3 commits intoalgorithmicsuperintelligence:mainfrom
yuxuan-z19:zyx-moe-lb

yuxuan-z19 commented Apr 22, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 22, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI Apr 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

yuxuan-z19 commented Apr 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changes

Fitness Representation

Comparison Semantics

Early Stopping

Logging and Formatting

Aggregation / Statistics

Pipeline Integration

Testing

Backward Compatibility

Fallback Behavior

Summary

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

yuxuan-z19 commented Apr 22, 2026 •

edited

Loading