Skip to content

Support vector-valued combined_score for multi-objective optimization#458

Open
yuxuan-z19 wants to merge 3 commits intoalgorithmicsuperintelligence:mainfrom
yuxuan-z19:zyx-moe-lb
Open

Support vector-valued combined_score for multi-objective optimization#458
yuxuan-z19 wants to merge 3 commits intoalgorithmicsuperintelligence:mainfrom
yuxuan-z19:zyx-moe-lb

Conversation

@yuxuan-z19
Copy link
Copy Markdown
Contributor

@yuxuan-z19 yuxuan-z19 commented Apr 22, 2026

This PR introduces support for vector-valued combined_score (e.g., tuple[float, ...]) and generalizes its handling across the OpenEvolve pipeline.

Related issues:


Changes

Fitness Representation

  • combined_score now supports: float | tuple | list
  • Removed implicit casting to float

Comparison Semantics

  • All comparisons and deltas are handled via NumPy:
    • np.subtract
    • np.allclose
  • Enables consistent element-wise behavior for non-scalar scores

Early Stopping

  • Thresholds are broadcast to match vector-valued scores
  • Event-based stopping uses np.allclose() for convergence checks

Logging and Formatting

  • Introduced format_score() for unified scalar/vector formatting
  • Removed assumptions of scalar formatting (%.4f, :.4f)

Aggregation / Statistics

  • Island statistics (e.g., mean) are computed element-wise for tuple/list scores

Pipeline Integration

  • Changes are applied across:
    • evolution loop
    • logging
    • prompt rendering

Testing

  • Added/updated unit tests covering tuple/list combined_score
  • All existing tests pass

Backward Compatibility

  • Scalar combined_score remains fully supported
  • Existing workflows continue to function without modification

Comparison logic is extended to support vector semantics when applicable.


Fallback Behavior

When combined_score is absent, the existing fallback is preserved: safe_numeric_average(p.metrics). This remains scalar (no implicit broadcasting).

Broadcasting this value to match vector-valued comparisons was considered, but not included to avoid introducing implicit behavior. This can be revisited if stricter consistency is preferred.


Summary

This PR extends combined_score from a scalar assumption to a general vector-compatible representation, with consistent behavior across evaluation, comparison, logging, and aggregation.

Regards,
@yuxuan-z19 (/w Baidu FM-Agent @baidubce)

Copilot AI review requested due to automatic review settings April 22, 2026 14:50
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds first-class support for vector-valued combined_score (e.g., tuple[float, ...] / list[float]) to enable multi-objective optimization semantics across scoring, logging, early stopping, and island statistics.

Changes:

  • Generalize combined_score handling (fitness extraction, formatting, and prompt rendering) to support scalar and vector scores.
  • Update comparison/delta and aggregation paths to use NumPy operations for non-scalar scores.
  • Add unit tests and a MoE load-balancing example workload demonstrating tuple-based scoring.

Reviewed changes

Copilot reviewed 21 out of 21 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
tests/test_tuple_score.py Adds tests validating get_fitness_score() behavior for float/list/tuple combined_score.
tests/test_database.py Adds tests for island stats aggregation with float/list/tuple scores; minor formatting cleanups.
openevolve/utils/metrics_utils.py Returns list/tuple combined_score as-is; adds broadcast_value() helper.
openevolve/utils/format_utils.py Introduces format_score() and uses NumPy subtraction for improvement formatting.
openevolve/prompts/defaults/fragments.json Removes scalar-only :.4f formatting placeholders for fitness fragments.
openevolve/prompt/sampler.py Uses format_score() in prompt rendering; adds NumPy allclose for “stable” fitness; broadcasts thresholds for vector scores.
openevolve/process_parallel.py Updates early-stopping logging/threshold handling to support vector-valued scores.
openevolve/database.py Uses format_score() for logging; uses np.subtract for score diffs; computes island averages element-wise via NumPy.
openevolve/api.py Widens EvolutionResult.best_score type and updates __repr__ formatting via format_score().
examples/moe_lb/run.sh Adds a runnable script for the MoE load-balancing example.
examples/moe_lb/requirements.txt Adds dependencies for the MoE load-balancing example.
examples/moe_lb/initial_program.py Provides a baseline implementation for the MoE load-balancing task.
examples/moe_lb/evaluator.py Adds an evaluator emitting tuple combined_score for multi-objective scoring.
examples/moe_lb/config.yaml Adds an example config emphasizing lexicographic multi-objective scoring.
examples/moe_lb/archives/task_config.yaml.bak Stores archived upstream task config reference.
examples/moe_lb/archives/prompt_multi_obj.md Archived prompt text for multi-objective variant.
examples/moe_lb/archives/prompt_lp.md Archived prompt text for LP-focused variant.
examples/moe_lb/archives/prompt_common.md Archived prompt text for common/heuristic variant.
examples/moe_lb/archives/loongflow_best_lp.py Archived “best” LP solution reference implementation.
examples/moe_lb/archives/loongflow_best_common.py Archived “best” heuristic solution reference implementation.
.pre-commit-config.yaml Updates isort and black hook revisions.

Comment on lines +548 to +553
HIGH_SCORE = broadcast_value(0.8, score)
MED_SCORE = broadcast_value(0.6, score)
LOW_SCORE = broadcast_value(0.4, score)

# Classify based on score ranges
if score >= 0.8:
if score >= HIGH_SCORE:
Copy link

Copilot AI Apr 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For vector-valued score, the score >= HIGH_SCORE / MED_SCORE / LOW_SCORE checks rely on Python sequence comparisons, which are lexicographic rather than element-wise. This can misclassify programs (e.g., a high first objective but very low secondary objectives still counts as “high performer”). If the intent is element-wise thresholds, use NumPy comparisons (e.g., np.all(np.greater_equal(score, HIGH_SCORE)) after coercing to arrays); if the intent is lexicographic ordering, consider comparing only the primary objective explicitly to make the rule unambiguous.

Copilot uses AI. Check for mistakes.
Comment thread openevolve/database.py
Comment thread openevolve/utils/metrics_utils.py
Comment thread examples/moe_lb/config.yaml
Comment thread openevolve/process_parallel.py
Comment on lines 751 to 754
else:
# Event-based early stopping
if current_score == self.config.convergence_threshold:
if np.allclose(current_score, convergence_threshold):
best_score = current_score
Copy link

Copilot AI Apr 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

convergence_threshold is defined only inside the early_stopping_patience > 0 branch, but it’s referenced in the event-based early stopping branch. If early_stopping_patience <= 0, this will raise an UnboundLocalError at runtime. Define/broadcast convergence_threshold before the patience check (or reference self.config.convergence_threshold directly) so both modes work.

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Use multiple metrics in more of a Pareto optimization way? Automatic stopping when a certain metric reaches a certain value?

2 participants