Skip to content

feat: migrate to Talos v1.13 machinery and gate config against target talosVersion #213

@lexfrei

Description

Summary

talm pins its Talos machinery to a cozystack/talos fork snapshot from 2026-01-26, which predates Talos v1.13. As a result, config documents introduced in v1.13 are not in the document registry and fail to load with "<Kind>" "v1alpha1": not registered. The first user-reported case is RoutingRuleConfig (Linux policy routing, needed for multi-NIC setups), but seven other documents are affected too.

This epic tracks two things:

  1. Migrating the machinery dependency to Talos v1.13 so these documents load.
  2. Adding version-skew gates so a document or field introduced in a newer Talos release cannot silently leak into a config that targets an older talosVersion (e.g. a v1.13-only document in a config aimed at v1.11), surfacing a clear error at talm template / talm apply time instead of a cryptic node-side rejection.

Current state

  • go.mod nominally requires pkg/machinery v1.13.0-beta.1, but a replace overrides it with cozystack/talos[/pkg/machinery] v0.0.0-20260126122716-d18a185e3680 (2026-01-26).
  • That fork snapshot is upstream-from-before-v1.13 plus a single patch: --skip-verify (upstream PR #12652, declined upstream, so the fork is required for as long as we need the flag).
  • Upstream added routing rules on 2026-03-13; it shipped in v1.13.0-beta.1 (2026-03-27) and is present through v1.13.3 (2026-05-25). The fork is ~7 weeks older than that commit, so the documents below are simply absent.

What we gain from v1.13

Re-basing the fork onto v1.13.3 registers these documents automatically (the document registry drives loading):

Kind Purpose
RoutingRuleConfig Linux policy routing rules (multi-NIC, source-based routing)
VRFConfig Virtual routing and forwarding
BlackholeRouteConfig Blackhole routes
KubeSpanConfig KubeSpan as a standalone document
TCPProbeConfig TCP health probe
ExternalVolumeConfig External volumes
EnvironmentConfig Environment variables as a document
ImageVerificationConfig Image signature verification

Workstreams

A. Re-derive the fork on a stable base

  • Re-derive cozystack/talos from upstream tag v1.13.3 and cherry-pick the --skip-verify patch on top, replacing the long-lived diverged branch that drifted to 2026-01-26.
  • Publish the new revision (tag / pseudo-version) for both talos and talos/pkg/machinery.
  • Policy: re-base onto the latest stable v1.13 patch tag rather than tracking main (alpha), so talm's machinery line matches the Talos line actually running on nodes.

B. Bump talm

  • Update the require and both replace lines in go.mod to the new fork revision; go mod tidy; confirm the build.
  • Confirm RoutingRuleConfig and the other seven documents load via the engine path.

C. Version-skew gates

  • Gate rendered config against the project's target talosVersion: reject (or warn on) any document/field that does not exist in the target release, with an actionable hint, instead of letting the node reject it.
  • Primary direction: newer document/field in an older-targeted config (v1.13 document in a v1.11 target).
  • Same skew family as the previously observed grubUseUKICmdline rejection on older targets.
  • Design decision to resolve in this issue: how to source per-Kind / per-field minimum-version data (machinery metadata vs a maintained compatibility matrix in talm), and whether gating is document-level only or also field-level.

D. Tests

  • Contract tests covering each newly-registered Kind through the engine (the custom LookupFunc path that unit charts can't reach).
  • Unit tests for the version-skew gate (both pass and reject cases).
  • Extend docs/manual-test-plan.md in the same commits as the code (forward-looking "do X, expect Y" steps).
  • Validate end-to-end on a live Talos v1.13 cluster.

E. Docs

  • Update operator-facing docs (cozystack/website) for the newly supported documents and the gate behavior.

F. Optional / low priority

  • Investigate whether --skip-verify can be upstreamed or replaced, which would let us drop the fork entirely and depend on plain upstream machinery.

Open questions

  • Source of per-Kind/per-field minimum-version data for the gate (machinery vs maintained matrix).
  • Gate depth: document-level vs field-level.
  • Cadence for re-basing the fork on new v1.13 patch releases.

Metadata

Metadata

Assignees

No one assigned

    Labels

    area/engineIssues or PRs related to pkg/engine (rendering, MergeFileAsPatch, helm)area/networkingIssues or PRs related to networking (interfaces, VIP, routes)kind/api-changeCategorizes issue or PR as related to adding, removing, or otherwise changing an APIkind/featureCategorizes issue or PR as related to a new featurepriority/important-soonMust be staffed and worked on either currently, or very soon, ideally in time for the next release

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions