Skip to content

feat(repo): add scripts to synthesize and consume azl repodata#17139

Open
reubeno wants to merge 1 commit into
microsoft:tomls/base/mainfrom
reubeno:just-repo-tools
Open

feat(repo): add scripts to synthesize and consume azl repodata#17139
reubeno wants to merge 1 commit into
microsoft:tomls/base/mainfrom
reubeno:just-repo-tools

Conversation

@reubeno
Copy link
Copy Markdown
Member

@reubeno reubeno commented May 11, 2026

Adds three new tools under scripts/repo/ for working with Azure Linux package repositories:

  • synthesize-repodata.py — given one or more upstream repo prefixes (Standard Azure Linux Repo Layout: per-channel main/debuginfo/srpms sub-repos) and/or explicit per-repo overrides, synthesize a fresh set of per-destination repodata trees that route each package to its intended channel based on azldev component metadata. Local repo overrides take precedence over upstream when NEVRAs collide (CLI order is preserved). The output is a static directory tree of standard createrepo_c repodata with absolute upstream URLs in package locations, so the synthesized repodata can be served from anywhere without needing to mirror the RPM content.

  • dnf-with-azl-repos — thin wrapper around dnf that probes one or more URL prefixes for the Standard Azure Linux Repo Layout, enables every sub-repo it discovers (silently skipping ones that don't exist), and execs dnf with those repos added on the command line.

  • _repo_layout.py — shared definition of the Standard Azure Linux Repo Layout (channels, sub-repo kinds, per-kind URL template) consumed by both scripts.

@binujp
Copy link
Copy Markdown
Contributor

binujp commented May 12, 2026

This looks good Reuben, thanks! Can you please add a few workflow examples and use cases? Interpreting command line option to what it does took some effort. That could be a me problem, but examples can help quite a bit.

Adds three new tools under `scripts/repo/` for working with Azure
Linux package repositories:

* `synthesize-repodata.py` — given one or more upstream repo
  prefixes (Standard Azure Linux Repo Layout: per-channel
  main/debuginfo/srpms sub-repos) and/or explicit per-repo
  overrides, synthesize a fresh set of per-destination repodata
  trees that route each package to its intended channel based on
  azldev component metadata. Local repo overrides take precedence
  over upstream when NEVRAs collide (CLI order is preserved). The
  output is a static directory tree of standard `createrepo_c`
  repodata with absolute upstream URLs in package locations, so
  the synthesized repodata can be served from anywhere without
  needing to mirror the RPM content.

* `dnf-with-azl-repos` — thin wrapper around `dnf` that probes
  one or more URL prefixes for the Standard Azure Linux Repo
  Layout, enables every sub-repo it discovers (silently skipping
  ones that don't exist), and execs `dnf` with those repos
  added on the command line.

* `_repo_layout.py` — shared definition of the Standard Azure
  Linux Repo Layout (channels, sub-repo kinds, per-kind URL
  template) consumed by both scripts.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@reubeno reubeno force-pushed the just-repo-tools branch from 233e068 to d3edddf Compare May 14, 2026 06:00
@reubeno reubeno marked this pull request as ready for review May 14, 2026 06:25
Copilot AI review requested due to automatic review settings May 14, 2026 06:25
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds three new scripts under scripts/repo/ for synthesizing Azure Linux per-channel/per-arch repodata trees from upstream RPM repositories and for invoking dnf against discovered Azure Linux repos. The scripts share a common layout definition (_repo_layout.py) that encodes the fixed channel × kind × arch matrix.

Changes:

  • New synthesize-repodata.py that fetches upstream repodata, queries azldev package list to assign packages to channels (with sibling-rpm inheritance fallback), and emits routed createrepo_c repodata referencing the original upstream URLs.
  • New dnf-with-azl-repos wrapper that probes URL prefixes for the standard layout (silently skipping 404s) and execs dnf with the discovered sub-repos enabled.
  • New _repo_layout.py shared module defining the standard CHANNELS, KIND_* constants, and SUBREPOS table.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 8 comments.

File Description
scripts/repo/synthesize-repodata.py Main synth tool: download repodata, build NEVRA universe, query azldev, decide routing per package, emit per-destination repodata + unpublished/fallback reports.
scripts/repo/dnf-with-azl-repos Thin dnf wrapper: HEAD-probe sub-repos under each --repo-prefix, build --repofrompath/--enablerepo args, exec dnf.
scripts/repo/_repo_layout.py Shared constants/dataclass describing the six standard sub-repos.

Comment on lines +315 to +323

We pull primary/filelists/other for the package universe AND every
auxiliary record (updateinfo, group, modules, ...) so phase 6 can
copy non-package metadata through to routed destinations.

Returns the path to the dir containing ``repodata/``, or None if
the repo's ``repomd.xml`` returned 404 and *repo* was prefix-derived
(silent skip). Other HTTP errors and explicit-origin 404s raise.
"""
Comment on lines +121 to +159
"""HEAD ``<probe_url>/repodata/repomd.xml``.

Returns ``(_PROBE_OK, None)`` on 2xx (or successful non-HTTP
responses such as ``file://``), ``(_PROBE_MISSING, None)`` on 404,
and ``(_PROBE_FAIL, "...")`` on any other transport error or
non-2xx HTTP status. The error string is suitable for inclusion in
a fatal-error message so the user can see the underlying cause.
"""
url = f"{probe_url.rstrip('/')}/repodata/repomd.xml"
req = urllib.request.Request(
url, method="HEAD", headers={"User-Agent": USER_AGENT}
)
try:
with urllib.request.urlopen(req, timeout=timeout) as resp:
# ``status`` is the HTTP status code for http(s); for
# ``file://`` and other non-HTTP schemes urllib's response
# has no status attribute -- a successful urlopen there
# already proved the file exists.
status = getattr(resp, "status", None)
if status is None or 200 <= status < 300:
return _PROBE_OK, None
return _PROBE_FAIL, f"HTTP {status}"
except urllib.error.HTTPError as e:
if e.code == 404:
return _PROBE_MISSING, None
return _PROBE_FAIL, f"HTTP {e.code}"
except urllib.error.URLError as e:
# urllib wraps a `file://` ENOENT as URLError(FileNotFoundError);
# treat that as MISSING so local fixtures behave like the HTTP 404
# case.
if isinstance(e.reason, FileNotFoundError):
return _PROBE_MISSING, None
return _PROBE_FAIL, f"URL error: {e.reason}"
except TimeoutError:
return _PROBE_FAIL, f"timed out after {timeout:.0f}s"
except OSError as e:
return _PROBE_FAIL, f"OS error: {e}"


Comment on lines +209 to +211
if found_here == 0 and not failures:
log(f"{PROG}: warning: no repos discovered under {prefix_trim}")
total_found += found_here
Comment on lines +199 to +216
else:
# No $basearch: caller is asserting "this URL is for one specific
# arch". We can't tell which from the URL alone, so we infer from the
# last path component if it matches a known arch; otherwise refuse.
# Strip query/fragment first so signed URLs (`...?sig=...`) don't
# poison the inference.
parts = urllib.parse.urlsplit(url)
path = parts.path.rstrip("/")
last = path.rsplit("/", 1)[-1] if path else ""
if last in arches:
out.append(InputRepo(kind, last, url.rstrip("/"), "explicit"))
else:
raise ValueError(
f"--repo {spec!r}: URL has no `$basearch` and its final path "
f"component {last!r} is not a known arch ({', '.join(arches)}); "
f"cannot determine arch"
)
return out
Comment on lines +358 to +378
for record in repomd.records:
# Only fetch the records we'll actually consume (primary,
# filelists, other, plus their _db variants). See
# PACKAGE_RECORD_TYPES above for why we skip aux records.
if record.type not in PACKAGE_RECORD_TYPES:
continue
href = record.location_href or ""
if not href:
continue
url = urllib.parse.urljoin(base, href)
# Constrain the cache destination path so a hostile/malformed
# repomd can't write outside cache_dir.
safe_rel = href.lstrip("/")
if ".." in Path(safe_rel).parts:
raise RuntimeError(
f"refusing to write metadata record outside cache: {href!r}"
)
dest = cache_dir / safe_rel
log(f" fetching {url}")
_http_get(url, dest, ssl_context)
return cache_dir
Comment on lines +120 to +158
def probe_repo(probe_url: str, *, timeout: float = PROBE_TIMEOUT) -> tuple[str, str | None]:
"""HEAD ``<probe_url>/repodata/repomd.xml``.

Returns ``(_PROBE_OK, None)`` on 2xx (or successful non-HTTP
responses such as ``file://``), ``(_PROBE_MISSING, None)`` on 404,
and ``(_PROBE_FAIL, "...")`` on any other transport error or
non-2xx HTTP status. The error string is suitable for inclusion in
a fatal-error message so the user can see the underlying cause.
"""
url = f"{probe_url.rstrip('/')}/repodata/repomd.xml"
req = urllib.request.Request(
url, method="HEAD", headers={"User-Agent": USER_AGENT}
)
try:
with urllib.request.urlopen(req, timeout=timeout) as resp:
# ``status`` is the HTTP status code for http(s); for
# ``file://`` and other non-HTTP schemes urllib's response
# has no status attribute -- a successful urlopen there
# already proved the file exists.
status = getattr(resp, "status", None)
if status is None or 200 <= status < 300:
return _PROBE_OK, None
return _PROBE_FAIL, f"HTTP {status}"
except urllib.error.HTTPError as e:
if e.code == 404:
return _PROBE_MISSING, None
return _PROBE_FAIL, f"HTTP {e.code}"
except urllib.error.URLError as e:
# urllib wraps a `file://` ENOENT as URLError(FileNotFoundError);
# treat that as MISSING so local fixtures behave like the HTTP 404
# case.
if isinstance(e.reason, FileNotFoundError):
return _PROBE_MISSING, None
return _PROBE_FAIL, f"URL error: {e.reason}"
except TimeoutError:
return _PROBE_FAIL, f"timed out after {timeout:.0f}s"
except OSError as e:
return _PROBE_FAIL, f"OS error: {e}"

Comment on lines +1192 to +1193
routing = query_azldev(
args.repo_root, src_map, output_dir, known_components
Comment on lines +1105 to +1106
f"Arch to expand `$basearch` into (default: "
f"{', '.join(DEFAULT_ARCHES)}). Repeatable."
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants