feat(repo): add scripts to synthesize and consume azl repodata by reubeno · Pull Request #17139 · microsoft/azurelinux

reubeno · 2026-05-11T20:56:31Z

Adds three new tools under scripts/repo/ for working with Azure Linux package repositories:

synthesize-repodata.py — given one or more upstream repo prefixes (Standard Azure Linux Repo Layout: per-channel main/debuginfo/srpms sub-repos) and/or explicit per-repo overrides, synthesize a fresh set of per-destination repodata trees that route each package to its intended channel based on azldev component metadata. Local repo overrides take precedence over upstream when NEVRAs collide (CLI order is preserved). The output is a static directory tree of standard createrepo_c repodata with absolute upstream URLs in package locations, so the synthesized repodata can be served from anywhere without needing to mirror the RPM content.
dnf-with-azl-repos — thin wrapper around dnf that probes one or more URL prefixes for the Standard Azure Linux Repo Layout, enables every sub-repo it discovers (silently skipping ones that don't exist), and execs dnf with those repos added on the command line.
_repo_layout.py — shared definition of the Standard Azure Linux Repo Layout (channels, sub-repo kinds, per-kind URL template) consumed by both scripts.

binujp · 2026-05-12T18:26:29Z

This looks good Reuben, thanks! Can you please add a few workflow examples and use cases? Interpreting command line option to what it does took some effort. That could be a me problem, but examples can help quite a bit.

Adds three new tools under `scripts/repo/` for working with Azure Linux package repositories: * `synthesize-repodata.py` — given one or more upstream repo prefixes (Standard Azure Linux Repo Layout: per-channel main/debuginfo/srpms sub-repos) and/or explicit per-repo overrides, synthesize a fresh set of per-destination repodata trees that route each package to its intended channel based on azldev component metadata. Local repo overrides take precedence over upstream when NEVRAs collide (CLI order is preserved). The output is a static directory tree of standard `createrepo_c` repodata with absolute upstream URLs in package locations, so the synthesized repodata can be served from anywhere without needing to mirror the RPM content. * `dnf-with-azl-repos` — thin wrapper around `dnf` that probes one or more URL prefixes for the Standard Azure Linux Repo Layout, enables every sub-repo it discovers (silently skipping ones that don't exist), and execs `dnf` with those repos added on the command line. * `_repo_layout.py` — shared definition of the Standard Azure Linux Repo Layout (channels, sub-repo kinds, per-kind URL template) consumed by both scripts. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Copilot

Pull request overview

Adds three new scripts under scripts/repo/ for synthesizing Azure Linux per-channel/per-arch repodata trees from upstream RPM repositories and for invoking dnf against discovered Azure Linux repos. The scripts share a common layout definition (_repo_layout.py) that encodes the fixed channel × kind × arch matrix.

Changes:

New synthesize-repodata.py that fetches upstream repodata, queries azldev package list to assign packages to channels (with sibling-rpm inheritance fallback), and emits routed createrepo_c repodata referencing the original upstream URLs.
New dnf-with-azl-repos wrapper that probes URL prefixes for the standard layout (silently skipping 404s) and execs dnf with the discovered sub-repos enabled.
New _repo_layout.py shared module defining the standard CHANNELS, KIND_* constants, and SUBREPOS table.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 8 comments.

File	Description
scripts/repo/synthesize-repodata.py	Main synth tool: download repodata, build NEVRA universe, query azldev, decide routing per package, emit per-destination repodata + unpublished/fallback reports.
scripts/repo/dnf-with-azl-repos	Thin dnf wrapper: HEAD-probe sub-repos under each `--repo-prefix`, build `--repofrompath`/`--enablerepo` args, exec dnf.
scripts/repo/_repo_layout.py	Shared constants/dataclass describing the six standard sub-repos.

+
+    We pull primary/filelists/other for the package universe AND every
+    auxiliary record (updateinfo, group, modules, ...) so phase 6 can
+    copy non-package metadata through to routed destinations.
+
+    Returns the path to the dir containing ``repodata/``, or None if
+    the repo's ``repomd.xml`` returned 404 and *repo* was prefix-derived
+    (silent skip). Other HTTP errors and explicit-origin 404s raise.
+    """


+    """HEAD ``<probe_url>/repodata/repomd.xml``.
+
+    Returns ``(_PROBE_OK, None)`` on 2xx (or successful non-HTTP
+    responses such as ``file://``), ``(_PROBE_MISSING, None)`` on 404,
+    and ``(_PROBE_FAIL, "...")`` on any other transport error or
+    non-2xx HTTP status. The error string is suitable for inclusion in
+    a fatal-error message so the user can see the underlying cause.
+    """
+    url = f"{probe_url.rstrip('/')}/repodata/repomd.xml"
+    req = urllib.request.Request(
+        url, method="HEAD", headers={"User-Agent": USER_AGENT}
+    )
+    try:
+        with urllib.request.urlopen(req, timeout=timeout) as resp:
+            # ``status`` is the HTTP status code for http(s); for
+            # ``file://`` and other non-HTTP schemes urllib's response
+            # has no status attribute -- a successful urlopen there
+            # already proved the file exists.
+            status = getattr(resp, "status", None)
+            if status is None or 200 <= status < 300:
+                return _PROBE_OK, None
+            return _PROBE_FAIL, f"HTTP {status}"
+    except urllib.error.HTTPError as e:
+        if e.code == 404:
+            return _PROBE_MISSING, None
+        return _PROBE_FAIL, f"HTTP {e.code}"
+    except urllib.error.URLError as e:
+        # urllib wraps a `file://` ENOENT as URLError(FileNotFoundError);
+        # treat that as MISSING so local fixtures behave like the HTTP 404
+        # case.
+        if isinstance(e.reason, FileNotFoundError):
+            return _PROBE_MISSING, None
+        return _PROBE_FAIL, f"URL error: {e.reason}"
+    except TimeoutError:
+        return _PROBE_FAIL, f"timed out after {timeout:.0f}s"
+    except OSError as e:
+        return _PROBE_FAIL, f"OS error: {e}"
+
+


+        if found_here == 0 and not failures:
+            log(f"{PROG}: warning: no repos discovered under {prefix_trim}")
+        total_found += found_here


+    else:
+        # No $basearch: caller is asserting "this URL is for one specific
+        # arch". We can't tell which from the URL alone, so we infer from the
+        # last path component if it matches a known arch; otherwise refuse.
+        # Strip query/fragment first so signed URLs (`...?sig=...`) don't
+        # poison the inference.
+        parts = urllib.parse.urlsplit(url)
+        path = parts.path.rstrip("/")
+        last = path.rsplit("/", 1)[-1] if path else ""
+        if last in arches:
+            out.append(InputRepo(kind, last, url.rstrip("/"), "explicit"))
+        else:
+            raise ValueError(
+                f"--repo {spec!r}: URL has no `$basearch` and its final path "
+                f"component {last!r} is not a known arch ({', '.join(arches)}); "
+                f"cannot determine arch"
+            )
+    return out


+    for record in repomd.records:
+        # Only fetch the records we'll actually consume (primary,
+        # filelists, other, plus their _db variants). See
+        # PACKAGE_RECORD_TYPES above for why we skip aux records.
+        if record.type not in PACKAGE_RECORD_TYPES:
+            continue
+        href = record.location_href or ""
+        if not href:
+            continue
+        url = urllib.parse.urljoin(base, href)
+        # Constrain the cache destination path so a hostile/malformed
+        # repomd can't write outside cache_dir.
+        safe_rel = href.lstrip("/")
+        if ".." in Path(safe_rel).parts:
+            raise RuntimeError(
+                f"refusing to write metadata record outside cache: {href!r}"
+            )
+        dest = cache_dir / safe_rel
+        log(f"  fetching {url}")
+        _http_get(url, dest, ssl_context)
+    return cache_dir


+def probe_repo(probe_url: str, *, timeout: float = PROBE_TIMEOUT) -> tuple[str, str | None]:
+    """HEAD ``<probe_url>/repodata/repomd.xml``.
+
+    Returns ``(_PROBE_OK, None)`` on 2xx (or successful non-HTTP
+    responses such as ``file://``), ``(_PROBE_MISSING, None)`` on 404,
+    and ``(_PROBE_FAIL, "...")`` on any other transport error or
+    non-2xx HTTP status. The error string is suitable for inclusion in
+    a fatal-error message so the user can see the underlying cause.
+    """
+    url = f"{probe_url.rstrip('/')}/repodata/repomd.xml"
+    req = urllib.request.Request(
+        url, method="HEAD", headers={"User-Agent": USER_AGENT}
+    )
+    try:
+        with urllib.request.urlopen(req, timeout=timeout) as resp:
+            # ``status`` is the HTTP status code for http(s); for
+            # ``file://`` and other non-HTTP schemes urllib's response
+            # has no status attribute -- a successful urlopen there
+            # already proved the file exists.
+            status = getattr(resp, "status", None)
+            if status is None or 200 <= status < 300:
+                return _PROBE_OK, None
+            return _PROBE_FAIL, f"HTTP {status}"
+    except urllib.error.HTTPError as e:
+        if e.code == 404:
+            return _PROBE_MISSING, None
+        return _PROBE_FAIL, f"HTTP {e.code}"
+    except urllib.error.URLError as e:
+        # urllib wraps a `file://` ENOENT as URLError(FileNotFoundError);
+        # treat that as MISSING so local fixtures behave like the HTTP 404
+        # case.
+        if isinstance(e.reason, FileNotFoundError):
+            return _PROBE_MISSING, None
+        return _PROBE_FAIL, f"URL error: {e.reason}"
+    except TimeoutError:
+        return _PROBE_FAIL, f"timed out after {timeout:.0f}s"
+    except OSError as e:
+        return _PROBE_FAIL, f"OS error: {e}"
+


+    routing = query_azldev(
+        args.repo_root, src_map, output_dir, known_components


+            f"Arch to expand `$basearch` into (default: "
+            f"{', '.join(DEFAULT_ARCHES)}). Repeatable."


reubeno force-pushed the just-repo-tools branch from 233e068 to d3edddf Compare May 14, 2026 06:00

reubeno marked this pull request as ready for review May 14, 2026 06:25

Copilot AI review requested due to automatic review settings May 14, 2026 06:25

Copilot started reviewing on behalf of reubeno May 14, 2026 06:26 View session

Copilot AI reviewed May 14, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(repo): add scripts to synthesize and consume azl repodata#17139

feat(repo): add scripts to synthesize and consume azl repodata#17139
reubeno wants to merge 1 commit into
microsoft:tomls/base/mainfrom
reubeno:just-repo-tools

reubeno commented May 11, 2026

Uh oh!

binujp commented May 12, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		routing = query_azldev(
		args.repo_root, src_map, output_dir, known_components

		f"Arch to expand `$basearch` into (default: "
		f"{', '.join(DEFAULT_ARCHES)}). Repeatable."

Conversation

reubeno commented May 11, 2026

Uh oh!

binujp commented May 12, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants