Domains by siddhsuresh · Pull Request #3 · flightcontrolhq/modules

siddhsuresh · 2026-05-20T20:55:38Z

Greptile Summary

This PR introduces a _shared/ravion_cert_groups child module that centralises wildcard-cert provisioning and DNS routing for both ECS clusters (parent mode) and ECS services (leaf mode), replacing the older inline ravion_domain resources. It also wires in the ravion Terraform provider across the affected modules and converts the variable-validation patterns in autoscaling, ecs_cluster, ecs_service, alb, and nlb from var.X == null || ... to try(..., true).

New _shared/ravion_cert_groups module — dispatches three cert-group kinds (ravion_auto, customer, inherit) with variant-aware DNS-record writes (Route53-Ravion, Route53 direct, Cloudflare) and listener-rule creation using sha256-derived priority buckets.
Cluster integration — ecs_cluster/ravion_domains.tf instantiates the module in parent mode, issuing one wildcard ACM cert per group; the cert ARNs are appended to the public ALB's listener cert list via the new local.public_alb_effective_certificate_arns local.
Service integration — ecs_service/ravion_domains.tf instantiates the module in leaf mode, receiving the cluster's ravion_certificate_groups output so inherit groups can nest under the right wildcard cert without duplicating state.

Confidence Score: 3/5

Two definite breakage paths need to be resolved before merging: the six module.yml output entries that have no backing Terraform output blocks will fail for any caller that references them, and inherit groups with a non-matching parent_group_name are silently dropped rather than erroring.

The module.yml / outputs.tf mismatch means the Ravion UI and any downstream module that reads the advertised cluster outputs will hit a plan-time error on every deployment. The silent drop of inherit groups with a mistyped parent_group_name produces a successful apply with zero domain allocations — a live service ends up with no URL and no diagnostic. Both issues are on the critical domain-provisioning path introduced by this PR.

compute/ecs_cluster/module.yml and compute/ecs_cluster/outputs.tf (output name mismatch); _shared/ravion_cert_groups/main.tf (silent inherit-group drop and unguarded domains[0] access).

Important Files Changed

Filename	Overview
_shared/ravion_cert_groups/main.tf	New shared module dispatching cert-group kinds (inherit, customer). Two issues: inherit groups with a mismatched parent_group_name are silently dropped, and aws_acm_certificate.customer will panic on domains[0] if the shared module is called without the ECS-service-level min-domains guard.
compute/ecs_cluster/module.yml	Adds 6 new Ravion outputs (ravion_managed_domains_enabled, ravion_cluster_fqdn, etc.) that have no corresponding output blocks in outputs.tf; the one block that was added (ravion_certificate_groups) is absent from module.yml — all 6 advertised outputs will fail at plan time.
_shared/ravion_cert_groups/parent.tf	New parent-mode logic issuing wildcard ACM certs per group and variant-dispatched DNS validation. Logic looks correct; tolist()[0] on domain_validation_options is safe for wildcard certs which produce exactly one validation option.
compute/ecs_cluster/ravion_domains.tf	New file wiring the shared cert-groups module in parent mode; straightforward delegation with correct mode and listener_arn gating.
compute/ecs_service/ravion_domains.tf	New file wiring the shared cert-groups module in leaf mode with routing target and listener ARN forwarded from cluster outputs.
compute/ecs_cluster/outputs.tf	Adds ravion_certificate_groups output but module.yml advertises six different output names that don't exist as Terraform output blocks.
compute/ecs_service/variables.tf	Adds ravion_certificate_groups and supporting domain-wiring variables with thorough per-kind validation; try() wrapping on existing validations consistent with broader PR pattern.
compute/autoscaling/variables.tf	Replaces null-guard short-circuit patterns with try(..., true) across all complex object validations; addresses OpenTofu evaluation-order issues with optional object attributes.
_shared/ravion_cert_groups/variables.tf	Shared module inputs well-documented; missing a min-domains validation for customer kind that the callers have but the module itself does not enforce.

Sequence Diagram

sequenceDiagram
    participant Cluster as ecs_cluster (parent mode)
    participant SharedModule as _shared/ravion_cert_groups
    participant ALB as AWS ALB
    participant ACM as AWS ACM
    participant DNS as DNS Provider (Route53/CF)
    participant Service as ecs_service (leaf mode)

    Cluster->>SharedModule: "mode=parent, cert_groups=[ravion_auto/customer]"
    SharedModule->>DNS: Lookup DnsProvider (platform_apex / customer)
    SharedModule->>ACM: "aws_acm_certificate (wildcard *.fqdn)"
    SharedModule->>DNS: Write validation records
    SharedModule->>ACM: aws_acm_certificate_validation
    SharedModule->>ALB: Attach wildcard cert (SNI)
    SharedModule-->>Cluster: "parent_groups output {parent_allocation_id, cert_arn, wildcard_fqdn}"

    Service->>SharedModule: "mode=leaf, cert_groups=[inherit/customer], cluster_groups=parent_groups"
    Note over SharedModule: inherit kind - nest under cluster parent allocation
    SharedModule->>DNS: Lookup DnsProvider (customer groups)
    SharedModule->>ACM: aws_acm_certificate (customer FQDNs)
    SharedModule->>DNS: Write routing + validation records
    SharedModule->>ALB: aws_lb_listener_rule (host-header per FQDN)
    SharedModule-->>Service: domain_fqdns, domain_allocation_ids

Comments Outside Diff (2)

compute/ecs_service/variables.tf, line 1032-1036 (link)

ravion_service_slug is declared but never used

var.ravion_service_slug is defined and documented as "the slug used to derive the service FQDN under the cluster wildcard", but ravion_domains.tf assigns slug = each.value (entries from var.ravion_domains) for every ravion_domain.this allocation — ravion_service_slug is never referenced. Any caller who sets this variable expecting it to influence domain naming will get silently incorrect FQDNs built from var.ravion_domains entries instead.

Prompt To Fix With AI

This is a comment left during a code review.
Path: compute/ecs_service/variables.tf
Line: 1032-1036

Comment:
**`ravion_service_slug` is declared but never used**

`var.ravion_service_slug` is defined and documented as "the slug used to derive the service FQDN under the cluster wildcard", but `ravion_domains.tf` assigns `slug = each.value` (entries from `var.ravion_domains`) for every `ravion_domain.this` allocation — `ravion_service_slug` is never referenced. Any caller who sets this variable expecting it to influence domain naming will get silently incorrect FQDNs built from `var.ravion_domains` entries instead.

How can I resolve this? If you propose a fix, please make it concise.

compute/autoscaling/variables.tf, line 9-11 (link)

try(..., true) wrapping silently swallows validation bugs

Every validation condition across autoscaling/variables.tf, ecs_cluster/variables.tf, ecs_service/variables.tf, networking/alb/variables.tf, and networking/nlb/variables.tf is now wrapped in try(..., true). This catches not only expected null-dereference errors but also any accidental typos or wrong attribute names in the validation expression — all of which now return true (pass) instead of surfacing the problem. The original var.X == null || short-circuit pattern was both readable and safe; try buys very little here while hiding future expression bugs.

Prompt To Fix With AI

This is a comment left during a code review.
Path: compute/autoscaling/variables.tf
Line: 9-11

Comment:
**`try(..., true)` wrapping silently swallows validation bugs**

Every validation `condition` across `autoscaling/variables.tf`, `ecs_cluster/variables.tf`, `ecs_service/variables.tf`, `networking/alb/variables.tf`, and `networking/nlb/variables.tf` is now wrapped in `try(..., true)`. This catches not only expected null-dereference errors but also any accidental typos or wrong attribute names in the validation expression — all of which now return `true` (pass) instead of surfacing the problem. The original `var.X == null ||` short-circuit pattern was both readable and safe; `try` buys very little here while hiding future expression bugs.

How can I resolve this? If you propose a fix, please make it concise.

Prompt To Fix All With AI

Fix the following 3 code review issues. Work through them one at a time, proposing concise fixes.

---

### Issue 1 of 3
compute/ecs_cluster/module.yml:552-557
**Six new `module.yml` outputs are never defined in `outputs.tf`**

`module.yml` now declares `ravion_managed_domains_enabled`, `ravion_cluster_domain_allocation_id`, `ravion_cluster_managed_domain_id`, `ravion_cluster_fqdn`, `ravion_cluster_certificate_arn`, and `ravion_dns_provider_id` as outputs, but none of these have a corresponding `output` block in `outputs.tf`. The only block that was actually added there is `ravion_certificate_groups`, which is itself absent from `module.yml`. Any caller (or the Ravion UI) that references any of these six advertised outputs will fail at plan time with `This object does not have an attribute named "..."`. The output that IS defined (`ravion_certificate_groups`) should be listed here, and the six phantom entries should either be removed or backed by real output blocks.

### Issue 2 of 3
_shared/ravion_cert_groups/main.tf:53-56
**`inherit` group silently vanishes when `parent_group_name` is not in `var.cluster_groups`**

The `contains(keys(var.cluster_groups), coalesce(g.parent_group_name, ""))` guard silently drops any `inherit` group whose `parent_group_name` doesn't match an entry in the map — no `ravion_domain`, no `aws_lb_listener_rule`, no error. A typo in the group name (e.g. `"default"` vs `"Default"`) produces a successful apply with zero domain allocations and the service silently gets no URL. The ECS-service-level validation only checks that `parent_group_name` is non-empty, not that it resolves. A `check` block or `precondition` that asserts every `inherit` group's `parent_group_name` exists in `var.cluster_groups` would surface the failure early.

### Issue 3 of 3
_shared/ravion_cert_groups/main.tf:199-202
**`aws_acm_certificate.customer` panics on `domains[0]` when the shared module is called with an empty `domains` list**

`aws_acm_certificate.customer` iterates `local.customer_groups` (all customer-kind groups) and accesses `each.value.domains[0]`. The shared module's `variables.tf` has no validation requiring at least one domain for `customer` kind — only the ECS-service-level caller enforces that constraint. A module consumer that bypasses or precedes that guard will hit an index-out-of-bounds error at plan time with no useful message. Adding a `validation` block on `var.cert_groups` in the shared module directly would make the invariant self-contained.

_{Reviews (3): Last reviewed commit: "feat(cert_groups): TF passes cert_group_..." | Re-trigger Greptile}

Greptile also left 2 inline comments on this PR.

…l plane Adds the Terraform module surface that's the customer-facing entry point for the Ravion domain control plane. Matches the V2 (DI) shape — three separate resources instead of the old bundled `ravion_domain`. compute/ecs_cluster/ravion_domains.tf: - ravion_domain.cluster — wildcard allocation (*.<fqdn>) - aws_acm_certificate.cluster — wildcard cert in customer's AWS - ravion_dns_records.cluster_validation — ACM validation CNAMEs into Ravion's Route53 (sync) - ravion_dns_records.cluster_routing — ALIAS to the cluster ALB - aws_acm_certificate_validation — blocks until ACM verifies (~30s) - ravion_managed_certificate.cluster — UI metadata sink compute/ecs_service/ravion_domains.tf: - ravion_domain.auto — child allocation under cluster - ravion_dns_records.auto_routing — service-leaf ALIAS to cluster ALB - aws_lb_listener_rule.ravion — host-header rule on cluster's HTTPS listener. No per-service cert — cluster wildcard covers via SNI. Cluster outputs four values the service module wires up to: - ravion_cluster_domain_allocation_id (the SNI parent) - ravion_cluster_managed_domain_id - ravion_cluster_fqdn - ravion_cluster_certificate_arn Service variables (all defaulted null — module is opt-in): - ravion_dns_zone_id - ravion_parent_domain_allocation_id - ravion_cluster_alb_dns_name / _zone_id / _https_listener_arn - ravion_service_slug / _listener_rule_priority versions.tf in both modules declares `ravion = ravion.com/ravion/domains >= 1.0.0`. terraform fmt clean. All AWS resources live in the customer's account; Ravion holds zero customer credentials. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

compute/ecs_cluster/module.yml gets the certificate source toggle — "Ravion managed (wildcard cert + FQDN auto-allocated)" vs "Bring your own ACM cert". Ravion-managed mode reveals a DNS Zone dropdown (populated via \$values:ravion/dns-zones — picks the platform-managed Ravion apex or any customer-owned zone) and an optional FQDN slug. When the cluster is in Ravion-managed mode: - ravion_domain.cluster allocates `*.<slug>-<hash>.<apex>` - aws_acm_certificate.cluster issues the wildcard cert - ravion_dns_records.cluster_validation lands the ACM validation CNAMEs in the configured zone (synchronous via the RavionRoute53Writer) - aws_acm_certificate_validation blocks ~30s on ACM - The validated cert ARN gets prepended to the ALB listener's certificate_arns (default cert); customer BYO arns can still be added for SNI on custom domains - ravion_managed_certificate.cluster records the metadata for the UI compute/ecs_cluster outputs add: ravion_managed_domains_enabled (bool) ← gates the service module ravion_dns_zone_id ravion_cluster_domain_allocation_id ← parent for service alloc ravion_cluster_managed_domain_id ravion_cluster_fqdn ravion_cluster_certificate_arn compute/ecs_service: - New `ravion_domains` variable (list of slugs, DNS-validated). - ravion_domains.tf now iterates the list — one ravion_domain + ravion_dns_records + aws_lb_listener_rule per entry, all keyed by toset(domains) so add/remove is a clean diff. - Listener-rule priorities derived deterministically from sha256("\${var.name}:\${domain}") so two services in the same cluster never collide. Caller can pin a base via ravion_listener_rule_priority_base. - Empty list = service is reachable via the cluster's apex wildcard only; no per-service FQDN. Same gating as before (ravion_parent_domain_allocation_id null/empty). - Outputs: ravion_domain_fqdns map + ravion_domain_allocation_ids map. No per-service ACM cert — the cluster's wildcard covers every FQDN allocated under it via SNI. (api-go \$values endpoint lives in the matching commit on the flightcontrol branch.) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Restores the validation conditions in compute/{autoscaling,ecs_cluster,ecs_service} and networking/{alb,nlb} variables.tf to their pre-03f87c3 state. The other changes in 03f87c3 (Ravion module.yml + listener wiring fixes) remain. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…integration - Added support for the new Ravion DNS provider in both cluster and service modules. - Replaced `ravion_dns_zone_id` with `ravion_dns_provider_id` and `ravion_dns_provider_given_id` for improved flexibility. - Updated local variables and data sources to handle DNS provider lookups and routing logic. - Enhanced routing record management for Route53 and Cloudflare based on the selected DNS provider. - Adjusted ACM certificate issuance and validation processes to accommodate the new provider structure. - Updated module documentation and variable descriptions for clarity on the new configuration options.

siddhsuresh · 2026-05-20T21:09:01Z

@greptile

greptile-apps · 2026-05-20T21:13:18Z

+  # same cluster don't collide on listener-rule priority. SHA-256 →
+  # 16-bit hex digest → mod 49000 + offset 1000 to stay clear of the
+  # lower reserved range. When the caller pins a base, sort the domain
+  # slugs and assign +0, +1, +2, ... so re-applies stay stable.
+  ravion_sorted_domains = sort(var.ravion_domains)
+  ravion_priority_for_domain = {
+    for idx, d in local.ravion_sorted_domains :
+    d => (


Hash-based priority can collide, causing apply failures

substr(sha256(...), 0, 4) produces a 16-bit value (0–65535); % 49000 maps it to 0–48999. Because 65536 > 49000, the distribution is non-uniform (values 0–16535 appear twice), and — more critically — any two (var.name, domain-slug) combinations that hash to the same 4-hex prefix produce an identical listener-rule priority. When a second service hits the same priority on the same ALB listener, AWS rejects the aws_lb_listener_rule create with a DuplicatePriority error. The birthday bound is ~√(49000) ≈ 221 unique combinations before a 50% collision probability, but even a handful of services sharing a large domain list creates real risk at cluster scale.

Prompt To Fix With AI

This is a comment left during a code review. Path: compute/ecs_service/ravion_domains.tf Line: 38-45 Comment: **Hash-based priority can collide, causing apply failures** `substr(sha256(...), 0, 4)` produces a 16-bit value (0–65535); `% 49000` maps it to 0–48999. Because 65536 > 49000, the distribution is non-uniform (values 0–16535 appear twice), and — more critically — any two `(var.name, domain-slug)` combinations that hash to the same 4-hex prefix produce an identical listener-rule priority. When a second service hits the same priority on the same ALB listener, AWS rejects the `aws_lb_listener_rule` create with a `DuplicatePriority` error. The birthday bound is ~√(49000) ≈ 221 unique combinations before a 50% collision probability, but even a handful of services sharing a large domain list creates real risk at cluster scale. How can I resolve this? If you propose a fix, please make it concise.

greptile-apps · 2026-05-20T21:13:19Z

  validation {
-    condition = var.instance_maintenance_policy == null || (
+    condition = try(
      coalesce(var.instance_maintenance_policy.min_healthy_percentage, 90) >= 0 &&
-      coalesce(var.instance_maintenance_policy.min_healthy_percentage, 90) <= 100
+      coalesce(var.instance_maintenance_policy.min_healthy_percentage, 90) <= 100,
+      true
    )
    error_message = "The min_healthy_percentage must be between 0 and 100."
  }

  validation {
-    condition = var.instance_maintenance_policy == null || (
+    condition = try(
      coalesce(var.instance_maintenance_policy.max_healthy_percentage, 120) >= 100 &&
-      coalesce(var.instance_maintenance_policy.max_healthy_percentage, 120) <= 200
+      coalesce(var.instance_maintenance_policy.max_healthy_percentage, 120) <= 200,
+      true
    )
    error_message = "The max_healthy_percentage must be between 100 and 200."
  }


try() without null guard silently swallows type errors

The previous var.instance_maintenance_policy == null || (condition) pattern was dropped; the refactored version relies entirely on try() to catch null-object access. While this works for the null case, try() returns true for any error — including a type mismatch where min_healthy_percentage or max_healthy_percentage is passed as a non-numeric value. Such inputs would silently pass validation instead of emitting the error_message. The same concern applies to the analogous try() wrapping used across compute/ecs_service/variables.tf and compute/ecs_cluster/variables.tf.

Prompt To Fix With AI

This is a comment left during a code review. Path: compute/autoscaling/variables.tf Line: 1392-1408 Comment: **`try()` without null guard silently swallows type errors** The previous `var.instance_maintenance_policy == null || (condition)` pattern was dropped; the refactored version relies entirely on `try()` to catch null-object access. While this works for the null case, `try()` returns `true` for *any* error — including a type mismatch where `min_healthy_percentage` or `max_healthy_percentage` is passed as a non-numeric value. Such inputs would silently pass validation instead of emitting the `error_message`. The same concern applies to the analogous `try()` wrapping used across `compute/ecs_service/variables.tf` and `compute/ecs_cluster/variables.tf`. How can I resolve this? If you propose a fix, please make it concise.

greptile-apps · 2026-05-20T21:13:20Z

+    }
+  ]
+}
+


Redundant && var.enable_public_alb — local.enable_ravion_domain already requires var.enable_public_alb (defined in locals.tf), so this term is always true when the outer local is true and redundant when it's false.

Suggested change

count = local.enable_ravion_domain ? 1 : 0

Prompt To Fix With AI

This is a comment left during a code review. Path: compute/ecs_cluster/ravion_domains.tf Line: 68 Comment: Redundant `&& var.enable_public_alb` — `local.enable_ravion_domain` already requires `var.enable_public_alb` (defined in `locals.tf`), so this term is always `true` when the outer local is `true` and redundant when it's `false`. ```suggestion count = local.enable_ravion_domain ? 1 : 0 ``` How can I resolve this? If you propose a fix, please make it concise.

greptile-apps · 2026-05-20T21:14:02Z

+  ravion_cluster_managed_domain_id: string
+  ravion_cluster_fqdn: string
+  ravion_cluster_certificate_arn: string
+  ravion_dns_zone_id: string


ravion_dns_zone_id declared but never defined

module.yml advertises ravion_dns_zone_id: string as a module output, but no matching output "ravion_dns_zone_id" block exists in outputs.tf. Any caller that references module.ecs_cluster.ravion_dns_zone_id will fail with This object does not have an attribute named "ravion_dns_zone_id" during plan. The output block that was likely intended is ravion_dns_provider_id (which is defined in outputs.tf but not listed in module.yml), suggesting these two were renamed out of sync.

Prompt To Fix With AI

This is a comment left during a code review. Path: compute/ecs_cluster/module.yml Line: 539 Comment: **`ravion_dns_zone_id` declared but never defined** `module.yml` advertises `ravion_dns_zone_id: string` as a module output, but no matching `output "ravion_dns_zone_id"` block exists in `outputs.tf`. Any caller that references `module.ecs_cluster.ravion_dns_zone_id` will fail with `This object does not have an attribute named "ravion_dns_zone_id"` during plan. The output block that was likely intended is `ravion_dns_provider_id` (which *is* defined in `outputs.tf` but not listed in `module.yml`), suggesting these two were renamed out of sync. How can I resolve this? If you propose a fix, please make it concise.

greptile-apps · 2026-05-20T21:14:04Z

+resource "ravion_domain" "this" {
+  for_each = local.ravion_domain_set
+
+  dns_provider_id             = var.ravion_dns_provider_id
+  slug                        = each.value
+  parent_domain_allocation_id = var.ravion_parent_domain_allocation_id
+}


given_id path silently leaves dns_provider_id null on domain resources

ravion_domain.this only passes var.ravion_dns_provider_id, not the given_id alternative. When a caller wires up the service using only ravion_dns_provider_given_id, dns_provider_id is null on every allocated domain resource while the data source lookup still succeeds — the domain allocation will be sent to the API without a provider reference. The cluster module resolved this by outputting ravion_dns_provider_id (the resolved opaque id), but there is no guard that validates the opaque id is actually present before creating domain resources.

Prompt To Fix With AI

This is a comment left during a code review. Path: compute/ecs_service/ravion_domains.tf Line: 54-60 Comment: **`given_id` path silently leaves `dns_provider_id` null on domain resources** `ravion_domain.this` only passes `var.ravion_dns_provider_id`, not the `given_id` alternative. When a caller wires up the service using only `ravion_dns_provider_given_id`, `dns_provider_id` is null on every allocated domain resource while the data source lookup still succeeds — the domain allocation will be sent to the API without a provider reference. The cluster module resolved this by outputting `ravion_dns_provider_id` (the resolved opaque id), but there is no guard that validates the opaque id is actually present before creating domain resources. How can I resolve this? If you propose a fix, please make it concise.

greptile-apps · 2026-05-20T21:14:06Z

+  ttl     = 1 # 1 == automatic, required when proxied = true; ALB doesn't accept CF proxying for arbitrary L7 so leave proxied false + ttl=60.
+  proxied = false


The ttl = 1 value and its inline comment are contradictory: 1 is Cloudflare's "automatic" TTL that is only meaningful when proxied = true. With proxied = false (the current value), a concrete TTL such as 60 seconds should be used — the same value the Ravion metadata record below uses for consistency.

Suggested change

ttl = 1 # 1 == automatic, required when proxied = true; ALB doesn't accept CF proxying for arbitrary L7 so leave proxied false + ttl=60.

proxied = false

ttl = 60

proxied = false

Prompt To Fix With AI

This is a comment left during a code review. Path: compute/ecs_cluster/ravion_domains.tf Line: 191-192 Comment: The `ttl = 1` value and its inline comment are contradictory: `1` is Cloudflare's "automatic" TTL that is only meaningful when `proxied = true`. With `proxied = false` (the current value), a concrete TTL such as 60 seconds should be used — the same value the Ravion metadata record below uses for consistency. ```suggestion ttl = 60 proxied = false ``` How can I resolve this? If you propose a fix, please make it concise.

greptile-apps · 2026-05-20T21:14:08Z

+  ravion_priority_for_domain = {
+    for idx, d in local.ravion_sorted_domains :
+    d => (
+      var.ravion_listener_rule_priority_base > 0
+      ? var.ravion_listener_rule_priority_base + idx
+      : (parseint(substr(sha256("${var.name}:${d}"), 0, 4), 16) % 49000) + 1000
+    )
+  }


Hash-based priority space is too small for collision-free allocation

The expression (parseint(substr(sha256("${var.name}:${d}"), 0, 4), 16) % 49000) + 1000 uses only the first 4 hex characters (16-bit, 0–65535 range), then takes % 49000, collapsing to 49,000 buckets. By the birthday problem, the probability of at least one collision reaches ~50% around 261 (service, domain) pairs sharing the same ALB listener. A collision causes aws_lb_listener_rule creation to fail with a duplicate-priority error that is silent until apply time.

Prompt To Fix With AI

This is a comment left during a code review. Path: compute/ecs_service/ravion_domains.tf Line: 43-50 Comment: **Hash-based priority space is too small for collision-free allocation** The expression `(parseint(substr(sha256("${var.name}:${d}"), 0, 4), 16) % 49000) + 1000` uses only the first 4 hex characters (16-bit, 0–65535 range), then takes `% 49000`, collapsing to 49,000 buckets. By the birthday problem, the probability of at least one collision reaches ~50% around 261 (service, domain) pairs sharing the same ALB listener. A collision causes `aws_lb_listener_rule` creation to fail with a duplicate-priority error that is silent until apply time. How can I resolve this? If you propose a fix, please make it concise.

…ard) The cluster wildcard cert covers every FQDN under the cluster apex via SNI, but services that need FQDNs outside that apex (multi-zone setups, vanity hostnames) had no way to get their own cert. Adds opt-in certificate groups. variables.tf: Adds var.ravion_certificate_groups — list of objects carrying: name — stable per-service identifier dns_provider_id (opt) — per-group provider override, falls back dns_provider_given_id — to the service-level provider when null domains — up to 10 slugs (ACM default SAN limit; validated; per-domain length + charset) Group names + domain slugs are validated; duplicate group names and groups with > 10 domains are rejected at plan time. data.tf: Per-group ravion_dns_provider lookup so each group can target a different zone than the cluster. Falls back to the service-level provider when the group's overrides are null. ravion_domains.tf: One ACM cert per group (primary + SANs derived from ravion_domain[group/slug].fqdn). Per-variant validation writers (route53_ravion / route53 / cloudflare) keyed by precomputed per-variant subsets of the (group, domain) pair map. ONE ravion_managed_certificate registration per group, scoped SERVICE. Per-domain routing record + aws_lb_listener_rule with deterministic priorities seeded "g:" to avoid collision with the existing ungrouped ravion_domains priority space. The cluster's HTTPS listener picks up each group's cert via aws_lb_listener_certificate; SNI handshake selects the most specific match at request time.

…omain files

Auto-mode zero-config URLs: Cluster: var.use_ravion_subdomain (default true). Looks up platform DnsProvider via stable given_id="ravion-platform-apex"; posts fqdn_override="<module_instance_id>.<apex>"; wildcard cert covers *.<module_instance_id>.<apex>. var.module_instance_id is the runner-injected identifier (null in standalone use → slug mode). Service: var.ravion_auto_subdomain (default true) + service_given_id. Allocates one ravion_domain with slug=service_given_id + parent_domain_allocation_id=cluster's allocation. Server derives <given-id>-<random>.<cluster-fqdn>. Per-variant routing record + listener rule for the auto allocation, same dispatch as the per- domain blocks (route53_ravion / route53 / cloudflare). Module form (cluster): use_ravion_subdomain boolean at the top of Managed Domains. ravion_dns_provider_id moved to show_when use_ravion_subdomain=false + sourced from $values:ravion/dns-providers-customer (no platform row in the dropdown — auto-mode handles that). ravion_dns_provider_given_id stays as the secondary input under the same false-gate.

Q3: service_given_id renamed to module_instance_given_id (matches the runner-injected TF_VAR_module_instance_given_id env var). Q4: cert groups drop the per-row use_ravion_subdomain toggle. Each group MUST specify its own customer-owned DNS provider; domains are full FQDNs posted via fqdn_override (validated under the provider's apex server-side). No fallback to the platform apex. Q7: cluster auto-mode local strips trailing dot from the resolved DnsProvider.domain_name so the literal FQDN written to state is canonical (no `<id>.example.com.`).

Replace stale ravion_dns_zone_id with ravion_dns_provider_id to match the actual output defined in outputs.tf. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…rovider_id Add lifecycle preconditions on ravion_auto_label and ravion_auto_auto domain resources so that a missing/empty ravion_dns_provider_id fails loudly at plan time instead of producing allocations with a null provider reference. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

ecs_service/ravion_domains.tf became 480+ lines of dispatch logic; the same logic needs to be reused by ecs_cluster and future static_site modules. Move it into modules/_shared/ravion_cert_groups/, parameterised by routing-target (ALB DNS+zone+listener+target-group) so the same module instantiates cleanly for both ECS and CloudFront-fronted callers down the line. ecs_service now declares a single `module "ravion_cert_groups"` block and forwards its var.ravion_certificate_groups plus the wiring fields. Behavior identical (2 kinds: ravion_auto + customer); plan is a no-op when state is migrated with `terraform state mv`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

ttl=1 is Cloudflare's "automatic" sentinel and is only meaningful when proxied=true. With proxied=false the value gets coerced server-side to a default that drifts plans on every apply. Match the explicit 60 the sibling ravion_dns_records.cluster_routing_metadata_cf already uses. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

try(..., true) returns true on ANY error in the inner expression — including attribute typos — which silently passes validation. Replace with the explicit null-guard pattern (var.X == null || <check>) that's short-circuit safe and surfaces real expression bugs. Two nlb conditions (dns_record_client_routing_policy + enforce_security_group_inbound_rules_on_private_link_traffic) gained the null guard since their contains() call would otherwise null-deref on the default. Files: autoscaling, ecs_cluster, ecs_service, networking/alb, networking/nlb. 68 conditions cleaned up. tofu validate passes on all. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…t mode) Replaces the single-wildcard cluster model with a list of cert groups, each issuing ONE wildcard ACM cert + ONE parent DomainAllocation. Services nest leaf FQDNs under any of these via their own cert groups (kind = cluster_wildcard, cluster_group_name = "<name>", coming in the next commit). Shared cert-groups module gains: - `var.mode = leaf | parent` discriminator - `var.platform_apex_provider_given_id` + `var.module_instance_id` - `parent.tf` — per-group wildcard cert + Ravion-managed parent alloc for ravion_auto kind, customer-supplied wildcard_fqdn for customer kind. Cert ARN + parent allocation id exposed via `parent_groups` output. SNI attach is the parent's job (cluster threads it through the ALB module's certificate_arns list). - `cloudflare_api_token` output so cluster's provider block can source the token from cert-group rows rather than a deleted top-level data source. Cluster module: - Drops `var.use_ravion_subdomain`, `var.ravion_dns_provider_id`, `var.ravion_dns_provider_given_id`, `var.ravion_cluster_slug`, the single `ravion_domain.cluster` resource, and all per-variant validation/routing blocks. The shared module handles the new shape end-to-end. - New `ravion_certificate_groups` variable + listener cert list concat (BYO ARNs first, cert-group certs as SNI extras). - Outputs collapse to a single `ravion_certificate_groups` map. `tofu validate` passes on both cluster + service. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Service cert groups rename `ravion_auto` kind → `cluster_wildcard` and add a required `cluster_group_name` field that picks WHICH cluster wildcard to nest under. Multiple wildcards per cluster (now possible after commit 921e2c2) means services must be explicit instead of implicitly picking the cluster's only allocation. Shared module: - Adds `var.cluster_groups` (map keyed by cluster cert-group name) consumed only by leaf-mode cluster_wildcard groups. - Drops the old `var.ravion_parent_domain_allocation_id` / `var.ravion_dns_provider_id` (replaced by per-cluster-group lookup). - `parent_groups` output now also exposes `dns_provider_id` so leaf allocations attach to the correct provider. - Renames the leaf resource blocks: ravion_auto_label → cluster_wildcard_label, ravion_auto_auto → cluster_wildcard_auto. - 32-bit hash retained on listener-rule priorities. ecs_service: - New `var.ravion_cluster_certificate_groups` (map output from cluster). - `ravion_certificate_groups` kind enum: `cluster_wildcard | customer`, plus per-row `cluster_group_name`. Validations enforce the wire-up. - Drops dead var.ravion_dns_provider_id + var.ravion_parent_domain_allocation_id (now sourced via the map). - locals.tf + data.tf stripped of dead dns_provider_lookup_key block. - provider.tf cloudflare api_token now sourced from the shared module. `tofu validate` passes on cluster + service. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…atch deferred) Form schema now accepts kind=external on both cluster + service cert groups. TF dispatch is NOT implemented yet — a `check` block fails the plan with a clear message when external is used. Reason: the Ravion TF provider needs `is_external` on ravion_domain (so dns_provider_id can be omitted) before we can model the metadata-only allocation. That extension lives in packages/terraform-provider-domains and ships as its own commit chain. Adding the kind to the validation enum NOW lets the api-go YAML publish + UI ship the radio option without churning the schema later. Operators who pick external will see a clear plan-time error referencing the README pointing at the follow-up work. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…atform hashes These were generated on darwin_arm64 (local validate) but the runner is linux_amd64 → tofu init fails with "doesn't match any of the checksums previously recorded in the dependency lock file" because no linux_amd64 hash is recorded. Quickest unblock: drop the files so `tofu init` regenerates a fresh lock on each runner. Follow-up: regenerate cross-platform locks via `tofu providers lock -platform=linux_amd64 -platform=darwin_arm64` and re-commit so reproducibility is preserved. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…lookup Plan failed with "missing lookup key — Provide either id or given_id" because the for_each iterated EVERY cert group including ravion_auto (which uses data.platform_apex), cluster_wildcard (which inherits from var.cluster_groups), and external (no provider row at all). Restrict the for_each to kind=customer. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Plan failed with "Invalid for_each argument — local.<map> will be known only after apply" because parent_validation_pairs_route53_* keyed each entry on `<group_name>/<opt.domain_name>` where opt.domain_name is the ACM cert's validation domain (known only after apply). TF can't iterate a map whose keys are unknown at plan time. Restructure: each parent group has ONE wildcard cert → ONE validation record. Key the per-variant subsets on group_name (known at plan time) and dereference `aws_acm_certificate.parent[name].domain_validation_options[0]` inside the resource body where unknown-at-plan VALUES are fine. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

`local.has_listener = var.listener_arn != null && var.listener_arn != ""` is unknown at plan time when the ALB listener is `(known after apply)` on first apply. The ternary `unknown ? local.X : {}` makes TF reject the for_each even when local.X is statically empty (parent mode w/ no customer groups, etc). Iterate the static maps directly. The listener_arn/target_group_arn references inside resource bodies are fine when unknown — TF plans them as `(known after apply)`. If the listener is genuinely null at apply time, those resources fail with a clear AWS error rather than being silently skipped (the gate was footgun-y anyway). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…n_dns_records DnsRecordBatchCreate looks up by allocation id (dalc_*); passing a managed_domain_id (mdom_*) errors with ALLOCATION_NOT_FOUND. The existing customer leaf code already uses `ravion_domain.X.id` (allocation id) — parent.tf incorrectly used `.managed_domain_id`. Match the working pattern. The `managed_domain_id` argument name in the ravion provider is a misnomer (it predates the rename) — actual semantics is "the DomainAllocation this record set belongs to". Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

CloudflareWriter on the api-go side now handles real cloudflare-go/v6 SDK writes when ravion_dns_records resources land. Replace the HCL cloudflare_dns_record + ravion_dns_records-metadata-mirror pair with a single ravion_dns_records call that triggers the SDK write server-side. Three pairs collapse: parent_validation: cloudflare_dns_record.parent_validation_cf + ravion_dns_records.parent_validation_metadata_cf → ravion_dns_records.parent_validation_cf customer_validation: cloudflare_dns_record.customer_validation_cf + ravion_dns_records.customer_validation_metadata_cf → ravion_dns_records.customer_validation_cf customer_routing: cloudflare_dns_record.customer_routing_cf + ravion_dns_records.customer_routing_metadata_cf → ravion_dns_records.customer_routing_cf aws_acm_certificate_validation references point at the renamed ravion_dns_records. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The user's headline complaint: every cluster's tofu init paid the Cloudflare provider plugin download + load even when the customer was a Route53-only shop. With CloudflareWriter now wired server-side in api-go (see ravion-monorepo commits feat(domains)...), all Cloudflare DNS writes flow through the ravion_dns_records resource + the api-go's cloudflare-go/v6 SDK. The HCL provider is not needed. Dropped: - cluster + service versions.tf: cloudflare required_provider block - cluster + service provider.tf: provider "cloudflare" {...} block + api_token = module.ravion_cert_groups.cloudflare_api_token wiring - _shared/ravion_cert_groups/versions.tf: cloudflare required_provider - _shared/ravion_cert_groups/outputs.tf: cloudflare_api_token output (no consumers after the provider blocks went away) tofu validate passes on cluster + service. hashicorp/aws stays (still needed for ALB/ACM/aws_route53_record); ravion provider stays. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…up_name Per the domain/infra separation rule: service cert-group kind: cluster_wildcard → inherit service field: cluster_group_name → parent_group_name service input var: ravion_cluster_certificate_groups → ravion_parent_certificate_groups shared module locals/resources: cluster_wildcard_label → inherit_label, etc. tag value: ravion:kind = inherit The reason: the inherit pattern (leaf nests under an upstream parent cert) isn't cluster-specific. Future callers like a CloudFront / Lambda module will provide the same parent_groups output, and the service's inherit kind already works for them once names don't bake "cluster" into the wire-level model. User-facing form labels can still say "Inherit a cluster wildcard" where that's the dominant use case — wire-level kinds stay generic. tofu validate passes on cluster + service. Go build green. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

These were locally generated by tofu init during validation. We deleted them earlier (8ff8246) because they only had darwin_arm64 hashes which broke the linux_amd64 runner. Stay out. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…ry ravion_domain Every ravion_domain in the shared module now sets the two denormalisation attributes (added in ravion-provider v2.2.0) so the Domains tab can render groups by joining ManagedDomain rows directly: parent.tf: ravion_domain.parent_ravion_auto → kind=ravion_auto ravion_domain.parent_customer → kind=customer main.tf: ravion_domain.inherit_label / inherit_auto → kind=inherit ravion_domain.customer → kind=customer Group name = the operator's cert_group.name in every case. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

siddhsuresh · 2026-05-21T12:39:20Z

@greptile

greptile-apps · 2026-05-21T12:45:28Z

+  ravion_managed_domains_enabled: boolean
+  ravion_cluster_domain_allocation_id: string
+  ravion_cluster_managed_domain_id: string
+  ravion_cluster_fqdn: string
+  ravion_cluster_certificate_arn: string
+  ravion_dns_provider_id: string


Six new module.yml outputs are never defined in outputs.tf

module.yml now declares ravion_managed_domains_enabled, ravion_cluster_domain_allocation_id, ravion_cluster_managed_domain_id, ravion_cluster_fqdn, ravion_cluster_certificate_arn, and ravion_dns_provider_id as outputs, but none of these have a corresponding output block in outputs.tf. The only block that was actually added there is ravion_certificate_groups, which is itself absent from module.yml. Any caller (or the Ravion UI) that references any of these six advertised outputs will fail at plan time with This object does not have an attribute named "...". The output that IS defined (ravion_certificate_groups) should be listed here, and the six phantom entries should either be removed or backed by real output blocks.

Prompt To Fix With AI

This is a comment left during a code review. Path: compute/ecs_cluster/module.yml Line: 552-557 Comment: **Six new `module.yml` outputs are never defined in `outputs.tf`** `module.yml` now declares `ravion_managed_domains_enabled`, `ravion_cluster_domain_allocation_id`, `ravion_cluster_managed_domain_id`, `ravion_cluster_fqdn`, `ravion_cluster_certificate_arn`, and `ravion_dns_provider_id` as outputs, but none of these have a corresponding `output` block in `outputs.tf`. The only block that was actually added there is `ravion_certificate_groups`, which is itself absent from `module.yml`. Any caller (or the Ravion UI) that references any of these six advertised outputs will fail at plan time with `This object does not have an attribute named "..."`. The output that IS defined (`ravion_certificate_groups`) should be listed here, and the six phantom entries should either be removed or backed by real output blocks. How can I resolve this? If you propose a fix, please make it concise.

greptile-apps · 2026-05-21T12:45:29Z

+  inherit_groups = local.leaf_mode ? {
+    for g in var.cert_groups :
+    g.name => g if g.kind == "inherit" && contains(keys(var.cluster_groups), coalesce(g.parent_group_name, ""))
+  } : {}


inherit group silently vanishes when parent_group_name is not in var.cluster_groups

The contains(keys(var.cluster_groups), coalesce(g.parent_group_name, "")) guard silently drops any inherit group whose parent_group_name doesn't match an entry in the map — no ravion_domain, no aws_lb_listener_rule, no error. A typo in the group name (e.g. "default" vs "Default") produces a successful apply with zero domain allocations and the service silently gets no URL. The ECS-service-level validation only checks that parent_group_name is non-empty, not that it resolves. A check block or precondition that asserts every inherit group's parent_group_name exists in var.cluster_groups would surface the failure early.

Prompt To Fix With AI

This is a comment left during a code review. Path: _shared/ravion_cert_groups/main.tf Line: 53-56 Comment: **`inherit` group silently vanishes when `parent_group_name` is not in `var.cluster_groups`** The `contains(keys(var.cluster_groups), coalesce(g.parent_group_name, ""))` guard silently drops any `inherit` group whose `parent_group_name` doesn't match an entry in the map — no `ravion_domain`, no `aws_lb_listener_rule`, no error. A typo in the group name (e.g. `"default"` vs `"Default"`) produces a successful apply with zero domain allocations and the service silently gets no URL. The ECS-service-level validation only checks that `parent_group_name` is non-empty, not that it resolves. A `check` block or `precondition` that asserts every `inherit` group's `parent_group_name` exists in `var.cluster_groups` would surface the failure early. How can I resolve this? If you propose a fix, please make it concise.

- drop cert_group_name/cert_group_kind from ravion_domain calls - add name + kind to every ravion_managed_certificate (customer, parent, plus new placeholder rows for inherit + external) - external groups: synthetic external:<name>:<group> ARN + PENDING status anchor the cert-group identity even when Ravion provisions no real cert - inherit groups: synthetic inherit:<name>:<group> placeholder cert with WILDCARD scope so the Domains tab projector groups inherit MDs under their named cert group - external_dns_provider_given_id variable + data lookup for the synthetic per-org EXTERNAL DnsProvider

…ispatch - customer_validation_pairs now keyed by static (group, domain) instead of post-apply domain_validation_options, so for_each keys are known at plan time - validation/routing resources resolve the per-domain validation option inside the resource body via list-filter on the cert's domain_validation_options (terraform tolerates unknown body values, only the for_each keys must be static) - customer_provider_kind + parent_provider_kind locals encode the provider variant as a plain string via nonsensitive() so the cloudflare api_token sensitivity marker no longer taints for_each filter conditions - all filter conditions switched from `provider.X != null` to `kind == "X"` so cloudflare/route53/route53_ravion dispatch survives plan

…alidation resources

siddhsuresh and others added 6 commits May 20, 2026 04:54

ecs_service: wrap null-tolerant validations in try() for TF 1.10

0da47ca

ecs_service/locals: wrap remaining null-attribute access in try()

a701eb6

ecs_service/auto_scaling: try() around null var.auto_scaling.scheduled

140000c

siddhsuresh marked this pull request as ready for review May 20, 2026 21:08

greptile-apps Bot reviewed May 20, 2026

View reviewed changes

siddhsuresh and others added 19 commits May 21, 2026 02:48

refactor: clean up comments and improve readability in provider and d…

b61ee3c

…omain files

ecs_cluster: add ravion_dns_provider_given_id form field

266f009

fix(ecs_cluster): correct ravion dns output name in module.yml

a1a2e69

Replace stale ravion_dns_zone_id with ravion_dns_provider_id to match the actual output defined in outputs.tf. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

siddhsuresh and others added 4 commits May 21, 2026 16:08

greptile-apps Bot reviewed May 21, 2026

View reviewed changes

siddhsuresh added 3 commits May 21, 2026 18:37

fix(cert_groups): swap each.value.domain_key → each.key in customer v…

785f6b3

…alidation resources

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Domains#3

Domains#3
siddhsuresh wants to merge 33 commits into
mainfrom
domains-v2-di

siddhsuresh commented May 20, 2026 •

edited by greptile-apps Bot

Loading

Uh oh!

siddhsuresh commented May 20, 2026

Uh oh!

greptile-apps Bot May 20, 2026

Uh oh!

greptile-apps Bot May 20, 2026

Uh oh!

greptile-apps Bot May 20, 2026

Uh oh!

greptile-apps Bot May 20, 2026

Uh oh!

greptile-apps Bot May 20, 2026

Uh oh!

greptile-apps Bot May 20, 2026

Uh oh!

greptile-apps Bot May 20, 2026

Uh oh!

siddhsuresh commented May 21, 2026

Uh oh!

greptile-apps Bot May 21, 2026

Uh oh!

greptile-apps Bot May 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

		ttl = 1 # 1 == automatic, required when proxied = true; ALB doesn't accept CF proxying for arbitrary L7 so leave proxied false + ttl=60.
		proxied = false

Conversation

siddhsuresh commented May 20, 2026 • edited by greptile-apps Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 3/5

Important Files Changed

Sequence Diagram

Comments Outside Diff (2)

Uh oh!

siddhsuresh commented May 20, 2026

Uh oh!

greptile-apps Bot May 20, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot May 20, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot May 20, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot May 20, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot May 20, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot May 20, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot May 20, 2026

Choose a reason for hiding this comment

Uh oh!

siddhsuresh commented May 21, 2026

Uh oh!

greptile-apps Bot May 21, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot May 21, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

siddhsuresh commented May 20, 2026 •

edited by greptile-apps Bot

Loading