Skip to content

Domains#3

Open
siddhsuresh wants to merge 33 commits into
mainfrom
domains-v2-di
Open

Domains#3
siddhsuresh wants to merge 33 commits into
mainfrom
domains-v2-di

Conversation

@siddhsuresh
Copy link
Copy Markdown

@siddhsuresh siddhsuresh commented May 20, 2026

Greptile Summary

This PR introduces a _shared/ravion_cert_groups child module that centralises wildcard-cert provisioning and DNS routing for both ECS clusters (parent mode) and ECS services (leaf mode), replacing the older inline ravion_domain resources. It also wires in the ravion Terraform provider across the affected modules and converts the variable-validation patterns in autoscaling, ecs_cluster, ecs_service, alb, and nlb from var.X == null || ... to try(..., true).

  • New _shared/ravion_cert_groups module — dispatches three cert-group kinds (ravion_auto, customer, inherit) with variant-aware DNS-record writes (Route53-Ravion, Route53 direct, Cloudflare) and listener-rule creation using sha256-derived priority buckets.
  • Cluster integrationecs_cluster/ravion_domains.tf instantiates the module in parent mode, issuing one wildcard ACM cert per group; the cert ARNs are appended to the public ALB's listener cert list via the new local.public_alb_effective_certificate_arns local.
  • Service integrationecs_service/ravion_domains.tf instantiates the module in leaf mode, receiving the cluster's ravion_certificate_groups output so inherit groups can nest under the right wildcard cert without duplicating state.

Confidence Score: 3/5

Two definite breakage paths need to be resolved before merging: the six module.yml output entries that have no backing Terraform output blocks will fail for any caller that references them, and inherit groups with a non-matching parent_group_name are silently dropped rather than erroring.

The module.yml / outputs.tf mismatch means the Ravion UI and any downstream module that reads the advertised cluster outputs will hit a plan-time error on every deployment. The silent drop of inherit groups with a mistyped parent_group_name produces a successful apply with zero domain allocations — a live service ends up with no URL and no diagnostic. Both issues are on the critical domain-provisioning path introduced by this PR.

compute/ecs_cluster/module.yml and compute/ecs_cluster/outputs.tf (output name mismatch); _shared/ravion_cert_groups/main.tf (silent inherit-group drop and unguarded domains[0] access).

Important Files Changed

Filename Overview
_shared/ravion_cert_groups/main.tf New shared module dispatching cert-group kinds (inherit, customer). Two issues: inherit groups with a mismatched parent_group_name are silently dropped, and aws_acm_certificate.customer will panic on domains[0] if the shared module is called without the ECS-service-level min-domains guard.
compute/ecs_cluster/module.yml Adds 6 new Ravion outputs (ravion_managed_domains_enabled, ravion_cluster_fqdn, etc.) that have no corresponding output blocks in outputs.tf; the one block that was added (ravion_certificate_groups) is absent from module.yml — all 6 advertised outputs will fail at plan time.
_shared/ravion_cert_groups/parent.tf New parent-mode logic issuing wildcard ACM certs per group and variant-dispatched DNS validation. Logic looks correct; tolist()[0] on domain_validation_options is safe for wildcard certs which produce exactly one validation option.
compute/ecs_cluster/ravion_domains.tf New file wiring the shared cert-groups module in parent mode; straightforward delegation with correct mode and listener_arn gating.
compute/ecs_service/ravion_domains.tf New file wiring the shared cert-groups module in leaf mode with routing target and listener ARN forwarded from cluster outputs.
compute/ecs_cluster/outputs.tf Adds ravion_certificate_groups output but module.yml advertises six different output names that don't exist as Terraform output blocks.
compute/ecs_service/variables.tf Adds ravion_certificate_groups and supporting domain-wiring variables with thorough per-kind validation; try() wrapping on existing validations consistent with broader PR pattern.
compute/autoscaling/variables.tf Replaces null-guard short-circuit patterns with try(..., true) across all complex object validations; addresses OpenTofu evaluation-order issues with optional object attributes.
_shared/ravion_cert_groups/variables.tf Shared module inputs well-documented; missing a min-domains validation for customer kind that the callers have but the module itself does not enforce.

Sequence Diagram

sequenceDiagram
    participant Cluster as ecs_cluster (parent mode)
    participant SharedModule as _shared/ravion_cert_groups
    participant ALB as AWS ALB
    participant ACM as AWS ACM
    participant DNS as DNS Provider (Route53/CF)
    participant Service as ecs_service (leaf mode)

    Cluster->>SharedModule: "mode=parent, cert_groups=[ravion_auto/customer]"
    SharedModule->>DNS: Lookup DnsProvider (platform_apex / customer)
    SharedModule->>ACM: "aws_acm_certificate (wildcard *.fqdn)"
    SharedModule->>DNS: Write validation records
    SharedModule->>ACM: aws_acm_certificate_validation
    SharedModule->>ALB: Attach wildcard cert (SNI)
    SharedModule-->>Cluster: "parent_groups output {parent_allocation_id, cert_arn, wildcard_fqdn}"

    Service->>SharedModule: "mode=leaf, cert_groups=[inherit/customer], cluster_groups=parent_groups"
    Note over SharedModule: inherit kind - nest under cluster parent allocation
    SharedModule->>DNS: Lookup DnsProvider (customer groups)
    SharedModule->>ACM: aws_acm_certificate (customer FQDNs)
    SharedModule->>DNS: Write routing + validation records
    SharedModule->>ALB: aws_lb_listener_rule (host-header per FQDN)
    SharedModule-->>Service: domain_fqdns, domain_allocation_ids
Loading

Comments Outside Diff (2)

  1. compute/ecs_service/variables.tf, line 1032-1036 (link)

    P1 ravion_service_slug is declared but never used

    var.ravion_service_slug is defined and documented as "the slug used to derive the service FQDN under the cluster wildcard", but ravion_domains.tf assigns slug = each.value (entries from var.ravion_domains) for every ravion_domain.this allocation — ravion_service_slug is never referenced. Any caller who sets this variable expecting it to influence domain naming will get silently incorrect FQDNs built from var.ravion_domains entries instead.

    Prompt To Fix With AI
    This is a comment left during a code review.
    Path: compute/ecs_service/variables.tf
    Line: 1032-1036
    
    Comment:
    **`ravion_service_slug` is declared but never used**
    
    `var.ravion_service_slug` is defined and documented as "the slug used to derive the service FQDN under the cluster wildcard", but `ravion_domains.tf` assigns `slug = each.value` (entries from `var.ravion_domains`) for every `ravion_domain.this` allocation — `ravion_service_slug` is never referenced. Any caller who sets this variable expecting it to influence domain naming will get silently incorrect FQDNs built from `var.ravion_domains` entries instead.
    
    How can I resolve this? If you propose a fix, please make it concise.
  2. compute/autoscaling/variables.tf, line 9-11 (link)

    P2 try(..., true) wrapping silently swallows validation bugs

    Every validation condition across autoscaling/variables.tf, ecs_cluster/variables.tf, ecs_service/variables.tf, networking/alb/variables.tf, and networking/nlb/variables.tf is now wrapped in try(..., true). This catches not only expected null-dereference errors but also any accidental typos or wrong attribute names in the validation expression — all of which now return true (pass) instead of surfacing the problem. The original var.X == null || short-circuit pattern was both readable and safe; try buys very little here while hiding future expression bugs.

    Prompt To Fix With AI
    This is a comment left during a code review.
    Path: compute/autoscaling/variables.tf
    Line: 9-11
    
    Comment:
    **`try(..., true)` wrapping silently swallows validation bugs**
    
    Every validation `condition` across `autoscaling/variables.tf`, `ecs_cluster/variables.tf`, `ecs_service/variables.tf`, `networking/alb/variables.tf`, and `networking/nlb/variables.tf` is now wrapped in `try(..., true)`. This catches not only expected null-dereference errors but also any accidental typos or wrong attribute names in the validation expression — all of which now return `true` (pass) instead of surfacing the problem. The original `var.X == null ||` short-circuit pattern was both readable and safe; `try` buys very little here while hiding future expression bugs.
    
    How can I resolve this? If you propose a fix, please make it concise.
Prompt To Fix All With AI
Fix the following 3 code review issues. Work through them one at a time, proposing concise fixes.

---

### Issue 1 of 3
compute/ecs_cluster/module.yml:552-557
**Six new `module.yml` outputs are never defined in `outputs.tf`**

`module.yml` now declares `ravion_managed_domains_enabled`, `ravion_cluster_domain_allocation_id`, `ravion_cluster_managed_domain_id`, `ravion_cluster_fqdn`, `ravion_cluster_certificate_arn`, and `ravion_dns_provider_id` as outputs, but none of these have a corresponding `output` block in `outputs.tf`. The only block that was actually added there is `ravion_certificate_groups`, which is itself absent from `module.yml`. Any caller (or the Ravion UI) that references any of these six advertised outputs will fail at plan time with `This object does not have an attribute named "..."`. The output that IS defined (`ravion_certificate_groups`) should be listed here, and the six phantom entries should either be removed or backed by real output blocks.

### Issue 2 of 3
_shared/ravion_cert_groups/main.tf:53-56
**`inherit` group silently vanishes when `parent_group_name` is not in `var.cluster_groups`**

The `contains(keys(var.cluster_groups), coalesce(g.parent_group_name, ""))` guard silently drops any `inherit` group whose `parent_group_name` doesn't match an entry in the map — no `ravion_domain`, no `aws_lb_listener_rule`, no error. A typo in the group name (e.g. `"default"` vs `"Default"`) produces a successful apply with zero domain allocations and the service silently gets no URL. The ECS-service-level validation only checks that `parent_group_name` is non-empty, not that it resolves. A `check` block or `precondition` that asserts every `inherit` group's `parent_group_name` exists in `var.cluster_groups` would surface the failure early.

### Issue 3 of 3
_shared/ravion_cert_groups/main.tf:199-202
**`aws_acm_certificate.customer` panics on `domains[0]` when the shared module is called with an empty `domains` list**

`aws_acm_certificate.customer` iterates `local.customer_groups` (all customer-kind groups) and accesses `each.value.domains[0]`. The shared module's `variables.tf` has no validation requiring at least one domain for `customer` kind — only the ECS-service-level caller enforces that constraint. A module consumer that bypasses or precedes that guard will hit an index-out-of-bounds error at plan time with no useful message. Adding a `validation` block on `var.cert_groups` in the shared module directly would make the invariant self-contained.

Reviews (3): Last reviewed commit: "feat(cert_groups): TF passes cert_group_..." | Re-trigger Greptile

Greptile also left 2 inline comments on this PR.

siddhsuresh and others added 6 commits May 20, 2026 04:54
…l plane

Adds the Terraform module surface that's the customer-facing entry point
for the Ravion domain control plane. Matches the V2 (DI) shape — three
separate resources instead of the old bundled `ravion_domain`.

compute/ecs_cluster/ravion_domains.tf:
- ravion_domain.cluster              — wildcard allocation (*.<fqdn>)
- aws_acm_certificate.cluster        — wildcard cert in customer's AWS
- ravion_dns_records.cluster_validation — ACM validation CNAMEs into
                                          Ravion's Route53 (sync)
- ravion_dns_records.cluster_routing — ALIAS to the cluster ALB
- aws_acm_certificate_validation     — blocks until ACM verifies (~30s)
- ravion_managed_certificate.cluster — UI metadata sink

compute/ecs_service/ravion_domains.tf:
- ravion_domain.auto              — child allocation under cluster
- ravion_dns_records.auto_routing — service-leaf ALIAS to cluster ALB
- aws_lb_listener_rule.ravion     — host-header rule on cluster's HTTPS
                                    listener. No per-service cert —
                                    cluster wildcard covers via SNI.

Cluster outputs four values the service module wires up to:
- ravion_cluster_domain_allocation_id  (the SNI parent)
- ravion_cluster_managed_domain_id
- ravion_cluster_fqdn
- ravion_cluster_certificate_arn

Service variables (all defaulted null — module is opt-in):
- ravion_dns_zone_id
- ravion_parent_domain_allocation_id
- ravion_cluster_alb_dns_name / _zone_id / _https_listener_arn
- ravion_service_slug / _listener_rule_priority

versions.tf in both modules declares `ravion = ravion.com/ravion/domains
>= 1.0.0`. terraform fmt clean.

All AWS resources live in the customer's account; Ravion holds zero
customer credentials.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
compute/ecs_cluster/module.yml gets the certificate source toggle —
"Ravion managed (wildcard cert + FQDN auto-allocated)" vs
"Bring your own ACM cert". Ravion-managed mode reveals a DNS Zone
dropdown (populated via \$values:ravion/dns-zones — picks the
platform-managed Ravion apex or any customer-owned zone) and an
optional FQDN slug.

When the cluster is in Ravion-managed mode:
- ravion_domain.cluster allocates `*.<slug>-<hash>.<apex>`
- aws_acm_certificate.cluster issues the wildcard cert
- ravion_dns_records.cluster_validation lands the ACM validation
  CNAMEs in the configured zone (synchronous via the
  RavionRoute53Writer)
- aws_acm_certificate_validation blocks ~30s on ACM
- The validated cert ARN gets prepended to the ALB listener's
  certificate_arns (default cert); customer BYO arns can still be
  added for SNI on custom domains
- ravion_managed_certificate.cluster records the metadata for the UI

compute/ecs_cluster outputs add:
  ravion_managed_domains_enabled (bool)  ← gates the service module
  ravion_dns_zone_id
  ravion_cluster_domain_allocation_id    ← parent for service alloc
  ravion_cluster_managed_domain_id
  ravion_cluster_fqdn
  ravion_cluster_certificate_arn

compute/ecs_service:
- New `ravion_domains` variable (list of slugs, DNS-validated).
- ravion_domains.tf now iterates the list — one ravion_domain +
  ravion_dns_records + aws_lb_listener_rule per entry, all keyed by
  toset(domains) so add/remove is a clean diff.
- Listener-rule priorities derived deterministically from
  sha256("\${var.name}:\${domain}") so two services in the same cluster
  never collide. Caller can pin a base via
  ravion_listener_rule_priority_base.
- Empty list = service is reachable via the cluster's apex wildcard
  only; no per-service FQDN. Same gating as before
  (ravion_parent_domain_allocation_id null/empty).
- Outputs: ravion_domain_fqdns map + ravion_domain_allocation_ids map.

No per-service ACM cert — the cluster's wildcard covers every FQDN
allocated under it via SNI.

(api-go \$values endpoint lives in the matching commit on the
flightcontrol branch.)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Restores the validation conditions in compute/{autoscaling,ecs_cluster,ecs_service}
and networking/{alb,nlb} variables.tf to their pre-03f87c3 state. The other
changes in 03f87c3 (Ravion module.yml + listener wiring fixes) remain.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@siddhsuresh siddhsuresh marked this pull request as ready for review May 20, 2026 21:08
…integration

- Added support for the new Ravion DNS provider in both cluster and service modules.
- Replaced `ravion_dns_zone_id` with `ravion_dns_provider_id` and `ravion_dns_provider_given_id` for improved flexibility.
- Updated local variables and data sources to handle DNS provider lookups and routing logic.
- Enhanced routing record management for Route53 and Cloudflare based on the selected DNS provider.
- Adjusted ACM certificate issuance and validation processes to accommodate the new provider structure.
- Updated module documentation and variable descriptions for clarity on the new configuration options.
@siddhsuresh
Copy link
Copy Markdown
Author

@greptile

Comment thread compute/ecs_service/ravion_domains.tf Outdated
Comment on lines +38 to +45
# same cluster don't collide on listener-rule priority. SHA-256 →
# 16-bit hex digest → mod 49000 + offset 1000 to stay clear of the
# lower reserved range. When the caller pins a base, sort the domain
# slugs and assign +0, +1, +2, ... so re-applies stay stable.
ravion_sorted_domains = sort(var.ravion_domains)
ravion_priority_for_domain = {
for idx, d in local.ravion_sorted_domains :
d => (
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Hash-based priority can collide, causing apply failures

substr(sha256(...), 0, 4) produces a 16-bit value (0–65535); % 49000 maps it to 0–48999. Because 65536 > 49000, the distribution is non-uniform (values 0–16535 appear twice), and — more critically — any two (var.name, domain-slug) combinations that hash to the same 4-hex prefix produce an identical listener-rule priority. When a second service hits the same priority on the same ALB listener, AWS rejects the aws_lb_listener_rule create with a DuplicatePriority error. The birthday bound is ~√(49000) ≈ 221 unique combinations before a 50% collision probability, but even a handful of services sharing a large domain list creates real risk at cluster scale.

Prompt To Fix With AI
This is a comment left during a code review.
Path: compute/ecs_service/ravion_domains.tf
Line: 38-45

Comment:
**Hash-based priority can collide, causing apply failures**

`substr(sha256(...), 0, 4)` produces a 16-bit value (0–65535); `% 49000` maps it to 0–48999. Because 65536 > 49000, the distribution is non-uniform (values 0–16535 appear twice), and — more critically — any two `(var.name, domain-slug)` combinations that hash to the same 4-hex prefix produce an identical listener-rule priority. When a second service hits the same priority on the same ALB listener, AWS rejects the `aws_lb_listener_rule` create with a `DuplicatePriority` error. The birthday bound is ~√(49000) ≈ 221 unique combinations before a 50% collision probability, but even a handful of services sharing a large domain list creates real risk at cluster scale.

How can I resolve this? If you propose a fix, please make it concise.

Comment on lines 1392 to 1408
validation {
condition = var.instance_maintenance_policy == null || (
condition = try(
coalesce(var.instance_maintenance_policy.min_healthy_percentage, 90) >= 0 &&
coalesce(var.instance_maintenance_policy.min_healthy_percentage, 90) <= 100
coalesce(var.instance_maintenance_policy.min_healthy_percentage, 90) <= 100,
true
)
error_message = "The min_healthy_percentage must be between 0 and 100."
}

validation {
condition = var.instance_maintenance_policy == null || (
condition = try(
coalesce(var.instance_maintenance_policy.max_healthy_percentage, 120) >= 100 &&
coalesce(var.instance_maintenance_policy.max_healthy_percentage, 120) <= 200
coalesce(var.instance_maintenance_policy.max_healthy_percentage, 120) <= 200,
true
)
error_message = "The max_healthy_percentage must be between 100 and 200."
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 try() without null guard silently swallows type errors

The previous var.instance_maintenance_policy == null || (condition) pattern was dropped; the refactored version relies entirely on try() to catch null-object access. While this works for the null case, try() returns true for any error — including a type mismatch where min_healthy_percentage or max_healthy_percentage is passed as a non-numeric value. Such inputs would silently pass validation instead of emitting the error_message. The same concern applies to the analogous try() wrapping used across compute/ecs_service/variables.tf and compute/ecs_cluster/variables.tf.

Prompt To Fix With AI
This is a comment left during a code review.
Path: compute/autoscaling/variables.tf
Line: 1392-1408

Comment:
**`try()` without null guard silently swallows type errors**

The previous `var.instance_maintenance_policy == null || (condition)` pattern was dropped; the refactored version relies entirely on `try()` to catch null-object access. While this works for the null case, `try()` returns `true` for *any* error — including a type mismatch where `min_healthy_percentage` or `max_healthy_percentage` is passed as a non-numeric value. Such inputs would silently pass validation instead of emitting the `error_message`. The same concern applies to the analogous `try()` wrapping used across `compute/ecs_service/variables.tf` and `compute/ecs_cluster/variables.tf`.

How can I resolve this? If you propose a fix, please make it concise.

Comment thread compute/ecs_cluster/ravion_domains.tf Outdated
}
]
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Redundant && var.enable_public_alblocal.enable_ravion_domain already requires var.enable_public_alb (defined in locals.tf), so this term is always true when the outer local is true and redundant when it's false.

Suggested change
count = local.enable_ravion_domain ? 1 : 0
Prompt To Fix With AI
This is a comment left during a code review.
Path: compute/ecs_cluster/ravion_domains.tf
Line: 68

Comment:
Redundant `&& var.enable_public_alb``local.enable_ravion_domain` already requires `var.enable_public_alb` (defined in `locals.tf`), so this term is always `true` when the outer local is `true` and redundant when it's `false`.

```suggestion
  count             = local.enable_ravion_domain ? 1 : 0
```

How can I resolve this? If you propose a fix, please make it concise.

Comment thread compute/ecs_cluster/module.yml Outdated
ravion_cluster_managed_domain_id: string
ravion_cluster_fqdn: string
ravion_cluster_certificate_arn: string
ravion_dns_zone_id: string
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 ravion_dns_zone_id declared but never defined

module.yml advertises ravion_dns_zone_id: string as a module output, but no matching output "ravion_dns_zone_id" block exists in outputs.tf. Any caller that references module.ecs_cluster.ravion_dns_zone_id will fail with This object does not have an attribute named "ravion_dns_zone_id" during plan. The output block that was likely intended is ravion_dns_provider_id (which is defined in outputs.tf but not listed in module.yml), suggesting these two were renamed out of sync.

Prompt To Fix With AI
This is a comment left during a code review.
Path: compute/ecs_cluster/module.yml
Line: 539

Comment:
**`ravion_dns_zone_id` declared but never defined**

`module.yml` advertises `ravion_dns_zone_id: string` as a module output, but no matching `output "ravion_dns_zone_id"` block exists in `outputs.tf`. Any caller that references `module.ecs_cluster.ravion_dns_zone_id` will fail with `This object does not have an attribute named "ravion_dns_zone_id"` during plan. The output block that was likely intended is `ravion_dns_provider_id` (which *is* defined in `outputs.tf` but not listed in `module.yml`), suggesting these two were renamed out of sync.

How can I resolve this? If you propose a fix, please make it concise.

Comment thread compute/ecs_service/ravion_domains.tf Outdated
Comment on lines +54 to +60
resource "ravion_domain" "this" {
for_each = local.ravion_domain_set

dns_provider_id = var.ravion_dns_provider_id
slug = each.value
parent_domain_allocation_id = var.ravion_parent_domain_allocation_id
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 given_id path silently leaves dns_provider_id null on domain resources

ravion_domain.this only passes var.ravion_dns_provider_id, not the given_id alternative. When a caller wires up the service using only ravion_dns_provider_given_id, dns_provider_id is null on every allocated domain resource while the data source lookup still succeeds — the domain allocation will be sent to the API without a provider reference. The cluster module resolved this by outputting ravion_dns_provider_id (the resolved opaque id), but there is no guard that validates the opaque id is actually present before creating domain resources.

Prompt To Fix With AI
This is a comment left during a code review.
Path: compute/ecs_service/ravion_domains.tf
Line: 54-60

Comment:
**`given_id` path silently leaves `dns_provider_id` null on domain resources**

`ravion_domain.this` only passes `var.ravion_dns_provider_id`, not the `given_id` alternative. When a caller wires up the service using only `ravion_dns_provider_given_id`, `dns_provider_id` is null on every allocated domain resource while the data source lookup still succeeds — the domain allocation will be sent to the API without a provider reference. The cluster module resolved this by outputting `ravion_dns_provider_id` (the resolved opaque id), but there is no guard that validates the opaque id is actually present before creating domain resources.

How can I resolve this? If you propose a fix, please make it concise.

Comment thread compute/ecs_cluster/ravion_domains.tf Outdated
Comment on lines +191 to +192
ttl = 1 # 1 == automatic, required when proxied = true; ALB doesn't accept CF proxying for arbitrary L7 so leave proxied false + ttl=60.
proxied = false
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 The ttl = 1 value and its inline comment are contradictory: 1 is Cloudflare's "automatic" TTL that is only meaningful when proxied = true. With proxied = false (the current value), a concrete TTL such as 60 seconds should be used — the same value the Ravion metadata record below uses for consistency.

Suggested change
ttl = 1 # 1 == automatic, required when proxied = true; ALB doesn't accept CF proxying for arbitrary L7 so leave proxied false + ttl=60.
proxied = false
ttl = 60
proxied = false
Prompt To Fix With AI
This is a comment left during a code review.
Path: compute/ecs_cluster/ravion_domains.tf
Line: 191-192

Comment:
The `ttl = 1` value and its inline comment are contradictory: `1` is Cloudflare's "automatic" TTL that is only meaningful when `proxied = true`. With `proxied = false` (the current value), a concrete TTL such as 60 seconds should be used — the same value the Ravion metadata record below uses for consistency.

```suggestion
  ttl     = 60
  proxied = false
```

How can I resolve this? If you propose a fix, please make it concise.

Comment thread compute/ecs_service/ravion_domains.tf Outdated
Comment on lines +43 to +50
ravion_priority_for_domain = {
for idx, d in local.ravion_sorted_domains :
d => (
var.ravion_listener_rule_priority_base > 0
? var.ravion_listener_rule_priority_base + idx
: (parseint(substr(sha256("${var.name}:${d}"), 0, 4), 16) % 49000) + 1000
)
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Hash-based priority space is too small for collision-free allocation

The expression (parseint(substr(sha256("${var.name}:${d}"), 0, 4), 16) % 49000) + 1000 uses only the first 4 hex characters (16-bit, 0–65535 range), then takes % 49000, collapsing to 49,000 buckets. By the birthday problem, the probability of at least one collision reaches ~50% around 261 (service, domain) pairs sharing the same ALB listener. A collision causes aws_lb_listener_rule creation to fail with a duplicate-priority error that is silent until apply time.

Prompt To Fix With AI
This is a comment left during a code review.
Path: compute/ecs_service/ravion_domains.tf
Line: 43-50

Comment:
**Hash-based priority space is too small for collision-free allocation**

The expression `(parseint(substr(sha256("${var.name}:${d}"), 0, 4), 16) % 49000) + 1000` uses only the first 4 hex characters (16-bit, 0–65535 range), then takes `% 49000`, collapsing to 49,000 buckets. By the birthday problem, the probability of at least one collision reaches ~50% around 261 (service, domain) pairs sharing the same ALB listener. A collision causes `aws_lb_listener_rule` creation to fail with a duplicate-priority error that is silent until apply time.

How can I resolve this? If you propose a fix, please make it concise.

siddhsuresh and others added 19 commits May 21, 2026 02:48
…ard)

The cluster wildcard cert covers every FQDN under the cluster apex
via SNI, but services that need FQDNs outside that apex (multi-zone
setups, vanity hostnames) had no way to get their own cert. Adds
opt-in certificate groups.

variables.tf:
  Adds var.ravion_certificate_groups — list of objects carrying:
    name                   — stable per-service identifier
    dns_provider_id (opt)  — per-group provider override, falls back
    dns_provider_given_id  — to the service-level provider when null
    domains                — up to 10 slugs (ACM default SAN limit;
                              validated; per-domain length + charset)
  Group names + domain slugs are validated; duplicate group names and
  groups with > 10 domains are rejected at plan time.

data.tf:
  Per-group ravion_dns_provider lookup so each group can target a
  different zone than the cluster. Falls back to the service-level
  provider when the group's overrides are null.

ravion_domains.tf:
  One ACM cert per group (primary + SANs derived from
  ravion_domain[group/slug].fqdn). Per-variant validation writers
  (route53_ravion / route53 / cloudflare) keyed by precomputed
  per-variant subsets of the (group, domain) pair map. ONE
  ravion_managed_certificate registration per group, scoped SERVICE.
  Per-domain routing record + aws_lb_listener_rule with
  deterministic priorities seeded "g:" to avoid collision with the
  existing ungrouped ravion_domains priority space. The cluster's
  HTTPS listener picks up each group's cert via
  aws_lb_listener_certificate; SNI handshake selects the most
  specific match at request time.
Auto-mode zero-config URLs:
  Cluster: var.use_ravion_subdomain (default true). Looks up platform
    DnsProvider via stable given_id="ravion-platform-apex"; posts
    fqdn_override="<module_instance_id>.<apex>"; wildcard cert covers
    *.<module_instance_id>.<apex>. var.module_instance_id is the
    runner-injected identifier (null in standalone use → slug mode).
  Service: var.ravion_auto_subdomain (default true) + service_given_id.
    Allocates one ravion_domain with slug=service_given_id +
    parent_domain_allocation_id=cluster's allocation. Server derives
    <given-id>-<random>.<cluster-fqdn>. Per-variant routing record +
    listener rule for the auto allocation, same dispatch as the per-
    domain blocks (route53_ravion / route53 / cloudflare).

Module form (cluster):
  use_ravion_subdomain boolean at the top of Managed Domains.
  ravion_dns_provider_id moved to show_when use_ravion_subdomain=false
    + sourced from $values:ravion/dns-providers-customer (no platform
    row in the dropdown — auto-mode handles that).
  ravion_dns_provider_given_id stays as the secondary input under the
    same false-gate.
Q3: service_given_id renamed to module_instance_given_id (matches
the runner-injected TF_VAR_module_instance_given_id env var).

Q4: cert groups drop the per-row use_ravion_subdomain toggle. Each
group MUST specify its own customer-owned DNS provider; domains are
full FQDNs posted via fqdn_override (validated under the provider's
apex server-side). No fallback to the platform apex.

Q7: cluster auto-mode local strips trailing dot from the resolved
DnsProvider.domain_name so the literal FQDN written to state is
canonical (no `<id>.example.com.`).
Replace stale ravion_dns_zone_id with ravion_dns_provider_id to match
the actual output defined in outputs.tf.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…rovider_id

Add lifecycle preconditions on ravion_auto_label and ravion_auto_auto
domain resources so that a missing/empty ravion_dns_provider_id fails
loudly at plan time instead of producing allocations with a null
provider reference.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
ecs_service/ravion_domains.tf became 480+ lines of dispatch logic; the
same logic needs to be reused by ecs_cluster and future static_site
modules. Move it into modules/_shared/ravion_cert_groups/, parameterised
by routing-target (ALB DNS+zone+listener+target-group) so the same
module instantiates cleanly for both ECS and CloudFront-fronted callers
down the line.

ecs_service now declares a single `module "ravion_cert_groups"` block
and forwards its var.ravion_certificate_groups plus the wiring fields.
Behavior identical (2 kinds: ravion_auto + customer); plan is a no-op
when state is migrated with `terraform state mv`.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
ttl=1 is Cloudflare's "automatic" sentinel and is only meaningful when
proxied=true. With proxied=false the value gets coerced server-side
to a default that drifts plans on every apply. Match the explicit 60
the sibling ravion_dns_records.cluster_routing_metadata_cf already uses.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
try(..., true) returns true on ANY error in the inner expression —
including attribute typos — which silently passes validation. Replace
with the explicit null-guard pattern (var.X == null || <check>) that's
short-circuit safe and surfaces real expression bugs.

Two nlb conditions (dns_record_client_routing_policy +
enforce_security_group_inbound_rules_on_private_link_traffic) gained
the null guard since their contains() call would otherwise null-deref
on the default.

Files: autoscaling, ecs_cluster, ecs_service, networking/alb,
networking/nlb. 68 conditions cleaned up. tofu validate passes on all.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…t mode)

Replaces the single-wildcard cluster model with a list of cert groups,
each issuing ONE wildcard ACM cert + ONE parent DomainAllocation.
Services nest leaf FQDNs under any of these via their own cert groups
(kind = cluster_wildcard, cluster_group_name = "<name>", coming in
the next commit).

Shared cert-groups module gains:
- `var.mode = leaf | parent` discriminator
- `var.platform_apex_provider_given_id` + `var.module_instance_id`
- `parent.tf` — per-group wildcard cert + Ravion-managed parent alloc
  for ravion_auto kind, customer-supplied wildcard_fqdn for customer
  kind. Cert ARN + parent allocation id exposed via `parent_groups`
  output. SNI attach is the parent's job (cluster threads it through
  the ALB module's certificate_arns list).
- `cloudflare_api_token` output so cluster's provider block can source
  the token from cert-group rows rather than a deleted top-level
  data source.

Cluster module:
- Drops `var.use_ravion_subdomain`, `var.ravion_dns_provider_id`,
  `var.ravion_dns_provider_given_id`, `var.ravion_cluster_slug`,
  the single `ravion_domain.cluster` resource, and all
  per-variant validation/routing blocks. The shared module handles
  the new shape end-to-end.
- New `ravion_certificate_groups` variable + listener cert list
  concat (BYO ARNs first, cert-group certs as SNI extras).
- Outputs collapse to a single `ravion_certificate_groups` map.

`tofu validate` passes on both cluster + service.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Service cert groups rename `ravion_auto` kind → `cluster_wildcard` and
add a required `cluster_group_name` field that picks WHICH cluster
wildcard to nest under. Multiple wildcards per cluster (now possible
after commit 921e2c2) means services must be explicit instead of
implicitly picking the cluster's only allocation.

Shared module:
- Adds `var.cluster_groups` (map keyed by cluster cert-group name)
  consumed only by leaf-mode cluster_wildcard groups.
- Drops the old `var.ravion_parent_domain_allocation_id` /
  `var.ravion_dns_provider_id` (replaced by per-cluster-group lookup).
- `parent_groups` output now also exposes `dns_provider_id` so leaf
  allocations attach to the correct provider.
- Renames the leaf resource blocks: ravion_auto_label →
  cluster_wildcard_label, ravion_auto_auto → cluster_wildcard_auto.
- 32-bit hash retained on listener-rule priorities.

ecs_service:
- New `var.ravion_cluster_certificate_groups` (map output from cluster).
- `ravion_certificate_groups` kind enum: `cluster_wildcard | customer`,
  plus per-row `cluster_group_name`. Validations enforce the wire-up.
- Drops dead var.ravion_dns_provider_id +
  var.ravion_parent_domain_allocation_id (now sourced via the map).
- locals.tf + data.tf stripped of dead dns_provider_lookup_key block.
- provider.tf cloudflare api_token now sourced from the shared module.

`tofu validate` passes on cluster + service.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…atch deferred)

Form schema now accepts kind=external on both cluster + service cert
groups. TF dispatch is NOT implemented yet — a `check` block fails
the plan with a clear message when external is used. Reason: the
Ravion TF provider needs `is_external` on ravion_domain (so
dns_provider_id can be omitted) before we can model the metadata-only
allocation. That extension lives in packages/terraform-provider-domains
and ships as its own commit chain.

Adding the kind to the validation enum NOW lets the api-go YAML
publish + UI ship the radio option without churning the schema later.
Operators who pick external will see a clear plan-time error
referencing the README pointing at the follow-up work.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…atform hashes

These were generated on darwin_arm64 (local validate) but the runner is
linux_amd64 → tofu init fails with "doesn't match any of the checksums
previously recorded in the dependency lock file" because no linux_amd64
hash is recorded.

Quickest unblock: drop the files so `tofu init` regenerates a fresh
lock on each runner. Follow-up: regenerate cross-platform locks via
`tofu providers lock -platform=linux_amd64 -platform=darwin_arm64`
and re-commit so reproducibility is preserved.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…lookup

Plan failed with "missing lookup key — Provide either id or given_id"
because the for_each iterated EVERY cert group including ravion_auto
(which uses data.platform_apex), cluster_wildcard (which inherits from
var.cluster_groups), and external (no provider row at all). Restrict
the for_each to kind=customer.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Plan failed with "Invalid for_each argument — local.<map> will be known
only after apply" because parent_validation_pairs_route53_* keyed each
entry on `<group_name>/<opt.domain_name>` where opt.domain_name is the
ACM cert's validation domain (known only after apply). TF can't iterate
a map whose keys are unknown at plan time.

Restructure: each parent group has ONE wildcard cert → ONE validation
record. Key the per-variant subsets on group_name (known at plan time)
and dereference `aws_acm_certificate.parent[name].domain_validation_options[0]`
inside the resource body where unknown-at-plan VALUES are fine.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`local.has_listener = var.listener_arn != null && var.listener_arn != ""`
is unknown at plan time when the ALB listener is `(known after apply)`
on first apply. The ternary `unknown ? local.X : {}` makes TF reject
the for_each even when local.X is statically empty (parent mode w/ no
customer groups, etc).

Iterate the static maps directly. The listener_arn/target_group_arn
references inside resource bodies are fine when unknown — TF plans
them as `(known after apply)`. If the listener is genuinely null at
apply time, those resources fail with a clear AWS error rather than
being silently skipped (the gate was footgun-y anyway).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…n_dns_records

DnsRecordBatchCreate looks up by allocation id (dalc_*); passing a
managed_domain_id (mdom_*) errors with ALLOCATION_NOT_FOUND. The
existing customer leaf code already uses `ravion_domain.X.id`
(allocation id) — parent.tf incorrectly used `.managed_domain_id`.
Match the working pattern.

The `managed_domain_id` argument name in the ravion provider is a
misnomer (it predates the rename) — actual semantics is "the
DomainAllocation this record set belongs to".

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
CloudflareWriter on the api-go side now handles real cloudflare-go/v6
SDK writes when ravion_dns_records resources land. Replace the
HCL cloudflare_dns_record + ravion_dns_records-metadata-mirror pair
with a single ravion_dns_records call that triggers the SDK write
server-side.

Three pairs collapse:
  parent_validation:    cloudflare_dns_record.parent_validation_cf       + ravion_dns_records.parent_validation_metadata_cf       → ravion_dns_records.parent_validation_cf
  customer_validation:  cloudflare_dns_record.customer_validation_cf     + ravion_dns_records.customer_validation_metadata_cf     → ravion_dns_records.customer_validation_cf
  customer_routing:     cloudflare_dns_record.customer_routing_cf        + ravion_dns_records.customer_routing_metadata_cf        → ravion_dns_records.customer_routing_cf

aws_acm_certificate_validation references point at the renamed
ravion_dns_records.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
siddhsuresh and others added 4 commits May 21, 2026 16:08
The user's headline complaint: every cluster's tofu init paid the
Cloudflare provider plugin download + load even when the customer
was a Route53-only shop. With CloudflareWriter now wired server-side
in api-go (see ravion-monorepo commits feat(domains)...), all
Cloudflare DNS writes flow through the ravion_dns_records resource +
the api-go's cloudflare-go/v6 SDK. The HCL provider is not needed.

Dropped:
  - cluster + service versions.tf: cloudflare required_provider block
  - cluster + service provider.tf: provider "cloudflare" {...} block +
    api_token = module.ravion_cert_groups.cloudflare_api_token wiring
  - _shared/ravion_cert_groups/versions.tf: cloudflare required_provider
  - _shared/ravion_cert_groups/outputs.tf: cloudflare_api_token output
    (no consumers after the provider blocks went away)

tofu validate passes on cluster + service. hashicorp/aws stays
(still needed for ALB/ACM/aws_route53_record); ravion provider stays.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…up_name

Per the domain/infra separation rule:
  service cert-group kind: cluster_wildcard → inherit
  service field:           cluster_group_name → parent_group_name
  service input var:       ravion_cluster_certificate_groups → ravion_parent_certificate_groups
  shared module locals/resources: cluster_wildcard_label → inherit_label, etc.
  tag value:               ravion:kind = inherit

The reason: the inherit pattern (leaf nests under an upstream parent
cert) isn't cluster-specific. Future callers like a CloudFront /
Lambda module will provide the same parent_groups output, and the
service's inherit kind already works for them once names don't bake
"cluster" into the wire-level model.

User-facing form labels can still say "Inherit a cluster wildcard"
where that's the dominant use case — wire-level kinds stay generic.

tofu validate passes on cluster + service. Go build green.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
These were locally generated by tofu init during validation. We
deleted them earlier (8ff8246) because they only had darwin_arm64
hashes which broke the linux_amd64 runner. Stay out.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ry ravion_domain

Every ravion_domain in the shared module now sets the two
denormalisation attributes (added in ravion-provider v2.2.0) so the
Domains tab can render groups by joining ManagedDomain rows directly:

  parent.tf:
    ravion_domain.parent_ravion_auto → kind=ravion_auto
    ravion_domain.parent_customer    → kind=customer
  main.tf:
    ravion_domain.inherit_label / inherit_auto → kind=inherit
    ravion_domain.customer                     → kind=customer

Group name = the operator's cert_group.name in every case.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@siddhsuresh
Copy link
Copy Markdown
Author

@greptile

Comment on lines +552 to +557
ravion_managed_domains_enabled: boolean
ravion_cluster_domain_allocation_id: string
ravion_cluster_managed_domain_id: string
ravion_cluster_fqdn: string
ravion_cluster_certificate_arn: string
ravion_dns_provider_id: string
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Six new module.yml outputs are never defined in outputs.tf

module.yml now declares ravion_managed_domains_enabled, ravion_cluster_domain_allocation_id, ravion_cluster_managed_domain_id, ravion_cluster_fqdn, ravion_cluster_certificate_arn, and ravion_dns_provider_id as outputs, but none of these have a corresponding output block in outputs.tf. The only block that was actually added there is ravion_certificate_groups, which is itself absent from module.yml. Any caller (or the Ravion UI) that references any of these six advertised outputs will fail at plan time with This object does not have an attribute named "...". The output that IS defined (ravion_certificate_groups) should be listed here, and the six phantom entries should either be removed or backed by real output blocks.

Prompt To Fix With AI
This is a comment left during a code review.
Path: compute/ecs_cluster/module.yml
Line: 552-557

Comment:
**Six new `module.yml` outputs are never defined in `outputs.tf`**

`module.yml` now declares `ravion_managed_domains_enabled`, `ravion_cluster_domain_allocation_id`, `ravion_cluster_managed_domain_id`, `ravion_cluster_fqdn`, `ravion_cluster_certificate_arn`, and `ravion_dns_provider_id` as outputs, but none of these have a corresponding `output` block in `outputs.tf`. The only block that was actually added there is `ravion_certificate_groups`, which is itself absent from `module.yml`. Any caller (or the Ravion UI) that references any of these six advertised outputs will fail at plan time with `This object does not have an attribute named "..."`. The output that IS defined (`ravion_certificate_groups`) should be listed here, and the six phantom entries should either be removed or backed by real output blocks.

How can I resolve this? If you propose a fix, please make it concise.

Comment on lines +53 to +56
inherit_groups = local.leaf_mode ? {
for g in var.cert_groups :
g.name => g if g.kind == "inherit" && contains(keys(var.cluster_groups), coalesce(g.parent_group_name, ""))
} : {}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 inherit group silently vanishes when parent_group_name is not in var.cluster_groups

The contains(keys(var.cluster_groups), coalesce(g.parent_group_name, "")) guard silently drops any inherit group whose parent_group_name doesn't match an entry in the map — no ravion_domain, no aws_lb_listener_rule, no error. A typo in the group name (e.g. "default" vs "Default") produces a successful apply with zero domain allocations and the service silently gets no URL. The ECS-service-level validation only checks that parent_group_name is non-empty, not that it resolves. A check block or precondition that asserts every inherit group's parent_group_name exists in var.cluster_groups would surface the failure early.

Prompt To Fix With AI
This is a comment left during a code review.
Path: _shared/ravion_cert_groups/main.tf
Line: 53-56

Comment:
**`inherit` group silently vanishes when `parent_group_name` is not in `var.cluster_groups`**

The `contains(keys(var.cluster_groups), coalesce(g.parent_group_name, ""))` guard silently drops any `inherit` group whose `parent_group_name` doesn't match an entry in the map — no `ravion_domain`, no `aws_lb_listener_rule`, no error. A typo in the group name (e.g. `"default"` vs `"Default"`) produces a successful apply with zero domain allocations and the service silently gets no URL. The ECS-service-level validation only checks that `parent_group_name` is non-empty, not that it resolves. A `check` block or `precondition` that asserts every `inherit` group's `parent_group_name` exists in `var.cluster_groups` would surface the failure early.

How can I resolve this? If you propose a fix, please make it concise.

- drop cert_group_name/cert_group_kind from ravion_domain calls
- add name + kind to every ravion_managed_certificate (customer, parent, plus new placeholder rows for inherit + external)
- external groups: synthetic external:<name>:<group> ARN + PENDING status anchor the cert-group identity even when Ravion provisions no real cert
- inherit groups: synthetic inherit:<name>:<group> placeholder cert with WILDCARD scope so the Domains tab projector groups inherit MDs under their named cert group
- external_dns_provider_given_id variable + data lookup for the synthetic per-org EXTERNAL DnsProvider
…ispatch

- customer_validation_pairs now keyed by static (group, domain) instead of post-apply domain_validation_options, so for_each keys are known at plan time
- validation/routing resources resolve the per-domain validation option inside the resource body via list-filter on the cert's domain_validation_options (terraform tolerates unknown body values, only the for_each keys must be static)
- customer_provider_kind + parent_provider_kind locals encode the provider variant as a plain string via nonsensitive() so the cloudflare api_token sensitivity marker no longer taints for_each filter conditions
- all filter conditions switched from `provider.X != null` to `kind == "X"` so cloudflare/route53/route53_ravion dispatch survives plan
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant