docs: move AWS DR guide to dedicated subpage by hanzei · Pull Request #8893 · mattermost/docs

hanzei · 2026-04-15T11:04:52Z

Summary

The old page was 90% an AWS guide. I've split the page into two: one general overview page on how we do DR at Mattermost and a sub page for the AWS guide. I've also added clarifications to the main page on what is supported and what isn't.

AI Summary

Extracts the AWS-specific active/passive DR deployment steps from backup-disaster-recovery.rst into a new disaster-recovery-aws.rst subpage
The main page now links to it via toctree, keeping the overview concise and making room for future platform-specific guides
Adds a note clarifying that Mattermost does not support active/active DR deployments (already landed on master)

Preview

http://mattermost-docs-preview-pulls.s3-website-us-east-1.amazonaws.com/8893/deployment-guide/backup-disaster-recovery.html#

🤖 Generated with Claude Code

…recovery Distinguish high availability (single-site clustering) from disaster recovery (multi-site failover), clarify that Mattermost supports active/passive DR only and does not support active/active deployments, and rename the "High Availability deployment" section to "Active/passive DR deployment" for accuracy. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Extract the AWS-specific active/passive DR deployment steps from backup-disaster-recovery.rst into a new disaster-recovery-aws.rst subpage. The main page now links to it via toctree, keeping the overview page concise and making room for future platform-specific guides. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

coderabbitai · 2026-04-15T11:07:50Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 3a52bfe8-a743-411f-9bd3-65ee08fd40b3

📥 Commits

Reviewing files that changed from the base of the PR and between 7fd420e and 995626c.

📒 Files selected for processing (2)

source/deployment-guide/backup-disaster-recovery.rst
source/deployment-guide/disaster-recovery-aws.rst

✅ Files skipped from review due to trivial changes (1)

source/deployment-guide/disaster-recovery-aws.rst

📝 Walkthrough

Walkthrough

Documentation restructured: the main disaster recovery guide was simplified to distinguish HA vs DR and introduce an active/passive DR approach; detailed AWS-specific active/passive DR implementation moved into a new platform-specific guide with end-to-end replication and failover steps.

Changes

Cohort / File(s)	Summary
General DR Guide `source/deployment-guide/backup-disaster-recovery.rst`	Removed prior AWS-specific step-by-step HA/DR content and images; normalized headings; replaced HA section with an Active/passive DR overview and added a toctree entry pointing to the AWS-specific DR guide.
AWS Platform-Specific DR Guide `source/deployment-guide/disaster-recovery-aws.rst`	Added new comprehensive AWS active/passive DR doc covering prerequisites, RDS Aurora global cluster setup (with write forwarding caveats), S3 replication configuration, OpenSearch cross-cluster replication (pull model, FGAC, IAM requirements), job server scheduler handling, failover switchover and OpenSearch replication reversal steps, testing, and DNS/app node guidance.

Sequence Diagram(s)

sequenceDiagram
    autonumber
    actor User
    participant DNS as DNS
    participant Primary as PrimaryRegion\n(App, RDS, S3, OpenSearch)
    participant Secondary as SecondaryRegion\n(App replica, RDS replica, S3 replica, OpenSearch replica)
    participant JobServer as JobServer

    User->>DNS: Resolve app endpoint
    DNS->>Primary: Direct traffic to Primary App nodes
    User->>Primary: App requests (reads/writes)
    Primary->>RDS: DB writes/reads
    Primary->>S3: Object writes (replicated)
    Primary->>OpenSearch: Index writes (replicated)

    Note over Primary,Secondary: Continuous replication configured\n(RDS global cluster, S3 replication, OpenSearch CCR)

    %% Failover initiation
    alt Primary failure detected
        Admin->>DNS: Switch endpoint to Secondary
        DNS->>Secondary: Route users to Secondary App nodes
        Admin->>RDS: Promote secondary as writer
        Admin->>JobServer: Disable scheduler on Secondary until failover complete
        Admin->>OpenSearch: Reverse replication direction, make indices writable
        Secondary->>S3: Accept replicated objects / sync
        Admin->>JobServer: Enable scheduler on Secondary
    end

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

🚥 Pre-merge checks | ✅ 3

✅ Passed checks (3 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title accurately describes the main change: extracting AWS-specific disaster recovery content into a dedicated subpage, which aligns with the substantial restructuring shown in the changeset.
Description check	✅ Passed	The description clearly explains the purpose of the PR—splitting the monolithic page into a general overview and AWS-specific subpage—and relates directly to the changeset modifications.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch docs/disaster-recovery-updates

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 3

🧹 Nitpick comments (2)

source/deployment-guide/disaster-recovery-aws.rst (2)
95-117: Use json instead of sh for the IAM policy block.

This block is a JSON policy document, not a shell command. Language labelling should match content.
Suggested minimal diff
-  .. code-block:: sh
+  .. code-block:: json
As per coding guidelines, "Require code fences or code directives to identify the language when practical."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@source/deployment-guide/disaster-recovery-aws.rst` around lines 95 - 117, The
code block showing the IAM policy is labeled as a shell snippet ("code-block::
sh") but contains JSON; update the directive to "code-block:: json" so the IAM
policy document is correctly identified and syntax-highlighted; locate the block
that currently begins with code-block:: sh and change that directive to
code-block:: json (the JSON policy object with keys "Version" and "Statement")
to match the content.
7-10: Add a short prerequisites block before procedural steps.

This page jumps into execution quickly. A compact prerequisites list (AWS account access, region pair selected, existing Mattermost primary deployment, DNS ownership, OpenSearch/RDS permissions) would reduce operator error for novice admins.

As per coding guidelines, "List prerequisites clearly at the top of documentation sections."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@source/deployment-guide/disaster-recovery-aws.rst` around lines 7 - 10, Add a
short "Prerequisites" block at the top of the Mattermost AWS disaster recovery
guide (before the procedural steps that start with the current introductory
paragraphs) listing required items: AWS account access and IAM permissions,
chosen region pair for failover, an existing Mattermost primary deployment,
control/ownership of DNS for failover updates, required OpenSearch/RDS
permissions and backups, and any tooling/CLI versions; ensure the block uses a
clear bullet list and a brief note about verifying backups and network
connectivity so novice operators see these checks before executing the
procedure.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@source/deployment-guide/disaster-recovery-aws.rst`:
- Line 13: Fix the typo in the cross-reference sentence by replacing
"documenation" with "documentation" in the sentence that references the
Upgrading Mattermost in Kubernetes and High Availability Environments doc (the
string containing ":doc:`Upgrading Mattermost in Kubernetes and High
Availability Environments
</administration-guide/upgrade/upgrade-mattermost-kubernetes-ha>`"). Ensure the
corrected sentence reads "...see the ... documentation." and keep the rest of
the cross-reference unchanged.
- Around line 235-237: Duplicate curl command checking the _status for
posts_<DATE> appears twice; remove the redundant line so only one curl -H
'Content-Type: application/json' -u '<USERNAME>:<PASSWORD>'
'https://<HOSTNAME>/_plugins/_replication/posts_<DATE>/_status?pretty' remains,
preserving the Sample output line that follows and keeping steps atomic and
numbered.
- Line 175: Replace the incorrect curl credential separator and clarify the host
placeholder: change the curl -u argument from "username/password" to the
required "username:password" format and update the URL placeholder (e.g., use a
clearer <hostname[:port]> or <elasticsearch-host>) in the example command string
shown in the diff so readers can substitute a real host.

---

Nitpick comments:
In `@source/deployment-guide/disaster-recovery-aws.rst`:
- Around line 95-117: The code block showing the IAM policy is labeled as a
shell snippet ("code-block:: sh") but contains JSON; update the directive to
"code-block:: json" so the IAM policy document is correctly identified and
syntax-highlighted; locate the block that currently begins with code-block:: sh
and change that directive to code-block:: json (the JSON policy object with keys
"Version" and "Statement") to match the content.
- Around line 7-10: Add a short "Prerequisites" block at the top of the
Mattermost AWS disaster recovery guide (before the procedural steps that start
with the current introductory paragraphs) listing required items: AWS account
access and IAM permissions, chosen region pair for failover, an existing
Mattermost primary deployment, control/ownership of DNS for failover updates,
required OpenSearch/RDS permissions and backups, and any tooling/CLI versions;
ensure the block uses a clear bullet list and a brief note about verifying
backups and network connectivity so novice operators see these checks before
executing the procedure.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: ca13014e-776c-4c71-a691-88e42d103ad0

📥 Commits

Reviewing files that changed from the base of the PR and between 80e869f and 7fd420e.

📒 Files selected for processing (2)

source/deployment-guide/backup-disaster-recovery.rst
source/deployment-guide/disaster-recovery-aws.rst

github-actions · 2026-04-15T11:08:13Z

Newest code from mattermost has been published to preview environment for Git SHA 7fd420e

- Fix typo: "documenation" → "documentation" - Fix curl credentials: "username/password" → "<USERNAME>:<PASSWORD>" and empty host placeholder - Remove duplicate posts_<DATE> status curl command - Change IAM policy code block language from sh to json - Add prerequisites section to disaster-recovery-aws.rst - Wrap HA vs DR explanation in a note directive Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

github-actions · 2026-04-15T11:52:30Z

Newest code from mattermost has been published to preview environment for Git SHA dde9aab

Change the three SSO failover sub-sections from ~~~~ to ^^^^^ so they render as children of "Failover from Single Sign-On outage" in the sidebar TOC rather than at the same level. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

github-actions · 2026-04-15T11:58:55Z

Newest code from mattermost has been published to preview environment for Git SHA 995626c

hanzei · 2026-04-15T14:08:22Z

cc @mrckndt

ewwollesen · 2026-04-21T14:53:34Z

+Set up in one data center
+--------------------------
+
+As a first step, set up Mattermost in a single data center. At a very basic high level, this would be something like below:


The description "At a very basic high level, this would be something like below" reads a little awkward, imo. Maybe something like "The following diagram illustrates a basic single data center architecture:" It reads a bit cleaner.

ewwollesen · 2026-04-21T15:04:26Z

+.. tip::
+
+  All you need is a recent OpenSearch version with fine-grained access control enabled. Node-to-node encryption is automatically enabled once you enable fine-grained access control.
+


"All you need is a recent OpenSearch version with fine-grained access control enabled." This reads a bit weird because the bullet just above it also lists Elasticsearch 7.10 as supported. The tip should either acknowledge both or be scoped appropriately. I get what you're trying to do, and it's probably not a big deal. Maybe reword it like "if you are already running Opensearch 2.x, all you need to do is turn on fine-grained access control and node-to-node encryption will enable automatically" so it reads a little more tip-like?

ewwollesen · 2026-04-21T15:26:48Z

+
+For simplicity, let's say ``site1`` is primary, and ``site2`` is secondary. Therefore, OS in ``site1`` is the leader domain, and in ``site2`` is the follower. The follower pulls from the leader. To switch the direction where ``site2`` becomes leader, and ``site1`` becomes follower.
+
+1. Remove the rule from ``site1`` > ``site 2`` in AWS Console. This will auto-pause the replication, but the indices in ``site2`` will still be read-only. Remove the replication rules for that.


"site 2" here versus "site2" everywhere else.

ewwollesen · 2026-04-21T15:29:36Z

+S3 bucket is auto-replicated both ways
+----------------------------------------
+
+There's nothing you need to do to ensure the S3 bucket is auto-replicating both ways.


This section have a whole header for just one sentence is not wrong but, I don't know, just seems kind of weird. Maybe make it a Note or Tip style section instead?

ewwollesen · 2026-04-21T15:34:54Z

+.. tip::
+  Websockets will still point to the old data center even if you have switched DNS. You need to roll over each app node gradually to move those connections to the new data center. If all your nodes are down, no action is necessary and the clients will automatically re-connect to the new data center.
+
+The S3 bucket is replicated bi-directionally while the database and ES/OS is replicated uni-directionally.


Should we add a note or section here at the end of what do do when the disaster event or whatever is over? Even if it's just "perform these operations the same way to restore functionality back to the primary data center".

ewwollesen · 2026-04-21T15:49:24Z

+----------------------
+
+If the job scheduler is left running in the secondary region, it will pick up jobs and start running them. Therefore, set ``JobSettings.RunScheduler`` to ``false`` on all nodes in the secondary region. When a failover happens, you need to enable it for the new primary region, and deactivate it for the new secondary region.
+


Should we tell them how to do this, or are we assuming if they are this far they probably know how to change settings?

hanzei and others added 2 commits April 15, 2026 11:29

coderabbitai Bot requested changes Apr 15, 2026

View reviewed changes

Comment thread source/deployment-guide/disaster-recovery-aws.rst Outdated

Comment thread source/deployment-guide/disaster-recovery-aws.rst Outdated

Comment thread source/deployment-guide/disaster-recovery-aws.rst

hanzei requested a review from neillcollie April 15, 2026 11:49

hanzei added 1: Dev Review Requires review by a core commiter 2: Editor Review Requires review by an editor labels Apr 15, 2026

coderabbitai Bot approved these changes Apr 15, 2026

View reviewed changes

hanzei requested a review from ewwollesen April 21, 2026 08:10

ewwollesen requested changes Apr 21, 2026

View reviewed changes

		.. tip::

		All you need is a recent OpenSearch version with fine-grained access control enabled. Node-to-node encryption is automatically enabled once you enable fine-grained access control.


		For simplicity, let's say ``site1`` is primary, and ``site2`` is secondary. Therefore, OS in ``site1`` is the leader domain, and in ``site2`` is the follower. The follower pulls from the leader. To switch the direction where ``site2`` becomes leader, and ``site1`` becomes follower.

		1. Remove the rule from ``site1`` > ``site 2`` in AWS Console. This will auto-pause the replication, but the indices in ``site2`` will still be read-only. Remove the replication rules for that.

		----------------------

		If the job scheduler is left running in the secondary region, it will pick up jobs and start running them. Therefore, set ``JobSettings.RunScheduler`` to ``false`` on all nodes in the secondary region. When a failover happens, you need to enable it for the new primary region, and deactivate it for the new secondary region.

Conversation

hanzei commented Apr 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

AI Summary

Preview

Uh oh!

coderabbitai Bot commented Apr 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

github-actions Bot commented Apr 15, 2026

Uh oh!

github-actions Bot commented Apr 15, 2026

Uh oh!

github-actions Bot commented Apr 15, 2026

Uh oh!

hanzei commented Apr 15, 2026

Uh oh!

ewwollesen Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

ewwollesen Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

ewwollesen Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

ewwollesen Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

ewwollesen Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

ewwollesen Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

hanzei commented Apr 15, 2026 •

edited

Loading

coderabbitai Bot commented Apr 15, 2026 •

edited

Loading