Skip to content

docs: move AWS DR guide to dedicated subpage#8893

Open
hanzei wants to merge 4 commits intomasterfrom
docs/disaster-recovery-updates
Open

docs: move AWS DR guide to dedicated subpage#8893
hanzei wants to merge 4 commits intomasterfrom
docs/disaster-recovery-updates

Conversation

@hanzei
Copy link
Copy Markdown
Contributor

@hanzei hanzei commented Apr 15, 2026

Summary

The old page was 90% an AWS guide. I've split the page into two: one general overview page on how we do DR at Mattermost and a sub page for the AWS guide. I've also added clarifications to the main page on what is supported and what isn't.

AI Summary

  • Extracts the AWS-specific active/passive DR deployment steps from backup-disaster-recovery.rst into a new disaster-recovery-aws.rst subpage
  • The main page now links to it via toctree, keeping the overview concise and making room for future platform-specific guides
  • Adds a note clarifying that Mattermost does not support active/active DR deployments (already landed on master)

Preview

http://mattermost-docs-preview-pulls.s3-website-us-east-1.amazonaws.com/8893/deployment-guide/backup-disaster-recovery.html#

🤖 Generated with Claude Code

hanzei and others added 2 commits April 15, 2026 11:29
…recovery

Distinguish high availability (single-site clustering) from disaster
recovery (multi-site failover), clarify that Mattermost supports
active/passive DR only and does not support active/active deployments,
and rename the "High Availability deployment" section to
"Active/passive DR deployment" for accuracy.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Extract the AWS-specific active/passive DR deployment steps from
backup-disaster-recovery.rst into a new disaster-recovery-aws.rst
subpage. The main page now links to it via toctree, keeping the
overview page concise and making room for future platform-specific
guides.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 15, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 3a52bfe8-a743-411f-9bd3-65ee08fd40b3

📥 Commits

Reviewing files that changed from the base of the PR and between 7fd420e and 995626c.

📒 Files selected for processing (2)
  • source/deployment-guide/backup-disaster-recovery.rst
  • source/deployment-guide/disaster-recovery-aws.rst
✅ Files skipped from review due to trivial changes (1)
  • source/deployment-guide/disaster-recovery-aws.rst

📝 Walkthrough

Walkthrough

Documentation restructured: the main disaster recovery guide was simplified to distinguish HA vs DR and introduce an active/passive DR approach; detailed AWS-specific active/passive DR implementation moved into a new platform-specific guide with end-to-end replication and failover steps.

Changes

Cohort / File(s) Summary
General DR Guide
source/deployment-guide/backup-disaster-recovery.rst
Removed prior AWS-specific step-by-step HA/DR content and images; normalized headings; replaced HA section with an Active/passive DR overview and added a toctree entry pointing to the AWS-specific DR guide.
AWS Platform-Specific DR Guide
source/deployment-guide/disaster-recovery-aws.rst
Added new comprehensive AWS active/passive DR doc covering prerequisites, RDS Aurora global cluster setup (with write forwarding caveats), S3 replication configuration, OpenSearch cross-cluster replication (pull model, FGAC, IAM requirements), job server scheduler handling, failover switchover and OpenSearch replication reversal steps, testing, and DNS/app node guidance.

Sequence Diagram(s)

sequenceDiagram
    autonumber
    actor User
    participant DNS as DNS
    participant Primary as PrimaryRegion\n(App, RDS, S3, OpenSearch)
    participant Secondary as SecondaryRegion\n(App replica, RDS replica, S3 replica, OpenSearch replica)
    participant JobServer as JobServer

    User->>DNS: Resolve app endpoint
    DNS->>Primary: Direct traffic to Primary App nodes
    User->>Primary: App requests (reads/writes)
    Primary->>RDS: DB writes/reads
    Primary->>S3: Object writes (replicated)
    Primary->>OpenSearch: Index writes (replicated)

    Note over Primary,Secondary: Continuous replication configured\n(RDS global cluster, S3 replication, OpenSearch CCR)

    %% Failover initiation
    alt Primary failure detected
        Admin->>DNS: Switch endpoint to Secondary
        DNS->>Secondary: Route users to Secondary App nodes
        Admin->>RDS: Promote secondary as writer
        Admin->>JobServer: Disable scheduler on Secondary until failover complete
        Admin->>OpenSearch: Reverse replication direction, make indices writable
        Secondary->>S3: Accept replicated objects / sync
        Admin->>JobServer: Enable scheduler on Secondary
    end
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately describes the main change: extracting AWS-specific disaster recovery content into a dedicated subpage, which aligns with the substantial restructuring shown in the changeset.
Description check ✅ Passed The description clearly explains the purpose of the PR—splitting the monolithic page into a general overview and AWS-specific subpage—and relates directly to the changeset modifications.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch docs/disaster-recovery-updates

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (2)
source/deployment-guide/disaster-recovery-aws.rst (2)

95-117: Use json instead of sh for the IAM policy block.

This block is a JSON policy document, not a shell command. Language labelling should match content.

Suggested minimal diff
-  .. code-block:: sh
+  .. code-block:: json

As per coding guidelines, "Require code fences or code directives to identify the language when practical."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@source/deployment-guide/disaster-recovery-aws.rst` around lines 95 - 117, The
code block showing the IAM policy is labeled as a shell snippet ("code-block::
sh") but contains JSON; update the directive to "code-block:: json" so the IAM
policy document is correctly identified and syntax-highlighted; locate the block
that currently begins with code-block:: sh and change that directive to
code-block:: json (the JSON policy object with keys "Version" and "Statement")
to match the content.

7-10: Add a short prerequisites block before procedural steps.

This page jumps into execution quickly. A compact prerequisites list (AWS account access, region pair selected, existing Mattermost primary deployment, DNS ownership, OpenSearch/RDS permissions) would reduce operator error for novice admins.

As per coding guidelines, "List prerequisites clearly at the top of documentation sections."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@source/deployment-guide/disaster-recovery-aws.rst` around lines 7 - 10, Add a
short "Prerequisites" block at the top of the Mattermost AWS disaster recovery
guide (before the procedural steps that start with the current introductory
paragraphs) listing required items: AWS account access and IAM permissions,
chosen region pair for failover, an existing Mattermost primary deployment,
control/ownership of DNS for failover updates, required OpenSearch/RDS
permissions and backups, and any tooling/CLI versions; ensure the block uses a
clear bullet list and a brief note about verifying backups and network
connectivity so novice operators see these checks before executing the
procedure.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@source/deployment-guide/disaster-recovery-aws.rst`:
- Line 13: Fix the typo in the cross-reference sentence by replacing
"documenation" with "documentation" in the sentence that references the
Upgrading Mattermost in Kubernetes and High Availability Environments doc (the
string containing ":doc:`Upgrading Mattermost in Kubernetes and High
Availability Environments
</administration-guide/upgrade/upgrade-mattermost-kubernetes-ha>`"). Ensure the
corrected sentence reads "...see the ... documentation." and keep the rest of
the cross-reference unchanged.
- Around line 235-237: Duplicate curl command checking the _status for
posts_<DATE> appears twice; remove the redundant line so only one curl -H
'Content-Type: application/json' -u '<USERNAME>:<PASSWORD>'
'https://<HOSTNAME>/_plugins/_replication/posts_<DATE>/_status?pretty' remains,
preserving the Sample output line that follows and keeping steps atomic and
numbered.
- Line 175: Replace the incorrect curl credential separator and clarify the host
placeholder: change the curl -u argument from "username/password" to the
required "username:password" format and update the URL placeholder (e.g., use a
clearer <hostname[:port]> or <elasticsearch-host>) in the example command string
shown in the diff so readers can substitute a real host.

---

Nitpick comments:
In `@source/deployment-guide/disaster-recovery-aws.rst`:
- Around line 95-117: The code block showing the IAM policy is labeled as a
shell snippet ("code-block:: sh") but contains JSON; update the directive to
"code-block:: json" so the IAM policy document is correctly identified and
syntax-highlighted; locate the block that currently begins with code-block:: sh
and change that directive to code-block:: json (the JSON policy object with keys
"Version" and "Statement") to match the content.
- Around line 7-10: Add a short "Prerequisites" block at the top of the
Mattermost AWS disaster recovery guide (before the procedural steps that start
with the current introductory paragraphs) listing required items: AWS account
access and IAM permissions, chosen region pair for failover, an existing
Mattermost primary deployment, control/ownership of DNS for failover updates,
required OpenSearch/RDS permissions and backups, and any tooling/CLI versions;
ensure the block uses a clear bullet list and a brief note about verifying
backups and network connectivity so novice operators see these checks before
executing the procedure.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: ca13014e-776c-4c71-a691-88e42d103ad0

📥 Commits

Reviewing files that changed from the base of the PR and between 80e869f and 7fd420e.

📒 Files selected for processing (2)
  • source/deployment-guide/backup-disaster-recovery.rst
  • source/deployment-guide/disaster-recovery-aws.rst

Comment thread source/deployment-guide/disaster-recovery-aws.rst Outdated
Comment thread source/deployment-guide/disaster-recovery-aws.rst Outdated
Comment thread source/deployment-guide/disaster-recovery-aws.rst
@github-actions
Copy link
Copy Markdown
Contributor

Newest code from mattermost has been published to preview environment for Git SHA 7fd420e

- Fix typo: "documenation" → "documentation"
- Fix curl credentials: "username/password" → "<USERNAME>:<PASSWORD>" and empty host placeholder
- Remove duplicate posts_<DATE> status curl command
- Change IAM policy code block language from sh to json
- Add prerequisites section to disaster-recovery-aws.rst
- Wrap HA vs DR explanation in a note directive

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@hanzei hanzei requested a review from neillcollie April 15, 2026 11:49
@github-actions
Copy link
Copy Markdown
Contributor

Newest code from mattermost has been published to preview environment for Git SHA dde9aab

@hanzei hanzei added 1: Dev Review Requires review by a core commiter 2: Editor Review Requires review by an editor labels Apr 15, 2026
Change the three SSO failover sub-sections from ~~~~ to ^^^^^ so they
render as children of "Failover from Single Sign-On outage" in the
sidebar TOC rather than at the same level.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown
Contributor

Newest code from mattermost has been published to preview environment for Git SHA 995626c

@hanzei
Copy link
Copy Markdown
Contributor Author

hanzei commented Apr 15, 2026

cc @mrckndt

@hanzei hanzei requested a review from ewwollesen April 21, 2026 08:10
Set up in one data center
--------------------------

As a first step, set up Mattermost in a single data center. At a very basic high level, this would be something like below:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The description "At a very basic high level, this would be something like below" reads a little awkward, imo. Maybe something like "The following diagram illustrates a basic single data center architecture:" It reads a bit cleaner.

.. tip::

All you need is a recent OpenSearch version with fine-grained access control enabled. Node-to-node encryption is automatically enabled once you enable fine-grained access control.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"All you need is a recent OpenSearch version with fine-grained access control enabled." This reads a bit weird because the bullet just above it also lists Elasticsearch 7.10 as supported. The tip should either acknowledge both or be scoped appropriately. I get what you're trying to do, and it's probably not a big deal. Maybe reword it like "if you are already running Opensearch 2.x, all you need to do is turn on fine-grained access control and node-to-node encryption will enable automatically" so it reads a little more tip-like?


For simplicity, let's say ``site1`` is primary, and ``site2`` is secondary. Therefore, OS in ``site1`` is the leader domain, and in ``site2`` is the follower. The follower pulls from the leader. To switch the direction where ``site2`` becomes leader, and ``site1`` becomes follower.

1. Remove the rule from ``site1`` > ``site 2`` in AWS Console. This will auto-pause the replication, but the indices in ``site2`` will still be read-only. Remove the replication rules for that.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"site 2" here versus "site2" everywhere else.

S3 bucket is auto-replicated both ways
----------------------------------------

There's nothing you need to do to ensure the S3 bucket is auto-replicating both ways.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This section have a whole header for just one sentence is not wrong but, I don't know, just seems kind of weird. Maybe make it a Note or Tip style section instead?

.. tip::
Websockets will still point to the old data center even if you have switched DNS. You need to roll over each app node gradually to move those connections to the new data center. If all your nodes are down, no action is necessary and the clients will automatically re-connect to the new data center.

The S3 bucket is replicated bi-directionally while the database and ES/OS is replicated uni-directionally.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we add a note or section here at the end of what do do when the disaster event or whatever is over? Even if it's just "perform these operations the same way to restore functionality back to the primary data center".

----------------------

If the job scheduler is left running in the secondary region, it will pick up jobs and start running them. Therefore, set ``JobSettings.RunScheduler`` to ``false`` on all nodes in the secondary region. When a failover happens, you need to enable it for the new primary region, and deactivate it for the new secondary region.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we tell them how to do this, or are we assuming if they are this far they probably know how to change settings?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

1: Dev Review Requires review by a core commiter 2: Editor Review Requires review by an editor

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants