Feat/336/wal-recover by xiao-77 · Pull Request #35200 · taosdata/TDengine

xiao-77 · 2026-04-22T06:14:02Z

Add syncNotifyWalTruncated interface in sync module
Call notification after three-replica recovery
Enable Raft to detect and sync missing logs

Description

Issue(s)

Close/close/Fix/fix/Resolve/resolve: Issue Link

Checklist

Please check the items in the checklist if applicable.

Is the user manual updated?
Are the test cases passed and automated?
Is there no significant decrease in test coverage?

- Add tsWalRecoveryPolicy global variable (default 0) - Only affects single replica: 0=refuse, 1=delete corrupted - Three replicas always auto-recover regardless of this config

- Add replica parameter to walCheckAndRepairMeta function - Temporarily pass replica=1 in walOpen (will be updated in next task)

- Three replicas: always auto-recover (ignore walRecoveryPolicy) - Single replica: controlled by walRecoveryPolicy - 0 (default): refuse to start, preserve corrupted WAL - 1: delete corrupted part and try to start - Add detailed error messages for single replica refusal

- Add replica parameter to walOpen function - Pass actual replica count from vnode configuration - Pass replica count from mnode syncMgmt - Enable replica-aware recovery policy decision

- Add syncNotifyWalTruncated interface in sync module - Call notification after three-replica recovery - Enable Raft to detect and sync missing logs

gemini-code-assist

Code Review

This pull request introduces a WAL recovery policy and enhances corruption handling by incorporating replica counts into the WAL opening process. Key changes include the addition of the tsWalRecoveryPolicy configuration and a notification mechanism for the sync module when WAL truncation occurs. Review feedback identifies critical issues where the notification logic is unreachable due to early returns and the recovery logic is overly specific to three-replica setups. Improvements are also suggested regarding the use of header files instead of extern declarations and the need for a more functional implementation of the sync notification to properly handle log synchronization after truncation.

gemini-code-assist · 2026-04-22T06:15:56Z

+        if (shouldRecover && tsWalDeleteOnCorruption) {
          TAOS_RETURN(walRenameCorruptedDir(pWal));
        }
        goto _exit;


There are two issues here:

The notification to the sync module (intended at line 680) is unreachable because this block either returns via TAOS_RETURN or jumps to _exit before reaching the end of the function.

For multi-replica setups, the log at line 620 claims auto-recovery is enabled, but the actual recovery is still gated by tsWalDeleteOnCorruption (which defaults to false). If recovery is intended to be automatic for clusters, this guard should be adjusted.

if (shouldRecover && (tsWalDeleteOnCorruption || replica > 1)) { if (replica > 1) { syncNotifyWalTruncated(pWal->cfg.vgId, truncatedVer); } TAOS_RETURN(walRenameCorruptedDir(pWal)); } goto _exit;

gemini-code-assist · 2026-04-22T06:15:56Z

+  if (walTruncated && replica == 3) {
+    int32_t syncCode = syncNotifyWalTruncated(pWal->cfg.vgId, truncatedVer);
+    if (syncCode != TSDB_CODE_SUCCESS) {
+      wWarn("vgId:%d, failed to notify sync module, code:0x%x", pWal->cfg.vgId, syncCode);
+    }
+  }


This block is unreachable when walTruncated is true. As noted in the previous comment, any path that detects corruption and sets walTruncated = true results in an early exit from the function. The notification logic should be moved into the recovery block where the actual truncation/renaming occurs.

gemini-code-assist · 2026-04-22T06:15:56Z

 #include "tutil.h"
 #include "walInt.h"

+extern int32_t syncNotifyWalTruncated(int32_t vgId, int64_t truncatedVer);


Instead of using an extern declaration inside a source file, it is better practice to include the header file where the function is declared. This ensures type safety and a single source of truth for the function signature.

#include "sync.h"

gemini-code-assist · 2026-04-22T06:15:56Z

-        if (tsWalDeleteOnCorruption) {
+
+        bool shouldRecover = false;
+        if (replica == 3) {


The check for replica == 3 is too specific. TDengine supports other replica counts (e.g., 2 or 5). It is better to check if the node is part of a multi-replica group.

if (replica > 1) {

gemini-code-assist · 2026-04-22T06:15:57Z

+int32_t syncNotifyWalTruncated(int32_t vgId, int64_t truncatedVer) {
+  sInfo("vgId:%d, notified sync module: WAL truncated to ver:%" PRId64, vgId, truncatedVer);
+  return TSDB_CODE_SUCCESS;
+}


The implementation of syncNotifyWalTruncated currently only logs a message. To fulfill the PR's objective of 'Enabling Raft to detect and sync missing logs', this function should update the state of the corresponding sync node (e.g., resetting its applied index or marking it as needing a snapshot/sync).

Copilot

Pull request overview

This PR extends WAL open/repair behavior with replica-aware recovery handling and introduces a sync-module notification hook intended to let Raft detect/sync missing logs after WAL truncation during recovery.

Changes:

Extend walOpen() / walCheckAndRepairMeta() to take a replica count and use it to decide corruption recovery behavior.
Add walRecoveryPolicy server config to control single-replica corruption recovery behavior.
Add syncNotifyWalTruncated() API and invoke it from WAL repair logic (replica=3 path).

Reviewed changes

Copilot reviewed 11 out of 11 changed files in this pull request and generated 5 comments.

Show a summary per file

File	Description
source/libs/wal/src/walMgmt.c	Extends `walOpen()` to accept replica count and passes it into meta repair.
source/libs/wal/src/walMeta.c	Adds replica-aware corruption handling and calls `syncNotifyWalTruncated()` after truncation.
source/libs/wal/inc/walInt.h	Updates internal prototype for `walCheckAndRepairMeta()`.
include/libs/wal/wal.h	Updates public `walOpen()` signature to include replica count.
source/dnode/vnode/src/vnd/vnodeOpen.c	Passes vnode replica count into `walOpen()`.
source/dnode/mnode/impl/src/mndMain.c	Passes mnode replica count into `walOpen()`.
source/common/src/tglobal.c	Adds `walRecoveryPolicy` global + config plumbing.
include/common/tglobal.h	Exposes `tsWalRecoveryPolicy` with documentation.
source/libs/sync/src/syncMain.c	Adds `syncNotifyWalTruncated()` implementation (currently logs only).
source/libs/sync/inc/syncInt.h	Declares internal `syncNotifyWalTruncated()` symbol.
include/libs/sync/sync.h	Exposes `syncNotifyWalTruncated()` in public sync API.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-22T06:19:44Z

+          }
+        }
+
+        if (shouldRecover && tsWalDeleteOnCorruption) {


The new "auto-recovery enabled for replica=3" path is inconsistent: when replica==3 and WAL scan fails, the code only recovers if tsWalDeleteOnCorruption is enabled; otherwise it falls through to _exit with an error code, causing walOpen() to fail despite logging that auto-recovery is enabled. Either perform the recovery action unconditionally for replica==3 (still preserving the corrupted dir), or make the log/error handling reflect the actual required config and set a deterministic return code.

Suggested change

if (shouldRecover && tsWalDeleteOnCorruption) {

if (replica == 3 || (shouldRecover && tsWalDeleteOnCorruption)) {

Copilot · 2026-04-22T06:19:44Z

+        }
+
+        if (shouldRecover && tsWalDeleteOnCorruption) {
          TAOS_RETURN(walRenameCorruptedDir(pWal));


When recovery triggers TAOS_RETURN(walRenameCorruptedDir(pWal)), the function returns before reaching the later syncNotifyWalTruncated() block, so the sync module is never notified in the main recovery path. If notification is required for replica=3 recovery, move the notification before the early return (or avoid returning early).

Suggested change

TAOS_RETURN(walRenameCorruptedDir(pWal));

code = walRenameCorruptedDir(pWal);

Copilot · 2026-04-22T06:19:45Z

 extern int64_t tsWalFsyncDataSizeLimit;
 extern bool    tsWalForceRepair;
 extern bool    tsWalDeleteOnCorruption;
+// WAL recovery policy (only affects single replica)


Header comment says walRecoveryPolicy "only affects single replica", but the new logic checks it for all replica != 3 cases (so it would also affect replica=2 if that configuration exists). Please align the comment (and possibly the config semantics) with actual behavior to avoid misconfiguration.

Suggested change

// WAL recovery policy (only affects single replica)

// WAL recovery policy (applies when replica count is not 3, including single-replica deployments)

Copilot · 2026-04-22T06:19:45Z

+  if (walTruncated && replica == 3) {
+    int32_t syncCode = syncNotifyWalTruncated(pWal->cfg.vgId, truncatedVer);
+    if (syncCode != TSDB_CODE_SUCCESS) {
+      wWarn("vgId:%d, failed to notify sync module, code:0x%x", pWal->cfg.vgId, syncCode);
+    }
+  }


walCheckAndRepairMeta() now calls syncNotifyWalTruncated(), but the wal static library does not link against sync (and sync already links against wal), so this introduces a circular/undefined-symbol risk (e.g. walTest links wal without sync). Replace this direct call with a decoupled mechanism such as a callback function pointer registered by sync at runtime, or a weak/no-op default implementation compiled into wal.

- Remove syncNotifyWalTruncated function and calls - Three-replica recovery works without explicit sync notification - Raft will automatically detect and sync missing logs

…L truncation - Remove tsWalDeleteOnCorruption parameter (no longer needed) - Implement proper WAL truncation instead of renaming entire directory - Delete corrupted WAL files from corruption point onwards - Update WAL metadata to reflect truncation - Ensure Raft can write with correct version after recovery

…engine into feat/336/wal-recover

- Add walTruncateCorruptedFiles() function to handle all truncation logic - Consolidate replica-based recovery policy decision in one place - Call truncation function from both walCheckAndRepairMeta and walLogEntriesComplete - Ensure consistent truncation behavior across all corruption scenarios

Copilot

Pull request overview

Copilot reviewed 11 out of 11 changed files in this pull request and generated 7 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-22T07:08:42Z

+  bool        walTruncated = false;
+  int64_t     truncatedVer = -1;


walTruncated and truncatedVer are declared but never used, which is dead code and can trigger -Wunused-variable warnings in some build configurations. Either remove them or use them for the intended truncation notification/metadata tracking (e.g., record truncatedVer and propagate it to the sync module).

Suggested change

bool walTruncated = false;

int64_t truncatedVer = -1;

Copilot · 2026-04-22T07:08:43Z

+        code = walTruncateCorruptedFiles(pWal, fileIdx, replica);
+        if (code != TSDB_CODE_SUCCESS) {
+          goto _exit;
        }
-        goto _exit;
-      }
-      // empty log file
-      lastVer = pFileInfo->firstVer - 1;

-      code = TSDB_CODE_SUCCESS;
+        // After truncation, set lastVer based on remaining files
+        lastVer = (fileIdx > 0) ? ((SWalFileInfo*)taosArrayGet(pWal->fileInfoSet, fileIdx - 1))->lastVer : -1;
+        wInfo("vgId:%d, WAL truncated, new lastVer:%" PRId64, pWal->cfg.vgId, lastVer);
+        updateMeta = true;
+        code = TSDB_CODE_SUCCESS;
+      } else {
+        // empty log file
+        lastVer = pFileInfo->firstVer - 1;
+        code = TSDB_CODE_SUCCESS;
+      }
    }
-    wInfo("vgId:%d, repaired file %s, last index:%" PRId64 ", fileSize:%" PRId64 ", fileSize in meta:%" PRId64,
-          pWal->cfg.vgId, fnameStr, lastVer, fileSize, pFileInfo->fileSize);

-    // update lastVer
-    pFileInfo->lastVer = lastVer;
-    totSize += pFileInfo->fileSize;
+    if (code == TSDB_CODE_SUCCESS && lastVer >= 0) {
+      wInfo("vgId:%d, repaired file %s, last index:%" PRId64 ", fileSize:%" PRId64 ", fileSize in meta:%" PRId64,
+            pWal->cfg.vgId, fnameStr, lastVer, fileSize, pFileInfo->fileSize);
+
+      // update lastVer
+      pFileInfo->lastVer = lastVer;
+      totSize += pFileInfo->fileSize;
+    }


In the lastVer<0 && code!=TSDB_CODE_WAL_LOG_NOT_EXIST branch, walTruncateCorruptedFiles() mutates pWal->fileInfoSet (removes entries starting at fileIdx). After that, pFileInfo still points to the removed element, but the code can still enter the update block and write pFileInfo->lastVer / use pFileInfo->fileSize, which becomes invalid memory access. Also, totSize may already include sizes of files that just got deleted. After truncation, avoid using the old pFileInfo pointer and recompute totSize/versions from the remaining fileInfoSet (e.g., restart the scan or break and rebuild totals).

Copilot · 2026-04-22T07:08:43Z

+
+int32_t syncNotifyWalTruncated(int32_t vgId, int64_t truncatedVer) {
+  sInfo("vgId:%d, notified sync module: WAL truncated to ver:%" PRId64, vgId, truncatedVer);
+  return TSDB_CODE_SUCCESS;
+}


syncNotifyWalTruncated() is added to public headers, but there are no call sites in the repo. If the intent is to notify Raft after WAL truncation/recovery, wire this into the WAL recovery path (e.g., after walTruncateCorruptedFiles succeeds) and pass the actual truncated version; otherwise keep the symbol internal or remove it to avoid exposing an unused API surface.

Suggested change

int32_t syncNotifyWalTruncated(int32_t vgId, int64_t truncatedVer) {

sInfo("vgId:%d, notified sync module: WAL truncated to ver:%" PRId64, vgId, truncatedVer);

return TSDB_CODE_SUCCESS;

}

Copilot · 2026-04-22T07:08:44Z

-    } else {
-      TAOS_RETURN(TSDB_CODE_WAL_LOG_INCOMPLETE);
-    }
+           ", snaphot index:%" PRId64 ", fileIdx:%d",


Typo in log message: "snaphot" should be "snapshot".

Suggested change

", snaphot index:%" PRId64 ", fileIdx:%d",

", snapshot index:%" PRId64 ", fileIdx:%d",

Copilot · 2026-04-22T07:08:44Z

+  if (replica == 3) {
+    shouldRecover = true;
+    SWalFileInfo* pFileInfo = taosArrayGet(pWal->fileInfoSet, fileIdx);
+    wInfo("vgId:%d, WAL corrupted at ver:%" PRId64 ", auto-recovery enabled for replica=3",
+          pWal->cfg.vgId, pFileInfo->firstVer);
+  } else {
+    shouldRecover = (tsWalRecoveryPolicy == 1);
+    if (shouldRecover) {
+      SWalFileInfo* pFileInfo = taosArrayGet(pWal->fileInfoSet, fileIdx);
+      wWarn("vgId:%d, WAL corrupted at ver:%" PRId64 ", force recovery enabled by walRecoveryPolicy=1",
+            pWal->cfg.vgId, pFileInfo->firstVer);
+    } else {
+      SWalFileInfo* pFileInfo = taosArrayGet(pWal->fileInfoSet, fileIdx);
+      wError("vgId:%d, WAL corrupted at ver:%" PRId64 ", refusing to start to prevent data loss",
+             pWal->cfg.vgId, pFileInfo->firstVer);
+      wError("vgId:%d, corrupted WAL files are preserved for manual inspection", pWal->cfg.vgId);
+      wError("vgId:%d, to force recovery with data loss, set 'walRecoveryPolicy 1' in taos.cfg and restart",
+             pWal->cfg.vgId);


walTruncateCorruptedFiles() doesn't validate fileIdx before calling taosArrayGet(pWal->fileInfoSet, fileIdx) and before computing the remove batch length (size - fileIdx). If fileIdx is negative or >= array size, this can lead to invalid memory access or a huge batch removal due to underflow. Add defensive checks at the start of this function and return an appropriate error when fileIdx is out of range.

Copilot · 2026-04-22T07:08:44Z

-    }
+           ", snaphot index:%" PRId64 ", fileIdx:%d",
+           pWal->cfg.vgId, pWal->vers.firstVer, pWal->vers.lastVer, index, pWal->vers.snapshotVer, fileIdx);
+    TAOS_RETURN(walTruncateCorruptedFiles(pWal, fileIdx, replica));


walTruncateCorruptedFiles() assumes fileIdx is a valid index into pWal->fileInfoSet (taosArrayGet/taosArrayRemoveBatch). In walLogEntriesComplete(), fileIdx can become == sz when the loop completes without breaking, which would make taosArrayGet out-of-bounds and can crash. Add explicit range checks (fileIdx >= 0 && fileIdx < taosArrayGetSize(...)) before using fileIdx, and decide a safe truncation point (or return an error) when the inconsistency is only in meta/vers rather than a specific corrupted file.

Suggested change

TAOS_RETURN(walTruncateCorruptedFiles(pWal, fileIdx, replica));

if (fileIdx >= 0 && fileIdx < sz) {

TAOS_RETURN(walTruncateCorruptedFiles(pWal, fileIdx, replica));

}

wError("vgId:%d, WAL metadata/version inconsistency detected without a valid corrupted file index, "

"skip truncation, fileCnt:%d, fileIdx:%d",

pWal->cfg.vgId, sz, fileIdx);

TAOS_RETURN(TSDB_CODE_WAL_FILE_CORRUPTED);

…eting - Read idx file to locate last valid log entry - Read log entry header to calculate exact truncation offset - Truncate both log and idx files to preserve valid data - Update fileInfo metadata after truncation - Ensure Raft can continue writing from correct position

- test_single_replica_refuse_to_start: verify default behavior refuses to start - test_single_replica_force_recovery: verify walRecoveryPolicy=1 enables recovery - test_three_replica_auto_recovery: verify 3-replica auto-recovery via Raft - test_wal_truncation_preserves_valid_data: verify valid data preservation - Simulate WAL corruption by truncating log files - Verify recovery behavior for different replica configurations

- WalRecoveryPolicy test fixture for recovery scenarios - singleReplicaRefuseToStart: verify default behavior refuses corrupted WAL - singleReplicaForceRecovery: verify walRecoveryPolicy=1 enables recovery - threeReplicaAutoRecovery: verify 3-replica ignores policy and auto-recovers - truncationPreservesValidData: verify valid data preserved after truncation - Simulate corruption by truncating log files at various points

… case

…d fix test crashes - walTruncateCorruptedFiles now only deletes files after the corrupted file; the corrupted file itself is already truncated by walTrimIdxFile and walScanLogGetLastVer before this function is called - fix singleReplicaRefuseToStart crash caused by passing NULL cfg to walOpen, which derefs pCfg before any policy check - rewrite corruptWalFile to produce 3 WAL segments via walBeginSnapshot/walEndSnapshot rolls and corrupt the middle file Made-with: Cursor

Copilot

Pull request overview

Copilot reviewed 16 out of 16 changed files in this pull request and generated 14 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-28T02:10:34Z


 // handle open and ctl
-SWal   *walOpen(const char *path, SWalCfg *pCfg);
+SWal   *walOpen(const char *path, SWalCfg *pCfg, int32_t replica);


walOpen signature now requires a replica argument, but there are still call sites in the repo (e.g. source/libs/sync/test/*.cpp) using the old 2-arg form. This will break compilation; please update all remaining callers (or provide a backward-compatible wrapper/default) so the tree builds cleanly.

Copilot · 2026-04-28T02:10:34Z

+  // Save old value
+  int32_t oldPolicy = tsWalRecoveryPolicy;
+  tsWalRecoveryPolicy = 1;  // Force recovery
+
+  // Open WAL - should succeed with truncation
+  SWalCfg cfg;
+  cfg.rollPeriod = -1;
+  cfg.segSize = -1;
+  cfg.committed = -1;
+  cfg.retentionPeriod = -1;
+  cfg.retentionSize = 0;
+  cfg.rollPeriod = 0;
+  cfg.vgId = 1;
+  cfg.level = TAOS_WAL_FSYNC;
+  pWal = walOpen(pathName, &cfg, 1);
  ASSERT_NE(pWal, nullptr);
-  ASSERT_EQ(pWal->vers.firstVer, -1);
-  ASSERT_EQ(pWal->vers.lastVer, -1);
-
-  tsWalDeleteOnCorruption = oldVal;
+
+  // Verify WAL is accessible
+  ASSERT_GE(pWal->vers.lastVer, -1);
+
+  // Restore
+  tsWalRecoveryPolicy = oldPolicy;


This test mutates the global tsWalRecoveryPolicy and only restores it after assertions. If an ASSERT_* fails, the function aborts early and the global policy remains modified, which can cascade failures into subsequent tests. Prefer a scope guard/RAII restore or make the assertions non-fatal so the restore always runs.

Copilot · 2026-04-28T02:10:35Z

+// walTrimIdxFile + walScanLogGetLastVer already truncate the corrupted file itself.
+// This function only needs to delete subsequent files that are entirely beyond the corruption point.
+static int32_t walTruncateCorruptedFiles(SWal* pWal, int32_t fileIdx, int32_t replica) {
+  int32_t code = TSDB_CODE_SUCCESS;
+  bool shouldRecover = false;
+
+  SWalFileInfo* pFileInfo = taosArrayGet(pWal->fileInfoSet, fileIdx);
+  if (replica >= 3) {
+    shouldRecover = true;
+    wInfo("vgId:%d, WAL corrupted at ver:%" PRId64 ", auto-recovery enabled for replica=%d",
+          pWal->cfg.vgId, pFileInfo->firstVer, replica);
+  } else {
+    shouldRecover = (tsWalRecoveryPolicy == 1);
+    if (shouldRecover) {
+      wWarn("vgId:%d, WAL corrupted at ver:%" PRId64 ", force recovery enabled by walRecoveryPolicy=1",
+            pWal->cfg.vgId, pFileInfo->firstVer);
+    } else {
+      wError("vgId:%d, WAL corrupted at ver:%" PRId64 ", refusing to start to prevent data loss",
+             pWal->cfg.vgId, pFileInfo->firstVer);
+      wError("vgId:%d, to force recovery with data loss, set 'walRecoveryPolicy 1' in taos.cfg and restart",
+             pWal->cfg.vgId);
+      TAOS_RETURN(TSDB_CODE_WAL_FILE_CORRUPTED);
+    }
+  }
+
+  // Delete all files after the corrupted file (the corrupted file itself is already truncated
+  // by walTrimIdxFile + walScanLogGetLastVer before this function is called)
+  int32_t deleteFrom = fileIdx + 1;
+  wInfo("vgId:%d, removing WAL files from index %d onwards", pWal->cfg.vgId, deleteFrom);
+


walTruncateCorruptedFiles() assumes the corrupted file itself has already been truncated by walScanLogGetLastVer, but walScanLogGetLastVer() intentionally skips truncation when fileIdx == sz - 1 (last file). That means corruption in the last WAL segment may not actually be removed, and the node may keep re-entering repair logic (or fail later) even though walTruncateCorruptedFiles() returns success. Consider truncating the corrupted file here as well (or adjusting walScanLogGetLastVer/repair flow) so last-segment corruption is recoverable under the intended policy/replica behavior.

Copilot · 2026-04-28T02:10:35Z

+  // Save old value
+  int32_t oldPolicy = tsWalRecoveryPolicy;
+  tsWalRecoveryPolicy = 0;  // Even with refuse policy, 3 replicas should auto-recover
+
+  // Open WAL with 3 replicas - should succeed
+  SWalCfg cfg;
+  cfg.rollPeriod = -1;
+  cfg.segSize = -1;
+  cfg.committed = -1;
+  cfg.retentionPeriod = -1;
+  cfg.retentionSize = 0;
+  cfg.rollPeriod = 0;
+  cfg.vgId = 1;
+  cfg.level = TAOS_WAL_FSYNC;
+  pWal = walOpen(pathName, &cfg, 3);  // Three replicas
+  ASSERT_NE(pWal, nullptr);

-  // Write logs to create multiple wal files
-  for (int i = 0; i < 100; i++) {
-    char newStr[100];
-    sprintf(newStr, "%s-%d", ranStr, i);
-    int len = strlen(newStr);
-    code = walAppendLog(pWal, i, 0, syncMeta, newStr, len, NULL);
-    ASSERT_EQ(code, 0);
-  }
-
-  code = walRollImpl(pWal);
-  ASSERT_EQ(code, 0);
-
-  for (int i = 100; i < 200; i++) {
-    char newStr[100];
-    sprintf(newStr, "%s-%d", ranStr, i);
-    int len = strlen(newStr);
-    code = walAppendLog(pWal, i, 0, syncMeta, newStr, len, NULL);
+  // Verify WAL is accessible
+  ASSERT_GE(pWal->vers.lastVer, -1);
+
+  // Restore
+  tsWalRecoveryPolicy = oldPolicy;


This test mutates the global tsWalRecoveryPolicy and only restores it after assertions. If an ASSERT_* fails, the function aborts early and the global policy remains modified, which can cascade failures into subsequent tests. Please ensure restore happens even on failure (scope guard/RAII or non-fatal assertions).

Copilot · 2026-04-28T02:10:35Z

+        # Dnode should fail to bring vnode online; vnode status should be offline/error
+        tdSql.query("show dnodes")
+        tdLog.info(f"dnode status after corrupted WAL start: {tdSql.queryResult}")
+
+        tdSql.execute("drop database if exists test_refuse")


This test claims to verify that a single-replica node refuses to start on WAL corruption, but it never asserts any failure condition (it only logs show dnodes output). As written, it can pass even if the dnode starts successfully. Add an explicit assertion (e.g., check the dnode/vnode status is offline/error, or that queries against the affected vnode fail) so the test actually validates the policy.

Suggested change

# Dnode should fail to bring vnode online; vnode status should be offline/error

tdSql.query("show dnodes")

tdLog.info(f"dnode status after corrupted WAL start: {tdSql.queryResult}")

tdSql.execute("drop database if exists test_refuse")

try:

# Dnode should fail to bring vnode online; log status for debugging

tdSql.query("show dnodes")

tdLog.info(f"dnode status after corrupted WAL start: {tdSql.queryResult}")

# Explicitly verify the affected single-replica vnode is not usable.

# If the query succeeds, the dnode/vnode came online despite WAL corruption.

query_failed = False

try:

tdSql.query("select count(*) from t1")

except Exception as err:

query_failed = True

tdLog.info(f"Query failed as expected after corrupted WAL restart: {err}")

assert query_failed, (

"Expected queries against test_refuse.t1 to fail after restarting with "

"corrupted WAL and walRecoveryPolicy=0, but the query succeeded."

)

finally:

tdSql.execute("drop database if exists test_refuse")

Copilot · 2026-04-28T02:10:36Z

+    ASSERT_GE((int)logFiles.size(), 3);
+    std::sort(logFiles.begin(), logFiles.end());
+


This file uses std::sort (and other algorithms) but does not include <algorithm>. Relying on transitive includes is non-portable and can fail to compile on some toolchains; please add the proper header include.

Copilot · 2026-04-28T02:10:36Z

@@ -431,7 +431,69 @@ static int32_t walRenameCorruptedDir(SWal* pWal) {
  TAOS_RETURN(code);
 }



After switching the corruption path to walTruncateCorruptedFiles(), the old static helper walRenameCorruptedDir() is no longer referenced anywhere in this file. If the build enables -Wunused-function (or treats warnings as errors), this will cause a build break. Consider removing it or gating it behind the same feature/ifdef as any remaining users.

Suggested change

static int32_t walRenameCorruptedDir(SWal* pWal) __attribute__((unused));

Copilot · 2026-04-28T02:10:37Z

-    } else {
-      TAOS_RETURN(TSDB_CODE_WAL_LOG_INCOMPLETE);
-    }
+           ", snaphot index:%" PRId64 ", fileIdx:%d",


Typo in log message: snaphot index should be snapshot index for consistency/searchability in logs.

Suggested change

", snaphot index:%" PRId64 ", fileIdx:%d",

", snapshot index:%" PRId64 ", fileIdx:%d",

Copilot · 2026-04-28T02:10:37Z

+  // Save old value
+  int32_t oldPolicy = tsWalRecoveryPolicy;
+  tsWalRecoveryPolicy = 0;  // Refuse to start
+
+  // Try to open WAL - should fail due to corrupted file and refuse-to-start policy
+  SWalCfg cfg;
+  cfg.rollPeriod = -1;
+  cfg.segSize = -1;
+  cfg.committed = -1;
+  cfg.retentionPeriod = -1;
+  cfg.retentionSize = 0;
+  cfg.rollPeriod = 0;
+  cfg.vgId = 1;
+  cfg.level = TAOS_WAL_FSYNC;
+  pWal = walOpen(pathName, &cfg, 1);
+  ASSERT_EQ(pWal, nullptr);
+
+  // Restore
+  tsWalRecoveryPolicy = oldPolicy;


This test mutates the global tsWalRecoveryPolicy and only restores it after assertions. If an ASSERT_* fails, the function aborts early and the global policy remains modified, which can cascade failures into subsequent tests. Use EXPECT_* plus unconditional restore, or a scope guard/RAII helper to restore tsWalRecoveryPolicy in all exit paths.

Copilot · 2026-04-28T02:10:37Z

+  // Save old value
+  int32_t oldPolicy = tsWalRecoveryPolicy;
+  tsWalRecoveryPolicy = 1;  // Force recovery
+
+  // Reopen with recovery
+  SWalCfg cfg;
+  cfg.rollPeriod = -1;
+  cfg.segSize = -1;
+  cfg.committed = -1;
+  cfg.retentionPeriod = -1;
+  cfg.retentionSize = 0;
+  cfg.rollPeriod = 0;
+  cfg.vgId = 1;
+  cfg.level = TAOS_WAL_FSYNC;
+  pWal = walOpen(pathName, &cfg, 1);
  ASSERT_NE(pWal, nullptr);
-  ASSERT_EQ(pWal->vers.firstVer, 0);
-  ASSERT_EQ(pWal->vers.lastVer, 199);
-
-  tsWalDeleteOnCorruption = oldVal;
-}
+
+  // Verify some data is preserved (at least first batch)
+  ASSERT_GE(pWal->vers.lastVer, 0);
+  ASSERT_LT(pWal->vers.lastVer, lastVerBeforeCorruption);
+
+  // Restore
+  tsWalRecoveryPolicy = oldPolicy;


This test mutates the global tsWalRecoveryPolicy and restores it only at the end. If any ASSERT_* fails before the restore, later tests may observe the wrong policy value. Please restore via a scope guard/RAII or use non-fatal expectations so the cleanup always runs.

Copilot

Pull request overview

Copilot reviewed 10 out of 11 changed files in this pull request and generated 7 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-28T05:53:16Z

+  // Delete all files after the corrupted file (the corrupted file itself is already truncated
+  // by walTrimIdxFile + walScanLogGetLastVer before this function is called)
+  int32_t deleteFrom = fileIdx + 1;
+  wInfo("vgId:%d, removing WAL files from index %d onwards", pWal->cfg.vgId, deleteFrom);
+
+  // Delete all files from fileIdx onwards
+  for (int32_t i = deleteFrom; i < taosArrayGetSize(pWal->fileInfoSet); i++) {
+    SWalFileInfo* pDelFileInfo = taosArrayGet(pWal->fileInfoSet, i);


walTruncateCorruptedFiles() currently deletes files starting at fileIdx + 1, but it’s also called from walLogEntriesComplete() when a non-consecutive firstVer is detected. In that gap/mismatch case, keeping the fileIdx file leaves the WAL still inconsistent (the gap remains). Consider deleting from fileIdx (or otherwise handling the gap case separately) and then recomputing vers.lastVer/writeCur to reflect the remaining prefix.

Copilot · 2026-04-28T05:53:17Z

  if (!complete) {
    wError("vgId:%d, WAL log entries incomplete in range [%" PRId64 ", %" PRId64 "], index:%" PRId64
-           ", snaphot index:%" PRId64,
-           pWal->cfg.vgId, pWal->vers.firstVer, pWal->vers.lastVer, index, pWal->vers.snapshotVer);
-    if (tsWalDeleteOnCorruption) {
-      TAOS_RETURN(walRenameCorruptedDir(pWal));
-    } else {
-      TAOS_RETURN(TSDB_CODE_WAL_LOG_INCOMPLETE);
+           ", snaphot index:%" PRId64 ", fileIdx:%d",
+           pWal->cfg.vgId, pWal->vers.firstVer, pWal->vers.lastVer, index, pWal->vers.snapshotVer, fileIdx);
+    if (fileIdx >= sz) {
+      // All file firstVers are consecutive but vers.lastVer doesn't match the last file's lastVer.
+      // There is no specific corrupted file — this is a meta-only inconsistency. Correct vers.lastVer
+      // in-place; the caller is responsible for persisting the corrected meta.
+      wWarn("vgId:%d, WAL vers.lastVer mismatch: recorded %" PRId64 " actual %" PRId64 ", correcting meta",
+            pWal->cfg.vgId, pWal->vers.lastVer, index);
+      pWal->vers.lastVer = index;
+      TAOS_RETURN(TSDB_CODE_SUCCESS);
    }
+    TAOS_RETURN(walTruncateCorruptedFiles(pWal, fileIdx, replica));
  } else {


After walLogEntriesComplete() calls walTruncateCorruptedFiles(), the code returns success without updating pWal->vers/totSize or persisting meta. If any files are removed, the in-memory and on-disk meta can remain stale, and subsequent components may observe an incorrect lastVer. Ensure the truncation path recomputes versions and saves meta (or returns a code that forces the caller to persist the repaired meta).

Copilot · 2026-04-28T05:53:17Z

+  if (replica >= 3) {
+    shouldRecover = true;
+    wInfo("vgId:%d, WAL corrupted at ver:%" PRId64 ", auto-recovery enabled for replica=%d",
+          pWal->cfg.vgId, pFileInfo->firstVer, replica);
+  } else {


PR description mentions notifying sync/Raft after three-replica recovery, but walTruncateCorruptedFiles() (the new truncation/recovery path) does not call syncNotifyWalTruncated() or any equivalent hook when replica >= 3. If Raft relies on this notification to resync missing logs, add the call here (or in the caller) at the point where truncation is committed.

Copilot · 2026-04-28T05:53:17Z

+  // Save old value
+  int32_t oldPolicy = tsWalRecoveryPolicy;
+  tsWalRecoveryPolicy = 0;  // Refuse to start
+


These tests mutate the global tsWalRecoveryPolicy and restore it at the end, but any failing ASSERT_* after the mutation will early-return and skip the restore, polluting global state for later tests. Use a guard/RAII helper or move the restore into TearDown() so it always runs (even on assertion failures).

Copilot · 2026-04-28T05:53:17Z

  TAOS_CHECK_RETURN(cfgAddInt32(pCfg, "syncRoutineReportInterval", tsRoutineReportInterval, 5, 600, CFG_SCOPE_SERVER, CFG_DYN_SERVER,CFG_CATEGORY_LOCAL));
  TAOS_CHECK_RETURN(cfgAddBool(pCfg, "syncLogHeartbeat", tsSyncLogHeartbeat, CFG_SCOPE_SERVER, CFG_DYN_SERVER,CFG_CATEGORY_LOCAL));
-  TAOS_CHECK_RETURN(cfgAddBool(pCfg, "walDeleteOnCorruption", tsWalDeleteOnCorruption, CFG_SCOPE_SERVER, CFG_DYN_NONE,CFG_CATEGORY_LOCAL));
+  TAOS_CHECK_RETURN(cfgAddInt32(pCfg, "walRecoveryPolicy", tsWalRecoveryPolicy, 0, 1, CFG_SCOPE_SERVER, CFG_DYN_NONE, CFG_CATEGORY_LOCAL));


Config key walDeleteOnCorruption is removed and replaced with walRecoveryPolicy, which is a breaking change for existing deployments/config files and docs. If backward compatibility is required, consider continuing to accept walDeleteOnCorruption as an alias (or at least emit a clear warning) and update documentation accordingly.

Suggested change

TAOS_CHECK_RETURN(cfgAddInt32(pCfg, "walRecoveryPolicy", tsWalRecoveryPolicy, 0, 1, CFG_SCOPE_SERVER, CFG_DYN_NONE, CFG_CATEGORY_LOCAL));

TAOS_CHECK_RETURN(cfgAddInt32(pCfg, "walRecoveryPolicy", tsWalRecoveryPolicy, 0, 1, CFG_SCOPE_SERVER, CFG_DYN_NONE, CFG_CATEGORY_LOCAL));

TAOS_CHECK_RETURN(cfgAddInt32(pCfg, "walDeleteOnCorruption", tsWalRecoveryPolicy, 0, 1, CFG_SCOPE_SERVER, CFG_DYN_NONE, CFG_CATEGORY_LOCAL));

Copilot · 2026-04-28T05:53:18Z

+int32_t syncNotifyWalTruncated(int32_t vgId, int64_t truncatedVer);
+


syncNotifyWalTruncated is declared in the public sync header, but there is no implementation in the repo (no definition found under source/libs/sync/src). Any new call site will fail to link, and the interface addition is incomplete as-is. Add the corresponding implementation (and ensure it’s exported from the sync library) or remove the declaration until it’s implemented.

Suggested change

int32_t syncNotifyWalTruncated(int32_t vgId, int64_t truncatedVer);

Copilot · 2026-04-28T05:53:18Z

+    if (code == TSDB_CODE_SUCCESS && lastVer >= 0) {
+      wInfo("vgId:%d, repaired file %s, last index:%" PRId64 ", fileSize:%" PRId64 ", fileSize in meta:%" PRId64,
+            pWal->cfg.vgId, fnameStr, lastVer, fileSize, pFileInfo->fileSize);
+
+      // update lastVer


In walCheckAndRepairMeta(), pFileInfo->lastVer is only updated when lastVer >= 0. For an empty first log file where firstVer == 0, the computed lastVer becomes -1 and this branch is skipped, leaving pFileInfo->lastVer potentially stale. Update pFileInfo->lastVer for the empty-file case as well (even when lastVer < 0).

Suggested change

if (code == TSDB_CODE_SUCCESS && lastVer >= 0) {

wInfo("vgId:%d, repaired file %s, last index:%" PRId64 ", fileSize:%" PRId64 ", fileSize in meta:%" PRId64,

pWal->cfg.vgId, fnameStr, lastVer, fileSize, pFileInfo->fileSize);

// update lastVer

if (code == TSDB_CODE_SUCCESS) {

if (lastVer >= 0) {

wInfo("vgId:%d, repaired file %s, last index:%" PRId64 ", fileSize:%" PRId64 ", fileSize in meta:%" PRId64,

pWal->cfg.vgId, fnameStr, lastVer, fileSize, pFileInfo->fileSize);

}

// update lastVer, including the empty-file case where lastVer == firstVer - 1

xiao-77 added 5 commits April 22, 2026 13:42

feat(wal): add walRecoveryPolicy config parameter

50b5164

- Add tsWalRecoveryPolicy global variable (default 0) - Only affects single replica: 0=refuse, 1=delete corrupted - Three replicas always auto-recover regardless of this config

refactor(wal): add replica parameter to walCheckAndRepairMeta

4eeeb7b

- Add replica parameter to walCheckAndRepairMeta function - Temporarily pass replica=1 in walOpen (will be updated in next task)

feat(wal): pass replica count from vnode to WAL layer

1f23a8a

- Add replica parameter to walOpen function - Pass actual replica count from vnode configuration - Pass replica count from mnode syncMgmt - Enable replica-aware recovery policy decision

feat(wal): add sync notification for WAL truncation

8d8b4f5

- Add syncNotifyWalTruncated interface in sync module - Call notification after three-replica recovery - Enable Raft to detect and sync missing logs

xiao-77 requested a review from guanshengliang as a code owner April 22, 2026 06:14

Copilot AI review requested due to automatic review settings April 22, 2026 06:14

xiao-77 requested a review from dapan1121 as a code owner April 22, 2026 06:14

taosdata-bot Bot added the team engine label Apr 22, 2026

Copilot started reviewing on behalf of xiao-77 April 22, 2026 06:14 View session

gemini-code-assist Bot reviewed Apr 22, 2026

View reviewed changes

Copilot AI reviewed Apr 22, 2026

View reviewed changes

xiao-77 added 4 commits April 22, 2026 14:21

feat(wal): remove sync notification (not needed)

dbf6f82

- Remove syncNotifyWalTruncated function and calls - Three-replica recovery works without explicit sync notification - Raft will automatically detect and sync missing logs

Merge branch 'feat/336/wal-recover' of https://github.com/taosdata/TD…

224b376

…engine into feat/336/wal-recover

Copilot AI review requested due to automatic review settings April 22, 2026 07:02

Copilot started reviewing on behalf of xiao-77 April 22, 2026 07:03 View session

Copilot AI reviewed Apr 22, 2026

View reviewed changes

xiao-77 added 6 commits April 23, 2026 10:18

fix(test): fix WAL create database syntax, wal 1 -> wal_level 1

93d55a5

refactor(test): split WAL recovery tests into separate files per test…

c98c398

… case

Copilot AI review requested due to automatic review settings April 28, 2026 02:02

xiao-77 requested a review from a team as a code owner April 28, 2026 02:02

Copilot started reviewing on behalf of xiao-77 April 28, 2026 02:03 View session

Copilot AI reviewed Apr 28, 2026

View reviewed changes

fix: ci problems.

e970734

feat: remove unused test.

7fb598d

Copilot AI review requested due to automatic review settings April 28, 2026 05:47

Copilot started reviewing on behalf of xiao-77 April 28, 2026 05:48 View session

Copilot AI reviewed Apr 28, 2026

View reviewed changes

	if (shouldRecover && tsWalDeleteOnCorruption) {
	if (replica == 3 \|\| (shouldRecover && tsWalDeleteOnCorruption)) {

	TAOS_RETURN(walRenameCorruptedDir(pWal));
	code = walRenameCorruptedDir(pWal);

	// WAL recovery policy (only affects single replica)
	// WAL recovery policy (applies when replica count is not 3, including single-replica deployments)

	", snaphot index:%" PRId64 ", fileIdx:%d",
	", snapshot index:%" PRId64 ", fileIdx:%d",

-    TAOS_RETURN(walTruncateCorruptedFiles(pWal, fileIdx, replica));
+    if (fileIdx >= 0 && fileIdx < sz) {
+      TAOS_RETURN(walTruncateCorruptedFiles(pWal, fileIdx, replica));
+    }
+    wError("vgId:%d, WAL metadata/version inconsistency detected without a valid corrupted file index, "
+           "skip truncation, fileCnt:%d, fileIdx:%d",
+           pWal->cfg.vgId, sz, fileIdx);
+    TAOS_RETURN(TSDB_CODE_WAL_FILE_CORRUPTED);

-        # Dnode should fail to bring vnode online; vnode status should be offline/error
-        tdSql.query("show dnodes")
-        tdLog.info(f"dnode status after corrupted WAL start: {tdSql.queryResult}")
-        tdSql.execute("drop database if exists test_refuse")
+        try:
+            # Dnode should fail to bring vnode online; log status for debugging
+            tdSql.query("show dnodes")
+            tdLog.info(f"dnode status after corrupted WAL start: {tdSql.queryResult}")
+            # Explicitly verify the affected single-replica vnode is not usable.
+            # If the query succeeds, the dnode/vnode came online despite WAL corruption.
+            query_failed = False
+            try:
+                tdSql.query("select count(*) from t1")
+            except Exception as err:
+                query_failed = True
+                tdLog.info(f"Query failed as expected after corrupted WAL restart: {err}")
+            assert query_failed, (
+                "Expected queries against test_refuse.t1 to fail after restarting with "
+                "corrupted WAL and walRecoveryPolicy=0, but the query succeeded."
+            )
+        finally:
+            tdSql.execute("drop database if exists test_refuse")

		ASSERT_GE((int)logFiles.size(), 3);
		std::sort(logFiles.begin(), logFiles.end());

		@@ -431,7 +431,69 @@ static int32_t walRenameCorruptedDir(SWal* pWal) {
		TAOS_RETURN(code);
		}



	static int32_t walRenameCorruptedDir(SWal* pWal) __attribute__((unused));

	TAOS_CHECK_RETURN(cfgAddInt32(pCfg, "walRecoveryPolicy", tsWalRecoveryPolicy, 0, 1, CFG_SCOPE_SERVER, CFG_DYN_NONE, CFG_CATEGORY_LOCAL));
	TAOS_CHECK_RETURN(cfgAddInt32(pCfg, "walRecoveryPolicy", tsWalRecoveryPolicy, 0, 1, CFG_SCOPE_SERVER, CFG_DYN_NONE, CFG_CATEGORY_LOCAL));
	TAOS_CHECK_RETURN(cfgAddInt32(pCfg, "walDeleteOnCorruption", tsWalRecoveryPolicy, 0, 1, CFG_SCOPE_SERVER, CFG_DYN_NONE, CFG_CATEGORY_LOCAL));

		int32_t syncNotifyWalTruncated(int32_t vgId, int64_t truncatedVer);

Conversation

xiao-77 commented Apr 22, 2026

Description

Issue(s)

Checklist

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Copilot AI Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Copilot AI Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Apr 28, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 28, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 28, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 28, 2026