Skip to content
Open
Show file tree
Hide file tree
Changes from 9 commits
Commits
Show all changes
17 commits
Select commit Hold shift + click to select a range
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 3 additions & 1 deletion include/common/tglobal.h
Original file line number Diff line number Diff line change
Expand Up @@ -291,7 +291,9 @@ extern int64_t tsmaDataDeleteMark;
// wal
extern int64_t tsWalFsyncDataSizeLimit;
extern bool tsWalForceRepair;
extern bool tsWalDeleteOnCorruption;
// WAL recovery policy (only affects single replica)
Copy link

Copilot AI Apr 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Header comment says walRecoveryPolicy "only affects single replica", but the new logic checks it for all replica != 3 cases (so it would also affect replica=2 if that configuration exists). Please align the comment (and possibly the config semantics) with actual behavior to avoid misconfiguration.

Suggested change
// WAL recovery policy (only affects single replica)
// WAL recovery policy (applies when replica count is not 3, including single-replica deployments)

Copilot uses AI. Check for mistakes.
// 0 = refuse to start (default), 1 = delete corrupted and start
extern int32_t tsWalRecoveryPolicy;

// internal
extern bool tsDiskIDCheckEnabled;
Expand Down
2 changes: 2 additions & 0 deletions include/libs/sync/sync.h
Original file line number Diff line number Diff line change
Expand Up @@ -303,6 +303,8 @@ const char* syncStr(ESyncState state);

int32_t syncNodeGetConfig(int64_t rid, SSyncCfg* cfg);

int32_t syncNotifyWalTruncated(int32_t vgId, int64_t truncatedVer);

Comment on lines +306 to +307
Copy link

Copilot AI Apr 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

syncNotifyWalTruncated is declared in the public sync header, but there is no implementation in the repo (no definition found under source/libs/sync/src). Any new call site will fail to link, and the interface addition is incomplete as-is. Add the corresponding implementation (and ensure it’s exported from the sync library) or remove the declaration until it’s implemented.

Suggested change
int32_t syncNotifyWalTruncated(int32_t vgId, int64_t truncatedVer);

Copilot uses AI. Check for mistakes.
// util
int32_t syncSnapInfoDataRealloc(SSnapshot* pSnap, int32_t size);

Expand Down
2 changes: 1 addition & 1 deletion include/libs/wal/wal.h
Original file line number Diff line number Diff line change
Expand Up @@ -166,7 +166,7 @@ int32_t walInit(stopDnodeFn stopDnode);
void walCleanUp();

// handle open and ctl
SWal *walOpen(const char *path, SWalCfg *pCfg);
SWal *walOpen(const char *path, SWalCfg *pCfg, int32_t replica);
Comment thread
xiao-77 marked this conversation as resolved.
Copy link

Copilot AI Apr 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

walOpen signature now requires a replica argument, but there are still call sites in the repo (e.g. source/libs/sync/test/*.cpp) using the old 2-arg form. This will break compilation; please update all remaining callers (or provide a backward-compatible wrapper/default) so the tree builds cleanly.

Copilot uses AI. Check for mistakes.
int32_t walAlter(SWal *, SWalCfg *pCfg);
int32_t walPersist(SWal *);
void walClose(SWal *);
Expand Down
8 changes: 4 additions & 4 deletions source/common/src/tglobal.c
Original file line number Diff line number Diff line change
Expand Up @@ -356,7 +356,7 @@ bool tsStartUdfd = true;
// wal
int64_t tsWalFsyncDataSizeLimit = (100 * 1024 * 1024L);
bool tsWalForceRepair = 0;
bool tsWalDeleteOnCorruption = false;
int32_t tsWalRecoveryPolicy = 0; // Default: refuse to start for single replica

// ttl
bool tsTtlChangeOnWrite = false; // if true, ttl delete time changes on last write
Expand Down Expand Up @@ -981,7 +981,7 @@ static int32_t taosAddServerCfg(SConfig *pCfg) {
TAOS_CHECK_RETURN(cfgAddInt64(pCfg, "syncApplyQueueSize", tsSyncApplyQueueSize, 32, 2048, CFG_SCOPE_SERVER, CFG_DYN_SERVER,CFG_CATEGORY_GLOBAL));
TAOS_CHECK_RETURN(cfgAddInt32(pCfg, "syncRoutineReportInterval", tsRoutineReportInterval, 5, 600, CFG_SCOPE_SERVER, CFG_DYN_SERVER,CFG_CATEGORY_LOCAL));
TAOS_CHECK_RETURN(cfgAddBool(pCfg, "syncLogHeartbeat", tsSyncLogHeartbeat, CFG_SCOPE_SERVER, CFG_DYN_SERVER,CFG_CATEGORY_LOCAL));
TAOS_CHECK_RETURN(cfgAddBool(pCfg, "walDeleteOnCorruption", tsWalDeleteOnCorruption, CFG_SCOPE_SERVER, CFG_DYN_NONE,CFG_CATEGORY_LOCAL));
TAOS_CHECK_RETURN(cfgAddInt32(pCfg, "walRecoveryPolicy", tsWalRecoveryPolicy, 0, 1, CFG_SCOPE_SERVER, CFG_DYN_NONE, CFG_CATEGORY_LOCAL));
Copy link

Copilot AI Apr 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Config key walDeleteOnCorruption is removed and replaced with walRecoveryPolicy, which is a breaking change for existing deployments/config files and docs. If backward compatibility is required, consider continuing to accept walDeleteOnCorruption as an alias (or at least emit a clear warning) and update documentation accordingly.

Suggested change
TAOS_CHECK_RETURN(cfgAddInt32(pCfg, "walRecoveryPolicy", tsWalRecoveryPolicy, 0, 1, CFG_SCOPE_SERVER, CFG_DYN_NONE, CFG_CATEGORY_LOCAL));
TAOS_CHECK_RETURN(cfgAddInt32(pCfg, "walRecoveryPolicy", tsWalRecoveryPolicy, 0, 1, CFG_SCOPE_SERVER, CFG_DYN_NONE, CFG_CATEGORY_LOCAL));
TAOS_CHECK_RETURN(cfgAddInt32(pCfg, "walDeleteOnCorruption", tsWalRecoveryPolicy, 0, 1, CFG_SCOPE_SERVER, CFG_DYN_NONE, CFG_CATEGORY_LOCAL));

Copilot uses AI. Check for mistakes.

TAOS_CHECK_RETURN(cfgAddInt32(pCfg, "syncTimeout", tsSyncTimeout, 0, 60 * 24 * 2 * 1000, CFG_SCOPE_SERVER, CFG_DYN_SERVER,CFG_CATEGORY_GLOBAL));

Expand Down Expand Up @@ -2143,8 +2143,8 @@ static int32_t taosSetServerCfg(SConfig *pCfg) {
tsRpcRecvLogThreshold = pItem->i32;
// GRANT_CFG_GET;

TAOS_CHECK_GET_CFG_ITEM(pCfg, pItem, "walDeleteOnCorruption");
tsWalDeleteOnCorruption = pItem->bval;
TAOS_CHECK_GET_CFG_ITEM(pCfg, pItem, "walRecoveryPolicy");
tsWalRecoveryPolicy = pItem->i32;

TAOS_RETURN(TSDB_CODE_SUCCESS);
}
Expand Down
2 changes: 1 addition & 1 deletion source/dnode/mnode/impl/src/mndMain.c
Original file line number Diff line number Diff line change
Expand Up @@ -632,7 +632,7 @@ static int32_t mndInitWal(SMnode *pMnode) {
}
#endif

pMnode->pWal = walOpen(path, &cfg);
pMnode->pWal = walOpen(path, &cfg, pMnode->syncMgmt.numOfReplicas);
if (pMnode->pWal == NULL) {
code = TSDB_CODE_MND_RETURN_VALUE_NULL;
if (terrno != 0) code = terrno;
Expand Down
2 changes: 1 addition & 1 deletion source/dnode/vnode/src/vnd/vnodeOpen.c
Original file line number Diff line number Diff line change
Expand Up @@ -490,7 +490,7 @@ SVnode *vnodeOpen(const char *path, int32_t diskPrimary, STfs *pTfs, SMsgCb msgC
TAOS_UNUSED(ret);

vInfo("vgId:%d, start to open vnode wal", TD_VID(pVnode));
pVnode->pWal = walOpen(tdir, &(pVnode->config.walCfg));
pVnode->pWal = walOpen(tdir, &(pVnode->config.walCfg), pVnode->config.syncCfg.replicaNum);
if (pVnode->pWal == NULL) {
vError("vgId:%d, failed to open vnode wal since %s. wal:%s", TD_VID(pVnode), tstrerror(terrno), tdir);
goto _err;
Expand Down
2 changes: 2 additions & 0 deletions source/libs/sync/inc/syncInt.h
Original file line number Diff line number Diff line change
Expand Up @@ -341,6 +341,8 @@ int32_t syncNodeDynamicQuorum(const SSyncNode* pSyncNode);
bool syncNodeIsMnode(SSyncNode* pSyncNode);
int32_t syncNodePeerStateInit(SSyncNode* pSyncNode);

int32_t syncNotifyWalTruncated(int32_t vgId, int64_t truncatedVer);

#ifdef __cplusplus
}
#endif
Expand Down
5 changes: 5 additions & 0 deletions source/libs/sync/src/syncMain.c
Original file line number Diff line number Diff line change
Expand Up @@ -4090,3 +4090,8 @@ bool syncNodeCanChange(SSyncNode* pSyncNode) {
return true;
}
#endif

int32_t syncNotifyWalTruncated(int32_t vgId, int64_t truncatedVer) {
sInfo("vgId:%d, notified sync module: WAL truncated to ver:%" PRId64, vgId, truncatedVer);
return TSDB_CODE_SUCCESS;
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The implementation of syncNotifyWalTruncated currently only logs a message. To fulfill the PR's objective of 'Enabling Raft to detect and sync missing logs', this function should update the state of the corresponding sync node (e.g., resetting its applied index or marking it as needing a snapshot/sync).

Copy link

Copilot AI Apr 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

syncNotifyWalTruncated() is added to public headers, but there are no call sites in the repo. If the intent is to notify Raft after WAL truncation/recovery, wire this into the WAL recovery path (e.g., after walTruncateCorruptedFiles succeeds) and pass the actual truncated version; otherwise keep the symbol internal or remove it to avoid exposing an unused API surface.

Suggested change
int32_t syncNotifyWalTruncated(int32_t vgId, int64_t truncatedVer) {
sInfo("vgId:%d, notified sync module: WAL truncated to ver:%" PRId64, vgId, truncatedVer);
return TSDB_CODE_SUCCESS;
}

Copilot uses AI. Check for mistakes.
Copy link

Copilot AI Apr 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

syncNotifyWalTruncated() is added as a public/internal API but is not called anywhere in this PR, and currently only logs a message without informing any SSyncNode state. This doesn’t match the PR description (“Call notification after three-replica recovery / Enable Raft to detect and sync missing logs”). Either wire this into the WAL truncation/recovery path (and have it trigger whatever raft re-sync behavior is required) or keep it internal until there’s an actual caller/behavior.

Copilot uses AI. Check for mistakes.
2 changes: 1 addition & 1 deletion source/libs/wal/inc/walInt.h
Original file line number Diff line number Diff line change
Expand Up @@ -168,7 +168,7 @@ int32_t walRemoveMeta(SWal* pWal);
int32_t walRollImpl(SWal* pWal);
int32_t walRollFileInfo(SWal* pWal);
int32_t walScanLogGetLastVer(SWal* pWal, int32_t fileIdx, int64_t* lastVer);
int32_t walCheckAndRepairMeta(SWal* pWal);
int32_t walCheckAndRepairMeta(SWal* pWal, int32_t replica);
int64_t walChangeWrite(SWal* pWal, int64_t ver);

int32_t walCheckAndRepairIdx(SWal* pWal);
Expand Down
114 changes: 92 additions & 22 deletions source/libs/wal/src/walMeta.c
Original file line number Diff line number Diff line change
Expand Up @@ -431,7 +431,69 @@ static int32_t walRenameCorruptedDir(SWal* pWal) {
TAOS_RETURN(code);
}

Copy link

Copilot AI Apr 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After switching the corruption path to walTruncateCorruptedFiles(), the old static helper walRenameCorruptedDir() is no longer referenced anywhere in this file. If the build enables -Wunused-function (or treats warnings as errors), this will cause a build break. Consider removing it or gating it behind the same feature/ifdef as any remaining users.

Suggested change
static int32_t walRenameCorruptedDir(SWal* pWal) __attribute__((unused));

Copilot uses AI. Check for mistakes.
static int32_t walLogEntriesComplete(SWal* pWal) {
static int32_t walTruncateCorruptedFiles(SWal* pWal, int32_t fileIdx, int32_t replica) {
int32_t code = TSDB_CODE_SUCCESS;
Copy link

Copilot AI Apr 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

code is declared in walTruncateCorruptedFiles() but never used. This will trigger an unused-variable warning on stricter builds; please remove it or use it for error propagation.

Suggested change
int32_t code = TSDB_CODE_SUCCESS;

Copilot uses AI. Check for mistakes.
bool shouldRecover = false;

if (replica == 3) {
shouldRecover = true;
SWalFileInfo* pFileInfo = taosArrayGet(pWal->fileInfoSet, fileIdx);
wInfo("vgId:%d, WAL corrupted at ver:%" PRId64 ", auto-recovery enabled for replica=3",
pWal->cfg.vgId, pFileInfo->firstVer);
} else {
Comment on lines +447 to +451
Copy link

Copilot AI Apr 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR description mentions notifying sync/Raft after three-replica recovery, but walTruncateCorruptedFiles() (the new truncation/recovery path) does not call syncNotifyWalTruncated() or any equivalent hook when replica >= 3. If Raft relies on this notification to resync missing logs, add the call here (or in the caller) at the point where truncation is committed.

Copilot uses AI. Check for mistakes.
shouldRecover = (tsWalRecoveryPolicy == 1);
if (shouldRecover) {
SWalFileInfo* pFileInfo = taosArrayGet(pWal->fileInfoSet, fileIdx);
wWarn("vgId:%d, WAL corrupted at ver:%" PRId64 ", force recovery enabled by walRecoveryPolicy=1",
pWal->cfg.vgId, pFileInfo->firstVer);
} else {
SWalFileInfo* pFileInfo = taosArrayGet(pWal->fileInfoSet, fileIdx);
wError("vgId:%d, WAL corrupted at ver:%" PRId64 ", refusing to start to prevent data loss",
pWal->cfg.vgId, pFileInfo->firstVer);
wError("vgId:%d, corrupted WAL files are preserved for manual inspection", pWal->cfg.vgId);
wError("vgId:%d, to force recovery with data loss, set 'walRecoveryPolicy 1' in taos.cfg and restart",
pWal->cfg.vgId);
Copy link

Copilot AI Apr 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

walTruncateCorruptedFiles() doesn't validate fileIdx before calling taosArrayGet(pWal->fileInfoSet, fileIdx) and before computing the remove batch length (size - fileIdx). If fileIdx is negative or >= array size, this can lead to invalid memory access or a huge batch removal due to underflow. Add defensive checks at the start of this function and return an appropriate error when fileIdx is out of range.

Copilot uses AI. Check for mistakes.
TAOS_RETURN(TSDB_CODE_WAL_FILE_CORRUPTED);
}
}

if (!shouldRecover) {
TAOS_RETURN(TSDB_CODE_WAL_FILE_CORRUPTED);
}

wInfo("vgId:%d, truncating WAL at corrupted file index %d", pWal->cfg.vgId, fileIdx);

Comment on lines +434 to +469
Copy link

Copilot AI Apr 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

walTruncateCorruptedFiles() assumes the corrupted file itself has already been truncated by walScanLogGetLastVer, but walScanLogGetLastVer() intentionally skips truncation when fileIdx == sz - 1 (last file). That means corruption in the last WAL segment may not actually be removed, and the node may keep re-entering repair logic (or fail later) even though walTruncateCorruptedFiles() returns success. Consider truncating the corrupted file here as well (or adjusting walScanLogGetLastVer/repair flow) so last-segment corruption is recoverable under the intended policy/replica behavior.

Copilot uses AI. Check for mistakes.
// Delete all files from fileIdx onwards
for (int32_t i = fileIdx; i < taosArrayGetSize(pWal->fileInfoSet); i++) {
SWalFileInfo* pDelFileInfo = taosArrayGet(pWal->fileInfoSet, i);
Comment on lines +465 to +472
Copy link

Copilot AI Apr 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

walTruncateCorruptedFiles() currently deletes files starting at fileIdx + 1, but it’s also called from walLogEntriesComplete() when a non-consecutive firstVer is detected. In that gap/mismatch case, keeping the fileIdx file leaves the WAL still inconsistent (the gap remains). Consider deleting from fileIdx (or otherwise handling the gap case separately) and then recomputing vers.lastVer/writeCur to reflect the remaining prefix.

Copilot uses AI. Check for mistakes.
char delLogName[WAL_FILE_LEN];
char delIdxName[WAL_FILE_LEN];

walBuildLogName(pWal, pDelFileInfo->firstVer, delLogName);
walBuildIdxName(pWal, pDelFileInfo->firstVer, delIdxName);

if (taosRemoveFile(delLogName) != 0) {
wWarn("vgId:%d, failed to remove corrupted log file %s", pWal->cfg.vgId, delLogName);
} else {
wInfo("vgId:%d, removed corrupted log file %s", pWal->cfg.vgId, delLogName);
}

if (taosRemoveFile(delIdxName) != 0) {
wWarn("vgId:%d, failed to remove corrupted idx file %s", pWal->cfg.vgId, delIdxName);
} else {
wInfo("vgId:%d, removed corrupted idx file %s", pWal->cfg.vgId, delIdxName);
}
}

// Remove deleted files from fileInfoSet
taosArrayRemoveBatch(pWal->fileInfoSet, fileIdx, taosArrayGetSize(pWal->fileInfoSet) - fileIdx, NULL);

wInfo("vgId:%d, WAL truncated successfully", pWal->cfg.vgId);

TAOS_RETURN(TSDB_CODE_SUCCESS);
}

static int32_t walLogEntriesComplete(SWal* pWal, int32_t replica) {
int32_t sz = taosArrayGetSize(pWal->fileInfoSet);
bool complete = true;
int32_t fileIdx = -1;
Expand All @@ -451,13 +513,9 @@ static int32_t walLogEntriesComplete(SWal* pWal) {

if (!complete) {
wError("vgId:%d, WAL log entries incomplete in range [%" PRId64 ", %" PRId64 "], index:%" PRId64
", snaphot index:%" PRId64,
pWal->cfg.vgId, pWal->vers.firstVer, pWal->vers.lastVer, index, pWal->vers.snapshotVer);
if (tsWalDeleteOnCorruption) {
TAOS_RETURN(walRenameCorruptedDir(pWal));
} else {
TAOS_RETURN(TSDB_CODE_WAL_LOG_INCOMPLETE);
}
", snaphot index:%" PRId64 ", fileIdx:%d",
Copy link

Copilot AI Apr 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo in log message: "snaphot" should be "snapshot".

Suggested change
", snaphot index:%" PRId64 ", fileIdx:%d",
", snapshot index:%" PRId64 ", fileIdx:%d",

Copilot uses AI. Check for mistakes.
Copy link

Copilot AI Apr 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo in log message: snaphot index should be snapshot index for consistency/searchability in logs.

Suggested change
", snaphot index:%" PRId64 ", fileIdx:%d",
", snapshot index:%" PRId64 ", fileIdx:%d",

Copilot uses AI. Check for mistakes.
pWal->cfg.vgId, pWal->vers.firstVer, pWal->vers.lastVer, index, pWal->vers.snapshotVer, fileIdx);
TAOS_RETURN(walTruncateCorruptedFiles(pWal, fileIdx, replica));
Copy link

Copilot AI Apr 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

walTruncateCorruptedFiles() assumes fileIdx is a valid index into pWal->fileInfoSet (taosArrayGet/taosArrayRemoveBatch). In walLogEntriesComplete(), fileIdx can become == sz when the loop completes without breaking, which would make taosArrayGet out-of-bounds and can crash. Add explicit range checks (fileIdx >= 0 && fileIdx < taosArrayGetSize(...)) before using fileIdx, and decide a safe truncation point (or return an error) when the inconsistency is only in meta/vers rather than a specific corrupted file.

Suggested change
TAOS_RETURN(walTruncateCorruptedFiles(pWal, fileIdx, replica));
if (fileIdx >= 0 && fileIdx < sz) {
TAOS_RETURN(walTruncateCorruptedFiles(pWal, fileIdx, replica));
}
wError("vgId:%d, WAL metadata/version inconsistency detected without a valid corrupted file index, "
"skip truncation, fileCnt:%d, fileIdx:%d",
pWal->cfg.vgId, sz, fileIdx);
TAOS_RETURN(TSDB_CODE_WAL_FILE_CORRUPTED);

Copilot uses AI. Check for mistakes.
} else {
Comment on lines 520 to 534
Copy link

Copilot AI Apr 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

walLogEntriesComplete() can call walTruncateCorruptedFiles(pWal, fileIdx, ...) with fileIdx == sz when the while loop finishes without hitting break but complete is still false. That makes taosArrayGet(pWal->fileInfoSet, fileIdx) out-of-bounds inside walTruncateCorruptedFiles, which can crash during WAL open/repair. Clamp fileIdx to a valid range (e.g., sz - 1) or handle the fileIdx >= sz case explicitly before calling the truncation routine.

Copilot uses AI. Check for mistakes.
Comment on lines 520 to 534
Copy link

Copilot AI Apr 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After walLogEntriesComplete() calls walTruncateCorruptedFiles(), the code returns success without updating pWal->vers/totSize or persisting meta. If any files are removed, the in-memory and on-disk meta can remain stale, and subsequent components may observe an incorrect lastVer. Ensure the truncation path recomputes versions and saves meta (or returns a code that forces the caller to persist the repaired meta).

Copilot uses AI. Check for mistakes.
TAOS_RETURN(TSDB_CODE_SUCCESS);
}
Expand Down Expand Up @@ -517,7 +575,7 @@ void walRegfree(regex_t* ptr) {
regfree(ptr);
}

int32_t walCheckAndRepairMeta(SWal* pWal) {
int32_t walCheckAndRepairMeta(SWal* pWal, int32_t replica) {
// load log files, get first/snapshot/last version info
int32_t code = 0;
int32_t lino = 0;
Expand All @@ -526,6 +584,8 @@ int32_t walCheckAndRepairMeta(SWal* pWal) {
regex_t logRegPattern, idxRegPattern;
TdDirPtr pDir = NULL;
SArray* actualLog = NULL;
bool walTruncated = false;
int64_t truncatedVer = -1;
Copy link

Copilot AI Apr 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

walTruncated and truncatedVer are declared but never used, which is dead code and can trigger -Wunused-variable warnings in some build configurations. Either remove them or use them for the intended truncation notification/metadata tracking (e.g., record truncatedVer and propagate it to the sync module).

Suggested change
bool walTruncated = false;
int64_t truncatedVer = -1;

Copilot uses AI. Check for mistakes.

wInfo("vgId:%d, begin to repair meta, wal path:%s, first index:%" PRId64 ", last index:%" PRId64
", snapshot index:%" PRId64,
Expand Down Expand Up @@ -607,22 +667,32 @@ int32_t walCheckAndRepairMeta(SWal* pWal) {
if (lastVer < 0) {
if (code != TSDB_CODE_WAL_LOG_NOT_EXIST) {
wError("vgId:%d, failed to scan wal last index since %s", pWal->cfg.vgId, tstrerror(code));
if (tsWalDeleteOnCorruption) {
TAOS_RETURN(walRenameCorruptedDir(pWal));

code = walTruncateCorruptedFiles(pWal, fileIdx, replica);
if (code != TSDB_CODE_SUCCESS) {
goto _exit;
}
goto _exit;
}
// empty log file
lastVer = pFileInfo->firstVer - 1;

code = TSDB_CODE_SUCCESS;
// After truncation, set lastVer based on remaining files
lastVer = (fileIdx > 0) ? ((SWalFileInfo*)taosArrayGet(pWal->fileInfoSet, fileIdx - 1))->lastVer : -1;
wInfo("vgId:%d, WAL truncated, new lastVer:%" PRId64, pWal->cfg.vgId, lastVer);
updateMeta = true;
code = TSDB_CODE_SUCCESS;
} else {
// empty log file
lastVer = pFileInfo->firstVer - 1;
code = TSDB_CODE_SUCCESS;
}
}
wInfo("vgId:%d, repaired file %s, last index:%" PRId64 ", fileSize:%" PRId64 ", fileSize in meta:%" PRId64,
pWal->cfg.vgId, fnameStr, lastVer, fileSize, pFileInfo->fileSize);

// update lastVer
pFileInfo->lastVer = lastVer;
totSize += pFileInfo->fileSize;
if (code == TSDB_CODE_SUCCESS && lastVer >= 0) {
wInfo("vgId:%d, repaired file %s, last index:%" PRId64 ", fileSize:%" PRId64 ", fileSize in meta:%" PRId64,
pWal->cfg.vgId, fnameStr, lastVer, fileSize, pFileInfo->fileSize);

// update lastVer
Comment on lines +702 to +706
Copy link

Copilot AI Apr 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In walCheckAndRepairMeta(), pFileInfo->lastVer is only updated when lastVer >= 0. For an empty first log file where firstVer == 0, the computed lastVer becomes -1 and this branch is skipped, leaving pFileInfo->lastVer potentially stale. Update pFileInfo->lastVer for the empty-file case as well (even when lastVer < 0).

Suggested change
if (code == TSDB_CODE_SUCCESS && lastVer >= 0) {
wInfo("vgId:%d, repaired file %s, last index:%" PRId64 ", fileSize:%" PRId64 ", fileSize in meta:%" PRId64,
pWal->cfg.vgId, fnameStr, lastVer, fileSize, pFileInfo->fileSize);
// update lastVer
if (code == TSDB_CODE_SUCCESS) {
if (lastVer >= 0) {
wInfo("vgId:%d, repaired file %s, last index:%" PRId64 ", fileSize:%" PRId64 ", fileSize in meta:%" PRId64,
pWal->cfg.vgId, fnameStr, lastVer, fileSize, pFileInfo->fileSize);
}
// update lastVer, including the empty-file case where lastVer == firstVer - 1

Copilot uses AI. Check for mistakes.
pFileInfo->lastVer = lastVer;
totSize += pFileInfo->fileSize;
}
Comment on lines +684 to +709
Copy link

Copilot AI Apr 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the lastVer<0 && code!=TSDB_CODE_WAL_LOG_NOT_EXIST branch, walTruncateCorruptedFiles() mutates pWal->fileInfoSet (removes entries starting at fileIdx). After that, pFileInfo still points to the removed element, but the code can still enter the update block and write pFileInfo->lastVer / use pFileInfo->fileSize, which becomes invalid memory access. Also, totSize may already include sizes of files that just got deleted. After truncation, avoid using the old pFileInfo pointer and recompute totSize/versions from the remaining fileInfoSet (e.g., restart the scan or break and rebuild totals).

Copilot uses AI. Check for mistakes.
}

// reset vers info and so on
Expand All @@ -647,7 +717,7 @@ int32_t walCheckAndRepairMeta(SWal* pWal) {
TAOS_CHECK_EXIT(walSaveMeta(pWal));
}

TAOS_CHECK_EXIT(walLogEntriesComplete(pWal));
TAOS_CHECK_EXIT(walLogEntriesComplete(pWal, replica));

Comment thread
xiao-77 marked this conversation as resolved.
wInfo("vgId:%d, success to repair meta, wal path:%s, first index:%" PRId64 ", last index:%" PRId64
", snapshot index:%" PRId64,
Expand Down
4 changes: 2 additions & 2 deletions source/libs/wal/src/walMgmt.c
Original file line number Diff line number Diff line change
Expand Up @@ -129,7 +129,7 @@ int32_t walInitWriteFileForSkip(SWal *pWal) {
TAOS_RETURN(code);
}

SWal *walOpen(const char *path, SWalCfg *pCfg) {
SWal *walOpen(const char *path, SWalCfg *pCfg, int32_t replica) {
int32_t code = 0;
SWal *pWal = taosMemoryCalloc(1, sizeof(SWal));
if (pWal == NULL) {
Expand Down Expand Up @@ -206,7 +206,7 @@ SWal *walOpen(const char *path, SWalCfg *pCfg) {
wWarn("vgId:%d, failed to load meta, code:0x%x", pWal->cfg.vgId, code);
}
if (pWal->cfg.level != TAOS_WAL_SKIP) {
code = walCheckAndRepairMeta(pWal);
code = walCheckAndRepairMeta(pWal, replica);
if (code < 0) {
wError("vgId:%d, cannot open wal since repair meta file failed since %s", pWal->cfg.vgId, tstrerror(code));
goto _err;
Expand Down
Loading