Skip to content
Draft
Show file tree
Hide file tree
Changes from 7 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -373,6 +373,14 @@ public class DatanodeConfiguration extends ReconfigurableConfig {
)
private boolean isDiskCheckEnabled = true;

@Config(key = "hdds.datanode.rocksdb.disk.check.io.test.enabled",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about we rename it to "hdds.datanode.disk.check.rocksdb.io.test.enabled", so that all the disk check property will share the "hdds.datanode.disk.check" prefix?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have moved the switch to its own PR in #10149.

I have updated it there.

defaultValue = "true",
type = ConfigType.BOOLEAN,
tags = {DATANODE},
description = "The configuration to enable or disable RocksDb disk IO checks."
)
private boolean isRocksDbDiskCheckEnabled = true;

@Config(key = "hdds.datanode.disk.check.io.failures.tolerated",
defaultValue = "1",
type = ConfigType.INT,
Expand Down Expand Up @@ -936,6 +944,10 @@ public boolean isDiskCheckEnabled() {
return isDiskCheckEnabled;
}

public boolean isRocksDbDiskCheckEnabled() {
return isRocksDbDiskCheckEnabled;
}

public Duration getDiskCheckSlidingWindowTimeout() {
return diskCheckSlidingWindowTimeout;
}
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -306,22 +306,25 @@ public synchronized VolumeCheckResult check(@Nullable Boolean unused)

@VisibleForTesting
public VolumeCheckResult checkDbHealth(File dbFile) throws InterruptedException {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's add javadoc to this method to clarify that our goal is to check global files like CURRENT and MANIFEST, and that verifying the contents of all SST files/values in the DB is done by the container data scanner.

if (!getDiskCheckEnabled()) {
if (!(getDiskCheckEnabled() && getDatanodeConfig().isRocksDbDiskCheckEnabled())) {
return VolumeCheckResult.HEALTHY;
}

try (ManagedOptions managedOptions = new ManagedOptions();
ManagedRocksDB ignored = ManagedRocksDB.openReadOnly(managedOptions, dbFile.toString())) {
ManagedRocksDB ignored =
ManagedRocksDB.openAsSecondary(managedOptions, dbFile.toString(), getTmpDir().getPath())) {
Comment thread
ptlrs marked this conversation as resolved.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this comment it looks like the logs are being written to the disk check dir, but it seems like the code doesn't match.

Suggested change
ManagedRocksDB.openAsSecondary(managedOptions, dbFile.toString(), getTmpDir().getPath())) {
ManagedRocksDB.openAsSecondary(managedOptions, dbFile.toString(), getDiskCheckDir().getPath())) {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have updated the PR to remove these log files every time after closing the secondary instance

I'm also not seeing this. Are there still some commits pending?

// Do nothing. Only check if rocksdb is accessible.
LOG.debug("Successfully opened the database at \"{}\" for HDDS volume {}.", dbFile, getStorageDir());
} catch (Exception e) {
if (Thread.currentThread().isInterrupted()) {
throw new InterruptedException("Check of database for volume " + this + " interrupted.");
}
LOG.warn("Could not open Volume DB located at {}", dbFile, e);

LOG.error("Could not open Volume DB located at {}", dbFile, e);
getIoTestSlidingWindow().add();
}


if (getIoTestSlidingWindow().isExceeded()) {
LOG.error("Failed to open the database at \"{}\" for HDDS volume {}: " +
"encountered more than the {} tolerated failures.",
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -75,6 +75,14 @@ public static ManagedRocksDB openReadOnly(
);
}

public static ManagedRocksDB openAsSecondary(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ptlrs thanks for all the research you've done on secondary instances. It would be great to put a summary as a javadoc above this method about how they work and what we can expect vs. readonly instances.

final ManagedOptions options,
final String dbPath,
final String secondaryDbLogFilePath)
throws RocksDBException {
return new ManagedRocksDB(RocksDB.openAsSecondary(options, dbPath, secondaryDbLogFilePath));
Comment thread
ptlrs marked this conversation as resolved.
}

public static ManagedRocksDB open(
final DBOptions options, final String path,
final List<ColumnFamilyDescriptor> columnFamilyDescriptors,
Expand Down