Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
32 commits
Select commit Hold shift + click to select a range
97fdf09
feat: add NVLink data extraction via NVML API
danbedford Apr 29, 2026
e47bc48
feat: add NVLink display to nvtop interface
danbedford Apr 29, 2026
2ef64f0
fix: replace non-existent NVML symbols with correct API calls
danbedford Apr 29, 2026
110a9cb
refactor: NVLink flat struct, CLI throughput, conditional layout
danbedford Apr 30, 2026
1b98806
feat: add NVLink error/correction counters, fix display layout, renam…
danbedford May 1, 2026
3f0face
fix: baseline reset for NVLink error counters, fix nvmlFieldValue_t l…
danbedford May 2, 2026
02464d6
feat: switch NVLink throughput from data-only to raw counters (--gett…
danbedford May 2, 2026
336ee07
docs: add TODO for datacenter GPU NVML API path, remove stale EMA com…
danbedford May 2, 2026
fc9e548
revert: remove EMA smoothing from NVLink throughput, prioritize raw a…
danbedford May 2, 2026
4b1f9f2
feat: increase NVLink max links from 18 to 36, hardcode 2s CLI poll i…
danbedford May 2, 2026
61a31dc
fix: add missing delwin for shader_cores/l2_cache_size/exec_engines, …
danbedford May 2, 2026
baa21a2
fix: priority-2 code quality improvements
danbedford May 2, 2026
737d4f0
feat: cache NVLink link count and version to avoid re-probing every r…
danbedford May 2, 2026
6f1183d
perf: skip NVLink re-probing when already detected, with monitored-se…
danbedford May 2, 2026
1d11cda
perf: cache full nvlink_info struct in refresh path, optimize draw pa…
danbedford May 2, 2026
b480c24
feat: add NVLink 6.0 (Rubin) version mapping for future-proofing
danbedford May 2, 2026
007c4ee
feat: detect and display NVLink-supported GPUs with 0 active links
danbedford May 3, 2026
4e2ff56
fix: count only active NVLink links, not physical slots with inactive…
danbedford May 3, 2026
08ca032
fix: account for NVLink window width in device_length() even with 0 a…
danbedford May 3, 2026
a527583
Fix: nvtop_set_nvlink_probe() overwrites any_device_has_nvlink_active
danbedford May 3, 2026
344e5a6
Fix: device_length() should not expand panel for NVLink 0-link case
danbedford May 3, 2026
4c29a4a
Fix: fan display format should use any_device_has_nvlink_active, not …
danbedford May 3, 2026
7b4987a
fix: guard NVLink functions with vendor check to avoid container_of o…
danbedford May 3, 2026
139d585
fix: device_length() should not expand panel for NVLink active links
danbedford May 3, 2026
50d51b7
fix: two-tier flag consistency and monitored-set-change state reset
danbedford May 3, 2026
03a65f3
# NVTop NVLink Fork - Changelog
danbedford May 3, 2026
f6955ee
Merge branch 'master' into nvlink
Syllo May 6, 2026
666ffed
feat: address PR #469 maintainer review comments
danbedford May 6, 2026
47a6cf8
fix: replace #include nvml.h with manual nvmlFieldValue_t typedef
danbedford May 7, 2026
b97dac8
fix: use {0} initializer for nvl_info in draw_devices() NVLink display
danbedford May 7, 2026
a226ed7
feat: add NVML field 160 ECC data errors to batched NVLink query
danbedford May 7, 2026
df42cc2
ui: use 2-letter labels for NVLink counters (FL/EE/CR)
danbedford May 7, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
31 changes: 31 additions & 0 deletions include/nvtop/extract_gpuinfo_common.h
Original file line number Diff line number Diff line change
Expand Up @@ -240,4 +240,35 @@ inline unsigned busy_usage_from_time_usage_round(uint64_t current_use_ns, uint64

unsigned nvtop_pcie_gen_from_link_speed(unsigned linkSpeed);

// NVLink support
#define NVTOP_NVLINK_MAX_LINKS 36

struct nvlink_info {
unsigned num_links; // Number of NVLink links on this device
unsigned version; // NVLink version (e.g. 3 for NVLink 3.0)
bool supported; // NVLink is supported on this device
bool has_throughput; // Whether throughput data was available this cycle
unsigned long long aggregate_tx; // Aggregate TX throughput across all links (KiB/s)
unsigned long long aggregate_rx; // Aggregate RX throughput across all links (KiB/s)
unsigned long long total_errors; // Cumulative-since-launch errors across all links
unsigned long long total_corrections; // Cumulative-since-launch CRC corrections across all links
unsigned long long total_ecc_errors; // Cumulative-since-launch ECC data errors across all links
};

unsigned nvtop_get_nvlink_info(struct gpu_info *gpu_info, struct nvlink_info *nvlink_info);
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You need to define these functions in another file (fallback ones returning unsupported/false/0) for when nvtop is not being compiled with NVIDIA support enabled. Currently they are only available in extract_gpuinfo_nvidia.c.

I would advise creating a new .c file and compile it only when NVIDIA_SUPPORT is OFF

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Created src/nvlink_nvidia_disabled.c with 4 no-op stub functions:

  • nvtop_get_nvlink_info() — returns 0
  • nvtop_get_nvlink_error_counts() — returns false
  • nvtop_probe_nvlink_list() — returns false
  • nvtop_reset_nvlink_cache() — no-op

The stub file is wired into src/CMakeLists.txt in the else(NVIDIA_SUPPORT) branch so it is only compiled when NVIDIA support is disabled. You originally mentioned 5 functions in your review, but the 5th (nvtop_set_nvlink_probe) was removed entirely per your suggestion in Comment 7, so there are now 4.

See commit 666ffed.


// Get display-ready NVLink error/correction/ECC counts from the per-device persistent struct.
// Returns true if baseline has been established at least once.
bool nvtop_get_nvlink_error_counts(struct gpu_info *gpu_info,
unsigned long long *out_errors,
unsigned long long *out_corrections,
unsigned long long *out_ecc);

// NVLink probe — call before initialize_curses to set layout mode
bool nvtop_probe_nvlink_list(struct list_head *devices);

// Reset per-GPU NVLink cache (probed flag, cached linkcount/version, cached info struct).
// Call when the monitored device set changes so newly-monitored NVLink GPUs get probed fresh.
void nvtop_reset_nvlink_cache(struct gpu_info *gpu_info);

#endif // EXTRACT_GPUINFO_COMMON_H__
3 changes: 3 additions & 0 deletions include/nvtop/interface_internal_common.h
Original file line number Diff line number Diff line change
Expand Up @@ -70,9 +70,11 @@ struct device_window {
WINDOW *gpu_clock_info;
WINDOW *mem_clock_info;
WINDOW *pcie_info;
WINDOW *nvlink_info;
WINDOW *shader_cores;
WINDOW *l2_cache_size;
WINDOW *exec_engines;
WINDOW *nvlink_errors;
bool enc_was_visible;
bool dec_was_visible;
nvtop_time last_decode_seen;
Expand Down Expand Up @@ -154,6 +156,7 @@ enum device_field {
device_shadercores,
device_l2features,
device_execengines,
device_nvlink_errors,
device_field_count,
};

Expand Down
2 changes: 2 additions & 0 deletions src/CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -52,6 +52,8 @@ endif()

if(NVIDIA_SUPPORT)
target_sources(nvtop PRIVATE extract_gpuinfo_nvidia.c)
else()
target_sources(nvtop PRIVATE nvlink_nvidia_disabled.c)
endif()

if(ASCEND_SUPPORT)
Expand Down
Loading
Loading