-
Notifications
You must be signed in to change notification settings - Fork 184
Fix RNG seeding bug in multi-GPU KMeans #2029
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
vinaydes
wants to merge
24
commits into
rapidsai:main
Choose a base branch
from
vinaydes:fix-mg-kmeans-seeding
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from 23 commits
Commits
Show all changes
24 commits
Select commit
Hold shift + click to select a range
2de9c50
Save point in debugging
vinaydes 46d7ae0
Adding the reproducer file provided in the bug
vinaydes ffd9a4c
Adding reproducer builder
vinaydes 9b00642
Extracted weights and wrote a RAFT discrete() reproducer
vinaydes c9f281b
Compare discrete result with weights array
vinaydes c2e801c
Some more debugging
vinaydes 5207d9c
Correctly seeding at multiple places
vinaydes 584fd32
Randomized seeeding
vinaydes 5ce4d7e
Removing debug code
vinaydes 6cf0df0
Removing trailing spaces
vinaydes 4fa2009
Removing debug prints
vinaydes aa1db02
Removing debug code
vinaydes cd3949a
Undoing cmake change
vinaydes ffeaef2
Merge branch 'main' into kmeans_mg_rng_bug
vinaydes 635ccac
Removing irrelevant changes
vinaydes c0a8705
Removing irrelevant changes 2
vinaydes 59cf6e5
Removing irrelevant changes 3
vinaydes 33649f7
Removing irrelevant changes 4
vinaydes 2761b82
Correcting the name of the rng type
vinaydes 2dbb610
Merge branch 'main' into kmeans_mg_rng_bug
vinaydes 5c57366
Merge branch 'main' into kmeans_mg_rng_bug
vinaydes abff264
Removing debuggin files
vinaydes be7bac1
Removing debug print
vinaydes 5fb247c
Merge branch 'main' into fix-mg-kmeans-seeding
aamijar File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Some comments aren't visible on the classic Files Changed page.
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -58,6 +58,7 @@ static cuvs::cluster::kmeans::params default_params; | |
| template <typename DataT, typename IndexT> | ||
| void initRandom(const raft::resources& handle, | ||
| const cuvs::cluster::kmeans::params& params, | ||
| std::mt19937_64& gen_64, | ||
| raft::device_matrix_view<const DataT, IndexT> X, | ||
| raft::device_matrix_view<DataT, IndexT> centroids) | ||
| { | ||
|
|
@@ -96,8 +97,9 @@ void initRandom(const raft::resources& handle, | |
| auto centroidsSampledInRank = | ||
| raft::make_device_matrix<DataT, IndexT>(handle, nCentroidsSampledInRank, n_features); | ||
|
|
||
| uint64_t gpu_seed = gen_64(); | ||
| cuvs::cluster::kmeans::shuffle_and_gather( | ||
| handle, X, centroidsSampledInRank.view(), nCentroidsSampledInRank, params.rng_state.seed); | ||
| handle, X, centroidsSampledInRank.view(), nCentroidsSampledInRank, gpu_seed); | ||
|
|
||
| std::vector<size_t> displs(n_ranks); | ||
| std::exclusive_scan(nCentroidsElementsToReceiveFromRank.begin(), | ||
|
|
@@ -130,6 +132,7 @@ void initRandom(const raft::resources& handle, | |
| template <typename DataT, typename IndexT> | ||
| void initKMeansPlusPlus(const raft::resources& handle, | ||
| const cuvs::cluster::kmeans::params& params, | ||
| std::mt19937_64& gen_64, | ||
| raft::device_matrix_view<const DataT, IndexT> X, | ||
| raft::device_matrix_view<DataT, IndexT> centroidsRawData, | ||
| rmm::device_uvector<char>& workspace) | ||
|
|
@@ -144,7 +147,6 @@ void initKMeansPlusPlus(const raft::resources& handle, | |
| auto n_clusters = params.n_clusters; | ||
| auto metric = params.metric; | ||
|
|
||
| raft::random::RngState rng(params.rng_state.seed, raft::random::GeneratorType::GenPhilox); | ||
|
|
||
| // <<<< Step-1 >>> : C <- sample a point uniformly at random from X | ||
| // 1.1 - Select a rank r' at random from the available n_rank ranks with a | ||
|
|
@@ -157,9 +159,8 @@ void initKMeansPlusPlus(const raft::resources& handle, | |
| // Choose rp on rank 0 and broadcast to all ranks to guarantee agreement | ||
| int rp = 0; | ||
| if (my_rank == KMEANS_COMM_ROOT) { | ||
| std::mt19937 gen(params.rng_state.seed); | ||
| std::uniform_int_distribution<> dis(0, n_rank - 1); | ||
| rp = dis(gen); | ||
| rp = dis(gen_64); | ||
| } | ||
| { | ||
| rmm::device_scalar<int> rp_d(stream); | ||
|
|
@@ -182,10 +183,9 @@ void initKMeansPlusPlus(const raft::resources& handle, | |
| // 1.2 - Rank r' samples a point uniformly at random from the local dataset | ||
| // X which will be used as the initial centroid for kmeans++ | ||
| if (my_rank == rp) { | ||
| std::mt19937 gen(params.rng_state.seed); | ||
| std::uniform_int_distribution<> dis(0, n_samples - 1); | ||
|
|
||
| int cIdx = dis(gen); | ||
| int cIdx = dis(gen_64); | ||
| auto centroidsView = raft::make_device_matrix_view<const DataT, IndexT>( | ||
| X.data_handle() + cIdx * n_features, 1, n_features); | ||
|
|
||
|
|
@@ -316,6 +316,9 @@ void initKMeansPlusPlus(const raft::resources& handle, | |
|
|
||
| // <<<< Step-4 >>> : Sample each point x in X independently and identify new | ||
| // potentialCentroids | ||
| uint64_t gpu_seed; | ||
| gpu_seed = gen_64(); | ||
| raft::random::RngState rng(gpu_seed, params.rng_state.type); | ||
| raft::random::uniform( | ||
| handle, rng, uniformRands.data_handle(), uniformRands.extent(0), (DataT)0, (DataT)1); | ||
| cuvs::cluster::kmeans::SamplingOp<DataT, IndexT> select_op(psi, | ||
|
|
@@ -404,16 +407,17 @@ void initKMeansPlusPlus(const raft::resources& handle, | |
| // seed they should generate the same potentialCentroids | ||
| auto const_centroids = raft::make_device_matrix_view<const DataT, IndexT>( | ||
| potentialCentroids.data_handle(), potentialCentroids.extent(0), potentialCentroids.extent(1)); | ||
| auto params_copy = params; | ||
| params_copy.rng_state.seed = gen_64(); | ||
| cuvs::cluster::kmeans::init_plus_plus( | ||
| handle, params, const_centroids, centroidsRawData, workspace); | ||
| handle, params_copy, const_centroids, centroidsRawData, workspace); | ||
|
|
||
| auto inertia = raft::make_host_scalar<DataT>(0); | ||
| auto n_iter = raft::make_host_scalar<IndexT>(0); | ||
| auto weight_view = | ||
| raft::make_device_vector_view<const DataT, IndexT>(weight.data_handle(), weight.extent(0)); | ||
| cuvs::cluster::kmeans::params params_copy = params; | ||
| params_copy.rng_state = default_params.rng_state; | ||
|
|
||
| // Update the seed one more time | ||
| params_copy.rng_state.seed = gen_64(); | ||
| cuvs::cluster::kmeans::fit_main<DataT, IndexT>(handle, | ||
| params_copy, | ||
| const_centroids, | ||
|
|
@@ -436,10 +440,10 @@ void initKMeansPlusPlus(const raft::resources& handle, | |
|
|
||
| // generate `n_random_clusters` centroids | ||
| cuvs::cluster::kmeans::params rand_params = params; | ||
| rand_params.rng_state = default_params.rng_state; | ||
| rand_params.rng_state.seed = gen_64(); | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Could be misleading to set this |
||
| rand_params.init = cuvs::cluster::kmeans::params::InitMethod::Random; | ||
| rand_params.n_clusters = n_random_clusters; | ||
| initRandom(handle, rand_params, X, centroidsRawData); | ||
| initRandom(handle, rand_params, gen_64, X, centroidsRawData); | ||
|
|
||
| // copy centroids generated during kmeans|| iteration to the buffer | ||
| raft::copy( | ||
|
|
@@ -514,6 +518,11 @@ void fit(const raft::resources& handle, | |
| auto n_clusters = params.n_clusters; | ||
| auto metric = params.metric; | ||
|
|
||
| const int my_rank = comm.get_rank(); | ||
| const int n_ranks = comm.get_size(); | ||
|
|
||
| std::mt19937_64 gen_64(params.rng_state.seed + (uint64_t(my_rank) << 32)); | ||
|
|
||
| auto weight = raft::make_device_vector<DataT, IndexT>(handle, n_samples); | ||
| if (sample_weight) { | ||
| raft::copy(handle, weight.view(), sample_weight.value()); | ||
|
|
@@ -529,11 +538,11 @@ void fit(const raft::resources& handle, | |
| CUVS_LOG_KMEANS(handle, | ||
| "KMeans.fit: initialize cluster centers by randomly choosing from the " | ||
| "input data.\n"); | ||
| initRandom<DataT, IndexT>(handle, params, X, centroids); | ||
| initRandom<DataT, IndexT>(handle, params, gen_64, X, centroids); | ||
| } else if (params.init == cuvs::cluster::kmeans::params::InitMethod::KMeansPlusPlus) { | ||
| // default method to initialize is kmeans++ | ||
| CUVS_LOG_KMEANS(handle, "KMeans.fit: initialize cluster centers using k-means++ algorithm.\n"); | ||
| initKMeansPlusPlus<DataT, IndexT>(handle, params, X, centroids, workspace); | ||
| initKMeansPlusPlus<DataT, IndexT>(handle, params, gen_64, X, centroids, workspace); | ||
| } else if (params.init == cuvs::cluster::kmeans::params::InitMethod::Array) { | ||
| CUVS_LOG_KMEANS(handle, | ||
| "KMeans.fit: initialize cluster centers from the ndarray array input " | ||
|
|
||
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is an issue. When
initKMeansPlusPlusoversamples (potentialCentroids.extent(0) > n_clusters), a mini single-GPU KMeans is run locally on each rank to reduce the candidates down ton_clusters. SincepotentialCentroidsandweightare already identical across ranks (via priorallgatherv/allreduce), and no communication happens after this reduction, all ranks must use the same RNG seed to produce identical results. Currently,init_plus_plusis given a per-rank-divergent seed fromgen_64(), which breaks this invariant.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can generate a seed at rank 0 and broadcast it to other ranks? Would that be an acceptable solution? What implications would a sync here will have on load balancing?
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At this step the workers just completed the allreduce operation. Broadcasting a seed should be relatively cheap to do.
I think that this solution could work. That said since earlier steps already introduced some randomness, we could theoretically perform this recluster step with a constant seed.
Also it would be great if we could double check that everything works as expected before we merge the PR. Especially that the KMeans algorithm actually starts with similar centroids initialization on all workers whatever the init mode chosen and whether there is a recluster step or not.