-
Notifications
You must be signed in to change notification settings - Fork 184
[Cleanup] Combine Batched and Regular KMeans Impl #2015
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from 40 commits
66d7fd3
0a09e6f
99a5730
a077406
d659875
ec2e8b7
03a6473
42a8d9d
86af2fa
d4e4e2c
0819af5
e0f079c
c2f7390
b9c3102
e3956c1
986d78a
a8e1d26
384d054
455b286
5462809
6ba759c
e76eaac
afbefdf
e62a63c
e4f08bf
6e4a8f0
4a8a85c
bbf2a9f
410092c
c515c1e
e8e63ab
30c457c
ab96623
269f23c
80a22ca
ac06b05
0569340
8cac63a
f6df4ae
9fc74b1
dec3dc4
0d030a2
b1c034e
a482495
8ecfdc1
1e1525e
ec22e07
d2e410d
b791c38
a05a006
73293cf
880c7b9
e2035ec
e28c200
55bbdad
9a9b8ee
a800b27
3db8582
c048352
2f968f8
affe85a
c6dea64
7dfab3e
7a383da
ce6c4b5
5a06a44
35cbf50
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -39,6 +39,8 @@ typedef enum { | |
|
|
||
| /** | ||
| * @brief Hyper-parameters for the kmeans algorithm | ||
| * NB: The inertia_check field is kept for ABI compatibility. Removed in cuvsKMeansParams_v1. | ||
| * CalVer for the replacement: 26.08 | ||
| */ | ||
| struct cuvsKMeansParams { | ||
| cuvsDistanceType metric; | ||
|
|
@@ -91,7 +93,7 @@ struct cuvsKMeansParams { | |
| */ | ||
| int batch_centroids; | ||
|
|
||
| /** Check inertia during iterations for early convergence. */ | ||
| /** Deprecated, ignored. Kept for ABI compatibility. */ | ||
| bool inertia_check; | ||
|
|
||
| /** | ||
|
|
@@ -108,7 +110,92 @@ struct cuvsKMeansParams { | |
| * Number of samples to process per GPU batch for the batched (host-data) API. | ||
| * When set to 0, defaults to n_samples (process all at once). | ||
| */ | ||
| int64_t streaming_batch_size; | ||
| int64_t streaming_batch_size; | ||
|
|
||
| /** | ||
| * Number of samples to draw for KMeansPlusPlus initialization. | ||
| * When set to 0, uses heuristic min(3 * n_clusters, n_samples) for host data, | ||
| * or n_samples for device data. | ||
| */ | ||
| int64_t init_size; | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Question: adding In practice cVS consumers go through
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. yes I was just following the instructions in the doc that we have as a guide for ABI stability. And we also have a CI check for the same, which will fail if there is a breaking change. I also added |
||
| }; | ||
|
|
||
| /** | ||
| * @brief Hyper-parameters for the kmeans algorithm | ||
| */ | ||
| struct cuvsKMeansParams_v1 { | ||
|
coderabbitai[bot] marked this conversation as resolved.
Outdated
tarang-jain marked this conversation as resolved.
Outdated
|
||
| cuvsDistanceType metric; | ||
|
|
||
| /** | ||
| * The number of clusters to form as well as the number of centroids to generate (default:8). | ||
| */ | ||
| int n_clusters; | ||
|
|
||
| /** | ||
| * Method for initialization, defaults to k-means++: | ||
| * - cuvsKMeansInitMethod::KMeansPlusPlus (k-means++): Use scalable k-means++ algorithm | ||
| * to select the initial cluster centers. | ||
| * - cuvsKMeansInitMethod::Random (random): Choose 'n_clusters' observations (rows) at | ||
| * random from the input data for the initial centroids. | ||
| * - cuvsKMeansInitMethod::Array (ndarray): Use 'centroids' as initial cluster centers. | ||
| */ | ||
| cuvsKMeansInitMethod init; | ||
|
|
||
| /** | ||
| * Maximum number of iterations of the k-means algorithm for a single run. | ||
| */ | ||
| int max_iter; | ||
|
|
||
| /** | ||
| * Relative tolerance with regards to inertia to declare convergence. | ||
| */ | ||
| double tol; | ||
|
|
||
| /** | ||
| * Number of instance k-means algorithm will be run with different seeds. | ||
| */ | ||
| int n_init; | ||
|
|
||
| /** | ||
| * Oversampling factor for use in the k-means|| algorithm | ||
| */ | ||
| double oversampling_factor; | ||
|
|
||
| /** | ||
| * batch_samples and batch_centroids are used to tile 1NN computation which is | ||
| * useful to optimize/control the memory footprint | ||
| * Default tile is [batch_samples x n_clusters] i.e. when batch_centroids is 0 | ||
| * then don't tile the centroids | ||
| */ | ||
| int batch_samples; | ||
|
|
||
| /** | ||
| * if 0 then batch_centroids = n_clusters | ||
| */ | ||
| int batch_centroids; | ||
|
|
||
| /** | ||
| * Whether to use hierarchical (balanced) kmeans or not | ||
| */ | ||
| bool hierarchical; | ||
|
|
||
| /** | ||
| * For hierarchical k-means , defines the number of training iterations | ||
| */ | ||
| int hierarchical_n_iters; | ||
|
|
||
| /** | ||
| * Number of samples to process per GPU batch for the batched (host-data) API. | ||
| * When set to 0, defaults to n_samples (process all at once). | ||
| */ | ||
| int64_t streaming_batch_size; | ||
|
|
||
| /** | ||
| * Number of samples to draw for KMeansPlusPlus initialization. | ||
| * When set to 0, uses heuristic min(3 * n_clusters, n_samples) for host data, | ||
| * or n_samples for device data. | ||
| */ | ||
| int64_t init_size; | ||
| }; | ||
|
coderabbitai[bot] marked this conversation as resolved.
|
||
|
|
||
| typedef struct cuvsKMeansParams* cuvsKMeansParams_t; | ||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -113,9 +113,18 @@ struct params : base_params { | |
| int batch_centroids = 0; | ||
|
|
||
| /** | ||
| * If true, check inertia during iterations for early convergence. | ||
| * Number of samples to randomly draw for the KMeansPlusPlus initialization | ||
| * step. A random subset of this size is used for centroid seeding. | ||
| * | ||
| * Only applies when dataset is on host; for device data the full dataset | ||
| * is always used for seeding and this parameter is ignored. | ||
| * | ||
| * When set to 0 (default) with host data uses `min(3 * n_clusters, n_samples)` | ||
| * as a default. | ||
| * | ||
| * Default: 0. | ||
| */ | ||
| bool inertia_check = false; | ||
| int64_t init_size = 0; | ||
|
Comment on lines
115
to
+127
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Breaking removal of The Please either:
As per coding guidelines: "API changes flagged for docs/ updates" and "Breaking changes require deprecation warnings and migration guide updates". 🤖 Prompt for AI Agents
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I soflty agree with this rabbit comment, would be worth to add a not in the PR description and code, C++ callers using the C++ API directly will fail to compile, not warn. That's a real source-API break worth a release-notes line at least There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Concretely, I'd suggest three things:
/**
* `@deprecated` since 26.06 — ignored. Inertia-based convergence now always runs.
* Use `init_size` to control KMeans++ seeding sample count.
* Will be removed in a future release.
*/
[[deprecated("inertia_check is ignored since 26.06; inertia-based convergence always runs")]]
bool inertia_check = false;
Would you like me to plan the code changes for items 1 and 2 above? 🧠 Learnings used |
||
|
|
||
| /** | ||
| * Number of samples to process per GPU batch when fitting with host data. | ||
|
|
||
Uh oh!
There was an error while loading. Please reload this page.