Skip to content

Dataops quick tour#2169

Open
jeromedockes wants to merge 18 commits into
skrub-data:mainfrom
jeromedockes:dataops-quick-tour
Open

Dataops quick tour#2169
jeromedockes wants to merge 18 commits into
skrub-data:mainfrom
jeromedockes:dataops-quick-tour

Conversation

@jeromedockes

Copy link
Copy Markdown
Member

introductory notebook

comes after #2162

@jeromedockes jeromedockes added documentation Add or improve the documentation data_ops Something related to the skrub DataOps labels Jun 15, 2026
Comment on lines +231 to +242
encoder = skrub.choose_from(
{"lse": skrub.StringEncoder(), "minhash": skrub.MinHashEncoder()}, name="encoder"
)
pred = employee_data.skb.apply(
skrub.TableVectorizer(high_cardinality=encoder)
).skb.apply(
HistGradientBoostingRegressor(
learning_rate=skrub.choose_float(0.01, 0.7, log=True, name="learning_rate")
),
y=salary,
)
print(pred.skb.describe_param_grid())

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey Jerome!
First of all, this as a whole is way clearer and more demonstrative as what we had before, but for this specific part, what do you think about also adding choose from with 2 estimators( like hgb and ridge or any other one that you like) to show that is possible as well? I know it might be too soon to show that option , but when learning about it, it really impressed me, so it would have been nice to see it from the start.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks @moujanrastgoo! you are right that this is important to show. I tried to show it by having a choice of the encoder between StringEncoder and MinHashEncoder, do you think it is enough to add a sentence to highlight that? I would like to keep the example simple, and also while useful tuning the choice of the final estimators could seem a bit odd, as one might consider those to be 2 different pipelines, and evaluate those separately and make a decision manually 🤔

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see what you mean! Yes, I agree, I think adding a sentence to highlight it, along with the link to the section explaining nested choices in the user guide is enough. Thank you very much!

@jeromedockes jeromedockes marked this pull request as ready for review June 23, 2026 13:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

data_ops Something related to the skrub DataOps documentation Add or improve the documentation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants