Open
Conversation
23d90b4 to
98abcd7
Compare
bc86294 to
e3acd5a
Compare
4 tasks
4 tasks
We will use a single-level JSON for algorithm selection including device-specific algorithms. Remove the collective ADI for now. We'll add the mechanism of selecting device-level algorithms later. gen_coll.py is updated to skip calling MPID_ collectives. Device collective CVARs are removed.
We will add the mechanism of selecting device-layer algorithms later.
Temporarily comment out the composition code that calls netmod/shm collectives since we will remove these apis next. Some NULL composition functions are removed.
We will replace the device-algorithm selelction later at MPIR-layer.
The auto selection should take care of restrictions. Error rather than fallback. If user use CVAR to select specific algorithm, we should check restrictions before jumping the the algorithm. We will design a common fallback handling there.
b1bcb29 to
b8aa4f6
Compare
Current compositional algorithms call MPIR collectives. We will refactor them later. But for now, generate a wrapper MPIR functions that calls _impl functions.
Remove the old routines that are now unused.
Add MPIR_init_coll_sig and MPID_init_coll_sig so we can add arbitrary attr bits or additional fields without hacking maint/gen_coll.py.
Provide a simple mechanism for a rank to dump collective algorithm counters. Set MPIR_CVAR_DUMP_COLL_ALGO_COUNTERS to the global rank of the process that we want it to dump since it is undesirable for every process to dump yet it does not always makes sense for rank 0 to dump especially when we don't always use comm world. It is counted in the CSEL framework so internal collectives are not counted when we internally use _fallback algorithms.
A universal nb alglorithm for blocking collectives.
They are replaced by MPIR_Coll_nb.
In coll_algorithms.txt, add "inline" attribute to skip add prototype for the corresponding algorithm function since it is inlined in the headers. Add "func_name" to directly specify algorithm function name. Add "macro_guard" to specify a preproc condition for the algorithm function. For example, the ch4 posix algorithm function needs be protected by "#if defined(MPIDI_CH4_SHM_POSIX)" (to be defined).
Add conditional condition - the condition function only can be called inside preprocess macro guard. We need generate another header file, coll_autogen.h, that are loaded after mpidpos.h. "coll_algos.h" goes into mpir_coll.h, which is included in between mpidpre.h and mpidpost.h. Refactor a bit so all the conditions parsing logics are wrapped in functions such as get_conditon_name, get_condition_func, etc. and they are defined together.
Sometime we may want to do differently between restriction-check and condition check. For example, algorithm like release_gather normally gets selelcted only after user calls the collective certain number of times. But if user selects the algorithm by CVAR, it won't make sense to do this repeat check in the restriction-check.
Rather than add individual boolean flags, use bit mask "flags" instead. It is easier to make sure we zero-initialize all the flags that way.
Enable CVARs and JSONs to select ch4-posix layer release_gather algorithms. Select MPIDI_POSIX_mpi_bcast_release_gather if it passes MPIDI_CH4_release_gather condition check, which only passes if comm is an posix intranode comm.
Extend the previous commit to activate release_gather algorithm for reduce, allreduce, and barrier.
In MPIDI_POSIX_check_release_gather we check comm's hierarchical fields to ensure the comm is a node-local comm, i.e. comm->num_external is 1. Set these fields for subcomms so we can run release_gather checks on these subcomms.
It is not a fatal condition if a comm is missing hierarchical information. But it is likely a negligence issue. Thus we add an assertion so we can catch such case during CI testing. At production with assertion tuned off, it is okay to just return false in a hierarchical condition check.
Treat the fail path in check_hierarchy as if it's no_local. This simplifies lower-layer condition checks since we can always directly check the hierarchical info.
Remove MPIR_CVAR_COLL_SELECTION_TUNING_JSON_FILE. It is now replaced with MPIR_CVAR_COLL_SELECTION_JSON_FILE. Although we could reuse the same CVAR name, but since we altered the syntax of JSON, using a different name prevents potential confusion.
Parse the json as a list of named subtrees such as:
{
"name=main": {...},
"name=bcast-intra-auto": {...},
...
}
Inside the subtree, we can refer to the named subtree using "call=name".
If the json does not contain named subtrees, treat it as a single tree
with the name "main".
Load src/mpi/coll/coll_selection.json as named subtrees. Add MPIR_Coll_run_tree which runs the selection on a subtree. Replace MPIR_Coll_auto with MPIR_Coll_json, and add MPIR_Coll_run_tree(csel_tree_auto, coll_sig) to allow recursive algorithms such as compositional algorithms. csel_tree_auto will fallback to csel_tree_main if it is not defined in the json file. But similarly, we can easily introduce more predefined subtree later, e.g. bcast-intra-auto etc. In CVAR selection, the "auto" should be default and value should be 0. Thus it should automatically fallthrough and run on csel_tree_main.
Suppress a warning.
Contributor
Author
|
test:mpich/ch4/most |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Pull Request Description
coll_algorithms.txtcatalogs all collective algorithms and conditionscoll_selection.jsonspecifies decision treeMPIR_CVAR_DUMP_COLL_ALGO_COUNTERSfor debug summaryMPIR_CVAR_DUMP_COLL_ALGO_COUNTERS[skip warnings]
Discussion
Reference: #7544
Also see comments in #7598 and #7666
Author Checklist
Particularly focus on why, not what. Reference background, issues, test failures, xfail entries, etc.
Commits are self-contained and do not do two things at once.
Commit message is of the form:
module: short descriptionCommit message explains what's in the commit.
Whitespace checker. Warnings test. Additional tests via comments.
For non-Argonne authors, check contribution agreement.
If necessary, request an explicit comment from your companies PR approval manager.