Skip to content

UCS/CONFIG: Support numeric ranges for devices#11340

Open
guy-ealey-morag wants to merge 14 commits intoopenucx:masterfrom
guy-ealey-morag:support-device-ranges-refactor
Open

UCS/CONFIG: Support numeric ranges for devices#11340
guy-ealey-morag wants to merge 14 commits intoopenucx:masterfrom
guy-ealey-morag:support-device-ranges-refactor

Conversation

@guy-ealey-morag
Copy link
Copy Markdown
Contributor

@guy-ealey-morag guy-ealey-morag commented Apr 14, 2026

What?

Add support numeric ranges for devices (e.g.. mlx5_[0-2]) with UCX_*_DEVICES environment variables (NET_DEVICES, SHM_DEVICES, ACC_DEVICES, SELF_DEVICES).

Why?

This makes it easier to specify multiple related devices easily, for example:
UCX_NET_DEVICES=mlx5_[1-3] instead of UCX_NET_DEVICES=mlx5_1,mlx5_2,mlx5_3

How?

Added a new config type UCS_CONFIG_TYPE_ALLOW_LIST_WITH_RANGES that detects the numeric range pattern in the list and expands it to the different items of the range.


Note:
Range expansion was originally planned to be implemented as a generic api in string and string_buffer (#11215), but due to increasing complexity in the generic implementation, it is put on hold for now and the non-generic design is prioritized instead.

Signed-off-by: Guy Ealey Morag <gealeymorag@nvidia.com>
Signed-off-by: Guy Ealey Morag <gealeymorag@nvidia.com>
Signed-off-by: Guy Ealey Morag <gealeymorag@nvidia.com>
@guy-ealey-morag guy-ealey-morag force-pushed the support-device-ranges-refactor branch from 1a6d926 to dcedc34 Compare April 14, 2026 12:49
Signed-off-by: Guy Ealey Morag <gealeymorag@nvidia.com>
Signed-off-by: Guy Ealey Morag <gealeymorag@nvidia.com>
Copy link
Copy Markdown
Contributor

@ColinNV ColinNV left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the tests pass...

Comment thread src/ucs/config/parser.c
Comment thread test/gtest/ucp/test_ucp_context.cc Outdated
unsigned expected_device_idx)
{
for (unsigned i = 0; i < count; ++i) {
std::string const device_name = build_mlx5_device_name(i);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

East const?

"a[0-A]b", "a[-]b", "[]", "[-1-2-]", "[--]",
"[1-2][3-4]", "[4-8][", "[5-6]]", "][4-5]", "[[0-4]]",
"a[0-4]b[6-8]c", "a[5-2]b", "[10-0]", "ab[100-1]suffix"
};
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

const and "a[-4]", "[-4]", "foo[]]", "[]]", "a-]", "a[-", "a[4-]", "a[4-6" which is not exhaustive but surely sufficient (as is your list).

Signed-off-by: Guy Ealey Morag <gealeymorag@nvidia.com>
Comment thread src/ucs/config/parser.c
Comment on lines +964 to +965
ucs_assertv(delim[0] != '\0', "delimiter cannot be empty");
ucs_assertv(strlen(delim) == 1, "delimiter must be a single character");
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't it be better to have delim as char instead of const char *?

Copy link
Copy Markdown
Contributor Author

@guy-ealey-morag guy-ealey-morag Apr 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This arg was actually only recently added (#11206)

Changing it to char would require putting it into a null-terminated string to be able to use it with strspn, strpbrk, ucs_string_buffer_for_each_token because they take a set of delimiters as argument.

The asserts are here to make sure there is a single delimiter given because we use it to build the output.

Let me know if you still think it's worth changing it.

Comment thread src/ucs/config/parser.c Outdated
ucs_free(str_dup);
return 0;
UCS_INIT_ONCE(&regex_init) {
ret = regcomp(&range_regex,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we call regfree somewhere?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved the regex object to the file scope and added it to the cleanup logic

Signed-off-by: Guy Ealey Morag <gealeymorag@nvidia.com>
Comment thread src/ucs/config/parser.c Outdated
"\\[([0-9]+)-([0-9]+)\\]" /* range */
"([^][]*)$" /* suffix */,
REG_EXTENDED);
ucs_assertv(ret == 0, "failed to compile range regex: %d", ret);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

always or fatal?

Copy link
Copy Markdown
Contributor Author

@guy-ealey-morag guy-ealey-morag Apr 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I expect it to never fail if it passes in CI, but I changed it to a fatal error for robustness

Comment thread src/ucs/config/parser.c Outdated
pmatch[0].rm_so = 0;
pmatch[0].rm_eo = (regoff_t)token_len;

ret = regexec(&ucs_config_range_regex, token, 5, pmatch, REG_STARTEND);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ucs_static_array_size() for all 5

Comment thread src/ucs/config/parser.c
break;
}

count_total += ucs_config_array_expand_range(strb, p, next - p, delim,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we make any truncation (reaching max_elements) logged, to help quickly understanding over/mis use?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added warning log with matching tests

Comment thread src/ucs/config/parser.c Outdated

snprintf(buf, max,
"comma-separated list (use \"all\" for including all items, "
"\'^\' for negation, or \'prefix[start-end]suffix\' for ranges) "
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe reverse order like: use 'all' for all items, prefic[s-e]suffix for ranges, start with ^ for negation?

Comment thread src/ucs/config/parser.c
return 1;
}

static void ucs_config_array_expand_ranges(ucs_string_buffer_t *strb,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some systems have: rocep124s0, rocep193s0, rocep125s0, rocep194s0. in that case worth noting that expansion could lead to needing many entries. rocep[124-193]s0 for instance.

Copy link
Copy Markdown
Contributor Author

@guy-ealey-morag guy-ealey-morag Apr 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case the best we can do right now is rocep[124-125]s0,rocep[193-194]s0.
If it's very common we can think of extending this feature to support multiple ranges,
something like rocep[123-124,193-194]s0.
What do you think?

Comment thread src/ucs/config/parser.c Outdated
ucs_assertv(strlen(delim) == 1, "delimiter must be a single character");

if (expand_ranges) {
ucs_config_array_expand_ranges(&strb, buf, delim, UCS_CONFIG_ARRAY_MAX);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for my understanding, after expansion, does it lead to output logs stating that specified devices are unknown, in case they would not be there?

Copy link
Copy Markdown
Contributor Author

@guy-ealey-morag guy-ealey-morag Apr 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it's one of the requirements to keep these warnings for ranges as well.
I updated the test test_nonexistent_device_warning to include a range as well

UCS_TEST_P(test_ucp_devices_config_mlx5, duplicate_device_warning_two_ranges)
{
test_duplicate_device_warning({"mlx5_0:1", "mlx5_1:1", "mlx5_2:1"},
"mlx5_[0-1],mlx5_[1-2]", "mlx5_1");
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for my understanding does range support mlx5_[0-1]:2? if yes could we add test if not already there?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but do you know if these devices exist in the CI setup?

Signed-off-by: Guy Ealey Morag <gealeymorag@nvidia.com>
Signed-off-by: Guy Ealey Morag <gealeymorag@nvidia.com>
Signed-off-by: Guy Ealey Morag <gealeymorag@nvidia.com>
Signed-off-by: Guy Ealey Morag <gealeymorag@nvidia.com>
Signed-off-by: Guy Ealey Morag <gealeymorag@nvidia.com>
…sing

Signed-off-by: Guy Ealey Morag <gealeymorag@nvidia.com>
Signed-off-by: Guy Ealey Morag <gealeymorag@nvidia.com>
Comment thread src/ucs/config/parser.c
const char *input, const char *delim,
size_t max_elements)
{
size_t count_total = 0;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minotr: num_elements would be more intuitive to be compared against max_elements?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants