host: Add AVX2 support for uhd::convert by anilgurses · Pull Request #789 · EttusResearch/uhd

anilgurses · 2024-09-29T23:13:40Z

Pull Request Details

Description

AVX2 support is implemented for uhd::convert. It was previously limited with sse2. It provides performance improvements for data type conversion.

Related Issue

N/A

Which devices/areas does this affect?

Affects the uhd::convert data conversion performance.

Testing Done

Testing is done using the tests written previously by UHD developers. It passes all previous tests and there is no need for new tests.

Checklist

I have read the CONTRIBUTING document.
My code follows the code style of this project. See CODING.md.
I have updated the documentation accordingly.
I have added tests to cover my changes, and all previous tests pass.
I have checked all compat numbers if they need updating (FPGA compat,
MPM compat, noc_shell, specific RFNoC block, ...)

github-actions · 2025-03-24T22:09:11Z

CLA Assistant Lite bot All contributors have signed the CLA ✍️ ✅

Signed-off-by: Anıl Gürses <[email protected]>

anilgurses · 2025-03-24T22:11:43Z

I have read the CLA Document and I hereby sign the CLA

anilgurses · 2025-10-27T00:43:27Z

Hi! Is there anything else needed for this PR?

mbr0wn · 2025-11-04T11:21:22Z

Hey @anilgurses, sorry for never responding here. The problem is that AVX2 support is not ubiquitous, and we need a way to only deploy it on demand. Something like a glibc conditional dispatch.

I was also thinking of merging this, but leaving it disabled unless explicitly enabled at compile time (this would not, for example, be the case for .deb files we distribute). But that's also work.

mbr0wn · 2025-11-04T11:23:11Z

host/lib/convert/CMakeLists.txt

 ########################################################################
+
+# Check for SSE2 support
+check_cxx_compiler_flag("-msse2" SSE2_SUPPORTED)


All of this assumes the compiling machine has the same arch as the executing machine.

mbr0wn · 2025-11-04T11:26:13Z

host/lib/convert/CMakeLists.txt

+# Check for AVX2 support
+check_cxx_compiler_flag("-mavx512" AVX512_SUPPORTED)
+if(AVX512_SUPPORTED)
+    message(STATUS "AVX512 is supported")


This means AVX512 is supported by the compiler, not that it's also supported by the CPU.

anilgurses · 2025-11-04T15:45:31Z

Thanks for the feedback! You are right. Let me check if I can find time to implement on-demand AVX512. I'll update this PR once it's ready.

anilgurses · 2026-02-03T05:51:35Z

Sorry for the delay, I've worked on it and developed runtime dispatch of SIMD functions for converters. I've also added a converter benchmark tool under tests. I was wondering how much performance gain is achieved on which instruction. I'm putting the table below. The baseline for comparison is compiler optimizations for the generic converter. It is not surprising that avx2 performs better at bigger packets. I might implement avx512 and test it with that as well. I've been using the previous version of the code for a year. I didn't encounter any issues but I realized that I made a big mistake on the SIMD_PRIORITY, which I have fixed. I ran my tests on Xeon Gold 6240. Let me know if I am missing something or I can improve my PR.

================================================================================
Summary: Fastest Converter for Each Type
================================================================================
                         Conversion  Best Priority      ns/sample Speedup vs Gen
--------------------------------------------------------------------------------
             fc32 -> sc12_item32_le     SSE2/SSSE3          0.683          2.13x
                  fc32 -> sc16_chdr           AVX2          0.489          2.20x
             fc32 -> sc16_item32_be           AVX2          0.491          4.05x
             fc32 -> sc16_item32_le           AVX2          0.490          2.05x
              fc32 -> sc8_item32_le           AVX2          0.420          3.20x
                  fc64 -> sc16_chdr           AVX2          0.748          2.23x
             fc64 -> sc16_item32_le           AVX2          0.747          2.44x
             sc12_item32_le -> fc32     SSE2/SSSE3          0.530          2.51x
             sc12_item32_le -> sc16     SSE2/SSSE3          0.331          2.24x
             sc16 -> sc12_item32_le     SSE2/SSSE3          0.390          2.15x
             sc16 -> sc16_item32_be           AVX2          0.288          1.80x
             sc16 -> sc16_item32_le           AVX2          0.287          1.39x
                  sc16_chdr -> fc32     SSE2/SSSE3          0.440          1.00x
                  sc16_chdr -> fc64     SSE2/SSSE3          0.736          1.41x
             sc16_item32_be -> fc32     SSE2/SSSE3          0.442          2.17x
             sc16_item32_be -> sc16     SSE2/SSSE3          0.291          2.36x
             sc16_item32_le -> fc32           AVX2          0.487          1.31x
             sc16_item32_le -> fc64     SSE2/SSSE3          0.737          1.55x
             sc16_item32_le -> sc16     SSE2/SSSE3          0.288          1.37x
              sc8_item32_le -> fc32     SSE2/SSSE3          0.369          2.09x

Copilot

Pull request overview

Adds AVX2-backed converter implementations and introduces runtime SIMD feature detection so UHD can select the best available uhd::convert implementation on a given CPU.

Changes:

Added runtime CPU SIMD feature detection (SSE2/SSSE3/AVX2/AVX512F) and new SIMD priority levels.
Implemented multiple AVX2 converters and updated existing SSE2/SSSE3 converters to register conditionally.
Updated build system and tests/examples to account for new priorities and benchmarking.

Reviewed changes

Copilot reviewed 26 out of 26 changed files in this pull request and generated 14 comments.

Show a summary per file

File	Description
host/tests/convert_test.cpp	Updates priority list and changes benchmark test decorator behavior.
host/lib/convert/ssse3_unpack_sc12.cpp	Adds runtime SSSE3 check to avoid registering on unsupported CPUs.
host/lib/convert/ssse3_pack_sc12.cpp	Adds runtime SSSE3 check to avoid registering on unsupported CPUs.
host/lib/convert/sse2_sc8_to_fc64.cpp	Switches to SSE2 runtime-gated converter declaration macro.
host/lib/convert/sse2_sc8_to_fc32.cpp	Switches to SSE2 runtime-gated converter declaration macro.
host/lib/convert/sse2_sc16_to_sc16.cpp	Switches to SSE2 runtime-gated converter declaration macro.
host/lib/convert/sse2_sc16_to_fc64.cpp	Switches to SSE2 runtime-gated converter declaration macro.
host/lib/convert/sse2_sc16_to_fc32.cpp	Switches to SSE2 runtime-gated converter declaration macro.
host/lib/convert/sse2_fc64_to_sc8.cpp	Switches to SSE2 runtime-gated converter declaration macro.
host/lib/convert/sse2_fc64_to_sc16.cpp	Switches to SSE2 runtime-gated converter declaration macro.
host/lib/convert/sse2_fc32_to_sc8.cpp	Switches to SSE2 runtime-gated converter declaration macro.
host/lib/convert/sse2_fc32_to_sc16.cpp	Switches to SSE2 runtime-gated converter declaration macro.
host/lib/convert/simd_features.hpp	New header for runtime SIMD detection + logging helpers.
host/lib/convert/convert_impl.cpp	Logs SIMD capabilities during converter/item-size registration.
host/lib/convert/convert_common.hpp	Adds SIMD converter macros with runtime gating and AVX2/AVX512 priorities.
host/lib/convert/avx2_sc8_to_fc32.cpp	New AVX2 converter implementation(s).
host/lib/convert/avx2_sc16_to_sc16.cpp	New AVX2 converter implementation(s).
host/lib/convert/avx2_sc16_to_fc64.cpp	New AVX2 converter implementation(s).
host/lib/convert/avx2_sc16_to_fc32.cpp	New AVX2 converter implementation(s).
host/lib/convert/avx2_fc64_to_sc8.cpp	New AVX2 converter implementation(s).
host/lib/convert/avx2_fc64_to_sc16.cpp	New AVX2 converter implementation(s).
host/lib/convert/avx2_fc32_to_sc8.cpp	New AVX2 converter implementation(s).
host/lib/convert/avx2_fc32_to_sc16.cpp	New AVX2 converter implementation(s).
host/lib/convert/CMakeLists.txt	Adds compiler-flag-based SIMD build detection and includes AVX2 sources.
host/examples/convert_benchmark.cpp	Adds a standalone converter benchmarking example.
host/examples/CMakeLists.txt	Builds the new `convert_benchmark` example.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-02-03T05:57:40Z

host/lib/convert/avx2_fc32_to_sc8.cpp

+    // dispatch according to alignment
+    if ((size_t(input) & 0xf) == 0) {
+        convert_fc32_1_to_sc8_item32_1_nswap_guts(_)
+    } else {
+        convert_fc32_1_to_sc8_item32_1_nswap_guts(u_)
+    }


Same issue as above in the LE converter: _mm256_load_ps requires 32-byte alignment, but the guard only checks 16-byte alignment (& 0xf). This can cause crashes on 16B-but-not-32B aligned buffers. Please switch the aligned path to require 32-byte alignment (& 0x1f) or use unaligned loads unconditionally.

Copilot · 2026-02-03T05:57:40Z

host/lib/convert/avx2_fc32_to_sc16.cpp

+    // need to dispatch according to alignment for fastest conversion
+    switch (size_t(input) & 0xf) {
+        case 0x0:
+            // the data is 16-byte aligned, so do the fast processing of the bulk of the
+            // samples
+            convert_fc32_1_to_item32_1_bswap_guts(_) break;
+        case 0x8:
+            // the first value is 8-byte aligned - process it and prepare the bulk of the
+            // data for fast conversion
+            xx_to_item32_sc16<uhd::htonx>(input, output, 1, scale_factor);
+            i++;
+            // do faster processing of the remaining samples now that we are 16-byte
+            // aligned
+            convert_fc32_1_to_item32_1_bswap_guts(_) break;
+        default:
+            // we are not 8 or 16-byte aligned, so do fast processing with the unaligned
+            // load
+            convert_fc32_1_to_item32_1_bswap_guts(u_)
+    }


Same 16B-vs-32B alignment bug in this converter: the dispatch checks size_t(input) & 0xf, but _mm256_load_ps requires 32-byte alignment when using the aligned-load form. Please update the aligned-path condition to 32-byte alignment (& 0x1f) or always use unaligned loads.

Copilot · 2026-02-03T05:57:40Z

host/lib/convert/avx2_fc32_to_sc16.cpp

+    // need to dispatch according to alignment for fastest conversion
+    switch (size_t(input) & 0xf) {
+        case 0x0:
+            // the data is 16-byte aligned, so do the fast processing of the bulk of the
+            // samples
+            convert_fc32_1_to_item32_1_guts(_) break;
+        case 0x8:
+            // the first sample is 8-byte aligned - process it to align the remainder of
+            // the samples to 16-bytes
+            xx_to_chdr_sc16(input, output, 1, scale_factor);
+            i++;
+            // do faster processing of the bulk of the samples now that we are 16-byte
+            // aligned
+            convert_fc32_1_to_item32_1_guts(_) break;
+        default:
+            // we are not 8 or 16-byte aligned, so do fast processing with the unaligned
+            // load
+            convert_fc32_1_to_item32_1_guts(u_)
+    }


Same alignment issue here: the aligned-load branch uses _mm256_load_ps but only checks for 16-byte alignment (& 0xf). This should be 32-byte alignment (& 0x1f) to be safe, or the code should use _mm256_loadu_ps for all loads.

Copilot · 2026-02-03T05:57:40Z

host/lib/convert/avx2_fc64_to_sc16.cpp

+    // dispatch according to alignment
+    if ((size_t(input) & 0xf) == 0) {
+        convert_fc64_1_to_item32_1_bswap_guts(_)
+    } else {
+        convert_fc64_1_to_item32_1_bswap_guts(u_)
+    }


Same alignment issue in this converter: _mm256_load_pd requires 32-byte alignment but the guard only checks 16-byte alignment (& 0xf). Please require 32-byte alignment (& 0x1f) for the aligned-load path or switch to _mm256_loadu_pd.

Copilot · 2026-02-03T05:57:41Z

host/lib/convert/convert_common.hpp

+ * Priority hierarchy:
+ *   PRIORITY_EMPTY   = -1  (empty/null converter)
+ *   PRIORITY_GENERAL =  0  (generic C++ implementation)
+ *   PRIORITY_TABLE   =  1  (table lookup)
+ *   PRIORITY_SIMD    =  3  (SSE2/NEON - baseline SIMD)
+ *   PRIORITY_SIMD_AVX2   = 4  (AVX2 - 256-bit SIMD)
+ *   PRIORITY_SIMD_AVX512 = 5  (AVX512 - 512-bit SIMD)


The priority hierarchy comment hard-codes PRIORITY_SIMD = 3, but on __ARM_NEON__ the code sets PRIORITY_SIMD = 2. This makes the comment misleading and can cause people/tools (like benchmarks) to report incorrect mappings. Please update the comment to reflect the platform-dependent values (or avoid listing the numeric constants).

Suggested change

* Priority hierarchy:

* PRIORITY_EMPTY = -1 (empty/null converter)

* PRIORITY_GENERAL = 0 (generic C++ implementation)

* PRIORITY_TABLE = 1 (table lookup)

* PRIORITY_SIMD = 3 (SSE2/NEON - baseline SIMD)

* PRIORITY_SIMD_AVX2 = 4 (AVX2 - 256-bit SIMD)

* PRIORITY_SIMD_AVX512 = 5 (AVX512 - 512-bit SIMD)

* Priority hierarchy (relative ordering; numeric values are

* platform-dependent and defined below):

* PRIORITY_EMPTY (empty/null converter)

* PRIORITY_GENERAL (generic C++ implementation)

* PRIORITY_TABLE (table lookup)

* PRIORITY_SIMD (SSE2/NEON - baseline SIMD)

* PRIORITY_SIMD_AVX2 (AVX2 - 256-bit SIMD)

* PRIORITY_SIMD_AVX512 (AVX512 - 512-bit SIMD)

Copilot · 2026-02-03T05:57:42Z

host/lib/convert/avx2_fc64_to_sc16.cpp

+    // dispatch according to alignment
+    if ((size_t(input) & 0xf) == 0) {
+        convert_fc64_1_to_item32_1_nswap_guts(_)
+    } else {
+        convert_fc64_1_to_item32_1_nswap_guts(u_)
+    }


The aligned-load path uses _mm256_load_pd (32-byte alignment required), but the dispatch only checks size_t(input) & 0xf (16-byte). On 16B-aligned but not 32B-aligned inputs, this can crash. Please update the aligned-path condition to & 0x1f or always use _mm256_loadu_pd.

Copilot · 2026-02-03T05:57:42Z

host/lib/convert/avx2_fc64_to_sc16.cpp

+    // dispatch according to alignment
+    if ((size_t(input) & 0xf) == 0) {
+        convert_fc64_1_to_chdr_1_guts(_)
+    } else {
+        convert_fc64_1_to_chdr_1_guts(u_)
+    }


Same 16B-vs-32B alignment bug here: the aligned-load path uses _mm256_load_pd but the condition only checks & 0xf. Please change this to a 32-byte alignment check (& 0x1f) or always use unaligned loads.

Copilot · 2026-02-03T05:57:42Z

host/examples/convert_benchmark.cpp

+            return "Auto (Best)";
+        case 0:
+            return "Generic";
+        case 1:
+            return "Unrolled";
+        case 2:
+            return "NEON";
+        case 3:
+            return "SSE2/SSSE3";
+        case 4:
+            return "AVX2";
+        case 5:
+            return "AVX512";


get_priority_name() hard-codes priority meanings that don’t match host/lib/convert/convert_common.hpp (e.g., prio 1 is PRIORITY_TABLE/table lookup there, not "Unrolled"; prio 2 is PRIORITY_SIMD only on ARM, while x86 uses SIMD at prio 3). This makes the benchmark output misleading. Please align the labels (and default priority list) with the actual priority constants/semantics used in convert_common.hpp.

Suggested change

return "Auto (Best)";

case 0:

return "Generic";

case 1:

return "Unrolled";

case 2:

return "NEON";

case 3:

return "SSE2/SSSE3";

case 4:

return "AVX2";

case 5:

return "AVX512";

// Automatic selection of the best available implementation

return "Auto (Best)";

case 0:

// Reference/generic implementation

return "Generic";

case 1:

// Table-lookup based implementation (PRIORITY_TABLE)

return "Table lookup";

case 2:

// SIMD implementation (architecture-dependent priority)

return "SIMD (priority 2)";

case 3:

// SIMD implementation (architecture-dependent priority)

return "SIMD (priority 3)";

Copilot · 2026-02-03T05:57:43Z

host/tests/convert_test.cpp

+// BOOST_TEST_DECORATOR(*boost::unit_test::disabled())
 BOOST_AUTO_TEST_CASE(benchmark_convert_types_fc32_and_sc8)
 {


The benchmark test benchmark_convert_types_fc32_and_sc8 has effectively been re-enabled by commenting out the BOOST_TEST_DECORATOR(*boost::unit_test::disabled()). With BENCHMARK_NSAMPS = 8*1024*1024 and BENCHMARK_NITERS = 4, this will substantially increase unit test runtime and can cause CI timeouts. Please re-disable this benchmark (or gate it behind an opt-in flag/env var) so benchmarks don’t run as part of the default test suite.

Copilot · 2026-02-03T05:57:43Z

host/lib/convert/avx2_fc32_to_sc8.cpp

+    // dispatch according to alignment
+    if ((size_t(input) & 0xf) == 0) {
+        convert_fc32_1_to_sc8_item32_1_bswap_guts(_)
+    } else {
+        convert_fc32_1_to_sc8_item32_1_bswap_guts(u_)
+    }


The aligned-load path uses _mm256_load_ps via _mm256_load##_al_##ps, which requires 32-byte alignment, but the alignment check only tests & 0xf (16-byte). If input is 16B-aligned but not 32B-aligned, this will execute an aligned AVX load on an unaligned address and can crash. Please either (a) check & 0x1f for 32-byte alignment, or (b) always use _mm256_loadu_ps and drop the aligned/unaligned dispatch.

anilgurses force-pushed the avx2-support branch from a060a42 to db32fa6 Compare March 24, 2025 22:08

Add AVX2 support for uhd::convert

1b78ab6

Signed-off-by: Anıl Gürses <[email protected]>

anilgurses force-pushed the avx2-support branch from db32fa6 to 1b78ab6 Compare March 24, 2025 22:11

mbr0wn reviewed Nov 4, 2025

View reviewed changes

anilgurses and others added 2 commits November 6, 2025 10:47

Merge branch 'EttusResearch:master' into avx2-support

d81238e

Add runtime dispatch for avx2 converters and converter benchmark

3e83ee8

Copilot AI review requested due to automatic review settings February 3, 2026 05:44

Copilot started reviewing on behalf of anilgurses February 3, 2026 05:44 View session

Copilot AI reviewed Feb 3, 2026

View reviewed changes

- * Priority hierarchy:
- *   PRIORITY_EMPTY   = -1  (empty/null converter)
- *   PRIORITY_GENERAL =  0  (generic C++ implementation)
- *   PRIORITY_TABLE   =  1  (table lookup)
- *   PRIORITY_SIMD    =  3  (SSE2/NEON - baseline SIMD)
- *   PRIORITY_SIMD_AVX2   = 4  (AVX2 - 256-bit SIMD)
- *   PRIORITY_SIMD_AVX512 = 5  (AVX512 - 512-bit SIMD)
+ * Priority hierarchy (relative ordering; numeric values are
+ * platform-dependent and defined below):
+ *   PRIORITY_EMPTY       (empty/null converter)
+ *   PRIORITY_GENERAL     (generic C++ implementation)
+ *   PRIORITY_TABLE       (table lookup)
+ *   PRIORITY_SIMD        (SSE2/NEON - baseline SIMD)
+ *   PRIORITY_SIMD_AVX2   (AVX2 - 256-bit SIMD)
+ *   PRIORITY_SIMD_AVX512 (AVX512 - 512-bit SIMD)

Conversation

anilgurses commented Sep 29, 2024

Pull Request Details

Description

Related Issue

Which devices/areas does this affect?

Testing Done

Checklist

Uh oh!

github-actions bot commented Mar 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

anilgurses commented Mar 24, 2025

Uh oh!

anilgurses commented Oct 27, 2025

Uh oh!

mbr0wn commented Nov 4, 2025

Uh oh!

mbr0wn Nov 4, 2025

Choose a reason for hiding this comment

Uh oh!

mbr0wn Nov 4, 2025

Choose a reason for hiding this comment

Uh oh!

anilgurses commented Nov 4, 2025

Uh oh!

anilgurses commented Feb 3, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Feb 3, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 3, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 3, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 3, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 3, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 3, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 3, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 3, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 3, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 3, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

github-actions bot commented Mar 24, 2025 •

edited

Loading