Use instruction-level parallelism with AES-NI to speed up AES-CTR by mfil · Pull Request #573 · Mbed-TLS/TF-PSA-Crypto

mfil · 2025-11-13T12:39:33Z

Description

One feature of the AES-NI instructions is instruction-level parallelism, which means that, while one round of AES takes multiple clock cycles to complete, another instruction can already start executing on the next clock cycle. Only when the result of the AES round is used do we need to wait for it to finish.

In particular, this makes it possible to encrypt or decrypt multiple AES blocks in parallel which can be used to speed up AES in some modes of operation. In this commit, it is used to speed up AES-CTR. I have chosen to do up to four blocks in parallel. More would be possible, but while testing on my machine, I didn't see meaningful speed-ups when going beyond 4, and I don't want to clutter the source files with even more assembly.

I have removed the defines with the aesenc etc. opcodes assuming that we don't need to support these older versions of binutils anymore. If that's wrong, I'll put them back (and add my own macros for the registers I need to use).

I would like to go on and also use it in AES-GCM, but there I ran into a problem. GCM runs the underlying block cipher in ECB mode. Due to an inconsistency in the mbedtls_cipher API, ECB mode always encrypts exactly one block. As far as I understand, mbedtls_cipher is on its way out, so I don't know if you would want to spend time reviewing a fix for this. Please let me know what you think would be the best approach!

An alternative I can see would be to use CTR mode (making sure that the 32-bit counter in GCM doesn't overflow) inside of GCM, if available.

PR checklist

changelog provided
framework PR not required
mbedtls development PR not required because: No changes in mbedtls are needed to make use of this. The speed-up will take effect as soon as the submodule is updated. (Should I make a pull request just to update the submodule?)
mbedtls 3.6 PR not required because: It's a new feature, not a bugfix. It doesn't have to be backported. (As far as I'm concerned.)
tests not required because: No changes in the output are intended, only performance.

One feature of the AES-NI instructions is instruction-level parallelism, which means that, while one round of AES takes multiple clock cycles to complete, another instruction can already be executed on the next clock cycle. Only when the result of the AES round is used do we need to wait for it to finish. In particular, this makes it possible to encrypt or decrypt multiple AES blocks in parallel which can be used to speed up AES in some modes of operation. In this commit, it is used to speed up AES-CTR. Signed-off-by: Max Fillinger <maximilian.fillinger@sentyron.com>

Due to a bug in the previous commit, it broke AES-CTR when AES-NI is not available. Signed-off-by: Max Fillinger <maximilian.fillinger@sentyron.com>

Signed-off-by: Max Fillinger <maximilian.fillinger@sentyron.com>

mfil · 2025-11-18T12:00:09Z

It says

TF OpenCI: Interface stability tests — Pre-test checks failed.

but when I click the link to view details, I see a green check on archive-timestamps and everything else is blank.

bjwtaylor · 2025-11-18T13:43:44Z

Hi @mfil, thanks for submitting this PR. It looks like that job has failed due to infrastructure issues, so I would ignore them for the time being. I'm trying to get them resolved and will let you know when it's working again.

mfil · 2025-11-18T19:24:23Z

Hi @bjwtaylor ! Thanks, I'll wait.

In the meantime, I found that the functions with the intrinsics may not actually improve performance depending on the compiler. What I hoped would happen is that with optimizations turned on, the compiler would unroll the loops and then pipeline the aesenc instructions. Clang 18.1.3 does so, GCC 13.3.0 does not.

I tried hand-unrolling the loops, and then I get the speed-ups also with GCC. Should I commit that?

bjwtaylor · 2025-11-19T08:36:07Z

@mfil, the PR will need to go through a full review process. In the meantime though I would add anything you think is useful. I've now resolved some of the CI issues, please ignore the OpenCI issues and any failures in the NewCI should be related to your changes.

On some compilers, the intrinsics-version of mbedtls_aesni_ecb_crypt_4blocks does not create any performance improvements because they do not unroll the for-loops. Therefore, this commit adds a version where the loops have been unrolled by hand. Signed-off-by: Max Fillinger <maximilian.fillinger@sentyron.com>

mfil · 2025-11-19T14:56:02Z

Ok, I added another commit that unrolls the "4blocks" function.

Signed-off-by: Max Fillinger <maximilian.fillinger@sentyron.com>

mfil · 2025-11-20T18:18:27Z

I fixed the error in tf_psa_crypto_check_changelog.

I figured out what's going wrong in test_aesni, but not what would be a good way to fix it. It checks that the binary aes.o contains the string mbedtls_internal_aes_encrypt. That function is declared static, and my changes in aes.c seem to cause the compiler in your test container to inline it.

For the failure on aarch64, it's not obvious to me what's going wrong. I guess I have to go play with QEMU.

The function mbedtls_aes_crypt_ecb_multiblock() didn't exit after it was done encrypting the input with mbedtls_aesce_crypt_ecb() but instead went on to encrypt the input buffer again with the AES software implementation. This did not only waste time, but also created incorrect outputs when trying to encrypt a buffer in place. Signed-off-by: Max Fillinger <maximilian.fillinger@sentyron.com>

mfil · 2025-11-20T18:42:01Z

After looking at it for the 10th time, I think I spotted the bug with aarch64.

The tests in "mbedtls/tests/scripts/all.sh test_aesni" check that the software implementation of AES is compiled by checking whether this function name appears in aes.o. This function is static, and my changes in the previous commits cause it to be inlined by GCC 5.4.0, which is the compiler used in the test image in mbedtls-test. This does not seem to be the case with more recent versions. To make the tests pass, I've added the noinline attribute to the function. Signed-off-by: Max Fillinger <maximilian.fillinger@sentyron.com>

Signed-off-by: Max Fillinger <maximilian.fillinger@sentyron.com>

Max Fillinger added 3 commits November 13, 2025 13:13

Unbreak AES-CTR without AES-NI

217aaa2

Due to a bug in the previous commit, it broke AES-CTR when AES-NI is not available. Signed-off-by: Max Fillinger <maximilian.fillinger@sentyron.com>

Zeroize keystream blocks in AES-CTR

59dafb5

Signed-off-by: Max Fillinger <maximilian.fillinger@sentyron.com>

bjwtaylor self-assigned this Nov 17, 2025

bjwtaylor added the needs-work label Nov 19, 2025

bjwtaylor removed their assignment Nov 19, 2025

bjwtaylor added enhancement New feature or request size-m Estimated task size: medium (~1w) priority-medium Medium priority - this can be reviewed as time permits needs-ci Needs to pass CI tests and removed needs-work labels Nov 20, 2025

Fix syntax error in ChangeLog.d file

b850688

Signed-off-by: Max Fillinger <maximilian.fillinger@sentyron.com>

Max Fillinger added 2 commits November 21, 2025 12:12

Fix coding style

55eac4b

Signed-off-by: Max Fillinger <maximilian.fillinger@sentyron.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use instruction-level parallelism with AES-NI to speed up AES-CTR#573

Use instruction-level parallelism with AES-NI to speed up AES-CTR#573
mfil wants to merge 8 commits intoMbed-TLS:developmentfrom
mfil:feature/aes_ni_parallel

mfil commented Nov 13, 2025 •

edited

Loading

Uh oh!

mfil commented Nov 18, 2025

Uh oh!

bjwtaylor commented Nov 18, 2025

Uh oh!

mfil commented Nov 18, 2025

Uh oh!

bjwtaylor commented Nov 19, 2025

Uh oh!

mfil commented Nov 19, 2025

Uh oh!

mfil commented Nov 20, 2025 •

edited

Loading

Uh oh!

mfil commented Nov 20, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

mfil commented Nov 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

PR checklist

Uh oh!

mfil commented Nov 18, 2025

Uh oh!

bjwtaylor commented Nov 18, 2025

Uh oh!

mfil commented Nov 18, 2025

Uh oh!

bjwtaylor commented Nov 19, 2025

Uh oh!

mfil commented Nov 19, 2025

Uh oh!

mfil commented Nov 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mfil commented Nov 20, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

mfil commented Nov 13, 2025 •

edited

Loading

mfil commented Nov 20, 2025 •

edited

Loading