Skip to content

Use instruction-level parallelism with AES-NI to speed up AES-CTR#573

Open
mfil wants to merge 8 commits intoMbed-TLS:developmentfrom
mfil:feature/aes_ni_parallel
Open

Use instruction-level parallelism with AES-NI to speed up AES-CTR#573
mfil wants to merge 8 commits intoMbed-TLS:developmentfrom
mfil:feature/aes_ni_parallel

Conversation

@mfil
Copy link
Contributor

@mfil mfil commented Nov 13, 2025

Description

One feature of the AES-NI instructions is instruction-level parallelism, which means that, while one round of AES takes multiple clock cycles to complete, another instruction can already start executing on the next clock cycle. Only when the result of the AES round is used do we need to wait for it to finish.

In particular, this makes it possible to encrypt or decrypt multiple AES blocks in parallel which can be used to speed up AES in some modes of operation. In this commit, it is used to speed up AES-CTR. I have chosen to do up to four blocks in parallel. More would be possible, but while testing on my machine, I didn't see meaningful speed-ups when going beyond 4, and I don't want to clutter the source files with even more assembly.

I have removed the defines with the aesenc etc. opcodes assuming that we don't need to support these older versions of binutils anymore. If that's wrong, I'll put them back (and add my own macros for the registers I need to use).

I would like to go on and also use it in AES-GCM, but there I ran into a problem. GCM runs the underlying block cipher in ECB mode. Due to an inconsistency in the mbedtls_cipher API, ECB mode always encrypts exactly one block. As far as I understand, mbedtls_cipher is on its way out, so I don't know if you would want to spend time reviewing a fix for this. Please let me know what you think would be the best approach!

An alternative I can see would be to use CTR mode (making sure that the 32-bit counter in GCM doesn't overflow) inside of GCM, if available.

PR checklist

  • changelog provided
  • framework PR not required
  • mbedtls development PR not required because: No changes in mbedtls are needed to make use of this. The speed-up will take effect as soon as the submodule is updated. (Should I make a pull request just to update the submodule?)
  • mbedtls 3.6 PR not required because: It's a new feature, not a bugfix. It doesn't have to be backported. (As far as I'm concerned.)
  • tests not required because: No changes in the output are intended, only performance.

Max Fillinger added 3 commits November 13, 2025 13:13
One feature of the AES-NI instructions is instruction-level parallelism,
which means that, while one round of AES takes multiple clock cycles to
complete, another instruction can already be executed on the next clock
cycle. Only when the result of the AES round is used do we need to wait
for it to finish.

In particular, this makes it possible to encrypt or decrypt multiple AES
blocks in parallel which can be used to speed up AES in some modes of
operation. In this commit, it is used to speed up AES-CTR.

Signed-off-by: Max Fillinger <maximilian.fillinger@sentyron.com>
Due to a bug in the previous commit, it broke AES-CTR when AES-NI is not
available.

Signed-off-by: Max Fillinger <maximilian.fillinger@sentyron.com>
Signed-off-by: Max Fillinger <maximilian.fillinger@sentyron.com>
@bjwtaylor bjwtaylor self-assigned this Nov 17, 2025
@mfil
Copy link
Contributor Author

mfil commented Nov 18, 2025

It says

TF OpenCI: Interface stability tests — Pre-test checks failed.

but when I click the link to view details, I see a green check on archive-timestamps and everything else is blank.

@bjwtaylor
Copy link
Contributor

Hi @mfil, thanks for submitting this PR. It looks like that job has failed due to infrastructure issues, so I would ignore them for the time being. I'm trying to get them resolved and will let you know when it's working again.

@mfil
Copy link
Contributor Author

mfil commented Nov 18, 2025

Hi @bjwtaylor ! Thanks, I'll wait.

In the meantime, I found that the functions with the intrinsics may not actually improve performance depending on the compiler. What I hoped would happen is that with optimizations turned on, the compiler would unroll the loops and then pipeline the aesenc instructions. Clang 18.1.3 does so, GCC 13.3.0 does not.

I tried hand-unrolling the loops, and then I get the speed-ups also with GCC. Should I commit that?

@bjwtaylor
Copy link
Contributor

@mfil, the PR will need to go through a full review process. In the meantime though I would add anything you think is useful. I've now resolved some of the CI issues, please ignore the OpenCI issues and any failures in the NewCI should be related to your changes.

@bjwtaylor bjwtaylor removed their assignment Nov 19, 2025
On some compilers, the intrinsics-version of
mbedtls_aesni_ecb_crypt_4blocks does not create any performance
improvements because they do not unroll the for-loops. Therefore, this
commit adds a version where the loops have been unrolled by hand.

Signed-off-by: Max Fillinger <maximilian.fillinger@sentyron.com>
@mfil
Copy link
Contributor Author

mfil commented Nov 19, 2025

Ok, I added another commit that unrolls the "4blocks" function.

@bjwtaylor bjwtaylor added enhancement New feature or request size-m Estimated task size: medium (~1w) priority-medium Medium priority - this can be reviewed as time permits needs-ci Needs to pass CI tests and removed needs-work labels Nov 20, 2025
Signed-off-by: Max Fillinger <maximilian.fillinger@sentyron.com>
@mfil
Copy link
Contributor Author

mfil commented Nov 20, 2025

I fixed the error in tf_psa_crypto_check_changelog.

I figured out what's going wrong in test_aesni, but not what would be a good way to fix it. It checks that the binary aes.o contains the string mbedtls_internal_aes_encrypt. That function is declared static, and my changes in aes.c seem to cause the compiler in your test container to inline it.

For the failure on aarch64, it's not obvious to me what's going wrong. I guess I have to go play with QEMU.

The function mbedtls_aes_crypt_ecb_multiblock() didn't exit after
it was done encrypting the input with mbedtls_aesce_crypt_ecb() but
instead went on to encrypt the input buffer again with the AES software
implementation. This did not only waste time, but also created incorrect
outputs when trying to encrypt a buffer in place.

Signed-off-by: Max Fillinger <maximilian.fillinger@sentyron.com>
@mfil
Copy link
Contributor Author

mfil commented Nov 20, 2025

After looking at it for the 10th time, I think I spotted the bug with aarch64.

Max Fillinger added 2 commits November 21, 2025 12:12
The tests in "mbedtls/tests/scripts/all.sh test_aesni" check that the
software implementation of AES is compiled by checking whether this
function name appears in aes.o. This function is static, and my changes
in the previous commits cause it to be inlined by GCC 5.4.0, which is
the compiler used in the test image in mbedtls-test. This does not seem
to be the case with more recent versions.

To make the tests pass, I've added the noinline attribute to the
function.

Signed-off-by: Max Fillinger <maximilian.fillinger@sentyron.com>
Signed-off-by: Max Fillinger <maximilian.fillinger@sentyron.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request needs-ci Needs to pass CI tests priority-medium Medium priority - this can be reviewed as time permits size-m Estimated task size: medium (~1w)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants