Use instruction-level parallelism with AES-NI to speed up AES-CTR#573
Use instruction-level parallelism with AES-NI to speed up AES-CTR#573mfil wants to merge 8 commits intoMbed-TLS:developmentfrom
Conversation
One feature of the AES-NI instructions is instruction-level parallelism, which means that, while one round of AES takes multiple clock cycles to complete, another instruction can already be executed on the next clock cycle. Only when the result of the AES round is used do we need to wait for it to finish. In particular, this makes it possible to encrypt or decrypt multiple AES blocks in parallel which can be used to speed up AES in some modes of operation. In this commit, it is used to speed up AES-CTR. Signed-off-by: Max Fillinger <maximilian.fillinger@sentyron.com>
Due to a bug in the previous commit, it broke AES-CTR when AES-NI is not available. Signed-off-by: Max Fillinger <maximilian.fillinger@sentyron.com>
Signed-off-by: Max Fillinger <maximilian.fillinger@sentyron.com>
|
It says
but when I click the link to view details, I see a green check on archive-timestamps and everything else is blank. |
|
Hi @mfil, thanks for submitting this PR. It looks like that job has failed due to infrastructure issues, so I would ignore them for the time being. I'm trying to get them resolved and will let you know when it's working again. |
|
Hi @bjwtaylor ! Thanks, I'll wait. In the meantime, I found that the functions with the intrinsics may not actually improve performance depending on the compiler. What I hoped would happen is that with optimizations turned on, the compiler would unroll the loops and then pipeline the I tried hand-unrolling the loops, and then I get the speed-ups also with GCC. Should I commit that? |
|
@mfil, the PR will need to go through a full review process. In the meantime though I would add anything you think is useful. I've now resolved some of the CI issues, please ignore the OpenCI issues and any failures in the NewCI should be related to your changes. |
On some compilers, the intrinsics-version of mbedtls_aesni_ecb_crypt_4blocks does not create any performance improvements because they do not unroll the for-loops. Therefore, this commit adds a version where the loops have been unrolled by hand. Signed-off-by: Max Fillinger <maximilian.fillinger@sentyron.com>
|
Ok, I added another commit that unrolls the "4blocks" function. |
Signed-off-by: Max Fillinger <maximilian.fillinger@sentyron.com>
|
I fixed the error in I figured out what's going wrong in For the failure on aarch64, it's not obvious to me what's going wrong. I guess I have to go play with QEMU. |
The function mbedtls_aes_crypt_ecb_multiblock() didn't exit after it was done encrypting the input with mbedtls_aesce_crypt_ecb() but instead went on to encrypt the input buffer again with the AES software implementation. This did not only waste time, but also created incorrect outputs when trying to encrypt a buffer in place. Signed-off-by: Max Fillinger <maximilian.fillinger@sentyron.com>
|
After looking at it for the 10th time, I think I spotted the bug with aarch64. |
The tests in "mbedtls/tests/scripts/all.sh test_aesni" check that the software implementation of AES is compiled by checking whether this function name appears in aes.o. This function is static, and my changes in the previous commits cause it to be inlined by GCC 5.4.0, which is the compiler used in the test image in mbedtls-test. This does not seem to be the case with more recent versions. To make the tests pass, I've added the noinline attribute to the function. Signed-off-by: Max Fillinger <maximilian.fillinger@sentyron.com>
Signed-off-by: Max Fillinger <maximilian.fillinger@sentyron.com>
Description
One feature of the AES-NI instructions is instruction-level parallelism, which means that, while one round of AES takes multiple clock cycles to complete, another instruction can already start executing on the next clock cycle. Only when the result of the AES round is used do we need to wait for it to finish.
In particular, this makes it possible to encrypt or decrypt multiple AES blocks in parallel which can be used to speed up AES in some modes of operation. In this commit, it is used to speed up AES-CTR. I have chosen to do up to four blocks in parallel. More would be possible, but while testing on my machine, I didn't see meaningful speed-ups when going beyond 4, and I don't want to clutter the source files with even more assembly.
I have removed the
defineswith theaesencetc. opcodes assuming that we don't need to support these older versions of binutils anymore. If that's wrong, I'll put them back (and add my own macros for the registers I need to use).I would like to go on and also use it in AES-GCM, but there I ran into a problem. GCM runs the underlying block cipher in ECB mode. Due to an inconsistency in the
mbedtls_cipherAPI, ECB mode always encrypts exactly one block. As far as I understand,mbedtls_cipheris on its way out, so I don't know if you would want to spend time reviewing a fix for this. Please let me know what you think would be the best approach!An alternative I can see would be to use CTR mode (making sure that the 32-bit counter in GCM doesn't overflow) inside of GCM, if available.
PR checklist