Optimizing x86 AES intrinsics on ARMv8-A
Fixing performance issues from emulated x86 intrinsics
In a prior post, I wrote about emulating x86 intrinsics on ARMv8-A by implementing replacement inline functions with ARM intrinstics. The ARM AES instructions have slightly different semantics than the x86 instructions, so it took some tricks to get them to match. The benefit of this approach is that algorithms which are written with the x86 intrinsics can be cross compiled to ARM without requiring the algorithm to be rewritten. The compiler does most of the work!
However, a drawback which has been bothering me the past few weeks is that the tricks to match the x86 semantics have a performance penalty. What I mean by this is, when implementing multiple rounds of AES with the emulated x86 intrinsics, the compiler will generate unoptimal code, and would be more efficient to use the native ARM intrinsics directly. Still, I really didn’t want to rewrite the algorithm by hand, so I spent more time to see what else could be done.
The Semantic Performance Penalty
Let’s review the x86 AESENC
and AESENCLAST
replacement functions from last time:
#include <stdint.h>
#include <arm_neon.h>
typedef uint8x16_t __m128i;
__m128i _mm_aesenc_si128(__m128i a, __m128i RoundKey)
{
return vaesmcq_u8(vaeseq_u8(a, (__m128i){})) ^ RoundKey;
}
__m128i _mm_aesenclast_si128(__m128i a, __m128i RoundKey)
{
return vaeseq_u8(a, (__m128i){}) ^ RoundKey;
}
Which compiles down to the following assembly:
# clang-6.0 -target aarch64-none-linux -march=armv8+crypto -O3
000000000000003c <_mm_aesenc_si128>:
3c: 6f00e402 movi v2.2d, #0x0
40: 4e284840 aese v0.16b, v2.16b
44: 4e286800 aesmc v0.16b, v0.16b
48: 6e211c00 eor v0.16b, v0.16b, v1.16b
4c: d65f03c0 ret
0000000000000050 <_mm_aesenclast_si128>:
50: 6f00e402 movi v2.2d, #0x0
54: 4e284840 aese v0.16b, v2.16b
58: 6e211c00 eor v0.16b, v0.16b, v1.16b
5c: d65f03c0 ret
The main difference between the x86 and ARM semantics is that x86 does an XOR at the end of the AESENC
and AESENCLAST
instructions, but ARM does an XOR at the beginning of their AESE
instruction. To get around this, the inline functions pass a key value of 0 to AESE
and then manually XOR the real key at the end of the sequence. The problem with this is that it wastes the built in XOR of the AESE instruction and requires an extra EOR
(ARM equivalent of XOR) instruction.
To demonstrate the performance penalty, consider the following two functions which implement three rounds (for simplicity) of AES encryption. The first one uses the x86 intrinsic functions, and the other implements the algorithm natively with ARM AES intrinsics.
__m128i aes_x86(__m128i data, __m128i k0, __m128i k1, __m128i k2, __m128i k3)
{
data = data ^ k0;
data = _mm_aesenc_si128(data, k1);
data = _mm_aesenc_si128(data, k2);
data = _mm_aesenclast_si128(data, k3);
return data;
}
uint8x16_t aes_arm(uint8x16_t data, uint8x16_t k0, uint8x16_t k1, uint8x16_t k2, uint8x16_t k3)
{
data = vaeseq_u8(data, k0);
data = vaesmcq_u8(data);
data = vaeseq_u8(data, k1);
data = vaesmcq_u8(data);
data = vaeseq_u8(data, k2);
data = data ^ k3;
return data;
}
The compiled code looks like this:
# clang-6.0 -target aarch64-none-linux -march=armv8+crypto -O3
0000000000000024 <aes_x86>:
24: 6e201c20 eor v0.16b, v1.16b, v0.16b
28: 6f00e401 movi v1.2d, #0x0
2c: 4e284820 aese v0.16b, v1.16b
30: 4e286800 aesmc v0.16b, v0.16b
34: 6e221c00 eor v0.16b, v0.16b, v2.16b
38: 4e284820 aese v0.16b, v1.16b
3c: 4e286800 aesmc v0.16b, v0.16b
40: 6e231c00 eor v0.16b, v0.16b, v3.16b
44: 4e284820 aese v0.16b, v1.16b
48: 6e241c00 eor v0.16b, v0.16b, v4.16b
4c: d65f03c0 ret
0000000000000050 <aes_arm>:
50: 4e284820 aese v0.16b, v1.16b
54: 4e286800 aesmc v0.16b, v0.16b
58: 4e284840 aese v0.16b, v2.16b
5c: 4e286800 aesmc v0.16b, v0.16b
60: 4e284860 aese v0.16b, v3.16b
64: 6e241c00 eor v0.16b, v0.16b, v4.16b
68: d65f03c0 ret
As you can see, the function with the x86 intrinsics is 50% longer than the optimized ARM implementation. The compiler doesn’t seem to realize that the EOR
instruction can be combined with the AESE
instruction, so it keeps them separate. Maybe this is something that can be fixed in the compiler?
LLVM Optimization Passes
If you haven’t noticed yet, I’m a big fan of the LLVM compiler project. It been around for 15 years, is well designed and has fairly comparable performance to GCC (with GCC having a slight advantage). One of the big differences from GCC, is that LLVM has a permissive open source license which encourages large tech companies like Apple, Google and Qualcomm (my current employer) to work on it.
I don’t have much of a background in compilers (sadly skipped that course in college), but I have developed a bit of an interest in the past couple years. I frequently compile small snippets of C code to see what instructions the compiler decided to use in that situation (much like how I did in the previous section). Every now and then, I find an excuse to dig around in the LLVM source code, to see why the compiler is doing something specific. This is another one of those cases.
If you don’t know much about LLVM, here is a little bit of background information. Compilation in LLVM is split into three phases. The first phase is the "frontend" which is responsible for transforming a high-level language into low level intermediate representation (IR) format. The IR is kind of like a generic assembly language, with types and infinite registers, and with a nice text representation. There are many frontends which work with LLVM, but the one I care about for this blog post is clang, which handles the C, C++ and Objective-C languages. The second phase is the "optimizer", which operates on IR and attempts to simplify it, so that it runs faster and/or takes up less space. The optimizer doesn’t need to know anything about the original source language or the target architecture, because it only must work with IR. The final phase is the "backend", which is responsible for turning IR into machine instructions for the target architecture. It decides how to allocate registers and how to schedule instructions so they run well for that particular processor. In this case, I am targeting the ARMv8-A architecture, but there are backends for many other architectures.
Typically, the best place to add a new optimization is the mid-level optimizer, since that can potentially benefit all frontends and all backends. To gain some insight into what is going on internally, clang can be instructed to stop in between the mid-level optimizer and the backend, to display the IR. The options -S -emit-llvm
will cause clang to output to a .ll
file with the text representation of IR.
; clang-6.0 -target aarch64-none-linux -march=armv8+crypto -O3 -S -emit-llvm
; Function Attrs: nounwind readnone
define <16 x i8> @aes_x86(<16 x i8>, <16 x i8>, <16 x i8>, <16 x i8>, <16 x i8>) local_unnamed_addr #0 {
%6 = xor <16 x i8> %1, %0
%7 = tail call <16 x i8> @llvm.aarch64.crypto.aese(<16 x i8> %6, <16 x i8> zeroinitializer) #3
%8 = tail call <16 x i8> @llvm.aarch64.crypto.aesmc(<16 x i8> %7) #3
%9 = xor <16 x i8> %8, %2
%10 = tail call <16 x i8> @llvm.aarch64.crypto.aese(<16 x i8> %9, <16 x i8> zeroinitializer) #3
%11 = tail call <16 x i8> @llvm.aarch64.crypto.aesmc(<16 x i8> %10) #3
%12 = xor <16 x i8> %11, %3
%13 = tail call <16 x i8> @llvm.aarch64.crypto.aese(<16 x i8> %12, <16 x i8> zeroinitializer) #3
%14 = xor <16 x i8> %13, %4
ret <16 x i8> %14
}
; Function Attrs: nounwind readnone
define <16 x i8> @aes_arm(<16 x i8>, <16 x i8>, <16 x i8>, <16 x i8>, <16 x i8>) local_unnamed_addr #0 {
%6 = tail call <16 x i8> @llvm.aarch64.crypto.aese(<16 x i8> %0, <16 x i8> %1) #3
%7 = tail call <16 x i8> @llvm.aarch64.crypto.aesmc(<16 x i8> %6) #3
%8 = tail call <16 x i8> @llvm.aarch64.crypto.aese(<16 x i8> %7, <16 x i8> %2) #3
%9 = tail call <16 x i8> @llvm.aarch64.crypto.aesmc(<16 x i8> %8) #3
%10 = tail call <16 x i8> @llvm.aarch64.crypto.aese(<16 x i8> %9, <16 x i8> %3) #3
%11 = xor <16 x i8> %10, %4
ret <16 x i8> %11
}
; Function Attrs: nounwind readnone
declare <16 x i8> @llvm.aarch64.crypto.aesmc(<16 x i8>) #2
; Function Attrs: nounwind readnone
declare <16 x i8> @llvm.aarch64.crypto.aese(<16 x i8>, <16 x i8>) #2
The IR in this case looks very similar to the final ARM assembly, which is promising. The first instruction in the @aes_x86
function does an XOR of registers %1
and %0
and stores the result into register %6
. The next instruction takes register %6
and a zero constant and performs the AESE intrinsic (represented as the @lvm.aarch64.crypto.aese
function) and stores the result in %7
. This seems ripe for an optimization, since these two instructions can be combined, in something like the first line of the @aes_arm
function.
Teaching an Old Compiler New Tricks
At this point, I needed a little bit of help. I knew that the mid-level optimizer is split in several passes which handle different types of optimizations, and that the InstCombine pass sounded applicable in this situation. However, this is the largest LLVM optimization pass and I wasn’t sure where to start. Fortunately, Chad Rosier was kind enough to point me to the InstCombiner::visitCallInst()
function in the InstCombineCalls.cpp file. Chad leads one of the LLVM teams at Qualcomm and has been extremely helpful when I had LLVM questions in the past.
The visitCallinst()
function contains a large switch statement for handling optimizations related to intrinsics. All I had to do was add some case statements for AESE
and AESD
intrinsics, write some code to match my pattern, and then rewrite the intrinsic call. This is what the final code looked like:
switch (II->getIntrinsicID()) {
// ....
case Intrinsic::arm_neon_aesd:
case Intrinsic::arm_neon_aese:
case Intrinsic::aarch64_crypto_aesd:
case Intrinsic::aarch64_crypto_aese: {
Value *DataArg = II->getArgOperand(0);
Value *KeyArg = II->getArgOperand(1);
// Try to use the builtin XOR in AESE and AESD to eliminate a prior XOR
Value *Data, *Key;
if (match(KeyArg, m_ZeroInt()) &&
match(DataArg, m_Xor(m_Value(Data), m_Value(Key)))) {
II->setArgOperand(0, Data);
II->setArgOperand(1, Key);
return II;
}
break;
}
In this code, II
is the pointer to the instruction that is being optimized, and the switch statement matches if the intrinisic type matches the AES intrinsics for 32-bit or 64-bit ARM. I first extract the two arguments of the intrinsic, as the DataArg
and KeyArg
values. I then use LLVM’s matching library pattern matching library to check if the KeyArg
is a zero constant and if the DataArg
comes from the result of an XOR operation. If so, the XOR operands are extracted into the Data
and Key
pointers, and those replace the arguments to the intrinsic.
That’s all that needed to handle the optimization. With Chad’s help, I also posted the patch to the community for review and it was accepted, so it will arrive in LLVM 7.
Performance
Let’s take a moment to discuss the performance gain from this optimization, since that is what really matters at the end of the day. There were two things that I wanted to look at. The first is how does the latency change when doing back to back AES rounds on a single block, and the second is how does the throughput change when doing AES on multiple blocks in parallel (with a single core). Here is the test code that I wrote, which uses a C++ template, so the number of parallel blocks could be easily changed:
template <int num_aes>
void test_aes_rounds(uint8x16_t * data, uint8x16_t key, uint64_t num_iterations)
{
uint8x16_t temp_data[num_aes];
// XOR the key with each data block
for (int aes_inst = 0; aes_inst < num_aes; aes_inst++)
{
temp_data[aes_inst] = data[aes_inst] ^ key;
}
for (uint64_t i = 0; i <= num_iterations; i++)
{
// Perform a normal round of AES encryption on each data block
for (int aes_inst = 0; aes_inst < num_aes; aes_inst++)
{
temp_data[aes_inst] = _mm_aesenc_si128(temp_data[aes_inst], key);
}
}
// Perform the last round of AES encryption on each data block
for (int aes_inst = 0; aes_inst < num_aes; aes_inst++)
{
data[aes_inst] = _mm_aesenclast_si128(temp_data[aes_inst], key);
}
}
To my pleasant surprise, LLVM was able to run my optimization on all the AESE
instructions in this function, even when the preceding XORs were outside of the main loop. Here is the compiled output for test_aes_rounds<2>()
:
# clang-7.0 -target aarch64-none-linux -march=armv8+crypto -O3
0000000000000088 <_Z5test2P12__Uint8x16_tS_m>:
88: ad400801 ldp q1, q2, [x0]
8c: aa1f03e8 mov x8, xzr
90: 91000508 add x8, x8, #0x1
94: 4e284801 aese v1.16b, v0.16b
98: 4e286821 aesmc v1.16b, v1.16b
9c: eb01011f cmp x8, x1
a0: 4e284802 aese v2.16b, v0.16b
a4: 4e286842 aesmc v2.16b, v2.16b
a8: 54ffff49 b.ls 90 <_Z5test2P12__Uint8x16_tS_m+0x8> //
b.plast
ac: 4e284801 aese v1.16b, v0.16b
b0: 4e284802 aese v2.16b, v0.16b
b4: 6e201c21 eor v1.16b, v1.16b, v0.16b
b8: 6e201c40 eor v0.16b, v2.16b, v0.16b
bc: ad000001 stp q1, q0, [x0]
c0: d65f03c0 ret
(Check out the full compiled result on Godbolt)
I compiled this code with both clang-6.0
and a nightly version of clang-7.0
with num_aes
set to various sizes. I then ran the code on an unspecified ARMv8-A processor for an unspecified number of iterations and measured how long it took in unspecified time units. Here are the results:
num_aes | clang-6.0 | clang-7.0 |
---|---|---|
1 | 30 | 25 |
2 | 30 | 30 |
4 | 50 | 40 |
8 | 100 | 60 |
The results show a modest performance benefit from the optimization at low values of num_aes
and more significant gains for higher values of num_aes
.
Final Thoughts
The optimization I wrote was for a very specific case, though it seems to be effective for my application. Still, I can think of a few more general optimizations that could be done, if the benefit is worth the cost. For example, I only match the case when the second argument to AESE
or AESD
is a constant zero, and the first comes from an XOR expression, but it would be equally valid if the order of those arguments were switched. I could also write a pattern matching rule which matches the case where one of the arguments is a non-zero constant and one of the XOR arguments is a constant, which would let me combine the constants.
Finally, I could imagine some advanced optimizations where if both arguments to AESE
and AESD
came from XOR operations, then the operands could be moved from one side to the other to simplify the calculations. This one might be tricky though, since there would be situations where this would make things better and make things worse. LLVM handles these scenarios for regular arithmetic operations with its "reassociate" pass, so in theory it could be extended to handle the AESE
and AESD
instructions as well. However, I suspect that this would be an expensive change in a hot spot of the code and would be a difficult sell to get that change accepted in the LLVM community.