Optimizing x86 AES intrinsics on ARMv8-A

Fixing performance issues from emulated x86 intrinsics

In a prior post, I wrote about emulating x86 intrinsics on ARMv8-A by implementing replacement inline functions with ARM intrinstics. The ARM AES instructions have slightly different semantics than the x86 instructions, so it took some tricks to get them to match. The benefit of this approach is that algorithms which are written with the x86 intrinsics can be cross compiled to ARM without requiring the algorithm to be rewritten. The compiler does most of the work!

However, a drawback which has been bothering me the past few weeks is that the tricks to match the x86 semantics have a performance penalty. What I mean by this is, when implementing multiple rounds of AES with the emulated x86 intrinsics, the compiler will generate unoptimal code, and would be more efficient to use the native ARM intrinsics directly. Still, I really didn’t want to rewrite the algorithm by hand, so I spent more time to see what else could be done.

The Semantic Performance Penalty

Let’s review the x86 AESENC and AESENCLAST replacement functions from last time:

#include <stdint.h>
#include <arm_neon.h>

typedef uint8x16_t __m128i;

__m128i _mm_aesenc_si128(__m128i a, __m128i RoundKey)
{
    return vaesmcq_u8(vaeseq_u8(a, (__m128i){})) ^ RoundKey;
}

__m128i _mm_aesenclast_si128(__m128i a, __m128i RoundKey)
{
    return vaeseq_u8(a, (__m128i){}) ^ RoundKey;
}

Which compiles down to the following assembly:

# clang-6.0 -target aarch64-none-linux -march=armv8+crypto -O3 

000000000000003c <_mm_aesenc_si128>:
  3c:   6f00e402        movi    v2.2d, #0x0
  40:   4e284840        aese    v0.16b, v2.16b
  44:   4e286800        aesmc   v0.16b, v0.16b
  48:   6e211c00        eor     v0.16b, v0.16b, v1.16b
  4c:   d65f03c0        ret

0000000000000050 <_mm_aesenclast_si128>:
  50:   6f00e402        movi    v2.2d, #0x0
  54:   4e284840        aese    v0.16b, v2.16b
  58:   6e211c00        eor     v0.16b, v0.16b, v1.16b
  5c:   d65f03c0        ret

The main difference between the x86 and ARM semantics is that x86 does an XOR at the end of the AESENC and AESENCLAST instructions, but ARM does an XOR at the beginning of their AESE instruction. To get around this, the inline functions pass a key value of 0 to AESE and then manually XOR the real key at the end of the sequence. The problem with this is that it wastes the built in XOR of the AESE instruction and requires an extra EOR (ARM equivalent of XOR) instruction.

To demonstrate the performance penalty, consider the following two functions which implement three rounds (for simplicity) of AES encryption. The first one uses the x86 intrinsic functions, and the other implements the algorithm natively with ARM AES intrinsics.

__m128i aes_x86(__m128i data, __m128i k0, __m128i k1, __m128i k2, __m128i k3)
{
    data = data ^ k0;
    data = _mm_aesenc_si128(data, k1);
    data = _mm_aesenc_si128(data, k2);
    data = _mm_aesenclast_si128(data, k3);
    return data;
}

uint8x16_t aes_arm(uint8x16_t data, uint8x16_t k0, uint8x16_t k1, uint8x16_t k2, uint8x16_t k3)
{
    data = vaeseq_u8(data, k0);
    data = vaesmcq_u8(data);
    data = vaeseq_u8(data, k1);
    data = vaesmcq_u8(data);
    data = vaeseq_u8(data, k2);
    data = data ^ k3;
    return data;
}

The compiled code looks like this:

# clang-6.0 -target aarch64-none-linux -march=armv8+crypto -O3

0000000000000024 <aes_x86>:
  24:   6e201c20        eor     v0.16b, v1.16b, v0.16b
  28:   6f00e401        movi    v1.2d, #0x0
  2c:   4e284820        aese    v0.16b, v1.16b
  30:   4e286800        aesmc   v0.16b, v0.16b
  34:   6e221c00        eor     v0.16b, v0.16b, v2.16b
  38:   4e284820        aese    v0.16b, v1.16b
  3c:   4e286800        aesmc   v0.16b, v0.16b
  40:   6e231c00        eor     v0.16b, v0.16b, v3.16b
  44:   4e284820        aese    v0.16b, v1.16b
  48:   6e241c00        eor     v0.16b, v0.16b, v4.16b
  4c:   d65f03c0        ret

0000000000000050 <aes_arm>:
  50:   4e284820        aese    v0.16b, v1.16b
  54:   4e286800        aesmc   v0.16b, v0.16b
  58:   4e284840        aese    v0.16b, v2.16b
  5c:   4e286800        aesmc   v0.16b, v0.16b
  60:   4e284860        aese    v0.16b, v3.16b
  64:   6e241c00        eor     v0.16b, v0.16b, v4.16b
  68:   d65f03c0        ret

As you can see, the function with the x86 intrinsics is 50% longer than the optimized ARM implementation. The compiler doesn’t seem to realize that the EOR instruction can be combined with the AESE instruction, so it keeps them separate. Maybe this is something that can be fixed in the compiler?

LLVM Optimization Passes

If you haven’t noticed yet, I’m a big fan of the LLVM compiler project. It been around for 15 years, is well designed and has fairly comparable performance to GCC (with GCC having a slight advantage). One of the big differences from GCC, is that LLVM has a permissive open source license which encourages large tech companies like Apple, Google and Qualcomm (my current employer) to work on it.

I don’t have much of a background in compilers (sadly skipped that course in college), but I have developed a bit of an interest in the past couple years. I frequently compile small snippets of C code to see what instructions the compiler decided to use in that situation (much like how I did in the previous section). Every now and then, I find an excuse to dig around in the LLVM source code, to see why the compiler is doing something specific. This is another one of those cases.

If you don’t know much about LLVM, here is a little bit of background information. Compilation in LLVM is split into three phases. The first phase is the "frontend" which is responsible for transforming a high-level language into low level intermediate representation (IR) format. The IR is kind of like a generic assembly language, with types and infinite registers, and with a nice text representation. There are many frontends which work with LLVM, but the one I care about for this blog post is clang, which handles the C, C++ and Objective-C languages. The second phase is the "optimizer", which operates on IR and attempts to simplify it, so that it runs faster and/or takes up less space. The optimizer doesn’t need to know anything about the original source language or the target architecture, because it only must work with IR. The final phase is the "backend", which is responsible for turning IR into machine instructions for the target architecture. It decides how to allocate registers and how to schedule instructions so they run well for that particular processor. In this case, I am targeting the ARMv8-A architecture, but there are backends for many other architectures.

Typically, the best place to add a new optimization is the mid-level optimizer, since that can potentially benefit all frontends and all backends. To gain some insight into what is going on internally, clang can be instructed to stop in between the mid-level optimizer and the backend, to display the IR. The options -S -emit-llvm will cause clang to output to a .ll file with the text representation of IR.

; clang-6.0 -target aarch64-none-linux -march=armv8+crypto -O3 -S -emit-llvm

; Function Attrs: nounwind readnone
define <16 x i8> @aes_x86(<16 x i8>, <16 x i8>, <16 x i8>, <16 x i8>, <16 x i8>) local_unnamed_addr #0 {
  %6 = xor <16 x i8> %1, %0
  %7 = tail call <16 x i8> @llvm.aarch64.crypto.aese(<16 x i8> %6, <16 x i8> zeroinitializer) #3
  %8 = tail call <16 x i8> @llvm.aarch64.crypto.aesmc(<16 x i8> %7) #3
  %9 = xor <16 x i8> %8, %2
  %10 = tail call <16 x i8> @llvm.aarch64.crypto.aese(<16 x i8> %9, <16 x i8> zeroinitializer) #3
  %11 = tail call <16 x i8> @llvm.aarch64.crypto.aesmc(<16 x i8> %10) #3
  %12 = xor <16 x i8> %11, %3
  %13 = tail call <16 x i8> @llvm.aarch64.crypto.aese(<16 x i8> %12, <16 x i8> zeroinitializer) #3
  %14 = xor <16 x i8> %13, %4
  ret <16 x i8> %14
}

; Function Attrs: nounwind readnone
define <16 x i8> @aes_arm(<16 x i8>, <16 x i8>, <16 x i8>, <16 x i8>, <16 x i8>) local_unnamed_addr #0 {
  %6 = tail call <16 x i8> @llvm.aarch64.crypto.aese(<16 x i8> %0, <16 x i8> %1) #3
  %7 = tail call <16 x i8> @llvm.aarch64.crypto.aesmc(<16 x i8> %6) #3
  %8 = tail call <16 x i8> @llvm.aarch64.crypto.aese(<16 x i8> %7, <16 x i8> %2) #3
  %9 = tail call <16 x i8> @llvm.aarch64.crypto.aesmc(<16 x i8> %8) #3
  %10 = tail call <16 x i8> @llvm.aarch64.crypto.aese(<16 x i8> %9, <16 x i8> %3) #3
  %11 = xor <16 x i8> %10, %4
  ret <16 x i8> %11
}

; Function Attrs: nounwind readnone
declare <16 x i8> @llvm.aarch64.crypto.aesmc(<16 x i8>) #2

; Function Attrs: nounwind readnone
declare <16 x i8> @llvm.aarch64.crypto.aese(<16 x i8>, <16 x i8>) #2

The IR in this case looks very similar to the final ARM assembly, which is promising. The first instruction in the @aes_x86 function does an XOR of registers %1 and %0 and stores the result into register %6. The next instruction takes register %6 and a zero constant and performs the AESE intrinsic (represented as the @lvm.aarch64.crypto.aese function) and stores the result in %7. This seems ripe for an optimization, since these two instructions can be combined, in something like the first line of the @aes_arm function.

Teaching an Old Compiler New Tricks

At this point, I needed a little bit of help. I knew that the mid-level optimizer is split in several passes which handle different types of optimizations, and that the InstCombine pass sounded applicable in this situation. However, this is the largest LLVM optimization pass and I wasn’t sure where to start. Fortunately, Chad Rosier was kind enough to point me to the InstCombiner::visitCallInst() function in the InstCombineCalls.cpp file. Chad leads one of the LLVM teams at Qualcomm and has been extremely helpful when I had LLVM questions in the past.

The visitCallinst() function contains a large switch statement for handling optimizations related to intrinsics. All I had to do was add some case statements for AESE and AESD intrinsics, write some code to match my pattern, and then rewrite the intrinsic call. This is what the final code looked like:

  switch (II->getIntrinsicID()) {
  // ....
  case Intrinsic::arm_neon_aesd:
  case Intrinsic::arm_neon_aese:
  case Intrinsic::aarch64_crypto_aesd:
  case Intrinsic::aarch64_crypto_aese: {
    Value *DataArg = II->getArgOperand(0);
    Value *KeyArg  = II->getArgOperand(1);

    // Try to use the builtin XOR in AESE and AESD to eliminate a prior XOR
    Value *Data, *Key;
    if (match(KeyArg, m_ZeroInt()) &&
        match(DataArg, m_Xor(m_Value(Data), m_Value(Key)))) {
      II->setArgOperand(0, Data);
      II->setArgOperand(1, Key);
      return II;
    }
    break;
  }

In this code, II is the pointer to the instruction that is being optimized, and the switch statement matches if the intrinisic type matches the AES intrinsics for 32-bit or 64-bit ARM. I first extract the two arguments of the intrinsic, as the DataArg and KeyArg values. I then use LLVM’s matching library pattern matching library to check if the KeyArg is a zero constant and if the DataArg comes from the result of an XOR operation. If so, the XOR operands are extracted into the Data and Key pointers, and those replace the arguments to the intrinsic.

That’s all that needed to handle the optimization. With Chad’s help, I also posted the patch to the community for review and it was accepted, so it will arrive in LLVM 7.

Performance

Let’s take a moment to discuss the performance gain from this optimization, since that is what really matters at the end of the day. There were two things that I wanted to look at. The first is how does the latency change when doing back to back AES rounds on a single block, and the second is how does the throughput change when doing AES on multiple blocks in parallel (with a single core). Here is the test code that I wrote, which uses a C++ template, so the number of parallel blocks could be easily changed:

template <int num_aes>
void test_aes_rounds(uint8x16_t * data, uint8x16_t key, uint64_t num_iterations)
{
    uint8x16_t temp_data[num_aes];

    // XOR the key with each data block
    for (int aes_inst = 0; aes_inst < num_aes; aes_inst++)
    {
        temp_data[aes_inst] = data[aes_inst] ^ key;
    }

    for (uint64_t i = 0; i <= num_iterations; i++)
    {
        // Perform a normal round of AES encryption on each data block
        for (int aes_inst = 0; aes_inst < num_aes; aes_inst++)
        {
            temp_data[aes_inst] = _mm_aesenc_si128(temp_data[aes_inst], key);
        }
    }

    // Perform the last round of AES encryption on each data block
    for (int aes_inst = 0; aes_inst < num_aes; aes_inst++)
    {
        data[aes_inst] = _mm_aesenclast_si128(temp_data[aes_inst], key);
    }
}

To my pleasant surprise, LLVM was able to run my optimization on all the AESE instructions in this function, even when the preceding XORs were outside of the main loop. Here is the compiled output for test_aes_rounds<2>():

# clang-7.0 -target aarch64-none-linux -march=armv8+crypto -O3

0000000000000088 <_Z5test2P12__Uint8x16_tS_m>:
  88:   ad400801        ldp     q1, q2, [x0]
  8c:   aa1f03e8        mov     x8, xzr
  90:   91000508        add     x8, x8, #0x1
  94:   4e284801        aese    v1.16b, v0.16b
  98:   4e286821        aesmc   v1.16b, v1.16b
  9c:   eb01011f        cmp     x8, x1
  a0:   4e284802        aese    v2.16b, v0.16b
  a4:   4e286842        aesmc   v2.16b, v2.16b
  a8:   54ffff49        b.ls    90 <_Z5test2P12__Uint8x16_tS_m+0x8>  //
b.plast
  ac:   4e284801        aese    v1.16b, v0.16b
  b0:   4e284802        aese    v2.16b, v0.16b
  b4:   6e201c21        eor     v1.16b, v1.16b, v0.16b
  b8:   6e201c40        eor     v0.16b, v2.16b, v0.16b
  bc:   ad000001        stp     q1, q0, [x0]
  c0:   d65f03c0        ret

(Check out the full compiled result on Godbolt)

I compiled this code with both clang-6.0 and a nightly version of clang-7.0 with num_aes set to various sizes. I then ran the code on an unspecified ARMv8-A processor for an unspecified number of iterations and measured how long it took in unspecified time units. Here are the results:

num_aes clang-6.0 clang-7.0
1 30 25
2 30 30
4 50 40
8 100 60

The results show a modest performance benefit from the optimization at low values of num_aes and more significant gains for higher values of num_aes.

Final Thoughts

The optimization I wrote was for a very specific case, though it seems to be effective for my application. Still, I can think of a few more general optimizations that could be done, if the benefit is worth the cost. For example, I only match the case when the second argument to AESE or AESD is a constant zero, and the first comes from an XOR expression, but it would be equally valid if the order of those arguments were switched. I could also write a pattern matching rule which matches the case where one of the arguments is a non-zero constant and one of the XOR arguments is a constant, which would let me combine the constants.

Finally, I could imagine some advanced optimizations where if both arguments to AESE and AESD came from XOR operations, then the operands could be moved from one side to the other to simplify the calculations. This one might be tricky though, since there would be situations where this would make things better and make things worse. LLVM handles these scenarios for regular arithmetic operations with its "reassociate" pass, so in theory it could be extended to handle the AESE and AESD instructions as well. However, I suspect that this would be an expensive change in a hot spot of the code and would be a difficult sell to get that change accepted in the LLVM community.