Emulating x86 AES Intrinsics on ARMv8-A

Emulating x86 AES Intrinsics on ARMv8-A

Recently I needed to port some C encryption code to run to run on an ARMv8-A (aarch64) processor. The problem is that the code uses some x86 AES intrinsics, which the compiler doesn’t recognize when targeting the ARM architecture. ARMv8-A does have an optional crypto extension, which includes several AES instructions, but they have slightly different semantics from the x86 instructions. I don’t have much experience with AES, and initially found this very confusing, since I had assumed that all AES implementations would need to work the same way (it is a standard after all!). It turns out that both approaches are sufficient for implementing AES, but they choose to break up the problem in a different way.

Background information on AES

The Advanced Encryption Standard (AES) is common symmetric encryption algorithm that uses a secret key to encrypt and decrypt data. AES encrypts 16-bytes at a time and uses key sizes of 128-bits to 256-bits. The 16-byte data block is transformed by repeating a sequence of steps of steps called rounds. The order of the steps is fixed, but the number of rounds can vary depending on the key size. For example, the standard states that for a 128-bit key size, 10 rounds are used.

The AES steps are defined as operations on a 4x4 matrix of 16-bytes, which looks like this:

|b0  b4  b8  b12|
|b1  b5  b9  b13|
|b2  b6  b10 b14|
|b3  b7  b11 b15|

...where bN is the Nth byte in the 16-byte data block.

A round is defined as the following steps:

  1. SubBytes – Uses a table to map each byte value to a unique byte value
  2. ShiftRows – Rotates the bytes in each row by a different amount
  3. MixColumns – Mixes by combining the four bytes of each column
  4. AddRoundKey – XORs each byte of the matrix with a value derived from the key

Comparison of AES encrypt in Intel vs ARM

Intel provides two AES instructions in x86 for encryption, which match up closely to the AES rounds:

  1. AESENC – AES Encrypt (Normal Round)
    a. ShiftRows
    b. SubBytes
    c. MixColumns
    d. AddRoundKey
  2. AESENCLAST – AES Encrypt (Last Round, No MixColumns)
    a. ShiftRows
    b. SubBytes
    c. AddRoundKey

(You may notice that the ShiftRows and SubBytes steps are swapped from the AES formal definition. This is ok, since it doesn’t change the final result)

ARM also provides two AES instructions for encryption, but blurs the lines between different rounds a little bit:

  1. AESE – AES Encrypt (AddRoundKey is first, No MixColumns)
    a. AddRoundKey
    b. ShiftRows
    c. SubBytes
  2. AESMC – AES MixColumns
    a. MixColumns

(See the ARM Architecture Manual for instruction details)

Here is how three rounds of AES encryption would be implemented with Intel and ARM:

Round Intel AES Steps ARM
Round 1 XOR AddRoundKey AESE
AESENC SubBytes
ShiftRows
MixColumns AESMC
AddRoundKey AESE
Round 2 AESENC SubBytes
ShiftRows
MixColumns AESMC
AddRoundKey AESE
Round 3 AESENCLAST SubBytes
ShiftRows
AddRoundKey XOR

Implementing AESENC with ARM instructions

I wanted to avoid rewriting the algorithm that I was porting, so I decided to stick with the Intel semantics, and reimplement the x86 intrinsics with ARM NEON intrinsics and GCC vector extensions. The intrinsics for the x86 AESENC and AESENCLAST instructions have these prototypes:

__m128i _mm_aesenc_si128 (__m128i a, __m128i RoundKey);
__m128i _mm_aesenclast_si128 (__m128i a, __m128i RoundKey);

The first step is to define an equivalent type for __m128i. I mapped this to the NEON type uint8x16_t:

#include <stdint.h>
#include <arm_neon.h>

typedef uint8x16_t __m128i;

Next, I had to come up with a sequence of ARM instructions that can be used to emulate the x86 semantics Using AESE+AESMC+XOR will get us close, except the ARM AESE has an extra AddRoundKey at the beginning that is not present in the x86 AESENC. However, since AddRoundKey simply XORs the key with the data, a key value of zero will turn the step into a NOP. Here is the final implementation:

__m128i _mm_aesenc_si128 (__m128i a, __m128i RoundKey)
{
    return vaesmcq_u8(vaeseq_u8(a, (__m128i){})) ^ RoundKey;
}

__m128i _mm_aesenclast_si128 (__m128i a, __m128i RoundKey)
{
    return vaeseq_u8(a, (__m128i){}) ^ RoundKey;
}
# clang-6.0 -target aarch64-none-linux -march=armv8+crypto -O3

000000000000003c <_mm_aesenc_si128>:
  3c:   6f00e402        movi    v2.2d, #0x0
  40:   4e284840        aese    v0.16b, v2.16b
  44:   4e286800        aesmc   v0.16b, v0.16b
  48:   6e211c00        eor     v0.16b, v0.16b, v1.16b
  4c:   d65f03c0        ret

0000000000000050 <_mm_aesenclast_si128>:
  50:   6f00e402        movi    v2.2d, #0x0
  54:   4e284840        aese    v0.16b, v2.16b
  58:   6e211c00        eor     v0.16b, v0.16b, v1.16b
  5c:   d65f03c0        ret

Implementing AESDEC with ARM instructions

There are two ways to decrypt AES. The first is called the “Inverse Cipher”, and it simply reverses the order of the steps used for encryption. Instead of using ShiftRows, SubBytes, and MixColumns, the inverse functions InvShiftRows, InvSubBytes, and InvMixColumns are used. The second method is a technique called the “Equivalent Inverse Cipher”, which generates the decryption keys a different way, but allows the decryption steps to be reordered in such a way that it is faster to implement in hardware. The x86 and ARMv8-A AES instructions are designed to be used with the second decryption algorithm. You can read more about it in Intel’s whitepaper.

Intel provides the AESDEC and AESDECLAST instructions in x86 to aid with AES decryption, while ARM provides the AESD and AESIMC instructions. Just like it was with encryption, these instructions have slightly different semantics when compared to the other architecture. Fortunately, it is still possible to replace the Intel intrinsics with a sequence of ARM intrinsics.

__m128i _mm_aesdec_si128 (__m128i a, __m128i RoundKey)
{
    return vaesimcq_u8(vaesdq_u8(a, (__m128i){})) ^ RoundKey;
}

__m128i _mm_aesdeclast_si128 (__m128i a, __m128i RoundKey)
{
    return vaesdq_u8(a, (__m128i){}) ^ RoundKey;
}
# clang-6.0 -target aarch64-none-linux -march=armv8+crypto -O3

0000000000000060 <_mm_aesdec_si128>:
  60:   6f00e402        movi    v2.2d, #0x0
  64:   4e285840        aesd    v0.16b, v2.16b
  68:   4e287800        aesimc  v0.16b, v0.16b
  6c:   6e211c00        eor     v0.16b, v0.16b, v1.16b
  70:   d65f03c0        ret

0000000000000074 <_mm_aesdeclast_si128>:
  74:   6f00e402        movi    v2.2d, #0x0
  78:   4e285840        aesd    v0.16b, v2.16b
  7c:   6e211c00        eor     v0.16b, v0.16b, v1.16b
  80:   d65f03c0        ret

Implementing AESKEYGENASSIST with ARM instructions

One part of AES that I glossed over is how the round key is generated for the AddRoundKey step. The standard defines an algorithm, and Intel provides an implementation via the AESKEYGENASSIT instruction. Unfortunately, ARM does not provide an equivalent instruction for ARMv8-A, so we must get our hands a little dirty.

The Intel documentation provides a fairly precise definition of what the AESKEYGENASSIT instruction does:

X3[31:0] := a[127:96]
X2[31:0] := a[95:64]
X1[31:0] := a[63:32]
X0[31:0] := a[31:0]
RCON[31:0] := ZeroExtend(imm8[7:0]);
dst[31:0] := SubWord(X1)
dst[63:32] := (RotWord(SubWord(X1)) XOR RCON;
dst[95:64] := SubWord(X3)
dst[127:96] := RotWord(SubWord(X3)) XOR RCON;

The only part of this that is really tricky is the SubWord() function, which uses the same substitution algorithm that the AES SubBytes step does. Implementing a custom lookup table is not very efficient, so it would be convenient to use the AESE instruction for this.

Just as with AES encryption, I used a zeroed out round key to skip the AddRoundKey step. This is what remains after SubBytes and ShiftRows steps transform the input:

|b0  b4  b8  b12|        |sub(b0)  sub(b4)  sub(b8)  sub(b12)|
|b1  b5  b9  b13|  AESE  |sub(b5)  sub(b9)  sub(b13) sub(b1) |
|b2  b6  b10 b14|  ===>  |sub(b10) sub(b14) sub(b2)  sub(b6) |
|b3  b7  b11 b15|        |sub(b15) sub(b3)  sub(b7)  sub(b11)|

Using the NEON TBL instruction, I can extract the desired bytes to build a new vector. On the left side, X1 is b4, b5, b6, b7, and on the right, those bytes have been shifted to positions 4, 1, 14, and 11. Similarly, X3 is b12, b13, b14, b15, and those bytes have been shifted to positions 9, 6, 3, 12.

__m128i dest = {
    // Undo ShiftRows step from AESE and extract X1 and X3
    a[0x4], a[0x1], a[0xE], a[0xB], // SubBytes(X1)
    a[0x4], a[0x1], a[0xE], a[0xB], // SubBytes(X1)
    a[0xC], a[0x9], a[0x6], a[0x3], // SubBytes(X3)
    a[0xC], a[0x9], a[0x6], a[0x3], // SubBytes(X3)
};

The next step is to rotate the bytes of words 1 and 3. There isn’t really a good instruction to do this, but since I am already shuffling the output of from AESE, can I can shuffle a little more to perform the rotation:

__m128i dest = {
    // Undo ShiftRows step from AESE and extract X1 and X3
    a[0x4], a[0x1], a[0xE], a[0xB], // SubBytes(X1)
    a[0x1], a[0xE], a[0xB], a[0x4], // ROT(SubBytes(X1))
    a[0xC], a[0x9], a[0x6], a[0x3], // SubBytes(X3)
    a[0x9], a[0x6], a[0x3], a[0xC], // ROT(SubBytes(X3))
};

Finally, the RCON value needs to be XOR with words 1 and 3. Here is the final implementation:

__m128i _mm_aeskeygenassist_si128 (__m128i a, const int imm8)
{
    a = vaeseq_u8(a, (__m128i){}); // AESE does ShiftRows and SubBytes on A
    __m128i dest = {
        // Undo ShiftRows step from AESE and extract X1 and X3
        a[0x4], a[0x1], a[0xE], a[0xB], // SubBytes(X1)
        a[0x1], a[0xE], a[0xB], a[0x4], // ROT(SubBytes(X1))
        a[0xC], a[0x9], a[0x6], a[0x3], // SubBytes(X3)
        a[0x9], a[0x6], a[0x3], a[0xC], // ROT(SubBytes(X3))
    };
    return dest ^ (__m128i)((uint32x4_t){0, rcon, 0, rcon});
}
# clang-6.0 -target aarch64-none-linux -march=armv8+crypto -O3

00000000000000d0 <_mm_aeskeygenassist_si128>:
  d0:   90000008        adrp    x8, 0 <load_8>
  d4:   3dc00102        ldr     q2, [x8]
  d8:   6f00e401        movi    v1.2d, #0x0
  dc:   4e040fe3        dup     v3.4s, wzr
  e0:   4e284820        aese    v0.16b, v1.16b
  e4:   4e0c1c03        mov     v3.s[1], w0
  e8:   4e020000        tbl     v0.16b, {v0.16b}, v2.16b
  ec:   4e1c1c03        mov     v3.s[3], w0
  f0:   6e231c00        eor     v0.16b, v0.16b, v3.16b
  f4:   d65f03c0        ret

0000000000000000 <.rodata.cst16>:
        ...
  40:   0b0e0104        .word   0x0b0e0104
  44:   040b0e01        .word   0x040b0e01
  48:   0306090c        .word   0x0306090c
  4c:   0c030609        .word   0x0c030609