Emulating x86 AES Intrinsics on ARMv8-A
Recently I needed to port some C encryption code to run to run on an ARMv8-A (aarch64) processor. The problem is that the code uses some x86 AES intrinsics, which the compiler doesn’t recognize when targeting the ARM architecture. ARMv8-A does have an optional crypto extension, which includes several AES instructions, but they have slightly different semantics from the x86 instructions. I don’t have much experience with AES, and initially found this very confusing, since I had assumed that all AES implementations would need to work the same way (it is a standard after all!). It turns out that both approaches are sufficient for implementing AES, but they choose to break up the problem in a different way.
Background information on AES
The Advanced Encryption Standard (AES) is common symmetric encryption algorithm that uses a secret key to encrypt and decrypt data. AES encrypts 16-bytes at a time and uses key sizes of 128-bits to 256-bits. The 16-byte data block is transformed by repeating a sequence of steps of steps called rounds. The order of the steps is fixed, but the number of rounds can vary depending on the key size. For example, the standard states that for a 128-bit key size, 10 rounds are used.
The AES steps are defined as operations on a 4x4 matrix of 16-bytes, which looks like this:
|b0 b4 b8 b12|
|b1 b5 b9 b13|
|b2 b6 b10 b14|
|b3 b7 b11 b15|
...where bN
is the Nth byte in the 16-byte data block.
A round is defined as the following steps:
SubBytes
– Uses a table to map each byte value to a unique byte valueShiftRows
– Rotates the bytes in each row by a different amountMixColumns
– Mixes by combining the four bytes of each columnAddRoundKey
– XORs each byte of the matrix with a value derived from the key
Comparison of AES encrypt in Intel vs ARM
Intel provides two AES instructions in x86 for encryption, which match up closely to the AES rounds:
AESENC
– AES Encrypt (Normal Round)
a.ShiftRows
b.SubBytes
c.MixColumns
d.AddRoundKey
AESENCLAST
– AES Encrypt (Last Round, NoMixColumns
)
a.ShiftRows
b.SubBytes
c.AddRoundKey
(You may notice that the ShiftRows
and SubBytes
steps are swapped from the AES formal definition. This is ok, since it doesn’t change the final result)
ARM also provides two AES instructions for encryption, but blurs the lines between different rounds a little bit:
AESE
– AES Encrypt (AddRoundKey is first, NoMixColumns
)
a.AddRoundKey
b.ShiftRows
c.SubBytes
AESMC
– AES MixColumns
a.MixColumns
(See the ARM Architecture Manual for instruction details)
Here is how three rounds of AES encryption would be implemented with Intel and ARM:
Round | Intel | AES Steps | ARM |
---|---|---|---|
Round 1 | XOR |
AddRoundKey |
AESE |
AESENC |
SubBytes |
||
ShiftRows |
|||
MixColumns |
AESMC |
||
AddRoundKey |
AESE |
||
Round 2 | AESENC |
SubBytes |
|
ShiftRows |
|||
MixColumns |
AESMC |
||
AddRoundKey |
AESE |
||
Round 3 | AESENCLAST |
SubBytes |
|
ShiftRows |
|||
AddRoundKey |
XOR |
Implementing AESENC with ARM instructions
I wanted to avoid rewriting the algorithm that I was porting, so I decided to stick with the Intel semantics, and reimplement the x86 intrinsics with ARM NEON intrinsics and GCC vector extensions. The intrinsics for the x86 AESENC
and AESENCLAST
instructions have these prototypes:
__m128i _mm_aesenc_si128 (__m128i a, __m128i RoundKey);
__m128i _mm_aesenclast_si128 (__m128i a, __m128i RoundKey);
The first step is to define an equivalent type for __m128i
. I mapped this to the NEON type uint8x16_t
:
#include <stdint.h>
#include <arm_neon.h>
typedef uint8x16_t __m128i;
Next, I had to come up with a sequence of ARM instructions that can be used to emulate the x86 semantics Using AESE+AESMC+XOR
will get us close, except the ARM AESE
has an extra AddRoundKey
at the beginning that is not present in the x86 AESENC
. However, since AddRoundKey
simply XORs the key with the data, a key value of zero will turn the step into a NOP. Here is the final implementation:
__m128i _mm_aesenc_si128 (__m128i a, __m128i RoundKey)
{
return vaesmcq_u8(vaeseq_u8(a, (__m128i){})) ^ RoundKey;
}
__m128i _mm_aesenclast_si128 (__m128i a, __m128i RoundKey)
{
return vaeseq_u8(a, (__m128i){}) ^ RoundKey;
}
# clang-6.0 -target aarch64-none-linux -march=armv8+crypto -O3
000000000000003c <_mm_aesenc_si128>:
3c: 6f00e402 movi v2.2d, #0x0
40: 4e284840 aese v0.16b, v2.16b
44: 4e286800 aesmc v0.16b, v0.16b
48: 6e211c00 eor v0.16b, v0.16b, v1.16b
4c: d65f03c0 ret
0000000000000050 <_mm_aesenclast_si128>:
50: 6f00e402 movi v2.2d, #0x0
54: 4e284840 aese v0.16b, v2.16b
58: 6e211c00 eor v0.16b, v0.16b, v1.16b
5c: d65f03c0 ret
Implementing AESDEC with ARM instructions
There are two ways to decrypt AES. The first is called the “Inverse Cipher”, and it simply reverses the order of the steps used for encryption. Instead of using ShiftRows
, SubBytes
, and MixColumns
, the inverse functions InvShiftRows
, InvSubBytes
, and InvMixColumns
are used. The second method is a technique called the “Equivalent Inverse Cipher”, which generates the decryption keys a different way, but allows the decryption steps to be reordered in such a way that it is faster to implement in hardware. The x86 and ARMv8-A AES instructions are designed to be used with the second decryption algorithm. You can read more about it in Intel’s whitepaper.
Intel provides the AESDEC
and AESDECLAST
instructions in x86 to aid with AES decryption, while ARM provides the AESD
and AESIMC
instructions. Just like it was with encryption, these instructions have slightly different semantics when compared to the other architecture. Fortunately, it is still possible to replace the Intel intrinsics with a sequence of ARM intrinsics.
__m128i _mm_aesdec_si128 (__m128i a, __m128i RoundKey)
{
return vaesimcq_u8(vaesdq_u8(a, (__m128i){})) ^ RoundKey;
}
__m128i _mm_aesdeclast_si128 (__m128i a, __m128i RoundKey)
{
return vaesdq_u8(a, (__m128i){}) ^ RoundKey;
}
# clang-6.0 -target aarch64-none-linux -march=armv8+crypto -O3
0000000000000060 <_mm_aesdec_si128>:
60: 6f00e402 movi v2.2d, #0x0
64: 4e285840 aesd v0.16b, v2.16b
68: 4e287800 aesimc v0.16b, v0.16b
6c: 6e211c00 eor v0.16b, v0.16b, v1.16b
70: d65f03c0 ret
0000000000000074 <_mm_aesdeclast_si128>:
74: 6f00e402 movi v2.2d, #0x0
78: 4e285840 aesd v0.16b, v2.16b
7c: 6e211c00 eor v0.16b, v0.16b, v1.16b
80: d65f03c0 ret
Implementing AESKEYGENASSIST with ARM instructions
One part of AES that I glossed over is how the round key is generated for the AddRoundKey
step. The standard defines an algorithm, and Intel provides an implementation via the AESKEYGENASSIT
instruction. Unfortunately, ARM does not provide an equivalent instruction for ARMv8-A, so we must get our hands a little dirty.
The Intel documentation provides a fairly precise definition of what the AESKEYGENASSIT
instruction does:
X3[31:0] := a[127:96]
X2[31:0] := a[95:64]
X1[31:0] := a[63:32]
X0[31:0] := a[31:0]
RCON[31:0] := ZeroExtend(imm8[7:0]);
dst[31:0] := SubWord(X1)
dst[63:32] := (RotWord(SubWord(X1)) XOR RCON;
dst[95:64] := SubWord(X3)
dst[127:96] := RotWord(SubWord(X3)) XOR RCON;
The only part of this that is really tricky is the SubWord()
function, which uses the same substitution algorithm that the AES SubBytes
step does. Implementing a custom lookup table is not very efficient, so it would be convenient to use the AESE
instruction for this.
Just as with AES encryption, I used a zeroed out round key to skip the AddRoundKey
step. This is what remains after SubBytes
and ShiftRows
steps transform the input:
|b0 b4 b8 b12| |sub(b0) sub(b4) sub(b8) sub(b12)|
|b1 b5 b9 b13| AESE |sub(b5) sub(b9) sub(b13) sub(b1) |
|b2 b6 b10 b14| ===> |sub(b10) sub(b14) sub(b2) sub(b6) |
|b3 b7 b11 b15| |sub(b15) sub(b3) sub(b7) sub(b11)|
Using the NEON TBL
instruction, I can extract the desired bytes to build a new vector. On the left side, X1 is b4, b5, b6, b7, and on the right, those bytes have been shifted to positions 4, 1, 14, and 11. Similarly, X3 is b12, b13, b14, b15, and those bytes have been shifted to positions 9, 6, 3, 12.
__m128i dest = {
// Undo ShiftRows step from AESE and extract X1 and X3
a[0x4], a[0x1], a[0xE], a[0xB], // SubBytes(X1)
a[0x4], a[0x1], a[0xE], a[0xB], // SubBytes(X1)
a[0xC], a[0x9], a[0x6], a[0x3], // SubBytes(X3)
a[0xC], a[0x9], a[0x6], a[0x3], // SubBytes(X3)
};
The next step is to rotate the bytes of words 1 and 3. There isn’t really a good instruction to do this, but since I am already shuffling the output of from AESE
, can I can shuffle a little more to perform the rotation:
__m128i dest = {
// Undo ShiftRows step from AESE and extract X1 and X3
a[0x4], a[0x1], a[0xE], a[0xB], // SubBytes(X1)
a[0x1], a[0xE], a[0xB], a[0x4], // ROT(SubBytes(X1))
a[0xC], a[0x9], a[0x6], a[0x3], // SubBytes(X3)
a[0x9], a[0x6], a[0x3], a[0xC], // ROT(SubBytes(X3))
};
Finally, the RCON value needs to be XOR with words 1 and 3. Here is the final implementation:
__m128i _mm_aeskeygenassist_si128 (__m128i a, const int imm8)
{
a = vaeseq_u8(a, (__m128i){}); // AESE does ShiftRows and SubBytes on A
__m128i dest = {
// Undo ShiftRows step from AESE and extract X1 and X3
a[0x4], a[0x1], a[0xE], a[0xB], // SubBytes(X1)
a[0x1], a[0xE], a[0xB], a[0x4], // ROT(SubBytes(X1))
a[0xC], a[0x9], a[0x6], a[0x3], // SubBytes(X3)
a[0x9], a[0x6], a[0x3], a[0xC], // ROT(SubBytes(X3))
};
return dest ^ (__m128i)((uint32x4_t){0, rcon, 0, rcon});
}
# clang-6.0 -target aarch64-none-linux -march=armv8+crypto -O3
00000000000000d0 <_mm_aeskeygenassist_si128>:
d0: 90000008 adrp x8, 0 <load_8>
d4: 3dc00102 ldr q2, [x8]
d8: 6f00e401 movi v1.2d, #0x0
dc: 4e040fe3 dup v3.4s, wzr
e0: 4e284820 aese v0.16b, v1.16b
e4: 4e0c1c03 mov v3.s[1], w0
e8: 4e020000 tbl v0.16b, {v0.16b}, v2.16b
ec: 4e1c1c03 mov v3.s[3], w0
f0: 6e231c00 eor v0.16b, v0.16b, v3.16b
f4: d65f03c0 ret
0000000000000000 <.rodata.cst16>:
...
40: 0b0e0104 .word 0x0b0e0104
44: 040b0e01 .word 0x040b0e01
48: 0306090c .word 0x0306090c
4c: 0c030609 .word 0x0c030609