sbc-bench/ARMv8-Crypto-Extensions.md at master · ThomasKaiser/sbc-bench

ARMv8 Crypto Extensions

Basics

SoC vendors who license ARMv8 cores (usually 64-bit capable) can decide between certain optional features: for example cryptographic acceleration called ‘ARMv8 Cryptography Extensions’.

Usually SoC vendors do, the only known exceptions are early Cortex-A53 SoCs like Qualcomm’s Snapdragon 410, Amlogic’s very first 64-bit SoC S905 (used only on ODROID-C2 and NanoPi K2), Phytium’s FTC662 core and BroadCom’s SoCs powering all 64-bit capable Raspberry Pis: all lack any crypto acceleration and perform way lower than all other 64-bit ARM SoCs in this area.

If the kernel has been built correctly, availability of accelerated cryptography functions can be checked by querying /proc/cpuinfo: The ‘Features’ entry will additionally show aes pmull sha1 sha2.

sbc-bench’s use of OpenSSL

sbc-bench is using OpenSSL’s internal AES benchmark as a detection for crypto acceleration testing single-threaded through AES-128, AES-192 and AES-256. For the latter a benchmark run looks like this:

openssl speed -elapsed -evp aes-256-cbc
You have chosen to measure elapsed time instead of user CPU time.
Doing aes-256-cbc for 3s on 16 size blocks: 63579690 aes-256-cbc's in 3.00s
Doing aes-256-cbc for 3s on 64 size blocks: 34729604 aes-256-cbc's in 3.00s
Doing aes-256-cbc for 3s on 256 size blocks: 11848770 aes-256-cbc's in 3.00s
Doing aes-256-cbc for 3s on 1024 size blocks: 3221240 aes-256-cbc's in 3.00s
Doing aes-256-cbc for 3s on 8192 size blocks: 419117 aes-256-cbc's in 3.00s
Doing aes-256-cbc for 3s on 16384 size blocks: 209578 aes-256-cbc's in 3.00s
...
The 'numbers' are in 1000s of bytes per second processed.
type             16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes  16384 bytes
aes-256-cbc     339091.68k   740898.22k  1011095.04k  1099516.59k  1144468.82k  1144575.32k

The results are ‘1000s of bytes per second processed’ and we’ll focus from now on only on most right column since not affected by initialization overhead (16K chunk size with a 1144575.32k score in the example above).

ARMv8 Crypto Extensions are not a classic ‘crypto engine’ running at a fixed clock (like Marvell’s CESA for example) but scale linearly with clockspeed. Also with the openssl benchmark it doesn’t matter how DRAM configuration/performance looks like since the whole benchmark runs inside CPU caches and while OpenSSL uses userspace crypto the scores are identical regardless whether userland is armhf or arm64 (see Samsung/Nexell S5P6818 numbers below). Distro used as well as OpenSSL version also don’t seem to matter.

Scores predictable based on CPU core and clockspeed

It all boils down to type of ARM core and CPU clockspeed since the ratio between openssl score and CPU clockspeed is fixed in the following way (using sbc-bench result collection):

  • Cortex-A35: ~217, an A35 running at 1000 MHz will produce an ~217000k aes-256-cbc score (or ~434000k at 2000 MHz)
  • Cortex-A57: ~359, an A57 running at 1000 MHz will produce an ~359000k aes-256-cbc score (or ~718000k at 2000 MHz)
  • Cortex-A53/A55: ~467, A53/A55 running at 1000 MHz will produce an ~467000k aes-256-cbc score (or ~935000k at 2000 MHz)
  • Cortex-A72/A73/A76/A77/A78/X1: ~570, A72/A73/A76 running at 1000 MHz will produce an ~570000k aes-256-cbc score (or ~1140000k at 2000 MHz)

Amazon’s Graviton/Graviton2 ARM CPUs and Neoverse-N1 cores score identical to A72/A73/A76/A77/A78 and the custom FTC663 core inside the Feiteng D2000 CPU performs identical to an A57 (another hint wrt similarity). NVidia’s Carmel core performs marginally better than Cortex-A57 (~374, the Jetson Xavier NX numbers below). Qualcomm’s Kryo Silver cores are based on A55 and perform exactly the same here while Qualcomm’s Qualcomm Falkor V1 behaves like Cortex-A72 and onwards.

Implications

Encryption/decryption performance with real-world tasks is an entirely different thing than looking at these results from a synthetic benchmark that runs completly inside the CPU cores/caches. Real performance with real use cases might look really different (e.g. full disk encryption or performance as a VPN gateway).

The openssl speed -elapsed -evp aes-256-cbc test is still more of a check whether crypto acceleration is available than a benchmark for real-world crypto performance. But if and only if ARMv8 Crypto Extensions have been licensed by an ARM SoC vendor simple conclusions can be drawn since there exists a fixed correlation between core type, clockspeed and aes-256-cbc score. So if we know that a new SoC features e.g. A55 cores, cheats with reported clockspeeds and we’re not able to measure clockspeeds then we can use the openssl benchmark to guess real CPU clockspeeds. Vice versa should work too but it’s better to look up the CPU ID instead.

All of this only applies to ARM SoCs with ARMv8 Crypto Extensions licensed. Since otherwise scores thrown out by openssl depend heavily on compiler version/settings and even different code paths. Check out ODROID-C2 and RPi 4 ‘AES-256 (16 KB)’ scores in official results list: with C2 ‘modern OS’ outperforms higher CPU clock and with RPi 4 comparing armhf userland (32-bit) and arm64 (64-bit) is even more telling since openssl reports less than 50% of ‘AES performance’ when running 64-bit compared to 32-bit since different code paths: generic C with 64-bit vs. optimized assembler routines with 32-bit.

Numbers the aforementioned conclusions are based on

Crawling through sbc-bench results collection comparing +30 different SoCs/CPUs from various vendors at various clockspeeds using OpenSSL versions 1.1.0f (25 May 2017) through 3.0.2 (15 Mar 2022) shows always the same relation between openssl score and clockspeed for those four core families (right column is OpenSSL’s aes-256-cbc score divided through clockspeed in MHz):