Intrinsics
What Intrinsics Are
Intrinsics are compiler-provided functions that expose CPU operations without writing inline assembly. SIMD intrinsics are the common case: one operation works on multiple lanes inside a vector register.
SIMD means Single Instruction, Multiple Data:
scalar: c0 = a0 + b0
c1 = a1 + b1
c2 = a2 + b2
c3 = a3 + b3
SIMD: c[0..3] = a[0..3] + b[0..3]
Intrinsics are useful when reversing optimized binaries because compiler output often maps back to recognizable operations: loads, stores, packed arithmetic, comparisons, shuffles, masks, and reductions.
Core Terms
| Term | Meaning |
|---|---|
| Lane | One element inside a vector register. |
| Vector width | Total register width, such as 128, 256, or 512 bits. |
| Element width | Per-lane type width, such as 8, 16, 32, or 64 bits. |
| Packed integer | Multiple integer lanes in one vector. |
| Packed float | Multiple floating-point lanes in one vector. |
| Intrinsic | C/C++ function-like wrapper that maps to one or more SIMD instructions. |
| Auto-vectorization | Compiler turns scalar loops into SIMD automatically. |
| Horizontal operation | Reduces across lanes, such as sum of all lanes. |
| Saturating arithmetic | Clamps on overflow instead of wrapping. Common in media code. |
| Shuffle/permute | Reorders lanes within or across vectors. |
| Mask/predicate | Per-lane enable/disable control for an operation. |
Main SIMD Families
| Platform | Families | Register model | Where to look |
|---|---|---|---|
| x86/x86-64 | MMX, SSE, SSE2, SSSE3, SSE4, AVX, AVX2, AVX-512, AVX10 | Fixed width: XMM 128-bit, YMM 256-bit, ZMM 512-bit | Intel Intrinsics Guide, Intel/AMD manuals, uops.info |
| Arm A/R-profile | Neon / Advanced SIMD | Fixed width: 64-bit and 128-bit vectors | Arm Neon Intrinsics Reference, ACLE |
| Arm A-profile | SVE / SVE2 | Scalable vector length, predicate-driven | Arm ACLE SVE docs, Arm Learning Paths |
| Arm M-profile | MVE / Helium | Embedded/Microcontroller vector extension | Arm MVE Intrinsics Reference |
| RISC-V | RVV | Scalable vector length | RISC-V Vector spec and compiler docs |
| Power | VSX/AltiVec | Fixed-width vectors | IBM/OpenPOWER docs |
Where To Find Intrinsics
How To Read Intrinsic Names
x86
Typical form:
_mm256_add_epi32
| Part | Meaning |
|---|---|
_mm | Intel intrinsic namespace. |
256 | Vector width: 256-bit YMM. Empty usually means 128-bit XMM. |
add | Operation. |
epi32 | Packed signed/unsigned 32-bit integer lanes. |
Common x86 suffixes:
| Suffix | Meaning |
|---|---|
ps | Packed single-precision floats, usually float32. |
pd | Packed double-precision floats, usually float64. |
ss | Scalar single-precision float. |
sd | Scalar double-precision float. |
epi8, epi16, epi32, epi64 | Packed integer lanes. |
si128, si256, si512 | Raw integer vector register. |
u8, u16, u32, u64 | Unsigned integer variants in some intrinsic names. |
mask, maskz | AVX-512 masked operation; maskz zeroes inactive lanes. |
Common headers:
| Header | Use |
|---|---|
<immintrin.h> | Most modern Intel/AMD intrinsics. |
<xmmintrin.h> | SSE. |
<emmintrin.h> | SSE2. |
<tmmintrin.h> | SSSE3. |
<smmintrin.h> | SSE4.1. |
<nmmintrin.h> | SSE4.2. |
Minimal x86 example:
#include <immintrin.h>
void add8_i32(const int *a, const int *b, int *out) {
__m256i va = _mm256_loadu_si256((const __m256i *)a);
__m256i vb = _mm256_loadu_si256((const __m256i *)b);
__m256i vc = _mm256_add_epi32(va, vb);
_mm256_storeu_si256((__m256i *)out, vc);
}
Compile:
gcc -O3 -mavx2 simd.c
clang -O3 -mavx2 simd.c
Arm Neon
Typical form:
vaddq_s32
| Part | Meaning |
|---|---|
v | Vector intrinsic prefix. |
add | Operation. |
q | 128-bit vector form. Without q, many Neon intrinsics use 64-bit vectors. |
s32 | Signed 32-bit integer lanes. |
Common Neon suffixes:
| Suffix | Meaning |
|---|---|
s8, s16, s32, s64 | Signed integer lanes. |
u8, u16, u32, u64 | Unsigned integer lanes. |
f16, f32, f64 | Floating-point lanes. |
p8, p16, p64 | Polynomial lanes. |
q | 128-bit vector form. |
n | One operand is a scalar immediate/value. |
lane | Uses a lane from another vector. |
Use:
#include <arm_neon.h>
void add4_i32(const int32_t *a, const int32_t *b, int32_t *out) {
int32x4_t va = vld1q_s32(a);
int32x4_t vb = vld1q_s32(b);
int32x4_t vc = vaddq_s32(va, vb);
vst1q_s32(out, vc);
}
Compile:
aarch64-linux-gnu-gcc -O3 simd.c
clang -O3 --target=aarch64-linux-gnu simd.c
On AArch64, Neon is part of the base architecture profile used by normal user-space targets. For AArch32, pass the correct FPU/architecture flags for the target CPU.
Arm SVE / SVE2
SVE is scalable SIMD. Code is written for an unknown runtime vector length, using predicates to control active lanes.
Typical form:
svadd_s32_m(pg, a, b)
| Part | Meaning |
|---|---|
sv | SVE intrinsic prefix. |
add | Operation. |
s32 | Signed 32-bit lanes. |
_m | Merge inactive lanes from the first input. |
_z | Zero inactive lanes. |
_x | Inactive lanes are don’t-care. |
pg | Predicate controlling active lanes. |
Minimal SVE loop shape:
#include <arm_sve.h>
void add_i32_sve(const int32_t *a, const int32_t *b, int32_t *out, int n) {
for (int i = 0; i < n; i += svcntw()) {
svbool_t pg = svwhilelt_b32(i, n);
svint32_t va = svld1_s32(pg, &a[i]);
svint32_t vb = svld1_s32(pg, &b[i]);
svint32_t vc = svadd_s32_m(pg, va, vb);
svst1_s32(pg, &out[i], vc);
}
}
Compile for an SVE-capable target:
aarch64-linux-gnu-gcc -O3 -march=armv8.2-a+sve simd.c
clang -O3 --target=aarch64-linux-gnu -march=armv8.2-a+sve simd.c
Use SVE/SVE2 when you want code that scales across different vector lengths. Avoid hardcoding a fixed number of lanes in SVE loops.
Working With SIMD
Start Scalar
Write a clear scalar version first:
for (int i = 0; i < n; i++) {
out[i] = a[i] + b[i];
}
Then decide whether to use compiler auto-vectorization, explicit intrinsics, or a portable abstraction.
Check Auto-Vectorization
Compilers can vectorize simple loops when aliasing and control flow are clear.
GCC:
gcc -O3 -march=native -fopt-info-vec-optimized -fopt-info-vec-missed code.c
Clang:
clang -O3 -march=native -Rpass=loop-vectorize -Rpass-missed=loop-vectorize code.c
Hints that help compilers:
void add(float *restrict out, const float *restrict a, const float *restrict b, int n) {
for (int i = 0; i < n; i++) {
out[i] = a[i] + b[i];
}
}
Use restrict only when the pointers really do not alias.
Use Intrinsics For Control
Use intrinsics when you need:
- exact instruction choice
- shuffles, masks, saturating arithmetic, or special instructions
- stable performance in a hot path
- behavior the compiler will not auto-vectorize
- reverse-engineering parity with compiler output
Avoid intrinsics when scalar code auto-vectorizes cleanly and performance is already good.
Handle Alignment And Tails
Most modern SIMD code can use unaligned loads:
| Platform | Unaligned load pattern |
|---|---|
| x86 | _mm_loadu_*, _mm256_loadu_*, _mm512_loadu_* |
| Neon | vld1* loads; alignment still matters for performance on some targets |
| SVE | svld1* with predicates for tails |
Tail handling options:
| Method | Use |
|---|---|
| Scalar cleanup loop | Simple and portable. |
| Masked load/store | AVX-512 and SVE are strong here. |
| Over-read within safe padding | Fast, but requires strict buffer guarantees. |
| Process fixed blocks only | Useful for codecs and cryptographic blocks. |
Verify Correctness
Check scalar and SIMD outputs on edge cases:
- empty input
- length smaller than vector width
- exact vector-width length
- one element past vector width
- unaligned input/output pointers
- max/min integer values
- NaN, infinity, signed zero for floating point
Measure
Use representative inputs and measure the whole hot loop, not just one instruction.
Useful tools:
| Tool | Use |
|---|---|
perf stat | Cycles, instructions, branches, cache misses. |
perf record/report | Hot path profiling. |
| Compiler Explorer | Inspect generated assembly. |
| uops.info | x86 instruction throughput/latency. |
| Agner Fog tables | x86 throughput/latency and microarchitecture notes. |
llvm-mca | Static throughput analysis for LLVM-supported targets. |
Python Prototyping
NumPy arrays are a convenient way to prototype SIMD-like logic in Python while reversing vectorized code. They model lanes, masks, shifts, reinterpret casts, and fixed-width overflow behavior closely enough for quick experiments and sanity checks.
Python references:
- Intel Intrinsics Guide: https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html
- NumPy SIMD optimizations: https://numpy.org/devdocs/reference/simd/index.html
Basic lane model:
import numpy as np
u8 = np.array([0xff, 0x01, 0x7f, 0x80], dtype=np.uint8)
print((u8 + np.uint8(1)).astype(np.uint8))
Intrinsic-like helpers:
import numpy as np
# Load / Store
def mm_set1_ps(val: float):
return np.full(4, val, dtype=np.float32)
def mm_set1_pd(val: float):
return np.full(2, val, dtype=np.float64)
def mm_loadu_ps(arr):
return np.array(arr, dtype=np.float32)
def mm_storeu_ps(dst, src):
dst[:] = src.astype(np.float32)
# Arithmetic
def mm_add_ps(a, b): return a + b
def mm_sub_ps(a, b): return a - b
def mm_mul_ps(a, b): return a * b
def mm_div_ps(a, b): return a / b
def mm_add_pd(a, b): return a + b
def mm_sub_pd(a, b): return a - b
def mm_mul_pd(a, b): return a * b
def mm_div_pd(a, b): return a / b
# Comparisons
def mm_cmpgt_ps(a, b): return (a > b).astype(np.float32)
def mm_cmplt_ps(a, b): return (a < b).astype(np.float32)
def mm_cmpeq_ps(a, b): return (a == b).astype(np.float32)
# Bitwise / Logical
def mm_and_ps(a, b):
return np.bitwise_and(a.view(np.int32), b.view(np.int32)).view(np.float32)
def mm_or_ps(a, b):
return np.bitwise_or(a.view(np.int32), b.view(np.int32)).view(np.float32)
def mm_xor_ps(a, b):
return np.bitwise_xor(a.view(np.int32), b.view(np.int32)).view(np.float32)
# Shifts / Casts
def mm_castsi128_ps(a): return a.view(np.float32)
def mm_castps_si128(a): return a.view(np.int32)
def mm_srli_epi32(a, imm):
return (a.astype(np.uint32) >> imm).astype(np.int32)
def mm_slli_epi32(a, imm):
return (a.astype(np.uint32) << imm).astype(np.int32)
# Math helpers: not exact intrinsics, but common when modeling algorithms.
def mm_abs_ps(a): return np.abs(a)
def mm_max_ps(a, b): return np.maximum(a, b)
def mm_min_ps(a, b): return np.minimum(a, b)
def mm_sqrt_ps(a): return np.sqrt(a)
def mm_rcp_ps(a): return 1.0 / a
def mm_rsqrt_ps(a): return 1.0 / np.sqrt(a)
Example use:
a = mm_loadu_ps([1, 2, 3, 4])
b = mm_set1_ps(10)
out = np.zeros(4, dtype=np.float32)
mm_storeu_ps(out, mm_add_ps(a, b))
print(out) # [11. 12. 13. 14.]
mask = mm_cmpgt_ps(out, mm_set1_ps(12))
print(mask) # [0. 0. 1. 1.]
Python model limits:
- NumPy does not model instruction latency or port pressure.
- NumPy may use SIMD internally, but the Python code is not a direct instruction trace.
- Integer overflow must be modeled with explicit dtypes.
- Floating-point results can differ if the real code uses approximations like reciprocal or reciprocal-square-root instructions.
Reversing SIMD Code
x86 Registers
| Register | Width | Common ISA |
|---|---|---|
xmm0..xmm31 | 128-bit | SSE, AVX, AVX-512VL |
ymm0..ymm31 | 256-bit | AVX, AVX2 |
zmm0..zmm31 | 512-bit | AVX-512 |
k0..k7 | mask registers | AVX-512 |
Useful GDB:
info all-registers
info registers xmm0 ymm0 zmm0
p $xmm0
x/16xb $rsp
x/8i $pc
Common x86 instruction prefixes:
| Prefix / mnemonic shape | Meaning |
|---|---|
p* | Packed integer, older SSE style. |
v* | VEX/EVEX encoded AVX/AVX-512 style. |
addps, mulps | Packed single-float operations. |
paddd, v paddd | Packed integer add dword lanes. |
pshuf*, vperm* | Shuffle or permute. |
pcmpeq*, pcmpgt* | Packed comparisons. |
pmovmskb, movmsk* | Extract sign bits to scalar mask. |
vpblend*, vpternlog* | Blend/boolean selection. |
Arm Registers
| Register | Use |
|---|---|
v0..v31 | AArch64 SIMD/floating-point registers. |
q0..q31 | 128-bit view of v registers. |
d0..d31 | 64-bit view. |
s0..s31 | 32-bit view. |
h0..h31 | 16-bit view. |
b0..b31 | 8-bit view. |
z0..z31 | SVE scalable vector registers. |
p0..p15 | SVE predicate registers. |
Common Arm patterns:
| Pattern | Meaning |
|---|---|
ld1 / st1 | Load/store vector lanes. |
add, sub, mul | Lane-wise arithmetic. |
mla, fmla | Multiply-add. |
tbl, tbx | Table lookup/shuffle. |
uzp, zip, trn | Unzip/zip/transpose lane interleaving. |
sq*, uq* | Signed/unsigned saturating operations. |
cnt | Count bits per byte. |
whilelt, ptrue | SVE predicate setup. |
Porting Between x86 And Arm
Direct one-to-one mappings do not always exist. Start from behavior:
- Identify lane type and vector width.
- Identify operation: arithmetic, compare, shuffle, load/store, mask, reduction.
- Write a scalar or NumPy model.
- Search the target architecture intrinsic reference.
- Verify edge cases and generated assembly.
Useful options:
| Option | Use |
|---|---|
| Rewrite with native Neon/SVE intrinsics | Best for performance and clarity on Arm. |
| Use SIMDe | Good for porting existing x86 intrinsic code to Arm or other platforms. |
| Use compiler vector extensions | Good for simple portable operations. |
| Keep scalar + auto-vectorization | Good when compiler produces clean SIMD. |
SIMDe example:
#define SIMDE_ENABLE_NATIVE_ALIASES
#include <simde/x86/avx2.h>
__m256i add8(__m256i a, __m256i b) {
return _mm256_add_epi32(a, b);
}
On x86, SIMDe can map to native x86 intrinsics. On Arm, it can map many x86 APIs to Neon or other available implementations.
Quick Build Flags
| Target | GCC/Clang flags |
|---|---|
| Native current CPU | -O3 -march=native |
| x86 SSE2 | -O3 -msse2 |
| x86 SSSE3 | -O3 -mssse3 |
| x86 SSE4.2 | -O3 -msse4.2 |
| x86 AVX | -O3 -mavx |
| x86 AVX2 | -O3 -mavx2 -mfma |
| x86 AVX-512 baseline | -O3 -mavx512f |
| AArch64 generic | -O3 --target=aarch64-linux-gnu |
| AArch64 SVE | -O3 -march=armv8.2-a+sve |
| AArch64 SVE2 | -O3 -march=armv9-a+sve2 |
| Arm MVE | Target-specific Cortex-M flags, usually with MVE/Helium-enabled -mcpu= |
Use runtime dispatch when shipping binaries to unknown CPUs. Do not compile a public binary with -march=native unless the deployment CPU is fixed.
References
x86
- Intel Intrinsics Guide: https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html
- Intel Intrinsics Guide release notes: https://www.intel.com/content/www/us/en/developer/articles/release-notes/intrinsics-guide-release-notes.html
- Microsoft x86 intrinsics list: https://learn.microsoft.com/en-us/cpp/intrinsics/x86-intrinsics-list
- uops.info instruction tables: https://uops.info
- Agner Fog optimization manuals: https://www.agner.org/optimize/
Arm
- Arm Neon Intrinsics Reference: https://arm-software.github.io/acle/neon_intrinsics/advsimd.html
- Arm C Language Extensions: https://arm-software.github.io/acle/main/acle.html
- Arm MVE Intrinsics Reference: https://arm-software.github.io/acle/mve_intrinsics/
- Arm Neon overview: https://www.arm.com/technologies/neon
- Arm SIMD hub and Intrinsics Search Engine: https://developer.arm.com/servers-and-cloud-computing/arm-simd
- Arm Learn the Architecture: https://www.arm.com/architecture/learn-the-architecture/a-profile
- Arm SIMD hub: https://www.arm.com/developer-hub/servers-and-cloud-computing/arm-simd
- Arm SIMD learning path: https://learn.arm.com/learning-paths/cross-platform/simd-loops/1-about/
- Arm intrinsics porting path: https://learn.arm.com/learning-paths/cross-platform/intrinsics/
- Arm SIMDe porting guide: https://learn.arm.com/learning-paths/cross-platform/intrinsics/simde/
Compilers And Libraries
- GCC vector extensions: https://gcc.gnu.org/onlinedocs/gcc/Vector-Extensions.html
- Clang language extensions: https://clang.llvm.org/docs/LanguageExtensions.html
- NumPy CPU/SIMD optimizations: https://numpy.org/devdocs/reference/simd/index.html
- NumPy NEP 38: https://numpy.org/neps/nep-0038-SIMD-optimizations.html
- SIMDe: https://github.com/simd-everywhere/simde
- Compiler Explorer: https://godbolt.org