Intrinsics

What Intrinsics Are

Intrinsics are compiler-provided functions that expose CPU operations without writing inline assembly. SIMD intrinsics are the common case: one operation works on multiple lanes inside a vector register.

SIMD means Single Instruction, Multiple Data:

scalar:  c0 = a0 + b0
         c1 = a1 + b1
         c2 = a2 + b2
         c3 = a3 + b3

SIMD:    c[0..3] = a[0..3] + b[0..3]

Intrinsics are useful when reversing optimized binaries because compiler output often maps back to recognizable operations: loads, stores, packed arithmetic, comparisons, shuffles, masks, and reductions.

Core Terms

Term	Meaning
Lane	One element inside a vector register.
Vector width	Total register width, such as 128, 256, or 512 bits.
Element width	Per-lane type width, such as 8, 16, 32, or 64 bits.
Packed integer	Multiple integer lanes in one vector.
Packed float	Multiple floating-point lanes in one vector.
Intrinsic	C/C++ function-like wrapper that maps to one or more SIMD instructions.
Auto-vectorization	Compiler turns scalar loops into SIMD automatically.
Horizontal operation	Reduces across lanes, such as sum of all lanes.
Saturating arithmetic	Clamps on overflow instead of wrapping. Common in media code.
Shuffle/permute	Reorders lanes within or across vectors.
Mask/predicate	Per-lane enable/disable control for an operation.

Main SIMD Families

Platform	Families	Register model	Where to look
x86/x86-64	MMX, SSE, SSE2, SSSE3, SSE4, AVX, AVX2, AVX-512, AVX10	Fixed width: XMM 128-bit, YMM 256-bit, ZMM 512-bit	Intel Intrinsics Guide, Intel/AMD manuals, uops.info
Arm A/R-profile	Neon / Advanced SIMD	Fixed width: 64-bit and 128-bit vectors	Arm Neon Intrinsics Reference, ACLE
Arm A-profile	SVE / SVE2	Scalable vector length, predicate-driven	Arm ACLE SVE docs, Arm Learning Paths
Arm M-profile	MVE / Helium	Embedded/Microcontroller vector extension	Arm MVE Intrinsics Reference
RISC-V	RVV	Scalable vector length	RISC-V Vector spec and compiler docs
Power	VSX/AltiVec	Fixed-width vectors	IBM/OpenPOWER docs

Where To Find Intrinsics

Need	Site
x86 SSE/AVX/AVX-512 intrinsics	https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html
Microsoft x86 intrinsic names	https://learn.microsoft.com/en-us/cpp/intrinsics/x86-intrinsics-list
Arm Neon intrinsics	https://arm-software.github.io/acle/neon_intrinsics/advsimd.html
Arm ACLE, Neon, SVE/SVE2 language extensions	https://arm-software.github.io/acle/main/acle.html
Arm MVE / Helium intrinsics	https://arm-software.github.io/acle/mve_intrinsics/
Arm SIMD and Intrinsics Search Engine hub	https://developer.arm.com/servers-and-cloud-computing/arm-simd
Arm Neon overview and guides	https://www.arm.com/technologies/neon
Arm learning material for Neon/SVE/SME	https://www.arm.com/architecture/learn-the-architecture/a-profile
Arm SIMD learning paths	https://learn.arm.com/learning-paths/cross-platform/simd-loops/1-about/
Arm intrinsics porting paths	https://learn.arm.com/learning-paths/cross-platform/intrinsics/
GCC vector extensions	https://gcc.gnu.org/onlinedocs/gcc/Vector-Extensions.html
Clang vector extensions and builtins	https://clang.llvm.org/docs/LanguageExtensions.html
NumPy CPU/SIMD optimization docs	https://numpy.org/devdocs/reference/simd/index.html
Portable intrinsics with SIMDe	https://github.com/simd-everywhere/simde
Compile and inspect generated assembly	https://godbolt.org
x86 latency/throughput data	https://uops.info
x86 optimization manuals and instruction tables	https://www.agner.org/optimize/

How To Read Intrinsic Names

x86

Typical form:

_mm256_add_epi32

Part	Meaning
`_mm`	Intel intrinsic namespace.
`256`	Vector width: 256-bit YMM. Empty usually means 128-bit XMM.
`add`	Operation.
`epi32`	Packed signed/unsigned 32-bit integer lanes.

Common x86 suffixes:

Suffix	Meaning
`ps`	Packed single-precision floats, usually `float32`.
`pd`	Packed double-precision floats, usually `float64`.
`ss`	Scalar single-precision float.
`sd`	Scalar double-precision float.
`epi8`, `epi16`, `epi32`, `epi64`	Packed integer lanes.
`si128`, `si256`, `si512`	Raw integer vector register.
`u8`, `u16`, `u32`, `u64`	Unsigned integer variants in some intrinsic names.
`mask`, `maskz`	AVX-512 masked operation; `maskz` zeroes inactive lanes.

Common headers:

Header	Use
`<immintrin.h>`	Most modern Intel/AMD intrinsics.
`<xmmintrin.h>`	SSE.
`<emmintrin.h>`	SSE2.
`<tmmintrin.h>`	SSSE3.
`<smmintrin.h>`	SSE4.1.
`<nmmintrin.h>`	SSE4.2.

Minimal x86 example:

#include <immintrin.h>

void add8_i32(const int *a, const int *b, int *out) {
    __m256i va = _mm256_loadu_si256((const __m256i *)a);
    __m256i vb = _mm256_loadu_si256((const __m256i *)b);
    __m256i vc = _mm256_add_epi32(va, vb);
    _mm256_storeu_si256((__m256i *)out, vc);
}

Compile:

gcc -O3 -mavx2 simd.c
clang -O3 -mavx2 simd.c

Arm Neon

Typical form:

vaddq_s32

Part	Meaning
`v`	Vector intrinsic prefix.
`add`	Operation.
`q`	128-bit vector form. Without `q`, many Neon intrinsics use 64-bit vectors.
`s32`	Signed 32-bit integer lanes.

Common Neon suffixes:

Suffix	Meaning
`s8`, `s16`, `s32`, `s64`	Signed integer lanes.
`u8`, `u16`, `u32`, `u64`	Unsigned integer lanes.
`f16`, `f32`, `f64`	Floating-point lanes.
`p8`, `p16`, `p64`	Polynomial lanes.
`q`	128-bit vector form.
`n`	One operand is a scalar immediate/value.
`lane`	Uses a lane from another vector.

Use:

#include <arm_neon.h>

void add4_i32(const int32_t *a, const int32_t *b, int32_t *out) {
    int32x4_t va = vld1q_s32(a);
    int32x4_t vb = vld1q_s32(b);
    int32x4_t vc = vaddq_s32(va, vb);
    vst1q_s32(out, vc);
}

Compile:

aarch64-linux-gnu-gcc -O3 simd.c
clang -O3 --target=aarch64-linux-gnu simd.c

On AArch64, Neon is part of the base architecture profile used by normal user-space targets. For AArch32, pass the correct FPU/architecture flags for the target CPU.

Arm SVE / SVE2

SVE is scalable SIMD. Code is written for an unknown runtime vector length, using predicates to control active lanes.

Typical form:

svadd_s32_m(pg, a, b)

Part	Meaning
`sv`	SVE intrinsic prefix.
`add`	Operation.
`s32`	Signed 32-bit lanes.
`_m`	Merge inactive lanes from the first input.
`_z`	Zero inactive lanes.
`_x`	Inactive lanes are don’t-care.
`pg`	Predicate controlling active lanes.

Minimal SVE loop shape:

#include <arm_sve.h>

void add_i32_sve(const int32_t *a, const int32_t *b, int32_t *out, int n) {
    for (int i = 0; i < n; i += svcntw()) {
        svbool_t pg = svwhilelt_b32(i, n);
        svint32_t va = svld1_s32(pg, &a[i]);
        svint32_t vb = svld1_s32(pg, &b[i]);
        svint32_t vc = svadd_s32_m(pg, va, vb);
        svst1_s32(pg, &out[i], vc);
    }
}

Compile for an SVE-capable target:

aarch64-linux-gnu-gcc -O3 -march=armv8.2-a+sve simd.c
clang -O3 --target=aarch64-linux-gnu -march=armv8.2-a+sve simd.c

Use SVE/SVE2 when you want code that scales across different vector lengths. Avoid hardcoding a fixed number of lanes in SVE loops.

Working With SIMD

Start Scalar

Write a clear scalar version first:

for (int i = 0; i < n; i++) {
    out[i] = a[i] + b[i];
}

Then decide whether to use compiler auto-vectorization, explicit intrinsics, or a portable abstraction.

Check Auto-Vectorization

Compilers can vectorize simple loops when aliasing and control flow are clear.

GCC:

gcc -O3 -march=native -fopt-info-vec-optimized -fopt-info-vec-missed code.c

Clang:

clang -O3 -march=native -Rpass=loop-vectorize -Rpass-missed=loop-vectorize code.c

Hints that help compilers:

void add(float *restrict out, const float *restrict a, const float *restrict b, int n) {
    for (int i = 0; i < n; i++) {
        out[i] = a[i] + b[i];
    }
}

Use restrict only when the pointers really do not alias.

Use Intrinsics For Control

Use intrinsics when you need:

exact instruction choice
shuffles, masks, saturating arithmetic, or special instructions
stable performance in a hot path
behavior the compiler will not auto-vectorize
reverse-engineering parity with compiler output

Avoid intrinsics when scalar code auto-vectorizes cleanly and performance is already good.

Handle Alignment And Tails

Most modern SIMD code can use unaligned loads:

Platform	Unaligned load pattern
x86	`_mm_loadu_`, `_mm256_loadu_`, `_mm512_loadu_*`
Neon	`vld1*` loads; alignment still matters for performance on some targets
SVE	`svld1*` with predicates for tails

Tail handling options:

Method	Use
Scalar cleanup loop	Simple and portable.
Masked load/store	AVX-512 and SVE are strong here.
Over-read within safe padding	Fast, but requires strict buffer guarantees.
Process fixed blocks only	Useful for codecs and cryptographic blocks.

Verify Correctness

Check scalar and SIMD outputs on edge cases:

empty input
length smaller than vector width
exact vector-width length
one element past vector width
unaligned input/output pointers
max/min integer values
NaN, infinity, signed zero for floating point

Measure

Use representative inputs and measure the whole hot loop, not just one instruction.

Useful tools:

Tool	Use
`perf stat`	Cycles, instructions, branches, cache misses.
`perf record/report`	Hot path profiling.
Compiler Explorer	Inspect generated assembly.
uops.info	x86 instruction throughput/latency.
Agner Fog tables	x86 throughput/latency and microarchitecture notes.
`llvm-mca`	Static throughput analysis for LLVM-supported targets.

Python Prototyping

NumPy arrays are a convenient way to prototype SIMD-like logic in Python while reversing vectorized code. They model lanes, masks, shifts, reinterpret casts, and fixed-width overflow behavior closely enough for quick experiments and sanity checks.

Python references:

Intel Intrinsics Guide: https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html
NumPy SIMD optimizations: https://numpy.org/devdocs/reference/simd/index.html

Basic lane model:

import numpy as np

u8 = np.array([0xff, 0x01, 0x7f, 0x80], dtype=np.uint8)
print((u8 + np.uint8(1)).astype(np.uint8))

Intrinsic-like helpers:

import numpy as np

# Load / Store
def mm_set1_ps(val: float):
    return np.full(4, val, dtype=np.float32)

def mm_set1_pd(val: float):
    return np.full(2, val, dtype=np.float64)

def mm_loadu_ps(arr):
    return np.array(arr, dtype=np.float32)

def mm_storeu_ps(dst, src):
    dst[:] = src.astype(np.float32)

# Arithmetic
def mm_add_ps(a, b): return a + b
def mm_sub_ps(a, b): return a - b
def mm_mul_ps(a, b): return a * b
def mm_div_ps(a, b): return a / b

def mm_add_pd(a, b): return a + b
def mm_sub_pd(a, b): return a - b
def mm_mul_pd(a, b): return a * b
def mm_div_pd(a, b): return a / b

# Comparisons
def mm_cmpgt_ps(a, b): return (a > b).astype(np.float32)
def mm_cmplt_ps(a, b): return (a < b).astype(np.float32)
def mm_cmpeq_ps(a, b): return (a == b).astype(np.float32)

# Bitwise / Logical
def mm_and_ps(a, b):
    return np.bitwise_and(a.view(np.int32), b.view(np.int32)).view(np.float32)

def mm_or_ps(a, b):
    return np.bitwise_or(a.view(np.int32), b.view(np.int32)).view(np.float32)

def mm_xor_ps(a, b):
    return np.bitwise_xor(a.view(np.int32), b.view(np.int32)).view(np.float32)

# Shifts / Casts
def mm_castsi128_ps(a): return a.view(np.float32)
def mm_castps_si128(a): return a.view(np.int32)

def mm_srli_epi32(a, imm):
    return (a.astype(np.uint32) >> imm).astype(np.int32)

def mm_slli_epi32(a, imm):
    return (a.astype(np.uint32) << imm).astype(np.int32)

# Math helpers: not exact intrinsics, but common when modeling algorithms.
def mm_abs_ps(a): return np.abs(a)
def mm_max_ps(a, b): return np.maximum(a, b)
def mm_min_ps(a, b): return np.minimum(a, b)

def mm_sqrt_ps(a): return np.sqrt(a)
def mm_rcp_ps(a): return 1.0 / a
def mm_rsqrt_ps(a): return 1.0 / np.sqrt(a)

Example use:

a = mm_loadu_ps([1, 2, 3, 4])
b = mm_set1_ps(10)
out = np.zeros(4, dtype=np.float32)

mm_storeu_ps(out, mm_add_ps(a, b))
print(out)  # [11. 12. 13. 14.]

mask = mm_cmpgt_ps(out, mm_set1_ps(12))
print(mask) # [0. 0. 1. 1.]

Python model limits:

NumPy does not model instruction latency or port pressure.
NumPy may use SIMD internally, but the Python code is not a direct instruction trace.
Integer overflow must be modeled with explicit dtypes.
Floating-point results can differ if the real code uses approximations like reciprocal or reciprocal-square-root instructions.

Reversing SIMD Code

x86 Registers

Register	Width	Common ISA
`xmm0`..`xmm31`	128-bit	SSE, AVX, AVX-512VL
`ymm0`..`ymm31`	256-bit	AVX, AVX2
`zmm0`..`zmm31`	512-bit	AVX-512
`k0`..`k7`	mask registers	AVX-512

Useful GDB:

info all-registers
info registers xmm0 ymm0 zmm0
p $xmm0
x/16xb $rsp
x/8i $pc

Common x86 instruction prefixes:

Prefix / mnemonic shape	Meaning
`p*`	Packed integer, older SSE style.
`v*`	VEX/EVEX encoded AVX/AVX-512 style.
`addps`, `mulps`	Packed single-float operations.
`paddd`, `v paddd`	Packed integer add dword lanes.
`pshuf`, `vperm`	Shuffle or permute.
`pcmpeq`, `pcmpgt`	Packed comparisons.
`pmovmskb`, `movmsk*`	Extract sign bits to scalar mask.
`vpblend`, `vpternlog`	Blend/boolean selection.

Arm Registers

Register	Use
`v0`..`v31`	AArch64 SIMD/floating-point registers.
`q0`..`q31`	128-bit view of `v` registers.
`d0`..`d31`	64-bit view.
`s0`..`s31`	32-bit view.
`h0`..`h31`	16-bit view.
`b0`..`b31`	8-bit view.
`z0`..`z31`	SVE scalable vector registers.
`p0`..`p15`	SVE predicate registers.

Common Arm patterns:

Pattern	Meaning
`ld1` / `st1`	Load/store vector lanes.
`add`, `sub`, `mul`	Lane-wise arithmetic.
`mla`, `fmla`	Multiply-add.
`tbl`, `tbx`	Table lookup/shuffle.
`uzp`, `zip`, `trn`	Unzip/zip/transpose lane interleaving.
`sq`, `uq`	Signed/unsigned saturating operations.
`cnt`	Count bits per byte.
`whilelt`, `ptrue`	SVE predicate setup.

Porting Between x86 And Arm

Direct one-to-one mappings do not always exist. Start from behavior:

Identify lane type and vector width.
Identify operation: arithmetic, compare, shuffle, load/store, mask, reduction.
Write a scalar or NumPy model.
Search the target architecture intrinsic reference.
Verify edge cases and generated assembly.

Useful options:

Option	Use
Rewrite with native Neon/SVE intrinsics	Best for performance and clarity on Arm.
Use SIMDe	Good for porting existing x86 intrinsic code to Arm or other platforms.
Use compiler vector extensions	Good for simple portable operations.
Keep scalar + auto-vectorization	Good when compiler produces clean SIMD.

SIMDe example:

#define SIMDE_ENABLE_NATIVE_ALIASES
#include <simde/x86/avx2.h>

__m256i add8(__m256i a, __m256i b) {
    return _mm256_add_epi32(a, b);
}

On x86, SIMDe can map to native x86 intrinsics. On Arm, it can map many x86 APIs to Neon or other available implementations.

Quick Build Flags

Target	GCC/Clang flags
Native current CPU	`-O3 -march=native`
x86 SSE2	`-O3 -msse2`
x86 SSSE3	`-O3 -mssse3`
x86 SSE4.2	`-O3 -msse4.2`
x86 AVX	`-O3 -mavx`
x86 AVX2	`-O3 -mavx2 -mfma`
x86 AVX-512 baseline	`-O3 -mavx512f`
AArch64 generic	`-O3 --target=aarch64-linux-gnu`
AArch64 SVE	`-O3 -march=armv8.2-a+sve`
AArch64 SVE2	`-O3 -march=armv9-a+sve2`
Arm MVE	Target-specific Cortex-M flags, usually with MVE/Helium-enabled `-mcpu=`

Use runtime dispatch when shipping binaries to unknown CPUs. Do not compile a public binary with -march=native unless the deployment CPU is fixed.

References

Intrinsics

What Intrinsics Are

Core Terms

Main SIMD Families

Where To Find Intrinsics

How To Read Intrinsic Names

x86

Arm Neon

Arm SVE / SVE2

Working With SIMD

Start Scalar

Check Auto-Vectorization

Use Intrinsics For Control

Handle Alignment And Tails

Verify Correctness

Measure

Python Prototyping

Reversing SIMD Code

x86 Registers

Arm Registers

Porting Between x86 And Arm

Quick Build Flags

References

x86

Arm

Compilers And Libraries