What Intrinsics Are

Intrinsics are compiler-provided functions that expose CPU operations without writing inline assembly. SIMD intrinsics are the common case: one operation works on multiple lanes inside a vector register.

SIMD means Single Instruction, Multiple Data:

scalar:  c0 = a0 + b0
         c1 = a1 + b1
         c2 = a2 + b2
         c3 = a3 + b3

SIMD:    c[0..3] = a[0..3] + b[0..3]

Intrinsics are useful when reversing optimized binaries because compiler output often maps back to recognizable operations: loads, stores, packed arithmetic, comparisons, shuffles, masks, and reductions.

Core Terms

TermMeaning
LaneOne element inside a vector register.
Vector widthTotal register width, such as 128, 256, or 512 bits.
Element widthPer-lane type width, such as 8, 16, 32, or 64 bits.
Packed integerMultiple integer lanes in one vector.
Packed floatMultiple floating-point lanes in one vector.
IntrinsicC/C++ function-like wrapper that maps to one or more SIMD instructions.
Auto-vectorizationCompiler turns scalar loops into SIMD automatically.
Horizontal operationReduces across lanes, such as sum of all lanes.
Saturating arithmeticClamps on overflow instead of wrapping. Common in media code.
Shuffle/permuteReorders lanes within or across vectors.
Mask/predicatePer-lane enable/disable control for an operation.

Main SIMD Families

PlatformFamiliesRegister modelWhere to look
x86/x86-64MMX, SSE, SSE2, SSSE3, SSE4, AVX, AVX2, AVX-512, AVX10Fixed width: XMM 128-bit, YMM 256-bit, ZMM 512-bitIntel Intrinsics Guide, Intel/AMD manuals, uops.info
Arm A/R-profileNeon / Advanced SIMDFixed width: 64-bit and 128-bit vectorsArm Neon Intrinsics Reference, ACLE
Arm A-profileSVE / SVE2Scalable vector length, predicate-drivenArm ACLE SVE docs, Arm Learning Paths
Arm M-profileMVE / HeliumEmbedded/Microcontroller vector extensionArm MVE Intrinsics Reference
RISC-VRVVScalable vector lengthRISC-V Vector spec and compiler docs
PowerVSX/AltiVecFixed-width vectorsIBM/OpenPOWER docs

Where To Find Intrinsics

NeedSite
x86 SSE/AVX/AVX-512 intrinsicshttps://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html
Microsoft x86 intrinsic nameshttps://learn.microsoft.com/en-us/cpp/intrinsics/x86-intrinsics-list
Arm Neon intrinsicshttps://arm-software.github.io/acle/neon_intrinsics/advsimd.html
Arm ACLE, Neon, SVE/SVE2 language extensionshttps://arm-software.github.io/acle/main/acle.html
Arm MVE / Helium intrinsicshttps://arm-software.github.io/acle/mve_intrinsics/
Arm SIMD and Intrinsics Search Engine hubhttps://developer.arm.com/servers-and-cloud-computing/arm-simd
Arm Neon overview and guideshttps://www.arm.com/technologies/neon
Arm learning material for Neon/SVE/SMEhttps://www.arm.com/architecture/learn-the-architecture/a-profile
Arm SIMD learning pathshttps://learn.arm.com/learning-paths/cross-platform/simd-loops/1-about/
Arm intrinsics porting pathshttps://learn.arm.com/learning-paths/cross-platform/intrinsics/
GCC vector extensionshttps://gcc.gnu.org/onlinedocs/gcc/Vector-Extensions.html
Clang vector extensions and builtinshttps://clang.llvm.org/docs/LanguageExtensions.html
NumPy CPU/SIMD optimization docshttps://numpy.org/devdocs/reference/simd/index.html
Portable intrinsics with SIMDehttps://github.com/simd-everywhere/simde
Compile and inspect generated assemblyhttps://godbolt.org
x86 latency/throughput datahttps://uops.info
x86 optimization manuals and instruction tableshttps://www.agner.org/optimize/

How To Read Intrinsic Names

x86

Typical form:

_mm256_add_epi32
PartMeaning
_mmIntel intrinsic namespace.
256Vector width: 256-bit YMM. Empty usually means 128-bit XMM.
addOperation.
epi32Packed signed/unsigned 32-bit integer lanes.

Common x86 suffixes:

SuffixMeaning
psPacked single-precision floats, usually float32.
pdPacked double-precision floats, usually float64.
ssScalar single-precision float.
sdScalar double-precision float.
epi8, epi16, epi32, epi64Packed integer lanes.
si128, si256, si512Raw integer vector register.
u8, u16, u32, u64Unsigned integer variants in some intrinsic names.
mask, maskzAVX-512 masked operation; maskz zeroes inactive lanes.

Common headers:

HeaderUse
<immintrin.h>Most modern Intel/AMD intrinsics.
<xmmintrin.h>SSE.
<emmintrin.h>SSE2.
<tmmintrin.h>SSSE3.
<smmintrin.h>SSE4.1.
<nmmintrin.h>SSE4.2.

Minimal x86 example:

#include <immintrin.h>

void add8_i32(const int *a, const int *b, int *out) {
    __m256i va = _mm256_loadu_si256((const __m256i *)a);
    __m256i vb = _mm256_loadu_si256((const __m256i *)b);
    __m256i vc = _mm256_add_epi32(va, vb);
    _mm256_storeu_si256((__m256i *)out, vc);
}

Compile:

gcc -O3 -mavx2 simd.c
clang -O3 -mavx2 simd.c

Arm Neon

Typical form:

vaddq_s32
PartMeaning
vVector intrinsic prefix.
addOperation.
q128-bit vector form. Without q, many Neon intrinsics use 64-bit vectors.
s32Signed 32-bit integer lanes.

Common Neon suffixes:

SuffixMeaning
s8, s16, s32, s64Signed integer lanes.
u8, u16, u32, u64Unsigned integer lanes.
f16, f32, f64Floating-point lanes.
p8, p16, p64Polynomial lanes.
q128-bit vector form.
nOne operand is a scalar immediate/value.
laneUses a lane from another vector.

Use:

#include <arm_neon.h>

void add4_i32(const int32_t *a, const int32_t *b, int32_t *out) {
    int32x4_t va = vld1q_s32(a);
    int32x4_t vb = vld1q_s32(b);
    int32x4_t vc = vaddq_s32(va, vb);
    vst1q_s32(out, vc);
}

Compile:

aarch64-linux-gnu-gcc -O3 simd.c
clang -O3 --target=aarch64-linux-gnu simd.c

On AArch64, Neon is part of the base architecture profile used by normal user-space targets. For AArch32, pass the correct FPU/architecture flags for the target CPU.

Arm SVE / SVE2

SVE is scalable SIMD. Code is written for an unknown runtime vector length, using predicates to control active lanes.

Typical form:

svadd_s32_m(pg, a, b)
PartMeaning
svSVE intrinsic prefix.
addOperation.
s32Signed 32-bit lanes.
_mMerge inactive lanes from the first input.
_zZero inactive lanes.
_xInactive lanes are don’t-care.
pgPredicate controlling active lanes.

Minimal SVE loop shape:

#include <arm_sve.h>

void add_i32_sve(const int32_t *a, const int32_t *b, int32_t *out, int n) {
    for (int i = 0; i < n; i += svcntw()) {
        svbool_t pg = svwhilelt_b32(i, n);
        svint32_t va = svld1_s32(pg, &a[i]);
        svint32_t vb = svld1_s32(pg, &b[i]);
        svint32_t vc = svadd_s32_m(pg, va, vb);
        svst1_s32(pg, &out[i], vc);
    }
}

Compile for an SVE-capable target:

aarch64-linux-gnu-gcc -O3 -march=armv8.2-a+sve simd.c
clang -O3 --target=aarch64-linux-gnu -march=armv8.2-a+sve simd.c

Use SVE/SVE2 when you want code that scales across different vector lengths. Avoid hardcoding a fixed number of lanes in SVE loops.

Working With SIMD

Start Scalar

Write a clear scalar version first:

for (int i = 0; i < n; i++) {
    out[i] = a[i] + b[i];
}

Then decide whether to use compiler auto-vectorization, explicit intrinsics, or a portable abstraction.

Check Auto-Vectorization

Compilers can vectorize simple loops when aliasing and control flow are clear.

GCC:

gcc -O3 -march=native -fopt-info-vec-optimized -fopt-info-vec-missed code.c

Clang:

clang -O3 -march=native -Rpass=loop-vectorize -Rpass-missed=loop-vectorize code.c

Hints that help compilers:

void add(float *restrict out, const float *restrict a, const float *restrict b, int n) {
    for (int i = 0; i < n; i++) {
        out[i] = a[i] + b[i];
    }
}

Use restrict only when the pointers really do not alias.

Use Intrinsics For Control

Use intrinsics when you need:

  • exact instruction choice
  • shuffles, masks, saturating arithmetic, or special instructions
  • stable performance in a hot path
  • behavior the compiler will not auto-vectorize
  • reverse-engineering parity with compiler output

Avoid intrinsics when scalar code auto-vectorizes cleanly and performance is already good.

Handle Alignment And Tails

Most modern SIMD code can use unaligned loads:

PlatformUnaligned load pattern
x86_mm_loadu_*, _mm256_loadu_*, _mm512_loadu_*
Neonvld1* loads; alignment still matters for performance on some targets
SVEsvld1* with predicates for tails

Tail handling options:

MethodUse
Scalar cleanup loopSimple and portable.
Masked load/storeAVX-512 and SVE are strong here.
Over-read within safe paddingFast, but requires strict buffer guarantees.
Process fixed blocks onlyUseful for codecs and cryptographic blocks.

Verify Correctness

Check scalar and SIMD outputs on edge cases:

  • empty input
  • length smaller than vector width
  • exact vector-width length
  • one element past vector width
  • unaligned input/output pointers
  • max/min integer values
  • NaN, infinity, signed zero for floating point

Measure

Use representative inputs and measure the whole hot loop, not just one instruction.

Useful tools:

ToolUse
perf statCycles, instructions, branches, cache misses.
perf record/reportHot path profiling.
Compiler ExplorerInspect generated assembly.
uops.infox86 instruction throughput/latency.
Agner Fog tablesx86 throughput/latency and microarchitecture notes.
llvm-mcaStatic throughput analysis for LLVM-supported targets.

Python Prototyping

NumPy arrays are a convenient way to prototype SIMD-like logic in Python while reversing vectorized code. They model lanes, masks, shifts, reinterpret casts, and fixed-width overflow behavior closely enough for quick experiments and sanity checks.

Python references:

Basic lane model:

import numpy as np

u8 = np.array([0xff, 0x01, 0x7f, 0x80], dtype=np.uint8)
print((u8 + np.uint8(1)).astype(np.uint8))

Intrinsic-like helpers:

import numpy as np

# Load / Store
def mm_set1_ps(val: float):
    return np.full(4, val, dtype=np.float32)

def mm_set1_pd(val: float):
    return np.full(2, val, dtype=np.float64)

def mm_loadu_ps(arr):
    return np.array(arr, dtype=np.float32)

def mm_storeu_ps(dst, src):
    dst[:] = src.astype(np.float32)

# Arithmetic
def mm_add_ps(a, b): return a + b
def mm_sub_ps(a, b): return a - b
def mm_mul_ps(a, b): return a * b
def mm_div_ps(a, b): return a / b

def mm_add_pd(a, b): return a + b
def mm_sub_pd(a, b): return a - b
def mm_mul_pd(a, b): return a * b
def mm_div_pd(a, b): return a / b

# Comparisons
def mm_cmpgt_ps(a, b): return (a > b).astype(np.float32)
def mm_cmplt_ps(a, b): return (a < b).astype(np.float32)
def mm_cmpeq_ps(a, b): return (a == b).astype(np.float32)

# Bitwise / Logical
def mm_and_ps(a, b):
    return np.bitwise_and(a.view(np.int32), b.view(np.int32)).view(np.float32)

def mm_or_ps(a, b):
    return np.bitwise_or(a.view(np.int32), b.view(np.int32)).view(np.float32)

def mm_xor_ps(a, b):
    return np.bitwise_xor(a.view(np.int32), b.view(np.int32)).view(np.float32)

# Shifts / Casts
def mm_castsi128_ps(a): return a.view(np.float32)
def mm_castps_si128(a): return a.view(np.int32)

def mm_srli_epi32(a, imm):
    return (a.astype(np.uint32) >> imm).astype(np.int32)

def mm_slli_epi32(a, imm):
    return (a.astype(np.uint32) << imm).astype(np.int32)

# Math helpers: not exact intrinsics, but common when modeling algorithms.
def mm_abs_ps(a): return np.abs(a)
def mm_max_ps(a, b): return np.maximum(a, b)
def mm_min_ps(a, b): return np.minimum(a, b)

def mm_sqrt_ps(a): return np.sqrt(a)
def mm_rcp_ps(a): return 1.0 / a
def mm_rsqrt_ps(a): return 1.0 / np.sqrt(a)

Example use:

a = mm_loadu_ps([1, 2, 3, 4])
b = mm_set1_ps(10)
out = np.zeros(4, dtype=np.float32)

mm_storeu_ps(out, mm_add_ps(a, b))
print(out)  # [11. 12. 13. 14.]

mask = mm_cmpgt_ps(out, mm_set1_ps(12))
print(mask) # [0. 0. 1. 1.]

Python model limits:

  • NumPy does not model instruction latency or port pressure.
  • NumPy may use SIMD internally, but the Python code is not a direct instruction trace.
  • Integer overflow must be modeled with explicit dtypes.
  • Floating-point results can differ if the real code uses approximations like reciprocal or reciprocal-square-root instructions.

Reversing SIMD Code

x86 Registers

RegisterWidthCommon ISA
xmm0..xmm31128-bitSSE, AVX, AVX-512VL
ymm0..ymm31256-bitAVX, AVX2
zmm0..zmm31512-bitAVX-512
k0..k7mask registersAVX-512

Useful GDB:

info all-registers
info registers xmm0 ymm0 zmm0
p $xmm0
x/16xb $rsp
x/8i $pc

Common x86 instruction prefixes:

Prefix / mnemonic shapeMeaning
p*Packed integer, older SSE style.
v*VEX/EVEX encoded AVX/AVX-512 style.
addps, mulpsPacked single-float operations.
paddd, v padddPacked integer add dword lanes.
pshuf*, vperm*Shuffle or permute.
pcmpeq*, pcmpgt*Packed comparisons.
pmovmskb, movmsk*Extract sign bits to scalar mask.
vpblend*, vpternlog*Blend/boolean selection.

Arm Registers

RegisterUse
v0..v31AArch64 SIMD/floating-point registers.
q0..q31128-bit view of v registers.
d0..d3164-bit view.
s0..s3132-bit view.
h0..h3116-bit view.
b0..b318-bit view.
z0..z31SVE scalable vector registers.
p0..p15SVE predicate registers.

Common Arm patterns:

PatternMeaning
ld1 / st1Load/store vector lanes.
add, sub, mulLane-wise arithmetic.
mla, fmlaMultiply-add.
tbl, tbxTable lookup/shuffle.
uzp, zip, trnUnzip/zip/transpose lane interleaving.
sq*, uq*Signed/unsigned saturating operations.
cntCount bits per byte.
whilelt, ptrueSVE predicate setup.

Porting Between x86 And Arm

Direct one-to-one mappings do not always exist. Start from behavior:

  1. Identify lane type and vector width.
  2. Identify operation: arithmetic, compare, shuffle, load/store, mask, reduction.
  3. Write a scalar or NumPy model.
  4. Search the target architecture intrinsic reference.
  5. Verify edge cases and generated assembly.

Useful options:

OptionUse
Rewrite with native Neon/SVE intrinsicsBest for performance and clarity on Arm.
Use SIMDeGood for porting existing x86 intrinsic code to Arm or other platforms.
Use compiler vector extensionsGood for simple portable operations.
Keep scalar + auto-vectorizationGood when compiler produces clean SIMD.

SIMDe example:

#define SIMDE_ENABLE_NATIVE_ALIASES
#include <simde/x86/avx2.h>

__m256i add8(__m256i a, __m256i b) {
    return _mm256_add_epi32(a, b);
}

On x86, SIMDe can map to native x86 intrinsics. On Arm, it can map many x86 APIs to Neon or other available implementations.

Quick Build Flags

TargetGCC/Clang flags
Native current CPU-O3 -march=native
x86 SSE2-O3 -msse2
x86 SSSE3-O3 -mssse3
x86 SSE4.2-O3 -msse4.2
x86 AVX-O3 -mavx
x86 AVX2-O3 -mavx2 -mfma
x86 AVX-512 baseline-O3 -mavx512f
AArch64 generic-O3 --target=aarch64-linux-gnu
AArch64 SVE-O3 -march=armv8.2-a+sve
AArch64 SVE2-O3 -march=armv9-a+sve2
Arm MVETarget-specific Cortex-M flags, usually with MVE/Helium-enabled -mcpu=

Use runtime dispatch when shipping binaries to unknown CPUs. Do not compile a public binary with -march=native unless the deployment CPU is fixed.

References

x86

Arm

Compilers And Libraries