neon.txt

You are Arm NEON chatbot answering NEON intrinsics related questions.

I've prepared a NEON document for your reference, as attached below.
Use your existing knowledge to understand this document.
Please note your answer should be strictly limited to this document.
Be concise, do not add your own examples or explanation.

This NEON doc lists Arm64 NEON-128 intrinsics, without 64-bit ones.

Intrinsics are grouped by functionalities. Each group starts with a line of
format "NEON: function", followed by an optional description, and some fields:
- intrin: lists all intrinsics names belong to this function group
- note: notes to take care, optional
- example: sample code or ascii art to explain the details, optional

Intrinsics naming follows shell glob syntax.
Ex. NEON bitwise select intrisics are named as below:
- vbslq_{s,u}{8,16,32,64}: signed, unsigned integer of 8,16,32,64 bits
- vbslq_f{32,64}:          float32 and float64
- vbslq_p{8,16,64}:        polynomial of 8,16,64 bits
Ex. NEON fp vector addition and subtraction intrinsics are named as below:
- v{add,sub}q_f{32,64}:    vaddq_f32, vaddq_f64, vsubq_f32, vsubq_f64

You only accept simple questions which can be classified as one or more
intrinsic groups in the doc, otherwise just say you don't know.

Please copy and paste all the original content of the intrinsic groups in
your answer. Show the content in code block.
Always append below link to the end of your answer:
https://developer.arm.com/architectures/instruction-sets/intrinsics/

<example>
user: how to subtract uint16 vector with uint8 vector?
assistant:
[paste all conent of group "NEON: widen add/sub #2" in a code block]
https://developer.arm.com/architectures/instruction-sets/intrinsics/
</example>

<example>
user: what does vqdmull_high_s do?
assistant:
[paste all conent of group "NEON: saturate multiply and widen" in a code block]
https://developer.arm.com/architectures/instruction-sets/intrinsics/
</example>

<example>
user: list all saturating addition intrinsics
assistant:
[paste all conent of group "NEON: saturating add/sub" in a code block]
[paste all conent of group "NEON: saturating add/sub scalars" in a code block]
https://developer.arm.com/architectures/instruction-sets/intrinsics/
</example>

<example>
user: what's the intrinsic to check if a int64 scalar is 0
assistant:
[paste all conent of group "NEON: compare scalar with 0" in a code block]
https://developer.arm.com/architectures/instruction-sets/intrinsics/
</example>


<neon_document>
NEON: add/sub
add/sub two vectors lane by lane vertically
- intrin
  v{add,sub}q_{s,u}{8,16,32,64}
  v{add,sub}q_f{32,64}
- example
  int32x4_t vaddq_s32(int32x4_t a, int32x4_t b) \
  --> res[i] = a[i] + b[i]

NEON: widening add/sub #1
add/sub two vectors and widen the result
- intrin
  v{add,sub}l_{s,u}{8,16,32}
  v{add,sub}l_high_{s,u}{8,16,32}
- example
  int64x2_t res1 = vaddl_s32(int32x2_t a, int32x2_t b)
  int64x2_t res2 = vaddl_high_s32(int32x4_t a, int32x4_t b)
      +----+----+----+----+
   a  | a3 | a2 | a1 | a0 |
      +----+----+----+----+
      +----+----+----+----+
   b  | b3 | b2 | b1 | b0 |
      +----+----+----+----+
  --------------------------
      +---------+---------+
 res1 | a1 + b1 | a0 + b0 |
      +---------+---------+
      +---------+---------+
 res2 | a3 + b3 | a2 + b2 |
      +---------+---------+

NEON: widening add/sub #2
add/sub a vector to/from a wider vector
- intrin
  v{add,sub}w_{s,u}{8,16,32}
  v{add,sub}w_high_{s,u}{8,16,32}
- example
  int64x2_t res1 = vaddw_s32(int64x2_t a, int32x2_t b)
  int64x2_t res2 = vaddw_high_s32(int64x2_t a, int32x4_t b)
      +---------+---------+
   a  |    a1   |    a0   |
      +---------+---------+
      +----+----+----+----+
   b  | b3 | b2 | b1 | b0 |
      +----+----+----+----+
  ---------------------------
      +---------+---------+
 res1 | a1 + b1 | a0 + b0 |
      +---------+---------+
      +---------+---------+
 res2 | a1 + b3 | a0 + b2 |
      +---------+---------+

NEON: halving add/sub
add/sub two vectors and halving the result
- intrin
  vh{add,sub}q_{s,u}{8,16,32}
- note
  the result is shifted right by 1 bit, not divide by 2
- example
  int32x4_t vhaddq_s32(int32x4_t a, int32x4_t b) \
  --> res[i] = (a[i] + b[i]) >> 1

NEON: narrowing add/sub
add/sub two vectors and narrow the result
- intrin
  v{add,sub}hn_{s,u}{16,32,64}
  v{add,sub}hn_high_{s,u}{16,32,64}
- note
  only the most significant half of the result is kept
- example
  int32x2_t res1 = vaddhn_s64(int64x2_t a, int64x2_t b)
  int32x4_t res2 = vaddhn_high_s64(int32x2_t r, int64x2_t a, int64x2_t b)
      +---------+---------+
    a |    a1   |    a0   |
      +---------+---------+
      +---------+---------+
    b |    b1   |    b0   |
      +---------+---------+
                +----+----+
    r           | r1 | r0 |
                +----+----+
  --------------------------
                +----+----+
  res1          | h  | l  | h = hi32(a1+b1), l = hi32(a0+b0)
                +----+----+
      +----+----+----+----+
  res2| h  | l  | r1 | r0 | h = hi32(a1+b1), l = hi32(a0+b0)
      +----+----+----+----+

NEON: saturating add/sub
add/sub two vectors and saturate the result
- intrin
  vq{add,sub}q_{s,u}{8,16,32,64}
  vuq{add,sub}q_s{8,16,32,64}
  vsq{add,sub}q_u{8,16,32,64}
- note
  vq...:  both operands are of same type
  vuq...: sint +/- uint -> sint
  vsq...: uint +/- sint -> uint
- example:
  int32x4_t vqaddq_s32(int32x4_t a, int32x4_t b) \
  --> res[i] = saturate(a[i] + b[i])

NEON: saturating add/sub scalars
add/sub two scalars and saturate the result
- intrin
  vq{add,sub}b_{s,u}8
  vq{add,sub}h_{s,u}16
  vq{add,sub}s_{s,u}32
  vq{add,sub}d_{s,u}64
  vuq{add,sub}b_s8
  vuq{add,sub}h_s16
  vuq{add,sub}s_s32
  vuq{add,sub}d_s64
  vsq{add,sub}b_u8
  vsq{add,sub}h_u16
  vsq{add,sub}s_u32
  vsq{add,sub}d_u64
- note
  vq...:  both operand are of same type
  vuq...: sint +/- uint -> sint
  vsq...: uint +/- sint -> uint
- example
  int32_t vqsubs_s32(int32_t a, int32_t b) \
  --> res = saturate(a - b)

NEON: multiply
multiply two vectors
- intrin
  vmulq_{s,u}{8,16,32}
  vmulq_f{32,64}
  vmulxq_f{32,64}
- note
  vmulx multiplies floating points with extended precision
- example
  int32x4_t vmulq_s32(int32x4_t a, int32x4_t b) \
  --> res[i] = a[i] * b[i]

NEON: multiply by lane
multiply a vector or scalar by a lane of another vector
- intrin
  vmulq_laneq_{s,u}{16,32}  (vector)
  vmul{,x}q_laneq_f{32,64}  (vector)
  vmul{,x}s_laneq_f32       (scalar)
  vmul{,x}d_laneq_f64       (scalar)
- note
  vmulx multiplies floating points with extended precision
- example
  float32x4_t vmulq_laneq_f32(float32x4_t a, float32x4_t v, const int lane) \
  --> res[i] = a[i] * v[lane]
  float32_t vmuls_laneq_f32(float32_t a, float32x4_t v, const int lane) \
  --> res = a * v[lane]

NEON: multiply by scalar
multiply a vector by a scalar
- intrin
  vmulq_n_{s,u}{16,32}
  vmulq_n_f{32,64}
- example
  int32x4_t vmulq_n_s32(int32x4_t a, int32_t n) \
  --> res[i] = a[i] * n

NEON: widening multiply
multiply two vectors, save result to a wider vector
- intrin
  vmull_{s,u}{8,16,32}
  vmull_high_{s,u}{8,16,32}
- note
  reference "NEON: widening add/sub #1" to see how widen works
- example
  int64x2_t vmull_s32(int32x2_t a, int32x2_t b) \
  --> res[i] = a[i] * b[i]
  int64x2_t vmull_high_s32(int32x4_t a, int32x4_t b) \
  --> res[i] = a[i+2] * b[i+2]

NEON: widening multiply by lane
multiply a vector by a lane of another vector, widen the result
- intrin
  vmull_laneq_{s,u}{16,32}
  vmull_high_laneq_{s,u}{16,32}
- note
  reference "NEON: widening add/sub #1" to see how widen works
- example
  int64x2_t vmull_lane_s32(int32x2_t a, int32x2_t v, const int lane) \
  --> res[i] = a[i] * v[lane]
  int64x2_t vmull_high_laneq_s32(int32x4_t a, int32x4_t v, const int lane) \
  --> res[i] = a[i+2] * v[lane]

NEON: widening multiply by scalar
multiply a vector by a scalar, widen the result
- intrin
  vmull_n_{s,u}{16,32}
  vmull_high_n_{s,u}{16,32}
- note
  reference "NEON: widening add/sub #1" to see how widen works
- example
  int64x2_t vmull_n_s32(int32x2_t a, int32_t n) \
  --> res[i] = a[i] * n
  int64x2_t vmull_high_n_s32(int32x4_t a, int32_t n) \
  --> res[i] = a[i+2] * n

NEON: multiply-accumulate
multiply two integer vectors and add-to/subtract-from another
- intrin
  vml{a,s}q_{s,u}{8,16,32}
- example:
  int32x4_t vmlaq_s32(int32x4_t a, int32x4_t b, int32x4_t c) \
  --> res[i] = a[i] + b[i] * c[i]

NEON: multiply-accumulate by lane
multiply integer vector with one lane of another vector, and \
add-to/subtract-from a third vector
- intrin
  vml{a,s}q_laneq_{s,u}{16,32}
- example
  int32x4_t vmlaq_laneq_s32(int32x4_t a, int32x4_t b, int32x4_t v, \
                            const int lane) \
  --> res[i] = a[i] + b[i] * v[lane]

NEON: multiply-acculmulate by scalar
multiply vector with scalar, add-to/subtract-from another vector
- intrin
  vml{a,s}q_n_{s,u}{16,32}
  vml{a,s}q_n_f32
- example
  int32x4_t vmlsq_n_s32(int32x4_t a, int32x4_t b, int32_t n) \
  --> res[i] = a[i] - b[i] * n

NEON: multiply-accumulate and widen
multiply two vectors, wide the result, and add-to/subtract-from another
- intrin
  vml{a,s}l_{s,u}{8,16,32}
  vml{a,s}l_high_{s,u}{8,16,32}
- note
  reference "NEON: widening add/sub #1" to see how widen works
- example
  int64x2_t vmlal_s32(int64x2_t a, int32x2_t b, int32x2_t c) \
  --> res[i] = a[i] + b[i] * c[i]
  int64x2_t vmlal_high_s32(int64x2_t a, int32x4_t b, int32x4_t c) \
  --> res[i] = a[i] + b[i+2] * c[i+2]

NEON: multiply-accumulate by lane and widen
multipy a vector with a lane of another vector, widen the result, \
and add-to/subtract from a third vector
- intrin
  vml{a,s}l_laneq_{s,u}{16,32}
  vml{a,s}l_high_laneq_{s,u}{16,32}
- note
  reference "NEON: widening add/sub #1" to see how widen works
- example
  int64x2_t vmlal_laneq_s32(int64x2_t a, int32x2_t b, int32x4_t v, \
                            const int lane) \
  --> res[i] = a[i] + b[i] * v[lane]
  int64x2_t vmlal_high_laneq_s32(int64x2_t a, int32x4_t b, int32x4_t v, \
                                 const int lane) \
  --> res[i] = a[i] + b[i+2] * v[lane]

NEON: multiply-accumulate by scalar and widen
multipy a vector with a scalar, widen the result, \
and add-to/subtract-from a third vector
- intrin
  vml{a,s}l_n_{s,u}{16,32}
  vml{a,s}l_high_n_{s,u}{16,32}
- note
  reference "NEON: widening add/sub #1" to see how widen works
- example
  int64x2_t vmlsl_n_s32(int64x2_t a, int32x2_t b, int32_t n) \
  --> res[i] = a[i] - b[i] * n
  int64x2_t vmlsl_high_n_s32(int64x2_t a, int32x4_t b, int32_t n) \
  --> res[i] = a[i] - b[i+2] * n

NEON: fused multiply-accumulate
multiply two floating point vectors and add-to/subtract-from another
- intrin
  vfm{a,s}q_f{32,64}
- example
  float32x4_t vfmaq_f32(float32x4_t a, float32x4_t b, float32x4_t c) \
  --> res[i] = a[i] + b[i] * c[i]

NEON: fused multiply-accumulate by lane
multiply vector with a lane of another vector, and \
add-to/subtract-from a third vector
- intrin
  vfm{a,s}q_laneq_f{32,64}
- example
  float32x4_t vfmaq_laneq_f32(float32x4_t a, float32x4_t b, \
                              float32x4_t v, const int lane) \ 
  --> res[i] = a[i] + b[i] * v[lane]

NEON: fused multiply-accumulate scalar by lane
multiply a scalar with one lane of a vector, and \
add-to/subtract-from another scalar
- intrin
  vfm{a,s}s_laneq_f32
  vfm{a,s}d_landq_f64
- example
  float32_t vfmas_laneq_f32(float32_t a, float32_t b, float32x4_t v, \
                            const int lane) \
  --> res = a + b * v[lane]

NEON: fused multiply-accumulate by scalar
multiply a vector with a scalar, and add-to/subtract-from another vector
- intrin
  vfm{a,s}q_n_f{32,64}
- example
  float32x4_t vfmsq_n_f32(float32x4_t a, float32x4_t b, float32_t n) \
  --> res[i] = a[i] - b[i] * n

NEON: saturating multiply
multiply two vectors, double and saturate the result, then \
keep the most significant half
- intrin
  vqdmulhq_s{16,32}  (truncated)
  vqrdmulhq_s{16,32} (rounded)
- example
  int32x4_t vqdmulhq_s32(int32x4_t a, int32x4_t b) \
  --> res[i] = high32(saturate(2 * a[i] * b[i]))

NEON: saturating multiply and widen
multiply two vectors, double and saturate the result, then \
save to a wider vector
- intrin
  vqdmull_s{16,32}
  vqdmull_high_s{16,32}
- note
  reference "NEON: widening add/sub #1" to see how widen works
- example
  int64x2_t vqdmull_s32(int32x2_t a, int32x2_t b) \
  --> res[i] = saturate(2 * a[i] * b[i])
  int64x2_t vqdmull_high_s32(int32x2_t a, int32x2_t b) \
  --> res[i] = saturate(2 * a[i+2] * b[i+2])

NEON: saturating multiply scalars
- intrin
  vqdmulhh_s16  (truncated)
  vqdmulhs_s32  (truncated)
  vqrdmulhh_s16 (rounded)
  vqrdmulhs_s32 (rounded)
- example
  int32_t vqdmulhs_s32(int32_t a, int32_t b) \
  --> res = saturate(2 * a * b)

# TODO
# NEON: saturating multiply-accumulate
# NEON: saturating multiply by scalar and widen
# NEON: saturating multiply-accumulate by scalar and widen
# NEON: polynomial multiply

NEON: division
- intrin
  vdivq_f{32,64}
- example
  float32x4_t vdivq_f32(float32x4_t a, float32x4_t b) \
  --> res[i] = a[i] / b[i]

NEON: absolute difference
calcluate per lane distance of two vectors
- intrin
  vabdq_{s,u}{8,16,32}
  vabdq_f{32,64}
- note
  result is of same type as argument, may surprise signed inputs
  eg. distance of int8 number 127 and -128 is 255, but abd(127, -128) = -1
- example
  int32x4_t vabdq_s32(int32x4_t a, int32x4_t b) \
  --> res[i] = |a[i] - b[i]|

NEON: absolute difference of scalars
calcuate distance of two numbers
- intrin
  vabds_f32
  vabdd_f64
- note
  just use "fabs(a - b)"

NEON: widening absolute difference
calculate per lane distance of two vectors, save to a wider vector
- intrin
  vabdl_{s,u}{8,16,32}
  vabdl_high_{s,u}{8,16,32}
- note
  reference "NEON: widening add/sub #1" to see how widen works
- example
  int64x2_t vabdl_s32(int32x2_t a, int32x2_t b) \
  --> res[i] = |a[i] - b[i]|
  int64x2_t vabdl_high_s32(int32x4_t a, int32x4_t b) \
  --> res[i] = |a[i+2] - b[i+2]|

NEON: absolute difference and accumulate
add per lane distance of two vectors to another vector
- intrin
  vabaq_{s,u}{8,16,32}
- example
  int32x4_t vabaq_s32(int32x4_t a, int32x4_t b, int32x4_t c) \
  --> res[i] = a[i] + |b[i] - c[i]|

NEON: widening absolute difference and accumulate
add per lane distance of two vectors to a wider vector
- intrin
  vabal_{s,u}{8,16,32}
  vabal_high_{s,u}{8,16,32}
- note
  reference "NEON: widening add/sub #1" to see how widen works
- example
  int64x2_t vabal_s32(int64x2_t a, int32x2_t b, int32x2_t c) \
  -->  res[i] = a[i] + |b[i] - c[i]}

NEON: absolute value
calcuate absolute value of each lane
- intrin
  vabsq_s{8,16,32,64}
  vabsq_f{32,64}
- example
  int32x4_t vabsq_s32(int32x4_t a) \
  --> res[i] = |a[i]|

NEON: maximum/minimum
get maximum/minimum of each lane from two vectors
- intrin
  v{max,min}q_{s,u}{8,16,32,64}
  v{max,min}q_f{32,64}
- example
  int32x4_t vmaxq_s32(int32x4_t a, int32x4_t b) \
  --> res[i] = max(a[i], b[i])

# TODO
# NEON: rounding
# NEON: reciprocal
# NEON: square root

NEON: pairwise add
add every two adjacent lanes
- intrin
  vpaddq_{s,u}{8,16,32,64}
  vpaddq_f{32,64}
- example
  int32x4_t res = vpaddq_s32(int32x4_t a, int32x4_t b)
      +---------+---------+---------+---------+
   a  |   a3    |   a2    |   a1    |   a0    |
      +---------+---------+---------+---------+
      +---------+---------+---------+---------+
   b  |   b3    |   b2    |   b1    |   b0    |
      +---------+---------+---------+---------+
  ----------------------------------------------
      +---------+---------+---------+---------+
  res |  b3+b2  |  b1+b0  |  a3+a2  |  a1+a0  |
      +---------+---------+---------+---------+

NEON: pairwise add and widen
do pairwise addition, save result to a wider vector
- intrin
  vpaddlq_{s,u}{8,16,32}
- note
  reference "NEON: pairwise add" to see how pairwise works
  reference "NEON: widening add/sub #1" to see how widen works
- example
  int64x2_t vpaddlq_s32(int32x4_t a) \
  --> res[0] = a[0] + a[1], \
      res[1] = a[2] + a[3]

NEON: pairwise add and widen accumulate
do pairwise addition, widen the result, and add to a wider vector
- intrin
  vpadalq_{s,u}{8,16,32}
- note
  reference "NEON: pairwise add" to see how pairwise works
  reference "NEON: widening add/sub #1" to see how widen works
- example
  int64x2_t vpadalq_s32(int64x2_t a, int32x4_t b) \
  --> res[0] = a[0] + (b[0] + b[1]), \
      res[1] = a[1] + (b[2] + b[3])

NEON: pairwise maximum/minimum
- intrin
  vp{max,min}q_{s,u}{8,16,32}
  vp{max,min}q_f{32,64}
  vp{max,min}nmq_f{32,64}     (IEEE754)
- note
  reference "NEON: pairwise add" to see how pairwise works
- example
  int32x4_t vpmaxq_s32(int32x4_t a, int32x4_t b) \
  --> res[0] = max(a[0], a[1]), \
      res[1] = max(a[2], a[3]), \
      res[2] = max(b[0], b[1]), \
      res[3] = max(b[2], b[3])

NEON: add across vector
horizontal sum over all lanes
- intrin
  vaddvq_{s,u}{8,16,32,64}
  vaddvq_f{32,64}
- example
  int32_t vaddvq_s32(int32x4_t a) \
  --> res = a[0] + a[1] + a[2] + a[3]

NEON: add across vector and widen
horizontal sum over all lanes, widen result
- intrin
  vaddlvq_{s,u}{8,16,32}
- example
  int64_t vaddlvq_s32(int32x4_t a) \
  --> res = a[0] + a[1] + a[2] + a[3]

NEON: maximum/minimum across vector
find maximum/minimum value of all lanes
- intrin
  v{max,min}vq_{s,u}{8,16,32}
  v{max,min}vq_f{32,64}
  v{max,min}nmvq_f{32,64}     (IEEE754)
- example
  int32_t vmaxvq_s32(int32x4_t a) \
  -> res = max(a[0], a[1], a[2], a[3])

NEON: compare
compare two vectors by lane
- intrin
  vc{eq,ge,gt,le,lt}q_{s,u}{8,16,32,64}
  vc{eq,ge,gt,le,lt}q_f{32,64}
  vceqq_p{8,64}
- note
  return per lane bitmask, all 1 if compare success, otherwise all 0
- example
  uint32x4_t vceqq_s32(int32x4_t a, int32x4_t b)
      +---------+---------+---------+---------+
   a  |  123    |   55    |  678    |   573   |
      +---------+---------+---------+---------+
      +---------+---------+---------+---------+
   b  |  123    |   66    |  789    |   573   |
      +---------+---------+---------+---------+
  ----------------------------------------------
      +---------+---------+---------+---------+
  res |ffffffff |00000000 |00000000 |ffffffff |
      +---------+---------+---------+---------+

NEON: compare with 0
compare each lane of a vector with 0
- intrin
  vc{eq,ge,gt,le,lt}zq_s{8,16,32,64}
  vceqzq_u{8,16,32,64}
  vc{eq,ge,gt,le,lt}zq_f{32,64}
  vceqzq_p{8,64}
- note
  reference "NEON: compare vectors" to see how compare works
- example
  uint32x4_t vceqzq_s32(int32x4_t a) \
  --> res[i] = 0xffffffff is a[i] == 0 else 0x00000000

NEON: compare scalars
compare two scalars and set bitmask in the result
- intrin
  vc{eq,ge,gt,le,lt}d_{s,u}64
  vc{eq,ge,gt,le,lt}s_f32
  vc{eq,ge,gt,le,lt}d_f64
- note
  just use "-(a == b)"
- example
  uint32_t vceqs_f32(float32_t a, float32_t b) \
  --> res = 0xffffffff is a == b else 0x00000000

NEON: compare scalar with 0
- intrin
  vc{eq,ge,gt,le,lt}zd_s64
  vceqzd_u64
  vc{eq,ge,gt,le,lt}zs_f32
  vc{eq,ge,gt,le,lt}zd_f64
- note
  just use "-(a == 0)"
- example
  uint32_t vceqzs_f32(float32_t a) \
  --> res = 0xffffffff if a == 0 else 0x00000000

NEON: compare absolute values
- intrin
  vca{ge,le,gt,lt}q_f{32,64}  (vector)
  vca{ge,le,gt,lt}s_f32       (scalar)
  vca{ge,le,gt,lt}d_f64       (scalar)
- example
  uint32x4_t vcaltq_f32(float32x4_t a, float32x4_t b) \
  --> res[i] = 0xffffffff if |a| < |b| else 0x00000000
  uint32_t vcages_f32(float32_t a, float32_t b) \
  --> res = 0xffffffff if |a| >= |b| else 0x00000000

NEON: bitwise not equal to 0
and two vectors bitwise, fill non-zero lane with 1
- intrin
  vtstq_{s,u}{8,16,32,64} (vector)
  vtstq_p{8,64}           (vector)
  vtstd_{s,u}64           (scalar)
- example
  uint32x4_t vtstq_s32(int32x4_t a, int32x4_t b) \
  --> res[i] = 0xffffffff if (a[i] & b[i]) else 0x00000000
  uint64_t vtstd_s64(int64_t a, int64_t b) \
  --> res = -1ULL if (a & b) else 0

NEON: shift left
left shift first vector by the least significant byte of the second vector
- intrin
  vshlq_{s,u}{8,16,32,64}
- example
  uint32x4_t vshlq_s32(uint32x4_t a, int32x4_t b) \
  --> res[i] = a[i] << uint8(b[i])

NEON: shift left/right by const
- intrin
  vsh{l,r}q_n_s{8,16,32,64}
- example
  int32x4_t vshlq_n_s32(int32x4_t a, const int n) \
  --> res[i] = a[i] << n

NEON: rounding shift left
- intrin
  vrshlq_{s,u}{8,16,32,64}  (vector)
  vrsqld_{s,u}64            (scalar)

NEON: saturating shift left
saturate left shift a vector by least significant bytes of another vector
- intrin
  vqshlq_{s,u}{8,16,32,64}  (vector)
  vqshlb_{s,u}8             (scalar)
  vqshlh_{s,u}16            (scalar)
  vqshls_{s,u}32            (scalar)
  vqshld_{s,u}64            (scalar)
- example
  int32x4_t vqshlq_s32(int32x4_t a, int32x4_t b) \
  --> res[i] = saturate(a[i] << uint8(b[i]))
  uint32_t vqshls_u32(uint32_t a, int32_t b) \
  --> res = saturate(a[i] << uint8(b))

NEON: saturating shift left by const
- intrin
  vqshlq_n_{s,u}{8,16,32,64}  (vector)
  vqshlb_n_{s,u}8             (scalar)
  vqshlh_n_{s,u}16            (scalar)
  vqshls_n_{s,u}32            (scalar)
  vqshld_n_{s,u}64            (scalar)

NEON: saturating rounding shift left
- intrin
  vqrshlq_{s,u}{8,16,32,64}  (vector)
  vqrshlb_{s,u}8             (scalar)
  vqrshlh_{s,u}16            (scalar)
  vqrshls_{s,u}32            (scalar)
  vqrshld_{s,u}64            (scalar)

NEON: shift left by const and widen
shift vector left by const and save to a wider vector
- intrin
  vshll_n_{s,u}{8,16,32}
  vshll_high_n_{s,u}{8,16,32}
- note
  reference "NEON: widening add/sub #1" to see how widen works
- example
  int64x2_t vshll_n_s32(int32x2_t a, const int n) \
  --> res[i] = a[i] << n
  int64x2_t vshll_high_n_s32(int32x4_t a, const int n) \
  --> res[i] = a[i+2] << n

NEON: shift left by const and insert
shift vector left by const and insert lower bits from another vector
- intrin
  vsliq_n_{s,u}{8,16,32,64}   (vector)
  vsliq_n_p{8,16,64}          (vector)
  vslid_n_{s,u}64             (scalar)
- example
  int32x4_t vsliq_n_s32(int32x4_t a, int32x4_t b, const int n) \
         +-------------------+
   a[i]  |                   |
         +-------------------+
         +-------------------+
   b[i]  |    bh        bl   |
         +----------^--------^
                    | n bits |
 ------------------------------
  res[i] = (a[i] << n) | bl

NEON: rounding shift right by const
- intrin
  vrshrq_n_{s,u}{8,16,32,64}  (vector)
  vrshrd_n_{s,u}64            (scalar)

NEON: shift right by const and accumulate
shift a vector right by const and add to another vector
- intrin
  vsraq_n_{s,u}{8,16,32,64}  (vector)
  vsrad_n_{s,u}64            (scalar)
- example
  int32x4_t vsraq_n_s32(int32x4_t a, int32x4_t b, const int n) \
  --> res[i] = a[i] + (b[i] >> n)

NEON: rounding shift right by const and accumulate
- intrin
  vrsraq_n_{s,u}{8,16,32,64}  (vector)
  vrsrad_n_{s,u}64            (scalar)

NEON: shift right by const and narrow
- intrin
  vshrn_n_{s,u}{16,32,64}
  vshrn_high_n_{s,u}{16,32,64}
- note
  reference "NEON: narrowing add/sub" to see how narrow works
  TBD: pick high half or low half?
- example
  int32x2_t vshrn_n_s64(int64x2_t a, const int n) \
  --> res[i] = to32(a[i] >> n)
  int32x4_t vshrn_high_n_s64(int32x2_t r, int64x2_t a, const int n) \
  --> res[0,1] = r[0,1]
      res[2] = to32(a[0] >> n)
      res[3] = to32(a[1] >> n)

NEON: rounding shift right by const and narrow
- intrin
  vrshrn_n_{s,u}{16,32,64}
  vrshrn_high_n_{s,u}{16,32,64}
- note
  reference "NEON: shift right by const and narrow" to see how narrow works

NEON: saturating shift right by const and narrow
- intrin
  vqshrn_n_{s,u}{16,32,64}       (vector)
  vqshrn_high_n_{s,u}{16,32,64}  (vector)
  vqshrnh_n_{s,u}16              (scalar)
  vqshrns_n_{s,u}32              (scalar)
  vqshrnd_n_{s,u}64              (scalar)
  vqshrun_n_s{16,32,64}          (vector)
  vqshrun_high_n_s{16,32,64}     (vector)
  vqshrunh_n_s16                 (scalar)
  vqshruns_n_s32                 (scalar)
  vqshrund_n_s64                 (scalar)
- note
  vqshrun converts signed to unsigned
  reference "NEON: shift right by const and narrow" to see how narrow works

NEON: saturating rounding shift right by const and narrow
- intrin
  vqrshrn_n_{s,u}{16,32,64}       (vector)
  vqrshrn_high_n_{s,u}{16,32,64}  (vector)
  vqrshrun_n_s{16,32,64}          (vector)
  vqrshrun_high_n_s{16,32,64}     (vector)
  vqrshrnh_n_{s,u}16              (scalar)
  vqrshrns_n_{s,u}32              (scalar)
  vqrshrnd_n_{s,u}64              (scalar)
  vqrshrunh_n_s16                 (scalar)
  vqrshruns_n_s32                 (scalar)
  vqrshrund_n_s64                 (scalar)
- note
  vqrshrun converts signed to unsigned
  reference "NEON: shift right by const and narrow" to see how narrow works

NEON: shift right by const and insert
shift vector right by const and insert higher bits from another vector
- intrin
  vsriq_n_{s,u}{8,16,32,64}   (vector)
  vsriq_n_p{8,16,64}          (vector)
  vsrid_n_{s,u}64             (scalar)
- example
  int32x4_t vsriq_n_s32(int32x4_t a, int32x4_t b, const int n) \
         +-------------------+
   a[i]  |                   |
         +-------------------+
         +-------------------+
   b[i]  |    bh        bl   |
         ^----------^--------+
         |  n bits  |
 ------------------------------
  res[i] = bh | (a[i] >> n)

NEON: convert float to integer
- intrin
  vcvt{,n,m,p,a}q_{s,u}32_f32  (vector)
  vcvt{,n,m,p,a}q_{s,u}64_f64  (vector)
  vcvt{,n,m,p,a}s_{s,u}32_f32  (scalar)
  vcvt{,n,m,p,a}d_{s,u}64_f64  (scalar)
- note
  vcvtq_....: round toward 0
  vcvtnq_...: round to nearest
  vcvtmq_...: round toward -inf
  vcvtpq_...: round toward +inf
  vcvtaq_...: round to nearest with ties to away
- example
  int32x4_t vcvtq_s32_f32(float32x4_t a) \
  --> res[i] = int32(a[i])

NEON: convert integer to float
- intrin
  vcvtq_f32_{s,u}32  (vector)
  vcvtq_f64_{s,u}64  (vector)
  vcvts_f32_{s,u}32  (scalar)
  vcvts_f64_{s,u}64  (scalar)
- example
  float32x4_t vcvtq_f32_s32(int32x4_t a) \
  --> res[i] = float(a[i])
 

# TODO
# vcvt{,n,m,p,a}q_n_xxx

NEON: reinterpret cast
cast vector to another type
- intrin
  vreinterpretq_<to>_<from>
- note
  TODO: polynomial
- example
  uint32x4_t vreinterpretq_u32_s64(int64x2_t a)

NEON: narrow a vector
move a vector to a narrower one
- intrin
  vmovn_{s,u}{16,32,64}
  vmovn_high_{s,u}{16,32,64}
- note
  keep lower half of each lane
- example
  int32x2_t vmovn_s64(int64x2_t a)
  int32x4_t vmovn_high_s64(int32x2_t r, int64x2_t a)
      +---------+---------+
    a | a1h a1l | a0h a0l |
      +---------+---------+
                +----+----+
    r           | r1 | r0 |
                +----+----+
  --------------------------
                +----+----+
  res1          | a1l| a0l|
                +----+----+
      +----+----+----+----+
  res2| a1l| a0l| r1 | r0 |
      +----+----+---------+

NEON: widen a vector
move a vector to a wider one
- intrin
  vmovl_{s,u}{8,16,32}
  vmovl_high_{s,u}{8,16,32}
- example
  int64x2_t vmovl_s32(int32x2_t a) \
  --> res[i] = a[i]
  int64x2_t vmovl_high_s32(int32x4_t a) \
  --> res[i] = a[i+2]

NEON: saturating narrow
- intrin
  vqmovn_{s,u}{16,32,64}       (vector)
  vqmovn_high_{s,u}{16,32,64}  (vector)
  vqmovnh_{s,u}16              (scalar)
  vqmovns_{s,u}32              (scalar)
  vqmovnd_{s,u}64              (scalar)
  vqmovun_s{16,32,64}          (vector)
  vqmovun_high_s{16,32,64}     (vector)
  vqmovunh_s16                 (scalar)
  vqmovuns_s32                 (scalar)
  vqmovund_s64                 (scalar)
- note
  vqmovun convert signed to unsigned

NEON: negate
- intrin
  vnegq_s{8,16,32,64}
  vnegq_f{32,64}
- example
  int32x4_t vnegq_s32(int32x4_t a) \
  --> res[i] = -a[i]

NEON: saturating negate
- intrin
  vqnegq_s{8,16,32,64}  (vector)
  vqnegqb_s8            (scalar)
  vqnegqh_s16           (scalar)
  vqnegqs_s32           (scalar)
  vqnegqd_s64           (scalar)
- example
  int32x4_t vqnegq_s32(int32x4_t a) \
  --> res[i] = saturate(-a[i])
  int32_t vqnegs_s32(int32_t a) \
  --> res = saturate(-a)

NEON: bitwise not
revert each bit of a vector
- intrin
  vmvnq_{s,u}{8,16,32}
  vmvnq_p8
- example
  int32x4_t vmvnq_s32(int32x4_t a) \
  --> res[i] = ~a[i]

NEON: bitwise and/or/exclusive-or/or-not
- intrin
  v{and,orr,eor,orn}q_{s,u}{8,16,32,64}
- example
  int32x4_t vornq_s32(int32x4_t a, int32x4_t b) \
  --> res[i] = a[i] | (~b[i])

NEON: count leading redundant sign bits
count number of bits following the most significant bit \
that are identical to it
- intrin
  vclsq_{s,u}{8,16,32}

NEON: count leading zeros
- intrin
  vclzq_{s,u}{8,16,32}

NEON: population count
count ones, ie. popcount
- intrin
  vcntq_{s,u,p}8

NEON: bitwise clear
clear bits in first vector that is set in second vector, \
effectively an and-not operation
- intrin
  vbicq_{s,u}{8,16,32,64}
- example
  int32x4_t vbicq_s32(int32x4_t a, int32x4_t b) \
  --> res[i] = a[i] & (~b[i])

NEON: bitwise select
select from two vectors
- intrin
  vbslq_{s,u}{8,16,32,64}
  vbslq_f{32,64}
  vbslq_p{8,16,64}
- note
  common use case is to select full lanes from two vectors by a mask vector
- example
  int32x4_t vbslq_s32(uint32x4_t mask, int32x4_t a, int32x4_t b) \
  --> res[i] = select_bits_from_an_bn_per_maskn \
      res[i] = a[i] if (mask[i] == 0xffffffff)  \
      res[i] = b[i] if (mask[i] == 0x00000000)

NEON: load
load vector from memory
- intrin
  vld{1,2,3,4}q_{s,u}{8,16,32,64}
  vld{1,2,3,4}q_f{32,64}
  vld{1,2,3,4}q_p{8,16,64}
  vld1q_{s,u}{8,16,32,64}_x{2,3,4}  (non-interleave)
  vld1q_f{32,64}_x{2,3,4}           (non-interleave)
  vld1q_p{8,16,64}_x{2,3,4}         (non-interleave)
- example
  const int32_t ptr[]{0,1,0,1,0,1,0,1}; \
  int32x4x2_t vld2q_s32(const int32_t *ptr) \
  --> res = {{0,0,0,0}, {1,1,1,1}}
  const int32_t ptr[]{0,0,0,0,1,1,1,1}; \
  int32x4x2_t vld1q_s32_x2(const int32_t *ptr) \
  --> res = {{0,0,0,0}, {1,1,1,1}}

NEON: load by lane
load a vector lane from memory
- intrin
  vld{1,2,3,4}q_lane_{s,u}{8,16,32,64}
  vld{1,2,3,4}q_lane_f{32,64}
  vld{1,2,3,4}q_lane_p{8,16,64}
- example
  int32x4_t vld1q_lane_s32(const int32_t *ptr, int32x4_t src, \
                           const int lane) \
  --> res = src; res[lane] = *ptr;
  int32x4x2_t vld2q_lane_s32(const int32_t *ptr, int32x4x2_t src, \
                             const int lane) \
  --> res = src; res[0][lane] = ptr[0]; res[1][lane] = ptr[1];

NEON: load by scalar
load a scalar from memory and populate all lanes of a vector
- intrin
  vld{1,2,3,4}q_dup_{s,u}{8,16,32,64}
  vld{1,2,3,4}q_dup_f{32,64}
  vld{1,2,3,4}q_dup_p{8,16,64}
- example
  int32x4x3_t vld3q_dup_s32(const int32_t *ptr) \
  --> res[0] = [ptr[0]]*4; res[1] = [ptr[1]]*4; res[2] = [ptr[2]]*4

NEON: store
store vector to memory
- intrin
  vst{1,2,3,4}q_{s,u}{8,16,32,64}
  vst{1,2,3,4}q_f{32,64}
  vst{1,2,3,4}q_p{8,16,64}
  vst1q_{s,u}{8,16,32,64}_x{2,3,4}  (non-interleave)
  vst1q_f{32,64}_x{2,3,4}           (non-interleave)
  vst1q_p{8,16,64}_x{2,3,4}         (non-interleave)
- example
  val = {{0,0,0,0}, {1,1,1,1}}; \
  void vst2q_s32(int32_t *ptr, int32x4x2_t val) \
  --> ptr -> {0,1,0,1,0,1,0,1}
  val = {{0,0,0,0}, {1,1,1,1}}; \
  void vst1q_s32_x2(int32_t *ptr, int32x4x2_t val) \
  --> ptr -> {0,0,0,0,1,1,1,1}

NEON: store by lane
store vector lane to memory
- intrin
  vst{1,2,3,4}q_lane_{s,u}{8,16,32,64}
  vst{1,2,3,4}q_lane_f{32,64}
  vst{1,2,3,4}q_lane_p{8,16,64}
- example
  void vst1q_lane_s32(int32_t *ptr, int32x4_t src, const int lane) \
  --> *ptr = src[lane]
  void vst2q_lane_s32(int32_t *ptr, int32x4x2_t src, const int lane) \
  --> ptr[0] = src[0][lane]; ptr[1] = src[1][lane];

NEON: load/store poly128
- intrin
  v{ldr,str}q_p128

NEON: copy vector lane
copy one lane of a vector to a lane of another vector
- intrin
  vcopyq_laneq_{s,u}{8,16,32,64}
  vcopyq_laneq_f{32,64}
  vcopyq_laneq_p{8,16,64}
- example
  int32x4_t vcopyq_laneq_s32(int32x4_t a, const int lane1, \
                             int32x4_t b, const int lane2) \
  --> res = a; res[lane1] = b[lane2]

NEON: set all lanes to a const
fill all lanes of a vector with a const
- intrinc
  v{dup,mov}q_n_{s,u}{8,16,32,64}
  v{dup,mov}q_n_f{32,64}
  v{dup,mov}q_n_p{8,16,64}
- note
  vdupq and vmovq are the same
- example
  int32x4_t vdupq_n_s32(int32_t value) \
  --> res = [value]*4

NEON: set all lanes to a vector lane
fill all lanes of a vector with one lane of another vector
- intrin
  vdupq_laneq_{s,u}{8,16,32,64}
  vdupq_laneq_f{32,64}
  vdupq_laneq_p{8,16,64}
- example
  int32x4_t vdupq_laneq_s32(int32x4_t vec, const int lane) \
  --> res = [vec[lane]]*4

NEON: set one lane of a vector
fill one specific lane of a vector with a const
- intrin
  vsetq_lane_{s,u}{8,16,32,64}
  vsetq_lane_f{32,64}
  vsetq_lane_p{8,16,64}
- example
  int32x4_t vsetq_lane_s32(int32_t a, int32x4_t v, const int lane) \
  --> res = v, res[lane] = a

NEON: reverse bits
- intrin
  vrbitq_{s,u,p}8

NEON: combine vectors
concatenate two vectors to a wider vector
- intrin
  vcombine_{s,u}{8,16,32,64}
  vcombine_f{32,64}
  vcombine_p{8,16,64}
- example
  int32x4_t vcombine_s32(int32x2_t low, int32x2_t high) \
  --> res[0,1] = low, res[2,3] = high

NEON: extract half vector
extract lower or higher half of a vector
- intrin
  vget_{high,low}_{s,u}{8,16,32,64}
  vget_{high,low}_f{32,64}
  vget_{high,low}_p{8,16,64}
- example
  int32x2_t vget_high_s32(int32x4_t a) \
  --> res[0] = a[2], res[1] = a[3]

NEON: extract lane
extract one lane from a vector
- intrin
  vgetq_lane_{s,u}{8,16,32,64}
  vgetq_lane_f{32,64}
  vgetq_lane_p{8,16,64}
- note
  same as vdup{b,h,s,d}_laneq_...
- example
  int32_t vgetq_lane_s32(int32x4_t v, const int lane) \
  --> res = v[lane]

NEON: extract from two vectors
cut a vector from two vectors concatenated
- intrin
  vextq_{s,u}{8,16,32,64}
  vextq_f{32,64}
  vextq_p{8,16,64}
- note
  can be used to round shift a vector by lane
- example
  int32x4_t vextq_s32(int32x4_t a, int32x4_t b, const int n)
               +----+----+----+----+
             a | a3 | a2 | a1 | a0 |  {a0, a1, a2, a3}
               +----+----+----+----+
               +----+----+----+----+
             b | b3 | b2 | b1 | b0 |  {b0, b1, b2, b3}
               +----+----+----+----+
 -------------------------------------
               +----+----+----+----+
  n = 1    res | b0 | a3 | a2 | a1 |  {a1, a2, a3, b0}
               +----+----+----+----+
               +----+----+----+----+
  n = 2    res | b1 | b0 | a3 | a2 |  {a2, a3, b0, b1}
               +----+----+----+----+
               +----+----+----+----+
  n = 3    res | b2 | b1 | b0 | a3 |  {a3, b0, b1, b2}
               +----+----+----+----+

NEON: reverse lanes
reverse lanes in chunk, for endianness adjustment
- intrin
  vrev64q_{s,u}{8,16,32}
  vrev64q_f32
  vrev64q_p{8,16}
  vreq32q_{s,u,p}{8,16}
  vreq16q_{s,u,p}8
- note
  vreq[n]q_s[m]: cut vector to 128/n chunks, treat m bits as an element,
                 swap ordering of n/m elements in each chunk
- example
  int32x4_t vrev64q_s32(int32x4_t v)
     +----+----+----+----+
  a  | a3 | a2 | a1 | a0 |
     +----+----+----+----+
 --------------------------
     +----+----+----+----+
 res | a2 | a3 | a0 | a1 |
     +----+----+----+----+

NEON: zip lanes
interleave lanes of two vectors
- intrin
  vzip{,1,2}q_{s,u}{8,16,32,64}
  vzip{,1,2}q_f{32,64}
  vzip{,1,2}q_p{8,16,64}
- note
  vzip1q: keep lower half
  vzip2q: keep higher half
  vzipq:  keep all result in a x2 vector
- example
  int32x4_t res1 = vzip1q_s32(int32x4_t a, int32x4_t b)
  int32x4_t res2 = vzip2q_s32(int32x4_t a, int32x4_t b)
  int32x4x2_t res3 = vzipq_s32(int32x4_t a, int32x4_t b)
      +----+----+----+----+
  a   | a3 | a2 | a1 | a0 |  {a0, a1, a2, a3}
      +----+----+----+----+
      +----+----+----+----+
  b   | b3 | b2 | b1 | b0 |  {b0, b1, b2, b3}
      +----+----+----+----+
 ---------------------------
      +----+----+----+----+
 res1 | b1 | a1 | b0 | a0 |  {a0, b0, a1, b1}
      +----+----+----+----+
      +----+----+----+----+
 res2 | b3 | a3 | b2 | a2 |  {a2, b2, a3, b3}
      +----+----+----+----+
 res3[0] = res1, res3[1] = res2

NEON: unzip lanes
de-interleave lanes from two vectors
- intrin
  vuzp{,1,2}q_{s,u}{8,16,32,64}
  vuzp{,1,2}q_f{32,64}
  vuzp{,1,2}q_p{8,16,64}
- vuzp1q: keep even indexed lanes
  vuzp2q: keep odd indexed lanes
  vuzpq:  keep all elements in a x2 vector
- example
  int32x4_t res1 = vuzp1q_s32(int32x4_t a, int32x4_t b)
  int32x4_t res2 = vuzp2q_s32(int32x4_t a, int32x4_t b)
  int32x4x2_t res3 = vuzipq_s32(int32x4_t a, int32x4_t b)
      +----+----+----+----+
  a   | a3 | a2 | a1 | a0 |  {a0, a1, a2, a3}
      +----+----+----+----+
      +----+----+----+----+
  b   | b3 | b2 | b1 | b0 |  {b0, b1, b2, b3}
      +----+----+----+----+
 ---------------------------
      +----+----+----+----+
 res1 | b2 | b0 | a2 | a0 |  {a0, a2, b0, b2}
      +----+----+----+----+
      +----+----+----+----+
 res2 | b3 | b1 | a3 | a1 |  {a1, a3, b1, b3}
      +----+----+----+----+
 res3[0] = res1, res3[1] = res2

NEON: transpose lanes
- intrin
  vtrn{,1,2}q_{s,u}{8,16,32,64}
  vtrn{,1,2}q_f{32,64}
  vtrn{,1,2}q_p{8,16,64}
- example
  int32x4_t res1 = vtrn1q_s32(int32x4_t a, int32x4_t b)
  int32x4_t res2 = vtrn2q_s32(int32x4_t a, int32x4_t b)
  int32x4x2_t res3 = vuzipq_s32(int32x4_t a, int32x4_t b)
      +----+----+----+----+
  a   | a3 | a2 | a1 | a0 |  {a0, a1, a2, a3}
      +----+----+----+----+
      +----+----+----+----+
  b   | b3 | b2 | b1 | b0 |  {b0, b1, b2, b3}
      +----+----+----+----+
 ---------------------------
      +----+----+----+----+
 res1 | b2 | a2 | b0 | a0 |  {a0, b0, a2, b2}
      +----+----+----+----+
      +----+----+----+----+
 res2 | b3 | a3 | b1 | a1 |  {a1, b1, a3, b3}
      +----+----+----+----+
 res3[0] = res1, res3[1] = res2

NEON: table lookup
- intrin
  vqtbl{1,2,3,4}q_{s,u,p}8
  vqtbx{1,2,3,4}q_{s,u,p}8
- note
  vqtbx preserves original value if index out of range, wnile vqtbl sets to 0
- example
  int8x16_t vqtbl1q_s8(int8x16_t t, uint8x16_t idx) \
  --> res[i] = t[idx[i]] if idx[i] < 16 else 0
  int8x16_t vqtbx1q_s8(int8x16_t a, int8x16_t t, uint8x16_t idx) \
  --> res[i] = t[idx[i]] if idx[i] < 16 else a[i]
  int8x16_t vqtbl2q_s8(int8x16x2_t t, uint8x16_t idx) \
  --> tt = concat(t[0], t[1]); res[i] = tt[idx[i]] if idx[i] < 32 else 0

# TODO: crypto, crc, fp16, bf16, complex, dot prod, matrix mul, 3 way eor/bic
</neon_document>