SSE組み込み関数に相当するネオン

ネオンイントリンシックを使用して最適化されたコードに変換しようとしています。SSE組み込み関数に相当するネオン

2つのオペラントを操作するcコードは、オペラントのベクトルを超えていません。

uint16_t mult_z216(uint16_t a,uint16_t b){ 
unsigned int c1 = a*b; 
    if(c1) 
    { 
     int c1h = c1 >> 16; 
     int c1l = c1 & 0xffff; 
     return (c1l - c1h + ((c1l<c1h)?1:0)) & 0xffff; 
    } 
    return (1-a-b) & 0xffff; 
}

この操作のSEE最適化されたバージョンは、すでに以下で実装されています：

#define MULT_Z216_SSE(a, b, c) \ 
    t0 = _mm_or_si128 ((a), (b)); \ //Computes the bitwise OR of the 128-bit value in a and the 128-bit value in b. 
    (c) = _mm_mullo_epi16 ((a), (b)); \ //low 16-bits of the product of two 16-bit integers 
    (a) = _mm_mulhi_epu16 ((a), (b)); \ //high 16-bits of the product of two 16-bit unsigned integers 
    (b) = _mm_subs_epu16((c), (a)); \ //Subtracts the 8 unsigned 16-bit integers of a from the 8 unsigned 16-bit integers of c and saturates 
    (b) = _mm_cmpeq_epi16 ((b), C_0x0_XMM); \ //Compares the 8 signed or unsigned 16-bit integers in a and the 8 signed or unsigned 16-bit integers in b for equality. (0xFFFF or 0x0) 
    (b) = _mm_srli_epi16 ((b), 15); \ //shift right 16 bits 
    (c) = _mm_sub_epi16 ((c), (a)); \ //Subtracts the 8 signed or unsigned 16-bit integers of b from the 8 signed or unsigned 16-bit integers of a. 
    (a) = _mm_cmpeq_epi16 ((c), C_0x0_XMM); \ ////Compares the 8 signed or unsigned 16-bit integers in a and the 8 signed or unsigned 16-bit integers in b for equality. (0xFFFF or 0x0) 
    (c) = _mm_add_epi16 ((c), (b)); \ // Adds the 8 signed or unsigned 16-bit integers in a to the 8 signed or unsigned 16-bit integers in b. 
    t0 = _mm_and_si128 (t0, (a)); \ //Computes the bitwise AND of the 128-bit value in a and the 128-bit value in b. 
    (c) = _mm_sub_epi16 ((c), t0); ///Subtracts the 8 signed or unsigned 16-bit integers of b from the 8 signed or unsigned 16-bit integers of a.

私はほとんどネオン組み込み関数を使用してこれを変換しました：

#define MULT_Z216_NEON(a, b, out) \ 
    temp = vorrq_u16 (*a, *b); \ 
    // ?? 
    // ?? 
    *b = vsubq_u16(*out, *a); \ 
    *b = vceqq_u16(*out, vdupq_n_u16(0x0000)); \ 
    *b = vshrq_n_u16(*b, 15); \ 
    *out = vsubq_s16(*out, *a); \ 
    *a = vceqq_s16(*c, vdupq_n_u16(0x0000)); \ 
    *c = vaddq_s16(*c, *b); \ 
    *temp = vandq_u16(*temp, *a); \ 
    *out = vsubq_s16(*out, *a);

私はネオン相当物がなくなったのは_mm_mullo_epi16 ((a), (b));と_mm_mulhi_epu16 ((a), (b));です。私は何かを誤解しているか、NEONにそのような組み込み関数がありません。 NEONS組み込み関数を使用してこれらのステップをアーカイブする方法が同等でない場合は、

UPDATE：

Iは、以下の点を強調するのを忘れてきた：関数のoperantsはuint16x8_t NEONベクトル（各要素が0と65535の間のuint16_t =>の整数である）です。ある回答では、本物のvqdmulhq_s16()を使用するように提案されました。乗法組み込み関数は、ベクトルを符号付きの値として解釈し、誤った出力を生成するため、この実装の使用は指定された実装と一致しません。

出典

2012-07-02 Kami

値が32767を超える場合は、下記の拡大倍数（vmull_u16）を使用する必要があります。値がすべて<32768であることが分かっている場合は、vqdmulhq_s16を使用できます。 – BitBank

あなたは使用することができます：32ビット製品のベクトルを返します

uint32x4_t vmull_u16 (uint16x4_t, uint16x4_t)

を。結果をハイとローの部分に分割したい場合は、NEONの組み込み関数を使用することができます。

出典

2012-07-02 18:30:50

その命令は、16x16 = 32乗算（出力を拡大）です。より近い指示があります（私の答えを見てください）。 – BitBank

@BitBank：OPは上位16ビットと下位16ビットを必要とするため、32ビットの結果が必要です。倍精度/飽和度の乗算は、精度を失うため代替ではありません。 –

vmulq_s16（）は、_mm_mullo_epi16と同等です。 _mm_mulhi_epu16と全く同じものはありません。最も近い命令はvqdmulhq_s16（）であり、これは「飽和し、倍増し、乗算し、高い部分を返す」。符号付き16ビット値でのみ動作し、倍精度を無効にするには、入力または出力を2で除算する必要があります。

出典

2012-07-02 22:02:13 BitBank

vqdmulhq_s16（）は符号付き入力を使用するので、GCCは間違った型付き引数について不平を言っています...効率的な方法でuint16x8_tからint16x8_tに変換するには？ – Kami

キャスティングマクロがあります。 vreinterpretq_s16_u16（）を使用してください – BitBank

符号付き乗算について私の編集を参照してください！ – Kami

SSE組み込み関数に相当するネオン

答えて

関連する問題