自動ベクトル化の比較

私のg ++ 5.4でベクトル化を使って比較するのに問題があります。基本的には、ベクトル化を使用して4つの符号なし整数を比較したいと思います。私の最初のアプローチは単純明快だった：g++ -std=c++11 -Wall -O3 -funroll-loops -march=native -mtune=native -ftree-vectorize -msse -msse2 -ffast-math -fopt-info-vec-missedしてコンパイル自動ベクトル化の比較

bool compare(unsigned int const pX[4]) { 
    bool c1 = (temp[0] < 1); 
    bool c2 = (temp[1] < 2); 
    bool c3 = (temp[2] < 3); 
    bool c4 = (temp[3] < 4); 
    return c1 && c2 && c3 && c4; 
}

は、それが原因ずれデータとの比較をベクトル化できなかったことを、可能に語った：

main.cpp:5:17: note: not vectorized: failed to find SLP opportunities in basic block. 
main.cpp:5:17: note: misalign = 0 bytes of ref MEM[(const unsigned int *)&x] 
main.cpp:5:17: note: misalign = 4 bytes of ref MEM[(const unsigned int *)&x + 4B] 
main.cpp:5:17: note: misalign = 8 bytes of ref MEM[(const unsigned int *)&x + 8B] 
main.cpp:5:17: note: misalign = 12 bytes of ref MEM[(const unsigned int *)&x + 12B]

このように、私の第二の試みを揃えるために、G ++伝えることでした

bool compare(unsigned int const pX[4]) { 
    unsigned int temp[4] __attribute__ ((aligned(16))); 
    temp[0] = pX[0]; 
    temp[1] = pX[1]; 
    temp[2] = pX[2]; 
    temp[3] = pX[3]; 

    bool c1 = (temp[0] < 1); 
    bool c2 = (temp[1] < 2); 
    bool c3 = (temp[2] < 3); 
    bool c4 = (temp[3] < 4); 
    return c1 && c2 && c3 && c4; 
}

しかし、同じ出力：データとは、一時的な配列を使用します。 AVX2は私のCPUでサポートされており、Intel固有のガイドによると、比較のため_mm256_cmpgt_epi8/16/32/64。どのようにg + +これを使用するように指示する任意のアイデアですか？

出典

2016-12-06 user1228633

移植可能な方法があるかどうかは不明ですが、すべてのboolが設定されているかどうかを確認したいのであれば（intelinsics）（https://software.intel.com/sites/landingpage/IntrinsicsGuide/）は、ビット数などですべてが間違っているかどうかを示します。[intelにも例があります]（https：///software.intel.com/en-us/blogs/2013/05/17/processing-arrays-of-bits-with-intel-advanced-vector-extensions-2-intel-avx2） – Mgetz

32ビットの符号なしSSE/AVXで比較する - 署名して試してみてください。 –

AVX2は32バイトのアライメントを必要とします。 –

確かに、コンパイラは「アンロールされたループ」が気に入らないようです。これは、私の作品：

bool compare(signed int const pX[8]) { 
    signed int const w[] __attribute__((aligned(32))) = {1,2,3,4,5,6,7,8}; 
    signed int out[8] __attribute__((aligned(32))); 

    for (unsigned int i = 0; i < 8; ++i) { 
     out[i] = (pX[i] <= w[i]); 
    } 

    bool temp = true; 
    for (unsigned int i = 0; i < 8; ++i) { 
     temp = temp && out[i]; 
     if (!temp) { 
      return false; 
     } 
    } 
    return true; 
}

outもsigned intであることに、注意してください。保存された結果を組み合わせるための速い方法が必要です。

出典

2016-12-06 21:02:18 user1228633

私はまた、アンロールされたループがコンパイラにとって問題であることを発見しました。高速インデックス上の#ompプラグマはベクトル化する必要があり、深いビットの深さの合計に合計する必要があります。もう1つのアプローチは、2D [n、m]が1D [n * m]として共表現され、それがコンパイラにとって自然に簡単な和集合です。 – Holmz

自動ベクトル化の比較

答えて

関連する問題