OpenBLASは組み込み関数よりも遅いdot_product

Fortranでドットプロダクトを作成する必要があります。 Fortranの組み込み関数dot_productを使用するか、OpenBLASのddotを使用します。問題はddotが遅いことです。これは私のコードです：BLASでOpenBLASは組み込み関数よりも遅いdot_product

：

dot_product

program VectorModule 
! time VectorModule.e = 0.19s 
implicit none 
double precision, dimension (3) :: b 
double precision     :: result 
integer, parameter    :: LargeInt_K = selected_int_kind (18) 
integer (kind=LargeInt_K)  :: I 

DO I = 1, 10000000 
    b(:) = 3 
    result = dot_product(b, b) 
END DO 
end program VectorModule

で

program VectorBLAS 
! time VectorBlas.e = 0.30s 
implicit none 
double precision, dimension(3) :: b 
double precision    :: result 
double precision, external  :: ddot 
integer, parameter    :: LargeInt_K = selected_int_kind (18) 
integer (kind=LargeInt_K)  :: I 

DO I = 1, 10000000 
    b(:) = 3 
    result = ddot(3, b, 1, b, 1) 
END DO 
end program VectorBLAS

2つのコードを使用してコンパイルされています。私は間違って

gfortran file_name.f90 -lblas -o file_name.e

何をしているのですか？ BLASは高速である必要はありませんか？

出典

2016-03-25 F.N.B

http://stackoverflow.com/questions/35926940//36035152＃36035152 –

BLAS、特に最適化されたバージョンは、一般に大型のアレイでは高速ですが、組み込み関数はサイズが小さいほど高速です。

これは特に、さらなる機能（例えば、異なる増分）に追加の作業が費やされるddotのリンクされたソースコードから見ることができます。配列の長さが短い場合は、ここで行われる作業が最適化のパフォーマンス向上を上回ります。

ベクターを大きくすると、最適化されたバージョンが高速になります。ここで

はこれを説明するための例です：

program test 
    use, intrinsic :: ISO_Fortran_env, only: REAL64 
    implicit none 
    integer     :: t1, t2, rate, ttot1, ttot2, i 
    real(REAL64), allocatable :: a(:),b(:),c(:) 
    real(REAL64), external :: ddot 

    allocate(a(100000), b(100000), c(100000)) 
    call system_clock(count_rate=rate) 

    ttot1 = 0 ; ttot2 = 0 
    do i=1,1000 
    call random_number(a) 
    call random_number(b) 

    call system_clock(t1) 
    c = dot_product(a,b) 
    call system_clock(t2) 
    ttot1 = ttot1 + t2 - t1 

    call system_clock(t1) 
    c = ddot(100000,a,1,b,1) 
    call system_clock(t2) 
    ttot2 = ttot2 + t2 - t1 
    enddo 
    print *,'dot_product: ', real(ttot1)/real(rate) 
    print *,'BLAS, ddot: ', real(ttot2)/real(rate) 
end program

BLASルーチンはここでかなり速いです：非常に多くの関連

OMP_NUM_THREADS=1 ./a.out 
dot_product: 0.145999998  
BLAS, ddot: 0.100000001

出典

2016-03-25 14:33:05

@FNB注意：使用するBLASライブラリの実装とコンパイル方法によっても異なります。 MKLはIntel CPU上で非常に効率的です。また、ディストリビューションのパッケージリポジトリからopenBLASをインストールしただけでは、あなたのアーキテクチャーに理想的に調整されていない可能性があります。 –

OpenBLASは組み込み関数よりも遅いdot_product

答えて

関連する問題