批改娘 20018. Square Root of Vector Elements

1. Background
2. Problem
3. Sample Input (stdin)
4. Sample Output (stdout)
5. Solution

Background

Intel 公司在 X86 指令集上提供了 AVX (Advanced Vector Extensions)、SSE (Streaming SIMD Extensions) 的特殊指令，這些在 SIMD 指令上提供強力的加速。

Problem

現在給你長度為 $N$ 的 32-bit floating point，分別將其開根號回傳。

你可能會需要函數庫 #include <x86intrin.h>
你可能會需要 AVX/SSE 開根指令 __m256 _mm256_sqrt_ps(__m256 a);

main.c

#include <stdio.h>
#include <memory.h>
#include <math.h>
#include <time.h>
#include "VSQRT.h"
static float a[1048576], b[1048576];
int main() {
    int N = 1048576;
    for (int i = 0; i < N; ++i)
        a[i] = i;
    for (int it = 0; it < 20; it++)
    {
        memcpy(b, a, sizeof(a[0])*N);
        clock_t t = clock();
        for (int i = 0; i < 10; i++)
            sqrt2(b, b+N);
        t = clock() - t;
        fprintf(stderr, "It took me %f seconds.\n", ((float) t)/CLOCKS_PER_SEC);
        float sum = 0;
        for (int i = 0; i < N; i++)
            sum += b[i];
        printf("%f\n", sum);
    }
    return 0;
}

VSQRT.h

#ifndef __VSQRT_H
#define __VSQRT_H
void sqrt2(float *begin, float *end);
#endif

VSQRT.c

請嘗試加速以下程式碼。

#include "VSQRT.h"
#include <math.h>
void sqrt2(float *begin, float *end) {
     for (; begin != end; begin++)
         *begin = sqrt(*begin);
}

Sample Input (stdin)

no input

Sample Output (stdout)

1051776.125000
1051776.125000
1051776.125000
1051776.125000
1051776.125000
1051776.125000
1051776.125000
1051776.125000
1051776.125000
1051776.125000
1051776.125000
1051776.125000
1051776.125000
1051776.125000
1051776.125000
1051776.125000
1051776.125000
1051776.125000
1051776.125000
1051776.125000

Solution

這裡我們使用 Intel CPU 撰寫 SIMD 指令時，透過編譯參數 -msse -mavx ... 指定相關的指令集進行調適。這些單一指令多資料處理的操作，使得在指定層級得到較高的平行度。

為了使用它們，我們需要較多的前置處理，例如要把資料載入和儲存時，必然要符合對齊的標準，若記憶體位址不是 32 的倍數 (AVX 的 256 bits) 或者 16 的倍數 (SSE 的 128 bits)，則系統會產生 trap，進而導致程式運行中斷而發生錯誤。這方面必須特別小心，有時編譯器恰好幫你對齊而正常運行，若增加其他變數時，對齊就有可能跑掉，這時候再去找錯誤會變得相當瑣碎。

通常些 SSE/AVX 加速指令，將會在 gcc -Ofast 中使用，而在 -O3 下仍有可能尚未開啟。而在 LLVM 中的 clang -O2 編譯下就會自動開啟 SSE。無論如何，要明白開啟這些的并行指令的代價，數據小造成運行效能低落，隨著資料量遽增才會明顯加速，若牽涉到浮點數計算，務必注意精準度的需求。

#include "VSQRT.h"
 
#include <stdint.h>
#include <math.h>
#include <x86intrin.h>
 
//#undef __AVX__
//#undef __SSE__
 
#if defined(__AVX__)
void sqrt2(float *a, float *end) {
    uint32_t offset = ((void *) a - (void *) 0);
    for (; a < end && (offset&31); a++, offset += sizeof(float))
        *a = sqrt(*a);
    __m256* ptr = (__m256*) a;
    for (; a + 8 <= end; a += 8, ptr++)
        _mm256_store_ps(a, _mm256_sqrt_ps(*ptr));
    for (; a < end; a++)
        *a = sqrt(*a);
}
#elif defined(__SSE__)
void sqrt2(float *a, float *end) {
    uint32_t offset = ((void *) a - (void *) 0);
    for (; a < end && (offset&15); a++, offset += sizeof(float))
        *a = sqrt(*a);
    __m128* ptr = (__m128*) a;
    for (; a + 4 <= end; a += 4, ptr++)
        _mm_store_ps(a, _mm_sqrt_ps(*ptr));
    for (; a < end; a++)
        *a = sqrt(*a);
}
#else
void sqrt2(float *begin, float *end) {
    for (; begin != end; begin++)
        *begin = sqrtf(*begin);
}
#endif

Morris' Blog