c++ - Is this function a good candidate for SIMD on Intel? -
i'm trying optimize following function (simplified bit, it's loop program spends lot of time):
int f(int len, unsigned char *p) { int = 0; while (i < len && p[i] >= 32 && p[i] <= 127) { i++; } return i; }
i think optimized vector instructions, bit of research, looks sse not meant working @ byte level. program targets 64-bit intel cpus on osx. there clever bit-twiddling trick i'm not seeing let me work on 64 bits @ time? llvm -o3 doesn't clever optimizations.
update:
the simd code fastest on benchmark (depending on size of input), reason app overall slower simd naïve code or bit-twiddling trick. context, application finding length of subsequences of ascii strings in stream of input terminal emulator. ascii strings special "fast path" treatment. mark 1 answer correct both great. did make 1 small improvement bit twiddling, removing if statement doing this:
while (i < len - 8) { uint64_t bytes = *(uint64_t *)(p + i); uint64_t middlebits = bytes & 0x6060606060606060; uint64_t highbits = bytes & 0x8080808080808080; middlebits |= (middlebits >> 1); middlebits &= ~(highbits >> 2); if ((middlebits & 0x2020202020202020) != 0x2020202020202020) { break; } += 8; }
you can vectorize compares using _mm_cmplt_epi8 , _mm_cmpgt_epi8 (msvc intrinsics).
you can use movemask on result of anding compare results. if result of movemask 0xffff comparisons passed. otherwise need run tail loop find out correct position failed test. figure out mask, depending upon value of 'len' may not worth effort.
the original unvectorized loop tail required if 'len' not multiple of 16. may or may not faster -- you'd need profile sure.
scrap - compares operate on signed values , doesnt work..
working version below.
union ummu8 { __m128i mm_; struct { unsigned char u8_; }; }; int f(int len, unsigned char *p) { int = 0; __m128i a; __m128i b; __m128i c; ummu8* pu = (ummu8*)p; int const len16 = len / 16; while (i < len16) { = pu[i].mm_; b = _mm_slli_epi32(a, 1); c = _mm_slli_epi32(a, 2); b = _mm_or_si128(b, c); = _mm_andnot_si128(a, b); int mask = _mm_movemask_epi8(a); if (mask == 0xffff) { ++i; } else { if (mask == 0) { return * 16; } break; } } *= 16; while (i < len && p[i] >= 32 && p[i] <= 127) { i++; } return i; }
since don't have 64 os on pc, cant proper perf test. however, profiling run gave:
- naive loop: 30.44
- 64bit integer: 15.22 (on 32bit os)
- sse impl: 5.21
so sse version lot faster naive loop version. i'd expect 64bit in version perform better on 64bit system - there may little difference between sse , 64bit versions.
Comments
Post a Comment