Intrinsically Unpredictable++

Most of this is now moot.

The problem was with Visual Studio (2008). Even though I had the Optimization options set correctly, those settings weren't actually being used to build. I looked in the .vcproj file, and sure enough, there was no 'Optimization="2"' line in the Release configuration for the VCCLCompilerTool, even though the UI said optimization was set to 'O2'. Closing VS, manually adding that line to the configuration in the .vcproj, and then reloading and rebuilding generated code that looked like it should. This is a Visual Studio bug, not a compiler bug.

The intrinsics version runs just as fast as my hand-written assembly.

Much thanks to all who questioned my earlier results.

Update: the compiler now produces code that crashes really fast, in x64 release mode. And MS knows about it. And they aren't going to fix it. There's a workaround for this particular compiler bug, but not for the one I now see with _mm_extract_epi8. Sigh.

Experiment: over.

So there are a lot of trade-offs and complications to be expected when using the SIMD instructions. A lot of the promised performance gains will be sacrificed to the quirks and restrictions of the architecture. And the only way that I know to determine if a non-trivial operation will really be better using MMX, SSE, or whatever is to code it up and try. So, that's what I did.

The article that started this, from CodeProject, used what are called "compiler intrinsics" for its SSE instructions. These are like training-wheels bolted onto the actual CPU instructions. They take care of some of the more dismal aspects of assembly programming while still giving you access to the very low-level instructions themselves. Higher level languages like C++ don't have any concept of parallel operations, so there's no way for you to code such operations directly. These SIMD intrinsics are extensions to the language that bridge the gap between what the langauge allows and what the CPU can do. You can use these intrinsics directly from your C/C++ code and the compiler takes care of whatever interfacing needs to happen. They make it far easier to write code to use the instructions, and in many cases they are perfect equivalents to the actual instructions. It's slick. And for the CodeProject 'alpha blend' article, the intrinsics behaved exactly as you'd want them to. Performance was awesome.

Emboldened, off I went.

The very first thing the RGBA/RGB alpha blending process requires is that you read those four RGBA pixels from the image and then expand them from 8-bit data to 16-bit data - that's yesterdays "Problem 3". This isn't hard, thankfully. It's basically three SSE instructions and a little bookkeeping: read, expand the first two pixels to 16-bit, expand the second two to 16-bits.

With C++ intrinsics, the start of the whole per-pixel alpha-blend process looks like this:

      // we'll need 128 bits worth of zeros, later
      __m128i Zeros = _mm_setzero_si128();

      // a loop. count from 0 to the number of pixels, by four
      // (because we're doing four pixels at a time in here)
      for (int curPix = 0; curPix < pixels; curPix += 4)
      {
         // read 128 bits from pRGBA (offset by
         // curPix * 4) into RGBALow
         __m128i RGBALow = _mm_loadu_si128((__m128i*)&pRGBA[curPix * 4]);

         // copy those 128 bits into RGBAHigh
         __m128i RGBAHigh = RGBAL;
      
         // expand (aka unpack) half of the data in 
         // RGBALow from 8 bits to 16 bits. 
         RGBALow  = _mm_unpacklo_epi8(RGBALow, Zeros);

         // expand (aka unpack) the other half 
         // of the data in RGBAHi from 8 bits to 16 bits. 
         RGBAHigh = _mm_unpackhi_epi8(RGBAHigh, Zeros);

         ... all the rest

The process is:

Read : RGBALow = RGBA1RGBA2RGBA3RGBA4

Copy : RGBAHigh = RGBA1RGBA2RGBA3RGBA4

Unpack : RGBALow = R0G0B0A01R0G0B0A02

Unpack : RGBAHigh = R0G0B0A03R0G0B0A04

The second parameter of the _mm_unpack* instruction controls how the data is unpacked. All zeros (via the Zeros variable) basically says "take 8-bit data and make it 16-bit". It does that by inserting a 0 after every byte of the original data. That obviously doubles the size, and since we're limited to 128 bits (16 bytes) we lose either the first half (_mm_unpackhi_epi8) or second half (_mm_unpacklo_epi8) of the data. But that's why we have a copy - one loses the first half, the other loses the second half. Together, they have all the data, expanded. Just what we want.

That's the start of the intrinsics version. It goes on for another fifty lines or so, which I will spare you. And while it's an ugly mess, compared to my lovely C++ version, it's damn easier to write and read than pure assembly. And after a bit of debugging, I got it to generate the same results as my C++ code.

And what about speed? Well, from the last post, it's clear that there were going to be a lot of compromises and workarounds. So, I wasn't expecting a 16x performance increase.

The intrinsics code got a performance decrease of 2x, compared to my C++ version. It ran half as fast.

So what went wrong?

Well, there's always the possibility that the way I approached the problem was fundamentally inefficient. But I couldn't think of a different way to do it. So, I chose to do what programmers from pre-historic times have done: blame the compiler (Visual Studio 2008).

And I was wise to do so.

Take the first line. We're creating a constant - 16 bytes worth of zeros.

__m128i Z = _mm_setzero_si128();

Here's what the compiler did:

pxor	xmm0, xmm0
movdqa	XMMWORD PTR $T209723[ebp], xmm0
movdqa	xmm0, XMMWORD PTR $T209723[ebp]
movdqa	XMMWORD PTR _Zeros$[ebp], xmm0

Huh?

Well, that first line is the assembly idiom for setting a register to zero - the xmm0 register, in this case. "pxor" is "Parallel exclusive or". If you exclusive-or something with itself, you get zero. And this code xors xmm0 with itself, setting it to zero.

(A register is a small storage space directly on the CPU itself. Registers are far faster than reading and writing from memory so you want to do everything you can in registers. But their number is limited, so sometimes you have to use memory because there just aren't enough registers.)

So, it sets xmm0 to zero. Then it writes those zeros to a bit of temporary memory. Then it reads them back into xmm0 from where it wrote them. Then it writes them to a place it calls "_Zeros". Why? Why not do the pxor and then write that to Zeros? Why write to the temporary location at all? I have no fucking idea why.

Skipping over the code that handles the 'for', we get to the reading of the RGBA pixels:

// read 128 bits from pRGBA into RGBALow
__m128i RGBALow  = _mm_loadu_si128((__m128i*)pRGBA);
__m128i RGBAHigh = RGBALow;

Read four RGBA pixels into RGBALow, copy them to RGBAHigh.

mov	edx, DWORD PTR _curPix$209729[ebp]
mov	eax, DWORD PTR _pRGBA$[ebx]
movdqu	xmm0, XMMWORD PTR [eax+edx*4]
movdqa	XMMWORD PTR $T209735[ebp], xmm0
movdqa	xmm0, XMMWORD PTR $T209735[ebp]
movdqa	XMMWORD PTR _RGBALow$209733[ebp], xmm0

movdqa	xmm0, XMMWORD PTR _RGBALow$209733[ebp]
movdqa	XMMWORD PTR _RGBAHigh$209736[ebp], xmm0

The first two lines take care of the bookkeeping to find the exact location of the pixel we need. The third line reads the four pixels into xmm0. Then it writes xmm0 to some temporary memory. Then it reads that memory back into xmm0. Then it writes xmm0 to a bit of memory called 'RGBALow'. Why? Again, not a clue. Then it reads RGBALow back into xmm0 (even though xmm0 already contains RGBALow) and writes it to RGBAHigh.

Why is it writing the data to temporary locations, reading it back, then writing it somewhere else? Honestly, I'm stumped. Why does it read into xmm0 what is already in xmm0? Mysteries.

And, the unpacks:

RGBAL = _mm_unpacklo_epi8(RGBAL, Z);
RGBAH = _mm_unpackhi_epi8(RGBAH, Z);

Here's what the compiler coughed-up:

movdqa	xmm0, XMMWORD PTR _Zeros$[ebp]
movdqa	xmm1, XMMWORD PTR _RGBALow$209733[ebp]
punpcklbw xmm1, xmm0
movdqa	XMMWORD PTR $T209737[ebp], xmm1
movdqa	xmm0, XMMWORD PTR $T209737[ebp]
movdqa	XMMWORD PTR _RGBALow$209733[ebp], xmm0

movdqa	xmm0, XMMWORD PTR _Zeros$[ebp]
movdqa	xmm1, XMMWORD PTR _RGBAHigh$209736[ebp]
punpckhbw xmm1, xmm0
movdqa	XMMWORD PTR $T209738[ebp], xmm1
movdqa	xmm0, XMMWORD PTR $T209738[ebp]
movdqa	XMMWORD PTR _RGBAHigh$209736[ebp], xmm0

The first line reads our Zeros from memory into xmm0. The second reads RGBALow into xmm1. The third does the unpacking of RGBALow into xmm1. Then it writes xmm1 to memory, at a place called "T209737". Then, it reads that "T209737" back into xmm0. Then it writes xmm0 to RGBALow. Why not just write xmm1 to RGBALow directly? Why write a value to memory, read that value back from memory, and then write the same value somewhere else? It's like putting something into the refrigerator with your left hand, taking it out with your right hand, then putting it back into the refrigerator, but on a different shelf, with your right hand.

About Zeros... It first generated them with a simple: pxor xmm0, xmm0. But in the intrisics code, it's going to memory and reading them each time it needs them. No assembly programmer would ever read a register full of zeros from memory. Just xor the damn thing with itself and move on. It's faster, takes less code, less typing, etc..

And note, all of the code in this chunk immediately follows the previous chunk. So, when it starts this chunk, xmm0 contains the "RGBALow" value - remember, the compiler put it there twice in a row, just to be sure! It could've recognized that. But no, it reads Zeros into xmm0, overwriting RGBALow, and then reads RGBALow into xmm1.

It reached into the refrigerator and took out a bottle of juice with its left hand then put the juice back; reached back in and took out a bottle of juice with its left hand; dropped the juice on the floor; grabbed an apple with its left hand; reached in and took out a bottle of juice with its right hand.

So, the code goes on like this for a while. All kinds of pointless write/read/write cycles, no attention paid to which register has which data, etc.. If I assume my logic is sound, then I really can blame the compiler for the performance. The code its generating is absurdly inefficient. I must be able to do better. But the only way to know is to... code it up and time it. That's what I did. I looked at the code the intrisics generated and stripped out all of those pointless write/read/write things. Then I worked on keeping as much stuff in registers as I could, I got rid of the Zeros thing, and then I simplified the loop bookkeeping.

So here's the bits I've discussed: the reading of the 4 RGBA pixels and their unpacking into 16-bit data:

Intrinsics:

__m128i RGBAL = _mm_loadu_si128((__m128i*)&pRGBA[curPix * 4]);
__m128i RGBAH = RGBAL;
RGBAL = _mm_unpacklo_epi8(RGBAL, Z);
RGBAH = _mm_unpackhi_epi8(RGBAH, Z);

And the mess created by the compiler from the intrinsics version:

pxor	xmm0, xmm0
movdqa	XMMWORD PTR $T209723[ebp], xmm0
movdqa	xmm0, XMMWORD PTR $T209723[ebp]
movdqa	XMMWORD PTR _Zeros$[ebp], xmm0

... (loop bookkeeping) ...

mov	edx, DWORD PTR _curPix$209729[ebp]
mov	eax, DWORD PTR _pRGBA$[ebx]
movdqu	xmm0, XMMWORD PTR [eax+edx*4]
movdqa	XMMWORD PTR $T209735[ebp], xmm0
movdqa	xmm0, XMMWORD PTR $T209735[ebp]
movdqa	XMMWORD PTR _RGBALow$209733[ebp], xmm0

movdqa	xmm0, XMMWORD PTR _RGBALow$209733[ebp]
movdqa	XMMWORD PTR _RGBAHigh$209736[ebp], xmm0

movdqa	xmm0, XMMWORD PTR _Zeros$[ebp]
movdqa	xmm1, XMMWORD PTR _RGBALow$209733[ebp]
punpcklbw xmm1, xmm0
movdqa	XMMWORD PTR $T209737[ebp], xmm1
movdqa	xmm0, XMMWORD PTR $T209737[ebp]
movdqa	XMMWORD PTR _RGBALow$209733[ebp], xmm0

movdqa	xmm0, XMMWORD PTR _Zeros$[ebp]
movdqa	xmm1, XMMWORD PTR _RGBAHigh$209736[ebp]
punpckhbw xmm1, xmm0
movdqa	XMMWORD PTR $T209738[ebp], xmm1
movdqa	xmm0, XMMWORD PTR $T209738[ebp]
movdqa	XMMWORD PTR _RGBAHigh$209736[ebp], xmm0

And finally, here's my hand-written assembly:

movdqu		xmm0, XMMWORD PTR [rax]	; read 4 RGBA pixels into xmm0
movdqa 		xmm1, xmm0      ; copy RGBAx4 into xmm1

; RGBA RGBA RGBA RGBA -> RxGxBxAx RxGxBxAx , RxGxBxAx RxGxBxAx
pxor		xmm5, xmm5      ; zeros
punpcklbw	xmm0, xmm5      ; unpack xmm0 - RGBAL
punpckhbw	xmm1, xmm5      ; unpack xmm1 - RGBAH

They all do the same thing.

xmm5? Yeah, there are 8 128-bit xmm registers the intrinsics code could've used. It could've used the other six for temporary storage instead of reading and writing to memory. It could've done all kinds of things with them. But it only ever used xmm0 and xmm1, and it used them terribly. Ex. In my code, I put the first two RGBA pixels into xmm0 and left them there. I wasn't reading or writing to memory every time I needed those pixels; when I needed them for something, they were in xmm0, waiting. And when I didn't need them anymore, I got to use xmm0 for other things.

But, my brevity and clever register usage counts for nothing on its own. What matters is how did it perform? Well, my hand-written assembly version is about three times faster than my C++ version, and between five and six times faster than the intrinsics version. I didn't have to change my logic at all, I just needed to get the compiler out of the way.

Why did intrinsics fail so miserably with my code while it worked fine for the code from CodeProject? I can only guess. And my guess is that the algorithm for the global alpha blending is so much simpler and the code is so much smaller that the compiler was able to figure out how to keep things in registers and avoid having to store things in memory. Without having to do that, it didn't have to resort to the ridiculous and inexplicable write/read/write pattern. Looking at what the compiler generated for that code confirms it - there's none of that write/read/write nonsense and the generated code is small and efficient.

I'm sure there are further improvements to be had. But, I'll take my 3x speedup and lesson learned.

And, I just got Beck tickets.

7 thoughts on “Intrinsically Unpredictable++

  1. Pseudonym

    It looks really likely that you’re compiling without optimization, based on the pattern of just using a couple of registers but lots of temporaries.

    1. cleek Post author

      that’s what i thought, too. but i’ve tried every optimization/inline/intrinsic setting i can find, and i still get that crappy assembly.

      and even if it is… why would that code even be acceptable in an unoptimized state? it’s horrid. it doesn’t even do the debugger any good, since you can debug straight assembly without issue.

      1. Pseudonym

        Code like that is an artifact of the way compilers work. Every node in the parse tree basically gets assigned to a unique temporary variable, which is assigned to a memory location in the stack frame. Since SSE2 operations are register-register or register-memory only rather than memory-memory, using an intrinsic involves loading the operands from memory to registers, performing the operation on those registers, then storing the result in the temporary variable allocated for the parse node corresponding to the right side of the assignment statement. The assignment statement itself results in code to copy that temporary variable to the named local variable.

        Compiling some CodeProject SSE2 example code with /O2 /Zi /arch:SSE2 seemed to generate decent code for me in VS2012, but I’m really more familiar with GCC/clang. If you feel like sharing your project I can try taking a look.

  2. ChrisR

    My guess was also that optimizations were off and I half-expected to see you enable optimizations later in the article to show the difference. But really, it generated that with optimizations on? Wow!

  3. ChrisR

    Hmm, I just ran some tests here. With VS2013 I see similar code being generated with optimizations off. But with optimizations on it is much better. What compiler are you using?

    1. cleek Post author

      VS2008

      good to know everyone else is getting good results. maybe there’s some compiler setting i haven’t discovered yet that’s giving me the crappy code.

      i will keep looking.

      1. ChrisR

        Very interesting! Thanks for posting the follow up and additional information.

Comments are closed.