I found this article on CodeProject, that shows a way to do “alpha blending” of one image onto another using Intel’s sexy SIMD instructions. Whassat? SIMD = Single Instruction Multiple Data. Instead of adding sixteen numbers to sixteen other numbers, one at a time, you can add sixteen numbers to sixteen other numbers all at once. One instruction, “add”, performed on multiple values in parallel. Instant sixteen-fold speedup!
The drawback is that these SIMD instructions (Intel’s MMX, SSE, SSE2, etc.) are fairly limited in utility, kindof arcane, and sometimes come with some pretty harsh restrictions. So, you end up having to rework your process to fit what the SIMD instructions allow. The potential benefits are great (16x! for every bit of arithmetic!), but you generally sacrifice a ton of speed in order to get all your data lined up just right so you can do those parallel adds. And as always, things that are simple in C/C++ become challenging in Assembly, and reworking your process to fit what the SIMD instructions demand on top of that can become an excruciating nightmare.
But excruciating is my first name. Well, not really, but is actually kindof close, in a theological and etymological way.
So, this CodeProject article shows how to blend one image onto another with variable opacity. It’s not a really complicated process: for every pixel, Pixelnew = (Pixel1 * opacity) + (Pixel2 * (1.0 – opacity). The trick is to make it fast.
I have some C++ code that burns through it pretty quickly and I was satisfied with it. But, I tried this SIMD code from the article, and had to admit defeat. It was nearly four times as fast as my beautiful C++. So, I reworked it a bit and replaced my C++ code with the result.
That got me to thinking.
What the article calls “alpha blending” is not what I call alpha blending. The code just does a blend of two images with a global opacity value – every pixel uses the same opacity. In my experience, “alpha blending” is what happens when each pixel has its own opacity value, aka “per-pixel alpha.” Where a standard pixel has a Red, Green, and Blue value, a pixel with per-pixel alpha has a Red, Green, Blue and Alpha value. That’s much more flexible (at the cost of 33% more memory per image).
What I wanted was a way to blend an RGBA (Red, Blue, Green and Alpha) image onto an RGB (no alpha) image, using the SIMD instructions. Every pixel in the ‘top’ image has its own opacity value, but the bottom image doesn’t.
RGBA and RGB pixels are different sizes. RGBA pixels are four bytes wide and RGB pixels are three. So I have to shuffle things around to get the Rs, Gs and Bs all lined up. That’s going to eat up time.
The SSE instructions work on chunks of data 128 bits wide. You read 128 bits, do your stuff on those 128 bits, then write 128 bits out. That means we can read 4 RGBA pixels at a time (8 bits per byte, 1 byte each for R,G,B,A; 8 x 4 x 4 = 128). But 128 bits would hold 5.3 RGB pixels. Since we need to have an RGBA for every RGB, we’re going to have to discard that last 1.3 RGB pixels at each pass. Not a problem, but the inefficiency gnaws on me.
Due to the nature of computer math, you can’t do that alpha blending equation using just 8 bits per R,G or B. The intermediate values will exceed what 8 bits can hold. So, you have to expand to 16-bit data. That means we can process just 2 pixels at a time, not 4. All’s not lost, since there’s a lot of initial work that can be done before we do that expansion into 16-bit data. So, we’ll just split those four initial pixels of 3×8 into two 3×16 and do one after the other, then recombine at the end, saving all the initial work at the cost of a little more work at the end.
We’re still ahead of C++, which has to do each pixel as three separate operations (and SIMD will do them 2 at a time), but you can see we’re running out of potential, quick.
After we process our two pixels, we have to store the result. But again, 128 bits = 5.3 RGB pixels, and we’re only working on 4. So we have to keep those extra 1.3 pixels around, stick them on the end of the four pixels we actually processed and then write the whole thing out – otherwise, we’d end up writing garbage over those 1.3 pixels on the end (which are the first 1.3 pixels of the next round!)
Given that we can’t read or write past the end of the image, because we are processing 4 RGBA/RGB pixels at a time, always, we can’t do anything if we have fewer than four pixels in the image. And if we end up with 1, 2 or 3 pixel at the end or a row, this code can’t do anything with them, either. So, we have to limit ourselves to images that are multiples of 4 wide.
Did I solve these problems? Tune in tomorrow-ish!