Category Archives: Programming

Dynamic Camouflage

There's a specific technique for finding the optimal solution to complex problems (typically mathematical or computer programming problems) by breaking them into simpler sub-problems - each of which only needs to be solved once, optimally - and then combining the results of those sub-problems to arrive at the final solution. It's called "Dynamic Programming". Sounds sexy! But when you get into it, you notice that there's really nothing dynamic about it. When you start down the DP path for a problem, you don't find yourself in an exciting world of fast and forceful - dynamic! - things to think about; rather, it's more about carefully and deliberately rethinking the problem at hand until you find a way to flip it on its head, or turn it inside out, or maybe even solve it backwards.

And the name doesn't actually refer to computer programming. It's about mathematical programming: finding the best possible solution from the set of all possible solutions, given some definition of 'best'. That DP is often used in computer programming today is coincidence.

So where does the name come from?

The man who came up with the name, Richard Bellman, explains:

"I spent the Fall quarter (of 1950) at RAND. My first task was to find a name for multistage decision processes. An interesting question is, Where did the name, dynamic programming, come from? The 1950s were not good years for mathematical research. We had a very interesting gentleman in Washington named Wilson. He was Secretary of Defense, and he actually had a pathological fear and hatred of the word research. I’m not using the term lightly; I’m using it precisely. His face would suffuse, he would turn red, and he would get violent if people used the term research in his presence. You can imagine how he felt, then, about the term mathematical. The RAND Corporation was employed by the Air Force, and the Air Force had Wilson as its boss, essentially. Hence, I felt I had to do something to shield Wilson and the Air Force from the fact that I was really doing mathematics inside the RAND Corporation. What title, what name, could I choose? In the first place I was interested in planning, in decision making, in thinking. But planning, is not a good word for various reasons. I decided therefore to use the word “programming”. I wanted to get across the idea that this was dynamic, this was multistage, this was time-varying. I thought, let's kill two birds with one stone. Let's take a word that has an absolutely precise meaning, namely dynamic, in the classical physical sense. It also has a very interesting property as an adjective, and that it's impossible to use the word dynamic in a pejorative sense. Try thinking of some combination that will possibly give it a pejorative meaning. It's impossible. Thus, I thought dynamic programming was a good name. It was something not even a Congressman could object to. So I used it as an umbrella for my activities."

Rain girl

This is one of my favorite pics - a girl walking in the rain outside our first house. I've posted it previously, a couple of times.

I came up with an implementation of Aaron Hertzmann's "painterly" non-photorealistic rendering algorithm. This recreates the source image by using simulated brush strokes of varying size, color and shape. It's not a blurring or filtering of the source; it's a recreation made in much the same fashion that a human painter would copy a picture: look at the source, choose a color, make a stroke which follows that color. Repeat until you're done. As with a human painter, it starts with large brushes to get the background and large areas filled, then it refines the image with progressively smaller brushes.

This video is the algorithm in action, drawing the strokes (20 per frame, to save time). I told it to use short strokes.

The music is called "One Bird". That's me.

Here's one of a pic of Adrian Belew:

Longer, thinner brush strokes.

And, Tricksey:

Tricksey, Painterly

Drat Of The Day

Which is the minimum of the two:

A: -3.4028235e+038
B: 1.1754944e-038

That depends on how you define minimum. If you want the smallest magnitude, it's B. B is so close to 0 it might as well be 0 for most practical uses (0.000000000000000000000000000000000000011754944). But A is smaller if you think 'minimum' means 'leftmost on the number line'. A is so far to the left of B there aren't enough pixels on all the monitors in the world to draw that number line.

In C++ the result of std::numeric_limits<float>::min() is B: the smallest value a 'float' can store is B.

But the most-negative number a float can store is A. And A is -std::numeric_limits<float>::max().

So, if you are trying to limit the results of a function to the range of values that a float can hold, and you do this:

double result = doSomething();
double t = max(std::numeric_limits::min(), result);
t = min(std::numeric_limits::max(), t);
return (float)t;

You're going to very surprised when you discover that you never see a value smaller than 1.1754944e-038. You'll never even get 0.

What you really needed to write is this:

double result = doSomething();
double t = max(-std::numeric_limits::max(), result);
t = min(std::numeric_limits::max(), t);
return (float)t;

But, that trickery only applies to floating point types. The smallest magnitude you can store in an int is 0. But std::numeric_limits<int>::min() is -2147483648. That's the most-negative value, not the value with least magnitude.

And now you know.

So, Filling...

Continuing with this... (busy week at work! whew!)

Instead of filling with boring old rectangles, what about filling with ellipses?

First problem: it's just too much work to do the dividing and conquering with actual ellipses. There's no good way to recursively chop a rectangular image into successively smaller ellipses. So, I decided to just stick with the rectangles right up until the point where I go to fill them in. Then, I'll just draw an ellipse that covers what the rectangle covers. This is just an experiment. I can do what I want.

So, given that we need each ellipse to cover, at least, the same area as the rectangle would have, is there a way to determine the optimal ellipse? That is: given a rectangle, how big is its proportional enclosing ellipse? Beats me. Sounds like tough problem. But, StackExchange knew:

A=RectWidth / √2
B=RectHeight / √2

So much simpler than I'd expected.

So, here we go. Everything else is the same, except that I'm drawing ellipses instead of rectangles.

ml_sz2_t100_ell

Hmm... Interesting. Looks like a lot of eggs. I don't like that the tops and lefts of the ellipses are all visible, but the bottoms and rights aren't. That's a consequence of the fact that the overall drawing and processing basically goes left to right, top to bottom.

Can I change the order in which they're drawn? Yes.

Instead of drawing ellipses as soon as they're calculated, I put them in a list. At the end I have a list of all the ellipses that would've been drawn. Then, sort that list by the area of each ellipse. And then, draw the sorted ellipses - biggest first. That way, the smaller ellipses will overlay the bigger ones. And that gives:

ml_sz2_t100_ellSt

Better. There's variety in which parts of the ellipses are obscured. And, the ends of the really big ellipses along the left and right sides are covered up.

With borders? Sure!

ml_sz2_t100_ellStO

No fill, borders only?

ml_sz2_t100_ellStNo

Sweeeeeeet.

Divide And Conquer

There's an old-school technique for rendering the classic fractal, the Mandelbrot set, called Divide and Conquer. This technique takes advantage of the fact that all of points in the Mandelbrot set are 'connected' - there are no islands of Mandelbrot set points which aren't connected to the main set; and there are no islands of non-Mandelbrot set enclosed by points within Mandelbrot set. Think of it as a single blob of water. Given this, you can speed up rendering of any subset of the Mandelbrot set by:

  1. Divide the image into four rectangles.
  2. For each of the four rectangles:
  3. If the rectangle has a non-zero width and height:
  4. Calculate all the points along the edges.
  5. If all the points along the edges have the same calculated value, all points within the rectangle must be the same, too. No 'islands', remember. So, fill the rectangle with the value you calculated for the edges.
  6. If the points along the edge are not the same, divide that rectangle into four rectangles and, for each sub-rectangle, go to step 3. (Keep in mind that since you've already calculated the points on the edges, you can reduce the size of the rectangle by 1 on each edge).

So, it's a recursive space-filling technique. It divides the work area into smaller and smaller rectangles until it finds something it can fill or it can no longer sub-divide. And in this way it gobbles-up large identically-valued sections of the Set very quickly, and so it spends most of its time working on details. In general, that's a performance gain. And, even better, this technique is far more fun to watch than just seeing pixels calculated one at a time, one row after another. It's kind of mezmerizing. Or, it was back in the day. These days, computers are fast enough that you can't see it in action.

Anyway, I thought it would be fun to put a variation of this technique to use on non-fractal images. So, I used color differences in the source image to determine where the edges of the rectangles go, with a selectable threshold to control color matching. I added a way to limit the minimum rectangle size, and taught it to optionally draw tastefully-colored borders around each rectangle.

And away we go!

Start with this:

monalisa_sm

Set the minimum rectangle dimension to 25 pixels, set the threshold to "100" and...

ml_sz25_t100

Not amazing.

The rectangles are too big. So, reduce the min rect size to 10:

ml_sz10_t100

Now you can start to see what's going on. Variety in rectangle sizes is what we're after - otherwise it's just pixelation. And this would be a most inefficient way to do pixelation...

Set min rect size to 4, and then to 2:

ml_sz4_t100

ml_sz2_t100

There we go.

Now, that's fun.

But there's one thing I don't love about this: the rectangles follow a perfectly even grid, regardless of min rect size. Look at the four rectangles in the top-left of this one, then look at the previous images. They are the same in all of the images except for minsize=25 - and in that case, those four rects fit perfectly into its top-left rectangle. This is completely expected, since we're just dividing the image into 1/2 then 1/4 then 1/8 then 1/16, etc., and so we're going to keep hitting those same edges over and over.

What we need is to vary the sizes of the sub-rects, to get off of that power-of-two thing. To do that, I added something called 'jitter'. Instead of dividing the rectangle evenly into four equal sub-rects, jitter divides the rectangle by 2, 3 or 4 vertically, then 2, 3 or 4 horizontally. The divisor is chosen at random.

That gives:

ml_sz2_t100_j

And because the jitter is random, the output image will be different every time.

Other variations:

No outlines:
ml_sz2_t100_no
Looks like hyper-aggressive JPEG compression.

Or, don't fill the rectangles, just draw the outlines:

ml_sz2_t100_je
I really like this effect. And this one shows another variation: instead of randomizing the sub-rect proportions, it just randomly nudges the edges of the sub-rects by a pixel or two. This maintains the even grid, but adds just enough noise to break the monotony if you look closely.

Or, render the same image twenty times with jitter on, and put the results into a GIF !

bart

Far out, man!

Faster Dot Product, Kill, Kill!

Trigger warning: Computer junk ahead!

"Dot product" is the name of a common mathematical pattern. If you have two lists of numbers of equal length (called 'vectors') A and B, and you multiply each pair of numbers Ai * Bi, then add up the results of all the multiplications, you get the dot product:

A: 1, 2, 3, 4
B: 10, 20, 30, 40
A⋅B= 1x10 + 2x20 + 3x30 + 4x30 = 10 + 40 + 90 + 120 = 260

It shows up everywhere. And I see variations of it all the time in digital image processing, especially in 4-element situations, as with that little A & B example.

For example, deep in the heart of my bi-cubic interpolation code (a way to do image scaling), there is this bit of code which takes 16 pixel values and two vectors of coefficients (af and bf) four values each.

float dot_flt_native(float *pix, float *af, float *bf)
{
   float t1 = af[0] * pix[00] + af[1] * pix[01] + af[2] * pix[02] + af[3] * pix[03];
   float t2 = af[0] * pix[04] + af[1] * pix[05] + af[2] * pix[06] + af[3] * pix[07];
   float t3 = af[0] * pix[08] + af[1] * pix[09] + af[2] * pix[10] + af[3] * pix[11];
   float t4 = af[0] * pix[12] + af[1] * pix[13] + af[2] * pix[14] + af[3] * pix[15];
   return res = bf[0] * t1 + bf[1] * t2 + bf[2] * t3 + bf[3] * t4;
}

It does five dot products. First, it does a dot product of the first four pixels and the 'af' vector, then another for the second 4 pix, and another for the third and fourth sets of pix. Those four dp's gives four intermediate results (t1,t2,t3,t4) which are then used in another dot product with the 'bf' vector. And that's the result. Twenty multiplications and fifteen additions.

Note the nice pattern in the way I've laid-out the code. Note the repetition: grab 4 pixels, multiple each by the corresponding item from af; Repeat; Repeat; Repeat. Different data pushed through the same simple process, repeated. Single Instruction, Multiple Data: SIMD.

Modern CPUs have the ability to do these parallel calculations in a single instructions: line your vectors up, let the CPU chomp away four (or eight or 16 or 32!) values at a time - for numerical accuracy reasons I can't use 8 or 16-bit integers, I have to use 32-bit floating point values in this calculation. As always, the trick with this stuff is to get the data lined up with as little fuss as possible - you don't want the time spent on setup to overwhelm any speedup you gain from being able to do parallel calculations. But, this data is trivial to line up. So, it's all about doing the calculation.

First method: as-is. Since I already have the code above, I can use that as the benchmark for accuracy and for speed. We'll call that the native floating point version.

Second method: what I really have. The version above uses floating point data as the input, but my actual data is integers and they have to be converted to floating point data, first. That's easy enough. And so this is my real benchmark.

float dot_int_native(unsigned char *pixb, float *af, float *bf)
{
   float pix[16];
   for (int i=0;i<16;i++) pix[i] = pixb[i];
   float t1 = af[0] * pix[00] + af[1] * pix[01] + af[2] * pix[02] + af[3] * pix[03];
   float t2 = af[0] * pix[04] + af[1] * pix[05] + af[2] * pix[06] + af[3] * pix[07];
   float t3 = af[0] * pix[08] + af[1] * pix[09] + af[2] * pix[10] + af[3] * pix[11];
   float t4 = af[0] * pix[12] + af[1] * pix[13] + af[2] * pix[14] + af[3] * pix[15];
   return res = bf[0] * t1 + bf[1] * t2 + bf[2] * t3 + bf[3] * t4;
}

Same thing as before, but with a little loop to convert the input to floats. 20 multiplies, 15 adds, and now 16 conversions from integer to float (but how expensive can they be!?)

Now for an attempt at SIMD. It just so happens that there is a 'dot product' instruction in CPUs that support Intel's 'SSE4' tech. It does a dot product in one instruction. Line em up, pull the trigger, done. The only ugly part about this is that we have to pick out the results from the first four dot product instructions (_mm_dp_ps), then pack them into a new vector for the final dot product.

float dot_flt_sse_dpps(unsigned char *pixb, float *af, float *bf)
{
   // same as above, convert integers to floats. 
   // use that ugly __declspec thing to align the data to 
   // 16 bytes, for SSE reasons
   __declspec(align(16)) float pix[16];
   for (int i=0;i<16;i++) pix[i] = pixb[i];

   // load the coefficient vectors into SSE vectors
   __m128 afx = _mm_load_ps(af);
   __m128 bfx = _mm_load_ps(bf); 

   __m128 t1 = _mm_load_ps(&pix[0]); // load four pixels
   t1 = _mm_dp_ps(t1, afx, 0xff);    // dot product against afx vector

   __m128 t2 = _mm_load_ps(&pix[4]); // load four more pixels
   t2 = _mm_dp_ps(t2, afx, 0xff);    // dot product against afx vector

   __m128 t3 = _mm_load_ps(&pix[8]); // etc
   t3 = _mm_dp_ps(t3, afx, 0xff);

   __m128 t4 = _mm_load_ps(&pix[12]);// etc
   t4 = _mm_dp_ps(t4, afx, 0xff);

   // grab the results from t2,t3,t4 and pack them in 
   // with the result from t1.
   t1.m128_f32[1] = t2.m128_f32[0];
   t1.m128_f32[2] = t3.m128_f32[0];
   t1.m128_f32[3] = t4.m128_f32[0];

   // dot product against bfx vector
   t1 = _mm_dp_ps(t1, bfx, 0xff);

   // grab the result out of t1
   return t1.m128_f32[0];
}

The _mm_dp_ps instructions take two vectors of four values each, calculate the dot product then put the result into an output vector of four values - putting the dot product result into each of the four output spots. So, this:

__m128 t1 = _mm_load_ps(&pix[0]);        // load four pixels
       t1 = _mm_dp_ps(t1, afx, 0xff);    // dot product against afx vector

does this:
x = pix0⋅afx0 + pix1⋅afx1 + pix2⋅afx2 + pix3⋅afx3
t1 = x,x,x,x

Weird, but that's just what it does. So, when we go to make that last vector for the final dot product, we pull the first value from each temporary vector (t1,t2,t3,t4) and stuff it into the corresponding place in t1.

   t1.m128_f32[1] = t2.m128_f32[0]; // result from t2 into t11
   t1.m128_f32[2] = t3.m128_f32[0]; // result from t3 into t12
   t1.m128_f32[3] = t4.m128_f32[0]; // result from t4 into t13

Then we do the final d.p. using t1 and bfx and get our result!

All the additions and multiplications? Gone! Vanished into the mm_dp_ps instructions.

And how does it perform?

So far we have three variations: float input native, integer input native and integer input SIMD. Since this is not a complex calculation, we need to run it a zillion times to see how it performs. So, for 10,000,000 iterations, here are the results so far:

Float native 1.34s
Integer native 2.57s
Integer SSE 4.93s

That second row shows that the integer to float conversion takes a long time. It counts for nearly half the time of the calculation. Ouch. Let's work on that, first.

So, here's a version using SIMD instructions to handle the integer to float conversion:

float dot_byte_sse_unpack_dot(unsigned char *p16, float *af, float *bf)
{
   __m128i pix = _mm_load_si128((__m128i*)p16);  // read all 16 8-bit pixels

   __m128i Z = _mm_setzero_si128();              // 128 bits worth of zeros. 
                                             
   __m128i p16L = _mm_unpacklo_epi8(pix, Z);     // unpack the 8-bit pixels to
   __m128i p16H = _mm_unpackhi_epi8(pix, Z);     // two sets of 16-bit pixels

   __m128i p32LL = _mm_unpacklo_epi16(p16L, Z);  // unpack the two sets of 16-bit
   __m128i p32LH = _mm_unpackhi_epi16(p16L, Z);  // pixels to four sets of 
   __m128i p32HL = _mm_unpacklo_epi16(p16H, Z);  // four 32-bit pixels.
   __m128i p32HH = _mm_unpackhi_epi16(p16H, Z);

   __m128 t1 = _mm_cvtepi32_ps(p32LL);           // convert the 16 32-bit integer
   __m128 t2 = _mm_cvtepi32_ps(p32LH);           // pixels to 16 32-bit floating
   __m128 t3 = _mm_cvtepi32_ps(p32HL);           // point pixels
   __m128 t4 = _mm_cvtepi32_ps(p32HH);

   __m128 afx = _mm_load_ps(af);                 // do the dot product, as before

   t1 = _mm_dp_ps(t1, afx, 0xff);
   t2 = _mm_dp_ps(t2, afx, 0xff);
   t3 = _mm_dp_ps(t3, afx, 0xff);
   t4 = _mm_dp_ps(t4, afx, 0xff);

   t1.m128_f32[1] = t2.m128_f32[0];
   t1.m128_f32[2] = t3.m128_f32[0];
   t1.m128_f32[3] = t4.m128_f32[0];

   __m128 bfx = _mm_load_ps(bf);
   t1 = _mm_dp_ps(t1, bfx, 0xff);

   return t1.m128_f32[0];
}

Seems like a lot of work, but when you look at what the compiler generates for " for (int i=0;i<16;i++) pix[i] = pixb[i];", this way is significantly faster. And that does, in fact, eliminate the integer to float penalty:

Float native 1.34s
Integer SSE unpack 2.28 Use SSE to unpack int to float
Integer native 2.57s Convert int to float
Integer SSE 4.93s Convert int to float, using mm_dp_ps

This one runs faster than the native integer-to-float version. But it's an uninspiring 11% speedup. Something in there is holding me down. But what?

Looking at the assembly output, the integer unpacking SSE version (above) comes in at a svelte 50 instructions while the integer native version is over 100 instructions long. So, it's certainly not about raw instruction count. It has to be something about the particular instructions that we're using: some instructions are faster than others, are there any slow ones in there?

Here's a handy reference: Instruction tables: Lists of instruction latencies, throughputs and micro-operation breakdowns for Intel, AMD and VIA CPUs. It lists the timings for all of the instructions available on all of the various modern Intel and compatible CPUs.

I happen to have an Intel Core i7-3770 in the PC I'm typing this on, which Intel has classified under the "Ivy Bridge" code name. Scrolling down to the Ivy Bridge section in that PDF and looking at some of the instructions used by my SSE code... I see that the load and unpack and int-to-float conversion instructions are all reasonably quick. But, the centerpiece of the function, _mm_dp_ps (aka 'DPPS') - the dot product instruction itself! - is very slow. And I have five of them. But that's what does all the work! Wah!!

Well, OK. That sucks. But, a dot product is just multiplication and addition. And there's certainly a way to do a parallel multiply:

pix0,pix1,pix2,pix3 x afx0,afx1,afx2,afx3 =
pix0⋅afx0, pix1⋅afx1, pix2⋅afx2, pix3⋅afx3

We can do parallel addition, too, but we don't need that. All the things we want to sum are in the same vector, not in separate vectors which is what parallel addition works on. We have to add all those items from the result of the multiplications together, for the full dot product.

We can pull each item out of the vector, add them one at a time, kindof like we're doing down at the bottom of the functions above.

float dot_byte_sse_unpack_dot_extr(unsigned char *p16, float *af, float *bf)
{
   // unpack and convert, etc..
   __m128i Z = _mm_setzero_si128();
   __m128i pix = _mm_load_si128((__m128i*)p16);
   __m128i p16L = _mm_unpacklo_epi8(pix, Z);
   __m128i p16H = _mm_unpackhi_epi8(pix, Z);
   __m128i p32LL = _mm_unpacklo_epi16(p16L, Z);
   __m128i p32LH = _mm_unpackhi_epi16(p16L, Z);
   __m128i p32HL = _mm_unpacklo_epi16(p16H, Z);
   __m128i p32HH = _mm_unpackhi_epi16(p16H, Z);
   __m128 t1 = _mm_cvtepi32_ps(p32LL);
   __m128 t2 = _mm_cvtepi32_ps(p32LH);
   __m128 t3 = _mm_cvtepi32_ps(p32HL);
   __m128 t4 = _mm_cvtepi32_ps(p32HH);

   __m128 afx = _mm_load_ps(af);
   __m128 bfx = _mm_load_ps(bf); 

   // mul input chunks times afx
   t1 = _mm_mul_ps(t1, afx); 
   t2 = _mm_mul_ps(t2, afx);
   t3 = _mm_mul_ps(t3, afx);
   t4 = _mm_mul_ps(t4, afx);

   // t1[0] gets the sum of t1[0],t1[1],t1[2],t1[3]
   // dot product one, done!
   t1.m128_f32[0] = 
       t1.m128_f32[0] + 
       t1.m128_f32[1] + 
       t1.m128_f32[2] + 
       t1.m128_f32[3];

   // t1[1] gets the sum of t2[0],t2[1],t2[2],t2[3]
   t1.m128_f32[1] = t2.m128_f32[0] + t2.m128_f32[1] + t2.m128_f32[2] + t2.m128_f32[3];
   t1.m128_f32[2] = t3.m128_f32[0] + t3.m128_f32[1] + t3.m128_f32[2] + t3.m128_f32[3];
   t1.m128_f32[3] = t4.m128_f32[0] + t4.m128_f32[1] + t4.m128_f32[2] + t4.m128_f32[3];

   t1 = _mm_mul_ps(t1, bfx);
   t1.m128_f32[0] = t1.m128_f32[0] + t1.m128_f32[1] + t1.m128_f32[2] + t1.m128_f32[3];

   return t1.m128_f32[0];
}

That's not great. We've reduced our multiplications to 4 (_mm_mul_ps), but we've increased the additions back up to 15.

Float native 1.34s
Integer SSE unpack> 2.28 Use SSE to unpack int to float, using mm_dp_ps
Integer native 2.57s Convert int to float
Integer SSE unpack, mul 2.75s Unpack with SSE, use mul and 15 additions
Integer SSE 4.93s Convert int to float, using mm_dp_ps

And we suffer for it.

There's an instruction that can help: it's called _mm_hadd_ps (HADDP - horizontal add). It adds adjacent pairs in a vector.

A0,A1,A2,A3 HADD B0,B1,B2,B3 =
A0+A1, A2+A3, B0+B1, B2+B3

It's not exactly what we want, we want all summed in one shot. But this it can be made to work.

float dot_byte_sse_unpack_dot_hadd(unsigned char *p16, float *af, float *bf)
{
   // unpack, convert, etc.
   __m128i Z = _mm_setzero_si128();
   __m128i pix = _mm_load_si128((__m128i*)p16);
   __m128i p16L = _mm_unpacklo_epi8(pix, Z);
   __m128i p16H = _mm_unpackhi_epi8(pix, Z);
   __m128i p32LL = _mm_unpacklo_epi16(p16L, Z);
   __m128i p32LH = _mm_unpackhi_epi16(p16L, Z);
   __m128i p32HL = _mm_unpacklo_epi16(p16H, Z);
   __m128i p32HH = _mm_unpackhi_epi16(p16H, Z);
   __m128 t1 = _mm_cvtepi32_ps(p32LL);
   __m128 t2 = _mm_cvtepi32_ps(p32LH);
   __m128 t3 = _mm_cvtepi32_ps(p32HL);
   __m128 t4 = _mm_cvtepi32_ps(p32HH);

   __m128 afx = _mm_load_ps(af);
   __m128 bfx = _mm_load_ps(bf); 

   // multiply
   t1 = _mm_mul_ps(t1, afx); 
   t2 = _mm_mul_ps(t2, afx);
   t3 = _mm_mul_ps(t3, afx);
   t4 = _mm_mul_ps(t4, afx);

   // these will fill t1 with the results
   // from the four dot products. it's an ugly
   // series of additions, but it works.
   t1 = _mm_hadd_ps(t1, t2);
   t3 = _mm_hadd_ps(t3, t4);
   t1 = _mm_hadd_ps(t1, t3);

   // final dot product
   t1 = _mm_mul_ps(t1, bfx);
   t1 = _mm_hadd_ps(t1, t1);
   t1 = _mm_hadd_ps(t1, t1);

   return t1.m128_f32[0];
}
Integer SSE unpack, mul, HADD 1.08s Unpack with SSE, using mul and HADD
Float native 1.34s
Integer SSE unpack> 2.28 Use SSE to unpack int to float using mm_dp_ps
Integer native 2.57s Convert int to float
Integer SSE unpack, mul 2.75s Unpack with SSE, using mul and 15 additions
Integer SSE 4.93s Convert int to float, using mm_dp_ps

Now that's what I'm talking about! More than twice as fast as my real benchmark, and even faster than the native float input version! Ultimate victory!

Maybe they'll speed up the _mm_dp_ps instruction up in future CPUs. But until then, this does the job pretty well.

Intrinsically Unpredictable++

Most of this is now moot.

The problem was with Visual Studio (2008). Even though I had the Optimization options set correctly, those settings weren't actually being used to build. I looked in the .vcproj file, and sure enough, there was no 'Optimization="2"' line in the Release configuration for the VCCLCompilerTool, even though the UI said optimization was set to 'O2'. Closing VS, manually adding that line to the configuration in the .vcproj, and then reloading and rebuilding generated code that looked like it should. This is a Visual Studio bug, not a compiler bug.

The intrinsics version runs just as fast as my hand-written assembly.

Much thanks to all who questioned my earlier results.

Update: the compiler now produces code that crashes really fast, in x64 release mode. And MS knows about it. And they aren't going to fix it. There's a workaround for this particular compiler bug, but not for the one I now see with _mm_extract_epi8. Sigh.

Experiment: over.

Continue reading

Intrinsically Unpredictable

I found this article on CodeProject, that shows a way to do "alpha blending" of one image onto another using Intel's sexy SIMD instructions. Whassat? SIMD = Single Instruction Multiple Data. Instead of adding sixteen numbers to sixteen other numbers, one at a time, you can add sixteen numbers to sixteen other numbers all at once. One instruction, "add", performed on multiple values in parallel. Instant sixteen-fold speedup!

The drawback is that these SIMD instructions (Intel's MMX, SSE, SSE2, etc.) are fairly limited in utility, kindof arcane, and sometimes come with some pretty harsh restrictions. So, you end up having to rework your process to fit what the SIMD instructions allow. The potential benefits are great (16x! for every bit of arithmetic!), but you generally sacrifice a ton of speed in order to get all your data lined up just right so you can do those parallel adds. And as always, things that are simple in C/C++ become challenging in Assembly, and reworking your process to fit what the SIMD instructions demand on top of that can become an excruciating nightmare.

But excruciating is my first name. Well, not really, but is actually kindof close, in a theological and etymological way.

So, this CodeProject article shows how to blend one image onto another with variable opacity. It's not a really complicated process: for every pixel, Pixelnew = (Pixel1 * opacity) + (Pixel2 * (1.0 - opacity). The trick is to make it fast.

I have some C++ code that burns through it pretty quickly and I was satisfied with it. But, I tried this SIMD code from the article, and had to admit defeat. It was nearly four times as fast as my beautiful C++. So, I reworked it a bit and replaced my C++ code with the result.

That got me to thinking.

What the article calls "alpha blending" is not what I call alpha blending. The code just does a blend of two images with a global opacity value - every pixel uses the same opacity. In my experience, "alpha blending" is what happens when each pixel has its own opacity value, aka "per-pixel alpha." Where a standard pixel has a Red, Green, and Blue value, a pixel with per-pixel alpha has a Red, Green, Blue and Alpha value. That's much more flexible (at the cost of 33% more memory per image).

What I wanted was a way to blend an RGBA (Red, Blue, Green and Alpha) image onto an RGB (no alpha) image, using the SIMD instructions. Every pixel in the 'top' image has its own opacity value, but the bottom image doesn't.

Problem 1.
RGBA and RGB pixels are different sizes. RGBA pixels are four bytes wide and RGB pixels are three. So I have to shuffle things around to get the Rs, Gs and Bs all lined up. That's going to eat up time.

Problem 2.
The SSE instructions work on chunks of data 128 bits wide. You read 128 bits, do your stuff on those 128 bits, then write 128 bits out. That means we can read 4 RGBA pixels at a time (8 bits per byte, 1 byte each for R,G,B,A; 8 x 4 x 4 = 128). But 128 bits would hold 5.3 RGB pixels. Since we need to have an RGBA for every RGB, we're going to have to discard that last 1.3 RGB pixels at each pass. Not a problem, but the inefficiency gnaws on me.

Problem 3.
Due to the nature of computer math, you can't do that alpha blending equation using just 8 bits per R,G or B. The intermediate values will exceed what 8 bits can hold. So, you have to expand to 16-bit data. That means we can process just 2 pixels at a time, not 4. All's not lost, since there's a lot of initial work that can be done before we do that expansion into 16-bit data. So, we'll just split those four initial pixels of 3x8 into two 3x16 and do one after the other, then recombine at the end, saving all the initial work at the cost of a little more work at the end.

We're still ahead of C++, which has to do each pixel as three separate operations (and SIMD will do them 2 at a time), but you can see we're running out of potential, quick.

Problem 4.
After we process our two pixels, we have to store the result. But again, 128 bits = 5.3 RGB pixels, and we're only working on 4. So we have to keep those extra 1.3 pixels around, stick them on the end of the four pixels we actually processed and then write the whole thing out - otherwise, we'd end up writing garbage over those 1.3 pixels on the end (which are the first 1.3 pixels of the next round!)

Problem 5.
Given that we can't read or write past the end of the image, because we are processing 4 RGBA/RGB pixels at a time, always, we can't do anything if we have fewer than four pixels in the image. And if we end up with 1, 2 or 3 pixel at the end or a row, this code can't do anything with them, either. So, we have to limit ourselves to images that are multiples of 4 wide.

Did I solve these problems? Tune in tomorrow-ish!

(pt 2)

ZAP [eax], [mm1]

Assembly language...

(scared yet?)

Writing it is like riding a bike without training wheels. No, more like riding a bike without training wheels or brakes, and it's a fixie. And you're riding down Mt Doom with a dwarf standing on the back pegs.

So, a few years ago I did a major revision of one of my software packages, and added support for 64-bit processors as part of the work. Along with a lot of nit-picky work, there were a few architectural changes, and one of them was reworking some in-line assembly I was using for performance reasons. The code is used to blend two images: pixel_top + pixel_bottom = pixel_new. It's the kind of thing that absolutely flies when using MMX because you can do 8 pixels at a time. Anyway, in x64, the MS C++ compiler doesn't allow in-line ASM as it does in Win32. Not exactly sure why. Doesn't matter. No big deal. I just moved each chunk of ASM to an external .ASM file, put a "PROC" and "END" around each, added parameter lists, etc.. Only slightly more complex than copy/paste.

And that worked fine for years. Until yesterday.

Yesterday I learned that one of those functions had started crashing in certain situations. And I stared at the code for hours and didn't see anything wrong with the logic - the code looked like it should do exactly what it was supposed to do. And it was! But it wasn't. For some reason, in release mode, with full optimizations turned on and with a certain data size, a pointer was going crazy and running off into the woods, where it encountered an angry wizard who killed the process dead.

Then I looked under a rock I found on the internet and learned something!

When you write in-line assembly with Microsoft's C/C++ compiler, you don't have to worry much about the contents of most of the common CPU registers when entering or leaving your _asm{...} blocks. The registers are almost like local variables. You can put anything you want in them and the compiler doesn't care and it will clean up after you. But when you move your ASM to an external (non-inline) function and use the MASM compiler (aka the 'assembler'), you are expected to behave as a responsible adult. You are supposed to give the assembler a list of all the registers you are going to use. You are supposed to do that so that the compiler can then coordinate its own register usage with yours - so that you don't interfere on each other. You're supposed to tell it; the ASM compiler doesn't actually care one way or another - want to ride your bike without a seat? ASM doesn't give a crap. Go ahead. Checking that you've done this would be easy, but the assembler doesn't bother. And if you don't tell the compiler that you're using a given register - EAX for example - it will assume EAX is available for other uses and that it doesn't have to worry about you meddling with EAX. But then if you do meddle with EAX in your function, and the compiler was using EAX to hold a pointer, and your program comes out of your function but now that pointer has gone crazy and run off into the woods where the angry wizard lives... well, your program is going to get killed.

Or maybe it won't! Maybe three years of new revisions will go by and never once will the compiler decide it cares about EAX when calling your function. It will be using EBX or EDX or whatever to hold that one pointer. But then one day (yesterday, in my case) it will decide to use EAX... and you won't know it until you hear the ZAP! of the angry wizard's staff flicking your program out of existence.

And that's why one should avoid assembly whenever possible.

The Loop

How to piss off a customer.

int i;
for (i=0; i < theList.size(); i++);
{
   theList[i].something = 0;
}

= crash.

Since I couldn't run the code (couldn't come up with data that would ever cause that code to run), it took me a day to figure out what the problem was. Any of you C/C++/etc peeps see the problem?