Category Archives: Programming

Throw Them All Away

Schneier on Spectre and Meltdown:

The security of pretty much every computer on the planet has just gotten a lot worse, and the only real solution — which of course is not a solution — is to throw them all away and buy new ones.

“Throw it away and buy a new one” is ridiculous security advice, but it’s what US-CERT recommends. It is also unworkable. The problem is that there isn’t anything to buy that isn’t vulnerable. Pretty much every major processor made in the past 20 years is vulnerable to some flavor of these vulnerabilities. Patching against Meltdown can degrade performance by almost a third. And there’s no patch for Spectre; the microprocessors have to be redesigned to prevent the attack, and that will take years. (Here’s a running list of who’s patched what.)

This is bad, but expect it more and more. Several trends are converging in a way that makes our current system of patching security vulnerabilities harder to implement.

Spectre and Meltdown are pretty catastrophic vulnerabilities, but they only affect the confidentiality of data. Now that they — and the research into the Intel ME vulnerability — have shown researchers where to look, more is coming — and what they’ll find will be worse than either Spectre or Meltdown. There will be vulnerabilities that will allow attackers to manipulate or delete data across processes, potentially fatal in the computers controlling our cars or implanted medical devices. These will be similarly impossible to fix, and the only strategy will be to throw our devices away and buy new ones.

Schneier is rarely optimistic when it comes to security issues. But I’d never bet against him being right, either.

Generating fantasy maps

Here is an interesting article about the algorithms used to generate ‘fantasy’ maps like this one:

This is far beyond the programs I wrote to generate terrain on my Amiga 500, way back in ’87. I was content with square grids, no ‘erosion’, no labels, etc.. But, I did draw them in color, with nice isometric 3D projection!

VBScript!

I’ve been fighting with a little script I wrote to massage some text for me. It’s a very simple task so I thought VBS would be adequate (read each line of text, if there’s a special bit of text in the line capitalize everything to its right – no big deal). I know I can use C# and JavaScript if I use Windows’ Powershell, but I didn’t think VBS would have any trouble with it. And it didn’t. But I learned something ridiculous about VBS in the process.

Here is a very simple VBScript program.

Function timesTwo(ByRef inParam)
    inParam = inParam * 2
    timesTwo = 1
End Function

i = 10

WScript.Echo "i = " & i
timesTwo(i)
WScript.Echo "i = " & i
z = timesTwo(i)
WScript.Echo "i = " & i

Here I have a simple function called ‘timesTwo’ that multiplies its input parameter by two and returns a value of 1. Since I declared the input of the function “ByRef”, the input is passed “by reference”, which basically means the function can change the value of the parameter. If I pass it a variable with a value of 5, the variable should have a value of ten when the function ends. Magic.

If the parameter was declared “ByVal” (“by value”), the function would receive a copy of the input parameter, which it could modify as much as it wanted, but changing the copy wouldn’t change the value of the original parameter. But I didn’t declare it that way…

The code below that function tests the function. It…

  1. initializes a variable called ‘i’ to a value of 10 (i = 10)
  2. prints the value of i (WScript.Echo "i = " & i )
  3. calls the function, ignoring its return value. (timesTwo(i))
  4. prints the value of i
  5. calls the function again, putting its return value in a variable ‘z’. (z = timesTwo(i))
  6. prints the value of i

So, what do you think we should get?

i = 10
i = 20
i = 40

That seems reasonable. Start with i = 10, print i, call the function to double i, print i, call the function to double i again, print i.

But if that worked, this would be a truly boring blog post!

No, what we actually get is:

i = 10
i = 10
i = 20

Why? Well, I don’t know exactly why. But, the effect is: if you ignore the return value of a function, the parameter you pass is actually passed by value (ByVal), not by reference.

Pass by value:

timesTwo(i)

Pass by reference:

z = timesTwo(i)

What. The. Fuck.

Why would that difference in behavior ever be useful, let alone expected ?

I should’ve used JavaScript.

Dynamic Camouflage

There’s a specific technique for finding the optimal solution to complex problems (typically mathematical or computer programming problems) by breaking them into simpler sub-problems – each of which only needs to be solved once, optimally – and then combining the results of those sub-problems to arrive at the final solution. It’s called “Dynamic Programming“. Sounds sexy! But when you get into it, you notice that there’s really nothing dynamic about it. When you start down the DP path for a problem, you don’t find yourself in an exciting world of fast and forceful – dynamic! – things to think about; rather, it’s more about carefully and deliberately rethinking the problem at hand until you find a way to flip it on its head, or turn it inside out, or maybe even solve it backwards.

And the name doesn’t actually refer to computer programming. It’s about mathematical programming: finding the best possible solution from the set of all possible solutions, given some definition of ‘best’. That DP is often used in computer programming today is coincidence.

So where does the name come from?

The man who came up with the name, Richard Bellman, explains:

“I spent the Fall quarter (of 1950) at RAND. My first task was to find a name for multistage decision processes. An interesting question is, Where did the name, dynamic programming, come from? The 1950s were not good years for mathematical research. We had a very interesting gentleman in Washington named Wilson. He was Secretary of Defense, and he actually had a pathological fear and hatred of the word research. I’m not using the term lightly; I’m using it precisely. His face would suffuse, he would turn red, and he would get violent if people used the term research in his presence. You can imagine how he felt, then, about the term mathematical. The RAND Corporation was employed by the Air Force, and the Air Force had Wilson as its boss, essentially. Hence, I felt I had to do something to shield Wilson and the Air Force from the fact that I was really doing mathematics inside the RAND Corporation. What title, what name, could I choose? In the first place I was interested in planning, in decision making, in thinking. But planning, is not a good word for various reasons. I decided therefore to use the word “programming”. I wanted to get across the idea that this was dynamic, this was multistage, this was time-varying. I thought, let’s kill two birds with one stone. Let’s take a word that has an absolutely precise meaning, namely dynamic, in the classical physical sense. It also has a very interesting property as an adjective, and that it’s impossible to use the word dynamic in a pejorative sense. Try thinking of some combination that will possibly give it a pejorative meaning. It’s impossible. Thus, I thought dynamic programming was a good name. It was something not even a Congressman could object to. So I used it as an umbrella for my activities.”

Rain girl

This is one of my favorite pics – a girl walking in the rain outside our first house. I’ve posted it previously, a couple of times.

I came up with an implementation of Aaron Hertzmann’s “painterly” non-photorealistic rendering algorithm. This recreates the source image by using simulated brush strokes of varying size, color and shape. It’s not a blurring or filtering of the source; it’s a recreation made in much the same fashion that a human painter would copy a picture: look at the source, choose a color, make a stroke which follows that color. Repeat until you’re done. As with a human painter, it starts with large brushes to get the background and large areas filled, then it refines the image with progressively smaller brushes.

This video is the algorithm in action, drawing the strokes (20 per frame, to save time). I told it to use short strokes.

The music is called “One Bird”. That’s me.

Here’s one of a pic of Adrian Belew:

Longer, thinner brush strokes.

And, Tricksey:

Tricksey, Painterly

Drat Of The Day

Which is the minimum of the two:

A: -3.4028235e+038
B: 1.1754944e-038

That depends on how you define minimum. If you want the smallest magnitude, it’s B. B is so close to 0 it might as well be 0 for most practical uses (0.000000000000000000000000000000000000011754944). But A is smaller if you think ‘minimum’ means ‘leftmost on the number line’. A is so far to the left of B there aren’t enough pixels on all the monitors in the world to draw that number line.

In C++ the result of std::numeric_limits<float>::min() is B: the smallest value a ‘float’ can store is B.

But the most-negative number a float can store is A. And A is std::numeric_limits<float>::max().

So, if you are trying to limit the results of a function to the range of values that a float can hold, and you do this:

double result = doSomething();
double t = max(std::numeric_limits::min(), result);
t = min(std::numeric_limits::max(), t);
return (float)t;

You’re going to very surprised when you discover that you never see a value smaller than 1.1754944e-038. You’ll never even get 0.

What you really needed to write is this:

double result = doSomething();
double t = max(-std::numeric_limits::max(), result);
t = min(std::numeric_limits::max(), t);
return (float)t;

But, that trickery only applies to floating point types. The smallest magnitude you can store in an int is 0. But std::numeric_limits<int>::min() is -2147483648. That’s the most-negative value, not the value with least magnitude.

And now you know.

So, Filling…

Continuing with this… (busy week at work! whew!)

Instead of filling with boring old rectangles, what about filling with ellipses?

First problem: it’s just too much work to do the dividing and conquering with actual ellipses. There’s no good way to recursively chop a rectangular image into successively smaller ellipses. So, I decided to just stick with the rectangles right up until the point where I go to fill them in. Then, I’ll just draw an ellipse that covers what the rectangle covers. This is just an experiment. I can do what I want.

So, given that we need each ellipse to cover, at least, the same area as the rectangle would have, is there a way to determine the optimal ellipse? That is: given a rectangle, how big is its proportional enclosing ellipse? Beats me. Sounds like tough problem. But, StackExchange knew:

A=RectWidth / √2
B=RectHeight / √2

So much simpler than I’d expected.

So, here we go. Everything else is the same, except that I’m drawing ellipses instead of rectangles.

ml_sz2_t100_ell

Hmm… Interesting. Looks like a lot of eggs. I don’t like that the tops and lefts of the ellipses are all visible, but the bottoms and rights aren’t. That’s a consequence of the fact that the overall drawing and processing basically goes left to right, top to bottom.

Can I change the order in which they’re drawn? Yes.

Instead of drawing ellipses as soon as they’re calculated, I put them in a list. At the end I have a list of all the ellipses that would’ve been drawn. Then, sort that list by the area of each ellipse. And then, draw the sorted ellipses – biggest first. That way, the smaller ellipses will overlay the bigger ones. And that gives:

ml_sz2_t100_ellSt

Better. There’s variety in which parts of the ellipses are obscured. And, the ends of the really big ellipses along the left and right sides are covered up.

With borders? Sure!

ml_sz2_t100_ellStO

No fill, borders only?

ml_sz2_t100_ellStNo

Sweeeeeeet.

Divide And Conquer

There’s an old-school technique for rendering the classic fractal, the Mandelbrot set, called Divide and Conquer. This technique takes advantage of the fact that all of points in the Mandelbrot set are ‘connected’ – there are no islands of Mandelbrot set points which aren’t connected to the main set; and there are no islands of non-Mandelbrot set enclosed by points within Mandelbrot set. Think of it as a single blob of water. Given this, you can speed up rendering of any subset of the Mandelbrot set by:

  1. Divide the image into four rectangles.
  2. For each of the four rectangles:
  3. If the rectangle has a non-zero width and height:
  4. Calculate all the points along the edges.
  5. If all the points along the edges have the same calculated value, all points within the rectangle must be the same, too. No ‘islands’, remember. So, fill the rectangle with the value you calculated for the edges.
  6. If the points along the edge are not the same, divide that rectangle into four rectangles and, for each sub-rectangle, go to step 3. (Keep in mind that since you’ve already calculated the points on the edges, you can reduce the size of the rectangle by 1 on each edge).

So, it’s a recursive space-filling technique. It divides the work area into smaller and smaller rectangles until it finds something it can fill or it can no longer sub-divide. And in this way it gobbles-up large identically-valued sections of the Set very quickly, and so it spends most of its time working on details. In general, that’s a performance gain. And, even better, this technique is far more fun to watch than just seeing pixels calculated one at a time, one row after another. It’s kind of mezmerizing. Or, it was back in the day. These days, computers are fast enough that you can’t see it in action.

Anyway, I thought it would be fun to put a variation of this technique to use on non-fractal images. So, I used color differences in the source image to determine where the edges of the rectangles go, with a selectable threshold to control color matching. I added a way to limit the minimum rectangle size, and taught it to optionally draw tastefully-colored borders around each rectangle.

And away we go!

Start with this:

monalisa_sm

Set the minimum rectangle dimension to 25 pixels, set the threshold to “100” and…

ml_sz25_t100

Not amazing.

The rectangles are too big. So, reduce the min rect size to 10:

ml_sz10_t100

Now you can start to see what’s going on. Variety in rectangle sizes is what we’re after – otherwise it’s just pixelation. And this would be a most inefficient way to do pixelation…

Set min rect size to 4, and then to 2:

ml_sz4_t100

ml_sz2_t100

There we go.

Now, that’s fun.

But there’s one thing I don’t love about this: the rectangles follow a perfectly even grid, regardless of min rect size. Look at the four rectangles in the top-left of this one, then look at the previous images. They are the same in all of the images except for minsize=25 – and in that case, those four rects fit perfectly into its top-left rectangle. This is completely expected, since we’re just dividing the image into 1/2 then 1/4 then 1/8 then 1/16, etc., and so we’re going to keep hitting those same edges over and over.

What we need is to vary the sizes of the sub-rects, to get off of that power-of-two thing. To do that, I added something called ‘jitter’. Instead of dividing the rectangle evenly into four equal sub-rects, jitter divides the rectangle by 2, 3 or 4 vertically, then 2, 3 or 4 horizontally. The divisor is chosen at random.

That gives:

ml_sz2_t100_j

And because the jitter is random, the output image will be different every time.

Other variations:

No outlines:
ml_sz2_t100_no
Looks like hyper-aggressive JPEG compression.

Or, don’t fill the rectangles, just draw the outlines:

ml_sz2_t100_je
I really like this effect. And this one shows another variation: instead of randomizing the sub-rect proportions, it just randomly nudges the edges of the sub-rects by a pixel or two. This maintains the even grid, but adds just enough noise to break the monotony if you look closely.

Or, render the same image twenty times with jitter on, and put the results into a GIF !

bart

Far out, man!

Faster Dot Product, Kill, Kill!

Trigger warning: Computer junk ahead!

Dot product” is the name of a common mathematical pattern. If you have two lists of numbers of equal length (called ‘vectors’) A and B, and you multiply each pair of numbers Ai * Bi, then add up the results of all the multiplications, you get the dot product:

A: 1, 2, 3, 4
B: 10, 20, 30, 40
A⋅B= 1×10 + 2×20 + 3×30 + 4×30 = 10 + 40 + 90 + 120 = 260

It shows up everywhere. And I see variations of it all the time in digital image processing, especially in 4-element situations, as with that little A & B example.

For example, deep in the heart of my bi-cubic interpolation code (a way to do image scaling), there is this bit of code which takes 16 pixel values and two vectors of coefficients (af and bf) four values each.

float dot_flt_native(float *pix, float *af, float *bf)
{
   float t1 = af[0] * pix[00] + af[1] * pix[01] + af[2] * pix[02] + af[3] * pix[03];
   float t2 = af[0] * pix[04] + af[1] * pix[05] + af[2] * pix[06] + af[3] * pix[07];
   float t3 = af[0] * pix[08] + af[1] * pix[09] + af[2] * pix[10] + af[3] * pix[11];
   float t4 = af[0] * pix[12] + af[1] * pix[13] + af[2] * pix[14] + af[3] * pix[15];
   return res = bf[0] * t1 + bf[1] * t2 + bf[2] * t3 + bf[3] * t4;
}

It does five dot products. First, it does a dot product of the first four pixels and the ‘af’ vector, then another for the second 4 pix, and another for the third and fourth sets of pix. Those four dp’s gives four intermediate results (t1,t2,t3,t4) which are then used in another dot product with the ‘bf’ vector. And that’s the result. Twenty multiplications and fifteen additions.

Note the nice pattern in the way I’ve laid-out the code. Note the repetition: grab 4 pixels, multiple each by the corresponding item from af; Repeat; Repeat; Repeat. Different data pushed through the same simple process, repeated. Single Instruction, Multiple Data: SIMD.

Modern CPUs have the ability to do these parallel calculations in a single instructions: line your vectors up, let the CPU chomp away four (or eight or 16 or 32!) values at a time – for numerical accuracy reasons I can’t use 8 or 16-bit integers, I have to use 32-bit floating point values in this calculation. As always, the trick with this stuff is to get the data lined up with as little fuss as possible – you don’t want the time spent on setup to overwhelm any speedup you gain from being able to do parallel calculations. But, this data is trivial to line up. So, it’s all about doing the calculation.

First method: as-is. Since I already have the code above, I can use that as the benchmark for accuracy and for speed. We’ll call that the native floating point version.

Second method: what I really have. The version above uses floating point data as the input, but my actual data is integers and they have to be converted to floating point data, first. That’s easy enough. And so this is my real benchmark.

float dot_int_native(unsigned char *pixb, float *af, float *bf)
{
   float pix[16];
   for (int i=0;i<16;i++) pix[i] = pixb[i];
   float t1 = af[0] * pix[00] + af[1] * pix[01] + af[2] * pix[02] + af[3] * pix[03];
   float t2 = af[0] * pix[04] + af[1] * pix[05] + af[2] * pix[06] + af[3] * pix[07];
   float t3 = af[0] * pix[08] + af[1] * pix[09] + af[2] * pix[10] + af[3] * pix[11];
   float t4 = af[0] * pix[12] + af[1] * pix[13] + af[2] * pix[14] + af[3] * pix[15];
   return res = bf[0] * t1 + bf[1] * t2 + bf[2] * t3 + bf[3] * t4;
}

Same thing as before, but with a little loop to convert the input to floats. 20 multiplies, 15 adds, and now 16 conversions from integer to float (but how expensive can they be!?)

Now for an attempt at SIMD. It just so happens that there is a 'dot product' instruction in CPUs that support Intel's 'SSE4' tech. It does a dot product in one instruction. Line em up, pull the trigger, done. The only ugly part about this is that we have to pick out the results from the first four dot product instructions (_mm_dp_ps), then pack them into a new vector for the final dot product.

float dot_flt_sse_dpps(unsigned char *pixb, float *af, float *bf)
{
   // same as above, convert integers to floats. 
   // use that ugly __declspec thing to align the data to 
   // 16 bytes, for SSE reasons
   __declspec(align(16)) float pix[16];
   for (int i=0;i<16;i++) pix[i] = pixb[i];

   // load the coefficient vectors into SSE vectors
   __m128 afx = _mm_load_ps(af);
   __m128 bfx = _mm_load_ps(bf); 

   __m128 t1 = _mm_load_ps(&pix[0]); // load four pixels
   t1 = _mm_dp_ps(t1, afx, 0xff);    // dot product against afx vector

   __m128 t2 = _mm_load_ps(&pix[4]); // load four more pixels
   t2 = _mm_dp_ps(t2, afx, 0xff);    // dot product against afx vector

   __m128 t3 = _mm_load_ps(&pix[8]); // etc
   t3 = _mm_dp_ps(t3, afx, 0xff);

   __m128 t4 = _mm_load_ps(&pix[12]);// etc
   t4 = _mm_dp_ps(t4, afx, 0xff);

   // grab the results from t2,t3,t4 and pack them in 
   // with the result from t1.
   t1.m128_f32[1] = t2.m128_f32[0];
   t1.m128_f32[2] = t3.m128_f32[0];
   t1.m128_f32[3] = t4.m128_f32[0];

   // dot product against bfx vector
   t1 = _mm_dp_ps(t1, bfx, 0xff);

   // grab the result out of t1
   return t1.m128_f32[0];
}

The _mm_dp_ps instructions take two vectors of four values each, calculate the dot product then put the result into an output vector of four values - putting the dot product result into each of the four output spots. So, this:

__m128 t1 = _mm_load_ps(&pix[0]);        // load four pixels
       t1 = _mm_dp_ps(t1, afx, 0xff);    // dot product against afx vector

does this:
x = pix0⋅afx0 + pix1⋅afx1 + pix2⋅afx2 + pix3⋅afx3
t1 = x,x,x,x

Weird, but that's just what it does. So, when we go to make that last vector for the final dot product, we pull the first value from each temporary vector (t1,t2,t3,t4) and stuff it into the corresponding place in t1.

   t1.m128_f32[1] = t2.m128_f32[0]; // result from t2 into t11
   t1.m128_f32[2] = t3.m128_f32[0]; // result from t3 into t12
   t1.m128_f32[3] = t4.m128_f32[0]; // result from t4 into t13

Then we do the final d.p. using t1 and bfx and get our result!

All the additions and multiplications? Gone! Vanished into the mm_dp_ps instructions.

And how does it perform?

So far we have three variations: float input native, integer input native and integer input SIMD. Since this is not a complex calculation, we need to run it a zillion times to see how it performs. So, for 10,000,000 iterations, here are the results so far:

Float native 1.34s
Integer native 2.57s
Integer SSE 4.93s

That second row shows that the integer to float conversion takes a long time. It counts for nearly half the time of the calculation. Ouch. Let's work on that, first.

So, here's a version using SIMD instructions to handle the integer to float conversion:

float dot_byte_sse_unpack_dot(unsigned char *p16, float *af, float *bf)
{
   __m128i pix = _mm_load_si128((__m128i*)p16);  // read all 16 8-bit pixels

   __m128i Z = _mm_setzero_si128();              // 128 bits worth of zeros. 
                                             
   __m128i p16L = _mm_unpacklo_epi8(pix, Z);     // unpack the 8-bit pixels to
   __m128i p16H = _mm_unpackhi_epi8(pix, Z);     // two sets of 16-bit pixels

   __m128i p32LL = _mm_unpacklo_epi16(p16L, Z);  // unpack the two sets of 16-bit
   __m128i p32LH = _mm_unpackhi_epi16(p16L, Z);  // pixels to four sets of 
   __m128i p32HL = _mm_unpacklo_epi16(p16H, Z);  // four 32-bit pixels.
   __m128i p32HH = _mm_unpackhi_epi16(p16H, Z);

   __m128 t1 = _mm_cvtepi32_ps(p32LL);           // convert the 16 32-bit integer
   __m128 t2 = _mm_cvtepi32_ps(p32LH);           // pixels to 16 32-bit floating
   __m128 t3 = _mm_cvtepi32_ps(p32HL);           // point pixels
   __m128 t4 = _mm_cvtepi32_ps(p32HH);

   __m128 afx = _mm_load_ps(af);                 // do the dot product, as before

   t1 = _mm_dp_ps(t1, afx, 0xff);
   t2 = _mm_dp_ps(t2, afx, 0xff);
   t3 = _mm_dp_ps(t3, afx, 0xff);
   t4 = _mm_dp_ps(t4, afx, 0xff);

   t1.m128_f32[1] = t2.m128_f32[0];
   t1.m128_f32[2] = t3.m128_f32[0];
   t1.m128_f32[3] = t4.m128_f32[0];

   __m128 bfx = _mm_load_ps(bf);
   t1 = _mm_dp_ps(t1, bfx, 0xff);

   return t1.m128_f32[0];
}

Seems like a lot of work, but when you look at what the compiler generates for " for (int i=0;i<16;i++) pix[i] = pixb[i];", this way is significantly faster. And that does, in fact, eliminate the integer to float penalty:

Float native 1.34s
Integer SSE unpack 2.28 Use SSE to unpack int to float
Integer native 2.57s Convert int to float
Integer SSE 4.93s Convert int to float, using mm_dp_ps

This one runs faster than the native integer-to-float version. But it's an uninspiring 11% speedup. Something in there is holding me down. But what?

Looking at the assembly output, the integer unpacking SSE version (above) comes in at a svelte 50 instructions while the integer native version is over 100 instructions long. So, it's certainly not about raw instruction count. It has to be something about the particular instructions that we're using: some instructions are faster than others, are there any slow ones in there?

Here's a handy reference: Instruction tables: Lists of instruction latencies, throughputs and micro-operation breakdowns for Intel, AMD and VIA CPUs. It lists the timings for all of the instructions available on all of the various modern Intel and compatible CPUs.

I happen to have an Intel Core i7-3770 in the PC I'm typing this on, which Intel has classified under the "Ivy Bridge" code name. Scrolling down to the Ivy Bridge section in that PDF and looking at some of the instructions used by my SSE code... I see that the load and unpack and int-to-float conversion instructions are all reasonably quick. But, the centerpiece of the function, _mm_dp_ps (aka 'DPPS') - the dot product instruction itself! - is very slow. And I have five of them. But that's what does all the work! Wah!!

Well, OK. That sucks. But, a dot product is just multiplication and addition. And there's certainly a way to do a parallel multiply:

pix0,pix1,pix2,pix3 x afx0,afx1,afx2,afx3 =
pix0⋅afx0, pix1⋅afx1, pix2⋅afx2, pix3⋅afx3

We can do parallel addition, too, but we don't need that. All the things we want to sum are in the same vector, not in separate vectors which is what parallel addition works on. We have to add all those items from the result of the multiplications together, for the full dot product.

We can pull each item out of the vector, add them one at a time, kindof like we're doing down at the bottom of the functions above.

float dot_byte_sse_unpack_dot_extr(unsigned char *p16, float *af, float *bf)
{
   // unpack and convert, etc..
   __m128i Z = _mm_setzero_si128();
   __m128i pix = _mm_load_si128((__m128i*)p16);
   __m128i p16L = _mm_unpacklo_epi8(pix, Z);
   __m128i p16H = _mm_unpackhi_epi8(pix, Z);
   __m128i p32LL = _mm_unpacklo_epi16(p16L, Z);
   __m128i p32LH = _mm_unpackhi_epi16(p16L, Z);
   __m128i p32HL = _mm_unpacklo_epi16(p16H, Z);
   __m128i p32HH = _mm_unpackhi_epi16(p16H, Z);
   __m128 t1 = _mm_cvtepi32_ps(p32LL);
   __m128 t2 = _mm_cvtepi32_ps(p32LH);
   __m128 t3 = _mm_cvtepi32_ps(p32HL);
   __m128 t4 = _mm_cvtepi32_ps(p32HH);

   __m128 afx = _mm_load_ps(af);
   __m128 bfx = _mm_load_ps(bf); 

   // mul input chunks times afx
   t1 = _mm_mul_ps(t1, afx); 
   t2 = _mm_mul_ps(t2, afx);
   t3 = _mm_mul_ps(t3, afx);
   t4 = _mm_mul_ps(t4, afx);

   // t1[0] gets the sum of t1[0],t1[1],t1[2],t1[3]
   // dot product one, done!
   t1.m128_f32[0] = 
       t1.m128_f32[0] + 
       t1.m128_f32[1] + 
       t1.m128_f32[2] + 
       t1.m128_f32[3];

   // t1[1] gets the sum of t2[0],t2[1],t2[2],t2[3]
   t1.m128_f32[1] = t2.m128_f32[0] + t2.m128_f32[1] + t2.m128_f32[2] + t2.m128_f32[3];
   t1.m128_f32[2] = t3.m128_f32[0] + t3.m128_f32[1] + t3.m128_f32[2] + t3.m128_f32[3];
   t1.m128_f32[3] = t4.m128_f32[0] + t4.m128_f32[1] + t4.m128_f32[2] + t4.m128_f32[3];

   t1 = _mm_mul_ps(t1, bfx);
   t1.m128_f32[0] = t1.m128_f32[0] + t1.m128_f32[1] + t1.m128_f32[2] + t1.m128_f32[3];

   return t1.m128_f32[0];
}

That's not great. We've reduced our multiplications to 4 (_mm_mul_ps), but we've increased the additions back up to 15.

Float native 1.34s
Integer SSE unpack> 2.28 Use SSE to unpack int to float, using mm_dp_ps
Integer native 2.57s Convert int to float
Integer SSE unpack, mul 2.75s Unpack with SSE, use mul and 15 additions
Integer SSE 4.93s Convert int to float, using mm_dp_ps

And we suffer for it.

There's an instruction that can help: it's called _mm_hadd_ps (HADDP - horizontal add). It adds adjacent pairs in a vector.

A0,A1,A2,A3 HADD B0,B1,B2,B3 =
A0+A1, A2+A3, B0+B1, B2+B3

It's not exactly what we want, we want all summed in one shot. But this it can be made to work.

float dot_byte_sse_unpack_dot_hadd(unsigned char *p16, float *af, float *bf)
{
   // unpack, convert, etc.
   __m128i Z = _mm_setzero_si128();
   __m128i pix = _mm_load_si128((__m128i*)p16);
   __m128i p16L = _mm_unpacklo_epi8(pix, Z);
   __m128i p16H = _mm_unpackhi_epi8(pix, Z);
   __m128i p32LL = _mm_unpacklo_epi16(p16L, Z);
   __m128i p32LH = _mm_unpackhi_epi16(p16L, Z);
   __m128i p32HL = _mm_unpacklo_epi16(p16H, Z);
   __m128i p32HH = _mm_unpackhi_epi16(p16H, Z);
   __m128 t1 = _mm_cvtepi32_ps(p32LL);
   __m128 t2 = _mm_cvtepi32_ps(p32LH);
   __m128 t3 = _mm_cvtepi32_ps(p32HL);
   __m128 t4 = _mm_cvtepi32_ps(p32HH);

   __m128 afx = _mm_load_ps(af);
   __m128 bfx = _mm_load_ps(bf); 

   // multiply
   t1 = _mm_mul_ps(t1, afx); 
   t2 = _mm_mul_ps(t2, afx);
   t3 = _mm_mul_ps(t3, afx);
   t4 = _mm_mul_ps(t4, afx);

   // these will fill t1 with the results
   // from the four dot products. it's an ugly
   // series of additions, but it works.
   t1 = _mm_hadd_ps(t1, t2);
   t3 = _mm_hadd_ps(t3, t4);
   t1 = _mm_hadd_ps(t1, t3);

   // final dot product
   t1 = _mm_mul_ps(t1, bfx);
   t1 = _mm_hadd_ps(t1, t1);
   t1 = _mm_hadd_ps(t1, t1);

   return t1.m128_f32[0];
}
Integer SSE unpack, mul, HADD 1.08s Unpack with SSE, using mul and HADD
Float native 1.34s
Integer SSE unpack> 2.28 Use SSE to unpack int to float using mm_dp_ps
Integer native 2.57s Convert int to float
Integer SSE unpack, mul 2.75s Unpack with SSE, using mul and 15 additions
Integer SSE 4.93s Convert int to float, using mm_dp_ps

Now that's what I'm talking about! More than twice as fast as my real benchmark, and even faster than the native float input version! Ultimate victory!

Maybe they'll speed up the _mm_dp_ps instruction up in future CPUs. But until then, this does the job pretty well.

Intrinsically Unpredictable++

Most of this is now moot.

The problem was with Visual Studio (2008). Even though I had the Optimization options set correctly, those settings weren’t actually being used to build. I looked in the .vcproj file, and sure enough, there was no ‘Optimization=”2″‘ line in the Release configuration for the VCCLCompilerTool, even though the UI said optimization was set to ‘O2’. Closing VS, manually adding that line to the configuration in the .vcproj, and then reloading and rebuilding generated code that looked like it should. This is a Visual Studio bug, not a compiler bug.

The intrinsics version runs just as fast as my hand-written assembly.

Much thanks to all who questioned my earlier results.

Update: the compiler now produces code that crashes really fast, in x64 release mode. And MS knows about it. And they aren’t going to fix it. There’s a workaround for this particular compiler bug, but not for the one I now see with _mm_extract_epi8. Sigh.

Experiment: over.

Continue reading