Category Archives: Programming

So, Filling…

Continuing with this… (busy week at work! whew!)

Instead of filling with boring old rectangles, what about filling with ellipses?

First problem: it’s just too much work to do the dividing and conquering with actual ellipses. There’s no good way to recursively chop a rectangular image into successively smaller ellipses. So, I decided to just stick with the rectangles right up until the point where I go to fill them in. Then, I’ll just draw an ellipse that covers what the rectangle covers. This is just an experiment. I can do what I want.

So, given that we need each ellipse to cover, at least, the same area as the rectangle would have, is there a way to determine the optimal ellipse? That is: given a rectangle, how big is its proportional enclosing ellipse? Beats me. Sounds like tough problem. But, StackExchange knew:

A=RectWidth / √2
B=RectHeight / √2

So much simpler than I’d expected.

So, here we go. Everything else is the same, except that I’m drawing ellipses instead of rectangles.

ml_sz2_t100_ell

Hmm… Interesting. Looks like a lot of eggs. I don’t like that the tops and lefts of the ellipses are all visible, but the bottoms and rights aren’t. That’s a consequence of the fact that the overall drawing and processing basically goes left to right, top to bottom.

Can I change the order in which they’re drawn? Yes.

Instead of drawing ellipses as soon as they’re calculated, I put them in a list. At the end I have a list of all the ellipses that would’ve been drawn. Then, sort that list by the area of each ellipse. And then, draw the sorted ellipses – biggest first. That way, the smaller ellipses will overlay the bigger ones. And that gives:

ml_sz2_t100_ellSt

Better. There’s variety in which parts of the ellipses are obscured. And, the ends of the really big ellipses along the left and right sides are covered up.

With borders? Sure!

ml_sz2_t100_ellStO

No fill, borders only?

ml_sz2_t100_ellStNo

Sweeeeeeet.

Divide And Conquer

There’s an old-school technique for rendering the classic fractal, the Mandelbrot set, called Divide and Conquer. This technique takes advantage of the fact that all of points in the Mandelbrot set are ‘connected’ – there are no islands of Mandelbrot set points which aren’t connected to the main set; and there are no islands of non-Mandelbrot set enclosed by points within Mandelbrot set. Think of it as a single blob of water. Given this, you can speed up rendering of any subset of the Mandelbrot set by:

  1. Divide the image into four rectangles.
  2. For each of the four rectangles:
  3. If the rectangle has a non-zero width and height:
  4. Calculate all the points along the edges.
  5. If all the points along the edges have the same calculated value, all points within the rectangle must be the same, too. No ‘islands’, remember. So, fill the rectangle with the value you calculated for the edges.
  6. If the points along the edge are not the same, divide that rectangle into four rectangles and, for each sub-rectangle, go to step 3. (Keep in mind that since you’ve already calculated the points on the edges, you can reduce the size of the rectangle by 1 on each edge).

So, it’s a recursive space-filling technique. It divides the work area into smaller and smaller rectangles until it finds something it can fill or it can no longer sub-divide. And in this way it gobbles-up large identically-valued sections of the Set very quickly, and so it spends most of its time working on details. In general, that’s a performance gain. And, even better, this technique is far more fun to watch than just seeing pixels calculated one at a time, one row after another. It’s kind of mezmerizing. Or, it was back in the day. These days, computers are fast enough that you can’t see it in action.

Anyway, I thought it would be fun to put a variation of this technique to use on non-fractal images. So, I used color differences in the source image to determine where the edges of the rectangles go, with a selectable threshold to control color matching. I added a way to limit the minimum rectangle size, and taught it to optionally draw tastefully-colored borders around each rectangle.

And away we go!

Start with this:

monalisa_sm

Set the minimum rectangle dimension to 25 pixels, set the threshold to “100” and…

ml_sz25_t100

Not amazing.

The rectangles are too big. So, reduce the min rect size to 10:

ml_sz10_t100

Now you can start to see what’s going on. Variety in rectangle sizes is what we’re after – otherwise it’s just pixelation. And this would be a most inefficient way to do pixelation…

Set min rect size to 4, and then to 2:

ml_sz4_t100

ml_sz2_t100

There we go.

Now, that’s fun.

But there’s one thing I don’t love about this: the rectangles follow a perfectly even grid, regardless of min rect size. Look at the four rectangles in the top-left of this one, then look at the previous images. They are the same in all of the images except for minsize=25 – and in that case, those four rects fit perfectly into its top-left rectangle. This is completely expected, since we’re just dividing the image into 1/2 then 1/4 then 1/8 then 1/16, etc., and so we’re going to keep hitting those same edges over and over.

What we need is to vary the sizes of the sub-rects, to get off of that power-of-two thing. To do that, I added something called ‘jitter’. Instead of dividing the rectangle evenly into four equal sub-rects, jitter divides the rectangle by 2, 3 or 4 vertically, then 2, 3 or 4 horizontally. The divisor is chosen at random.

That gives:

ml_sz2_t100_j

And because the jitter is random, the output image will be different every time.

Other variations:

No outlines:
ml_sz2_t100_no
Looks like hyper-aggressive JPEG compression.

Or, don’t fill the rectangles, just draw the outlines:

ml_sz2_t100_je
I really like this effect. And this one shows another variation: instead of randomizing the sub-rect proportions, it just randomly nudges the edges of the sub-rects by a pixel or two. This maintains the even grid, but adds just enough noise to break the monotony if you look closely.

Or, render the same image twenty times with jitter on, and put the results into a GIF !

bart

Far out, man!

Faster Dot Product, Kill, Kill!

Trigger warning: Computer junk ahead!

Dot product” is the name of a common mathematical pattern. If you have two lists of numbers of equal length (called ‘vectors’) A and B, and you multiply each pair of numbers Ai * Bi, then add up the results of all the multiplications, you get the dot product:

A: 1, 2, 3, 4
B: 10, 20, 30, 40
A⋅B= 1×10 + 2×20 + 3×30 + 4×30 = 10 + 40 + 90 + 120 = 260

It shows up everywhere. And I see variations of it all the time in digital image processing, especially in 4-element situations, as with that little A & B example.

For example, deep in the heart of my bi-cubic interpolation code (a way to do image scaling), there is this bit of code which takes 16 pixel values and two vectors of coefficients (af and bf) four values each.

float dot_flt_native(float *pix, float *af, float *bf)
{
   float t1 = af[0] * pix[00] + af[1] * pix[01] + af[2] * pix[02] + af[3] * pix[03];
   float t2 = af[0] * pix[04] + af[1] * pix[05] + af[2] * pix[06] + af[3] * pix[07];
   float t3 = af[0] * pix[08] + af[1] * pix[09] + af[2] * pix[10] + af[3] * pix[11];
   float t4 = af[0] * pix[12] + af[1] * pix[13] + af[2] * pix[14] + af[3] * pix[15];
   return res = bf[0] * t1 + bf[1] * t2 + bf[2] * t3 + bf[3] * t4;
}

It does five dot products. First, it does a dot product of the first four pixels and the ‘af’ vector, then another for the second 4 pix, and another for the third and fourth sets of pix. Those four dp’s gives four intermediate results (t1,t2,t3,t4) which are then used in another dot product with the ‘bf’ vector. And that’s the result. Twenty multiplications and fifteen additions.

Note the nice pattern in the way I’ve laid-out the code. Note the repetition: grab 4 pixels, multiple each by the corresponding item from af; Repeat; Repeat; Repeat. Different data pushed through the same simple process, repeated. Single Instruction, Multiple Data: SIMD.

Modern CPUs have the ability to do these parallel calculations in a single instructions: line your vectors up, let the CPU chomp away four (or eight or 16 or 32!) values at a time – for numerical accuracy reasons I can’t use 8 or 16-bit integers, I have to use 32-bit floating point values in this calculation. As always, the trick with this stuff is to get the data lined up with as little fuss as possible – you don’t want the time spent on setup to overwhelm any speedup you gain from being able to do parallel calculations. But, this data is trivial to line up. So, it’s all about doing the calculation.

First method: as-is. Since I already have the code above, I can use that as the benchmark for accuracy and for speed. We’ll call that the native floating point version.

Second method: what I really have. The version above uses floating point data as the input, but my actual data is integers and they have to be converted to floating point data, first. That’s easy enough. And so this is my real benchmark.

float dot_int_native(unsigned char *pixb, float *af, float *bf)
{
   float pix[16];
   for (int i=0;i<16;i++) pix[i] = pixb[i];
   float t1 = af[0] * pix[00] + af[1] * pix[01] + af[2] * pix[02] + af[3] * pix[03];
   float t2 = af[0] * pix[04] + af[1] * pix[05] + af[2] * pix[06] + af[3] * pix[07];
   float t3 = af[0] * pix[08] + af[1] * pix[09] + af[2] * pix[10] + af[3] * pix[11];
   float t4 = af[0] * pix[12] + af[1] * pix[13] + af[2] * pix[14] + af[3] * pix[15];
   return res = bf[0] * t1 + bf[1] * t2 + bf[2] * t3 + bf[3] * t4;
}

Same thing as before, but with a little loop to convert the input to floats. 20 multiplies, 15 adds, and now 16 conversions from integer to float (but how expensive can they be!?)

Now for an attempt at SIMD. It just so happens that there is a 'dot product' instruction in CPUs that support Intel's 'SSE4' tech. It does a dot product in one instruction. Line em up, pull the trigger, done. The only ugly part about this is that we have to pick out the results from the first four dot product instructions (_mm_dp_ps), then pack them into a new vector for the final dot product.

float dot_flt_sse_dpps(unsigned char *pixb, float *af, float *bf)
{
   // same as above, convert integers to floats. 
   // use that ugly __declspec thing to align the data to 
   // 16 bytes, for SSE reasons
   __declspec(align(16)) float pix[16];
   for (int i=0;i<16;i++) pix[i] = pixb[i];

   // load the coefficient vectors into SSE vectors
   __m128 afx = _mm_load_ps(af);
   __m128 bfx = _mm_load_ps(bf); 

   __m128 t1 = _mm_load_ps(&pix[0]); // load four pixels
   t1 = _mm_dp_ps(t1, afx, 0xff);    // dot product against afx vector

   __m128 t2 = _mm_load_ps(&pix[4]); // load four more pixels
   t2 = _mm_dp_ps(t2, afx, 0xff);    // dot product against afx vector

   __m128 t3 = _mm_load_ps(&pix[8]); // etc
   t3 = _mm_dp_ps(t3, afx, 0xff);

   __m128 t4 = _mm_load_ps(&pix[12]);// etc
   t4 = _mm_dp_ps(t4, afx, 0xff);

   // grab the results from t2,t3,t4 and pack them in 
   // with the result from t1.
   t1.m128_f32[1] = t2.m128_f32[0];
   t1.m128_f32[2] = t3.m128_f32[0];
   t1.m128_f32[3] = t4.m128_f32[0];

   // dot product against bfx vector
   t1 = _mm_dp_ps(t1, bfx, 0xff);

   // grab the result out of t1
   return t1.m128_f32[0];
}

The _mm_dp_ps instructions take two vectors of four values each, calculate the dot product then put the result into an output vector of four values - putting the dot product result into each of the four output spots. So, this:

__m128 t1 = _mm_load_ps(&pix[0]);        // load four pixels
       t1 = _mm_dp_ps(t1, afx, 0xff);    // dot product against afx vector

does this:
x = pix0⋅afx0 + pix1⋅afx1 + pix2⋅afx2 + pix3⋅afx3
t1 = x,x,x,x

Weird, but that's just what it does. So, when we go to make that last vector for the final dot product, we pull the first value from each temporary vector (t1,t2,t3,t4) and stuff it into the corresponding place in t1.

   t1.m128_f32[1] = t2.m128_f32[0]; // result from t2 into t11
   t1.m128_f32[2] = t3.m128_f32[0]; // result from t3 into t12
   t1.m128_f32[3] = t4.m128_f32[0]; // result from t4 into t13

Then we do the final d.p. using t1 and bfx and get our result!

All the additions and multiplications? Gone! Vanished into the mm_dp_ps instructions.

And how does it perform?

So far we have three variations: float input native, integer input native and integer input SIMD. Since this is not a complex calculation, we need to run it a zillion times to see how it performs. So, for 10,000,000 iterations, here are the results so far:

Float native 1.34s
Integer native 2.57s
Integer SSE 4.93s

That second row shows that the integer to float conversion takes a long time. It counts for nearly half the time of the calculation. Ouch. Let's work on that, first.

So, here's a version using SIMD instructions to handle the integer to float conversion:

float dot_byte_sse_unpack_dot(unsigned char *p16, float *af, float *bf)
{
   __m128i pix = _mm_load_si128((__m128i*)p16);  // read all 16 8-bit pixels

   __m128i Z = _mm_setzero_si128();              // 128 bits worth of zeros. 
                                             
   __m128i p16L = _mm_unpacklo_epi8(pix, Z);     // unpack the 8-bit pixels to
   __m128i p16H = _mm_unpackhi_epi8(pix, Z);     // two sets of 16-bit pixels

   __m128i p32LL = _mm_unpacklo_epi16(p16L, Z);  // unpack the two sets of 16-bit
   __m128i p32LH = _mm_unpackhi_epi16(p16L, Z);  // pixels to four sets of 
   __m128i p32HL = _mm_unpacklo_epi16(p16H, Z);  // four 32-bit pixels.
   __m128i p32HH = _mm_unpackhi_epi16(p16H, Z);

   __m128 t1 = _mm_cvtepi32_ps(p32LL);           // convert the 16 32-bit integer
   __m128 t2 = _mm_cvtepi32_ps(p32LH);           // pixels to 16 32-bit floating
   __m128 t3 = _mm_cvtepi32_ps(p32HL);           // point pixels
   __m128 t4 = _mm_cvtepi32_ps(p32HH);

   __m128 afx = _mm_load_ps(af);                 // do the dot product, as before

   t1 = _mm_dp_ps(t1, afx, 0xff);
   t2 = _mm_dp_ps(t2, afx, 0xff);
   t3 = _mm_dp_ps(t3, afx, 0xff);
   t4 = _mm_dp_ps(t4, afx, 0xff);

   t1.m128_f32[1] = t2.m128_f32[0];
   t1.m128_f32[2] = t3.m128_f32[0];
   t1.m128_f32[3] = t4.m128_f32[0];

   __m128 bfx = _mm_load_ps(bf);
   t1 = _mm_dp_ps(t1, bfx, 0xff);

   return t1.m128_f32[0];
}

Seems like a lot of work, but when you look at what the compiler generates for " for (int i=0;i<16;i++) pix[i] = pixb[i];", this way is significantly faster. And that does, in fact, eliminate the integer to float penalty:

Float native 1.34s
Integer SSE unpack 2.28 Use SSE to unpack int to float
Integer native 2.57s Convert int to float
Integer SSE 4.93s Convert int to float, using mm_dp_ps

This one runs faster than the native integer-to-float version. But it's an uninspiring 11% speedup. Something in there is holding me down. But what?

Looking at the assembly output, the integer unpacking SSE version (above) comes in at a svelte 50 instructions while the integer native version is over 100 instructions long. So, it's certainly not about raw instruction count. It has to be something about the particular instructions that we're using: some instructions are faster than others, are there any slow ones in there?

Here's a handy reference: Instruction tables: Lists of instruction latencies, throughputs and micro-operation breakdowns for Intel, AMD and VIA CPUs. It lists the timings for all of the instructions available on all of the various modern Intel and compatible CPUs.

I happen to have an Intel Core i7-3770 in the PC I'm typing this on, which Intel has classified under the "Ivy Bridge" code name. Scrolling down to the Ivy Bridge section in that PDF and looking at some of the instructions used by my SSE code... I see that the load and unpack and int-to-float conversion instructions are all reasonably quick. But, the centerpiece of the function, _mm_dp_ps (aka 'DPPS') - the dot product instruction itself! - is very slow. And I have five of them. But that's what does all the work! Wah!!

Well, OK. That sucks. But, a dot product is just multiplication and addition. And there's certainly a way to do a parallel multiply:

pix0,pix1,pix2,pix3 x afx0,afx1,afx2,afx3 =
pix0⋅afx0, pix1⋅afx1, pix2⋅afx2, pix3⋅afx3

We can do parallel addition, too, but we don't need that. All the things we want to sum are in the same vector, not in separate vectors which is what parallel addition works on. We have to add all those items from the result of the multiplications together, for the full dot product.

We can pull each item out of the vector, add them one at a time, kindof like we're doing down at the bottom of the functions above.

float dot_byte_sse_unpack_dot_extr(unsigned char *p16, float *af, float *bf)
{
   // unpack and convert, etc..
   __m128i Z = _mm_setzero_si128();
   __m128i pix = _mm_load_si128((__m128i*)p16);
   __m128i p16L = _mm_unpacklo_epi8(pix, Z);
   __m128i p16H = _mm_unpackhi_epi8(pix, Z);
   __m128i p32LL = _mm_unpacklo_epi16(p16L, Z);
   __m128i p32LH = _mm_unpackhi_epi16(p16L, Z);
   __m128i p32HL = _mm_unpacklo_epi16(p16H, Z);
   __m128i p32HH = _mm_unpackhi_epi16(p16H, Z);
   __m128 t1 = _mm_cvtepi32_ps(p32LL);
   __m128 t2 = _mm_cvtepi32_ps(p32LH);
   __m128 t3 = _mm_cvtepi32_ps(p32HL);
   __m128 t4 = _mm_cvtepi32_ps(p32HH);

   __m128 afx = _mm_load_ps(af);
   __m128 bfx = _mm_load_ps(bf); 

   // mul input chunks times afx
   t1 = _mm_mul_ps(t1, afx); 
   t2 = _mm_mul_ps(t2, afx);
   t3 = _mm_mul_ps(t3, afx);
   t4 = _mm_mul_ps(t4, afx);

   // t1[0] gets the sum of t1[0],t1[1],t1[2],t1[3]
   // dot product one, done!
   t1.m128_f32[0] = 
       t1.m128_f32[0] + 
       t1.m128_f32[1] + 
       t1.m128_f32[2] + 
       t1.m128_f32[3];

   // t1[1] gets the sum of t2[0],t2[1],t2[2],t2[3]
   t1.m128_f32[1] = t2.m128_f32[0] + t2.m128_f32[1] + t2.m128_f32[2] + t2.m128_f32[3];
   t1.m128_f32[2] = t3.m128_f32[0] + t3.m128_f32[1] + t3.m128_f32[2] + t3.m128_f32[3];
   t1.m128_f32[3] = t4.m128_f32[0] + t4.m128_f32[1] + t4.m128_f32[2] + t4.m128_f32[3];

   t1 = _mm_mul_ps(t1, bfx);
   t1.m128_f32[0] = t1.m128_f32[0] + t1.m128_f32[1] + t1.m128_f32[2] + t1.m128_f32[3];

   return t1.m128_f32[0];
}

That's not great. We've reduced our multiplications to 4 (_mm_mul_ps), but we've increased the additions back up to 15.

Float native 1.34s
Integer SSE unpack> 2.28 Use SSE to unpack int to float, using mm_dp_ps
Integer native 2.57s Convert int to float
Integer SSE unpack, mul 2.75s Unpack with SSE, use mul and 15 additions
Integer SSE 4.93s Convert int to float, using mm_dp_ps

And we suffer for it.

There's an instruction that can help: it's called _mm_hadd_ps (HADDP - horizontal add). It adds adjacent pairs in a vector.

A0,A1,A2,A3 HADD B0,B1,B2,B3 =
A0+A1, A2+A3, B0+B1, B2+B3

It's not exactly what we want, we want all summed in one shot. But this it can be made to work.

float dot_byte_sse_unpack_dot_hadd(unsigned char *p16, float *af, float *bf)
{
   // unpack, convert, etc.
   __m128i Z = _mm_setzero_si128();
   __m128i pix = _mm_load_si128((__m128i*)p16);
   __m128i p16L = _mm_unpacklo_epi8(pix, Z);
   __m128i p16H = _mm_unpackhi_epi8(pix, Z);
   __m128i p32LL = _mm_unpacklo_epi16(p16L, Z);
   __m128i p32LH = _mm_unpackhi_epi16(p16L, Z);
   __m128i p32HL = _mm_unpacklo_epi16(p16H, Z);
   __m128i p32HH = _mm_unpackhi_epi16(p16H, Z);
   __m128 t1 = _mm_cvtepi32_ps(p32LL);
   __m128 t2 = _mm_cvtepi32_ps(p32LH);
   __m128 t3 = _mm_cvtepi32_ps(p32HL);
   __m128 t4 = _mm_cvtepi32_ps(p32HH);

   __m128 afx = _mm_load_ps(af);
   __m128 bfx = _mm_load_ps(bf); 

   // multiply
   t1 = _mm_mul_ps(t1, afx); 
   t2 = _mm_mul_ps(t2, afx);
   t3 = _mm_mul_ps(t3, afx);
   t4 = _mm_mul_ps(t4, afx);

   // these will fill t1 with the results
   // from the four dot products. it's an ugly
   // series of additions, but it works.
   t1 = _mm_hadd_ps(t1, t2);
   t3 = _mm_hadd_ps(t3, t4);
   t1 = _mm_hadd_ps(t1, t3);

   // final dot product
   t1 = _mm_mul_ps(t1, bfx);
   t1 = _mm_hadd_ps(t1, t1);
   t1 = _mm_hadd_ps(t1, t1);

   return t1.m128_f32[0];
}
Integer SSE unpack, mul, HADD 1.08s Unpack with SSE, using mul and HADD
Float native 1.34s
Integer SSE unpack> 2.28 Use SSE to unpack int to float using mm_dp_ps
Integer native 2.57s Convert int to float
Integer SSE unpack, mul 2.75s Unpack with SSE, using mul and 15 additions
Integer SSE 4.93s Convert int to float, using mm_dp_ps

Now that's what I'm talking about! More than twice as fast as my real benchmark, and even faster than the native float input version! Ultimate victory!

Maybe they'll speed up the _mm_dp_ps instruction up in future CPUs. But until then, this does the job pretty well.

Intrinsically Unpredictable++

Most of this is now moot.

The problem was with Visual Studio (2008). Even though I had the Optimization options set correctly, those settings weren’t actually being used to build. I looked in the .vcproj file, and sure enough, there was no ‘Optimization=”2″‘ line in the Release configuration for the VCCLCompilerTool, even though the UI said optimization was set to ‘O2’. Closing VS, manually adding that line to the configuration in the .vcproj, and then reloading and rebuilding generated code that looked like it should. This is a Visual Studio bug, not a compiler bug.

The intrinsics version runs just as fast as my hand-written assembly.

Much thanks to all who questioned my earlier results.

Update: the compiler now produces code that crashes really fast, in x64 release mode. And MS knows about it. And they aren’t going to fix it. There’s a workaround for this particular compiler bug, but not for the one I now see with _mm_extract_epi8. Sigh.

Experiment: over.

Continue reading

Intrinsically Unpredictable

I found this article on CodeProject, that shows a way to do “alpha blending” of one image onto another using Intel’s sexy SIMD instructions. Whassat? SIMD = Single Instruction Multiple Data. Instead of adding sixteen numbers to sixteen other numbers, one at a time, you can add sixteen numbers to sixteen other numbers all at once. One instruction, “add”, performed on multiple values in parallel. Instant sixteen-fold speedup!

The drawback is that these SIMD instructions (Intel’s MMX, SSE, SSE2, etc.) are fairly limited in utility, kindof arcane, and sometimes come with some pretty harsh restrictions. So, you end up having to rework your process to fit what the SIMD instructions allow. The potential benefits are great (16x! for every bit of arithmetic!), but you generally sacrifice a ton of speed in order to get all your data lined up just right so you can do those parallel adds. And as always, things that are simple in C/C++ become challenging in Assembly, and reworking your process to fit what the SIMD instructions demand on top of that can become an excruciating nightmare.

But excruciating is my first name. Well, not really, but is actually kindof close, in a theological and etymological way.

So, this CodeProject article shows how to blend one image onto another with variable opacity. It’s not a really complicated process: for every pixel, Pixelnew = (Pixel1 * opacity) + (Pixel2 * (1.0 – opacity). The trick is to make it fast.

I have some C++ code that burns through it pretty quickly and I was satisfied with it. But, I tried this SIMD code from the article, and had to admit defeat. It was nearly four times as fast as my beautiful C++. So, I reworked it a bit and replaced my C++ code with the result.

That got me to thinking.

What the article calls “alpha blending” is not what I call alpha blending. The code just does a blend of two images with a global opacity value – every pixel uses the same opacity. In my experience, “alpha blending” is what happens when each pixel has its own opacity value, aka “per-pixel alpha.” Where a standard pixel has a Red, Green, and Blue value, a pixel with per-pixel alpha has a Red, Green, Blue and Alpha value. That’s much more flexible (at the cost of 33% more memory per image).

What I wanted was a way to blend an RGBA (Red, Blue, Green and Alpha) image onto an RGB (no alpha) image, using the SIMD instructions. Every pixel in the ‘top’ image has its own opacity value, but the bottom image doesn’t.

Problem 1.
RGBA and RGB pixels are different sizes. RGBA pixels are four bytes wide and RGB pixels are three. So I have to shuffle things around to get the Rs, Gs and Bs all lined up. That’s going to eat up time.

Problem 2.
The SSE instructions work on chunks of data 128 bits wide. You read 128 bits, do your stuff on those 128 bits, then write 128 bits out. That means we can read 4 RGBA pixels at a time (8 bits per byte, 1 byte each for R,G,B,A; 8 x 4 x 4 = 128). But 128 bits would hold 5.3 RGB pixels. Since we need to have an RGBA for every RGB, we’re going to have to discard that last 1.3 RGB pixels at each pass. Not a problem, but the inefficiency gnaws on me.

Problem 3.
Due to the nature of computer math, you can’t do that alpha blending equation using just 8 bits per R,G or B. The intermediate values will exceed what 8 bits can hold. So, you have to expand to 16-bit data. That means we can process just 2 pixels at a time, not 4. All’s not lost, since there’s a lot of initial work that can be done before we do that expansion into 16-bit data. So, we’ll just split those four initial pixels of 3×8 into two 3×16 and do one after the other, then recombine at the end, saving all the initial work at the cost of a little more work at the end.

We’re still ahead of C++, which has to do each pixel as three separate operations (and SIMD will do them 2 at a time), but you can see we’re running out of potential, quick.

Problem 4.
After we process our two pixels, we have to store the result. But again, 128 bits = 5.3 RGB pixels, and we’re only working on 4. So we have to keep those extra 1.3 pixels around, stick them on the end of the four pixels we actually processed and then write the whole thing out – otherwise, we’d end up writing garbage over those 1.3 pixels on the end (which are the first 1.3 pixels of the next round!)

Problem 5.
Given that we can’t read or write past the end of the image, because we are processing 4 RGBA/RGB pixels at a time, always, we can’t do anything if we have fewer than four pixels in the image. And if we end up with 1, 2 or 3 pixel at the end or a row, this code can’t do anything with them, either. So, we have to limit ourselves to images that are multiples of 4 wide.

Did I solve these problems? Tune in tomorrow-ish!

(pt 2)

ZAP [eax], [mm1]

Assembly language…

(scared yet?)

Writing it is like riding a bike without training wheels. No, more like riding a bike without training wheels or brakes, and it’s a fixie. And you’re riding down Mt Doom with a dwarf standing on the back pegs.

So, a few years ago I did a major revision of one of my software packages, and added support for 64-bit processors as part of the work. Along with a lot of nit-picky work, there were a few architectural changes, and one of them was reworking some in-line assembly I was using for performance reasons. The code is used to blend two images: pixel_top + pixel_bottom = pixel_new. It’s the kind of thing that absolutely flies when using MMX because you can do 8 pixels at a time. Anyway, in x64, the MS C++ compiler doesn’t allow in-line ASM as it does in Win32. Not exactly sure why. Doesn’t matter. No big deal. I just moved each chunk of ASM to an external .ASM file, put a “PROC” and “END” around each, added parameter lists, etc.. Only slightly more complex than copy/paste.

And that worked fine for years. Until yesterday.

Yesterday I learned that one of those functions had started crashing in certain situations. And I stared at the code for hours and didn’t see anything wrong with the logic – the code looked like it should do exactly what it was supposed to do. And it was! But it wasn’t. For some reason, in release mode, with full optimizations turned on and with a certain data size, a pointer was going crazy and running off into the woods, where it encountered an angry wizard who killed the process dead.

Then I looked under a rock I found on the internet and learned something!

When you write in-line assembly with Microsoft’s C/C++ compiler, you don’t have to worry much about the contents of most of the common CPU registers when entering or leaving your _asm{…} blocks. The registers are almost like local variables. You can put anything you want in them and the compiler doesn’t care and it will clean up after you. But when you move your ASM to an external (non-inline) function and use the MASM compiler (aka the ‘assembler’), you are expected to behave as a responsible adult. You are supposed to give the assembler a list of all the registers you are going to use. You are supposed to do that so that the compiler can then coordinate its own register usage with yours – so that you don’t interfere on each other. You’re supposed to tell it; the ASM compiler doesn’t actually care one way or another – want to ride your bike without a seat? ASM doesn’t give a crap. Go ahead. Checking that you’ve done this would be easy, but the assembler doesn’t bother. And if you don’t tell the compiler that you’re using a given register – EAX for example – it will assume EAX is available for other uses and that it doesn’t have to worry about you meddling with EAX. But then if you do meddle with EAX in your function, and the compiler was using EAX to hold a pointer, and your program comes out of your function but now that pointer has gone crazy and run off into the woods where the angry wizard lives… well, your program is going to get killed.

Or maybe it won’t! Maybe three years of new revisions will go by and never once will the compiler decide it cares about EAX when calling your function. It will be using EBX or EDX or whatever to hold that one pointer. But then one day (yesterday, in my case) it will decide to use EAX… and you won’t know it until you hear the ZAP! of the angry wizard’s staff flicking your program out of existence.

And that’s why one should avoid assembly whenever possible.

The Loop

How to piss off a customer.

int i;
for (i=0; i < theList.size(); i++);
{
   theList[i].something = 0;
}

= crash.

Since I couldn't run the code (couldn't come up with data that would ever cause that code to run), it took me a day to figure out what the problem was. Any of you C/C++/etc peeps see the problem?

Mochi! I Hibaried On Your Joomla! Xajax!

I’ve had this job:

We programmers know a few tools but not the ones the architect knows. That’s ok – we’ve learned lots of tools over the years so we keep quiet and think to ourselves "I don’t know what this guy is talking about but I can learn these tools". So when the architect says "exclude slf4j from the library’s build sequence or modify the pom file dependency list" we don’t say "what the hell are you talking about". We don’t say anything. We go to google and spend the next two weeks learning slf4j and Ivy and Maven, and RESTful WebServices and Grails, and the proper syntax for the BuildConfig file. Then we reboot the computer three or four times for good measure; download security patches for the IDE; get the latest version of the JDK; clone a few repositories from GitHub; study the Artifactory website; look for new docs on the wiki; and hope to god someone has figured out why the WAR file doesn’t deploy to the 3.2 version of the app server. In all this time no code has been written. No problems have been solved. No user interfaces have been created. But everyone has been terribly busy. And the architect has been studying newer, better versions of the three or four tools we have now almost learned, but which have become obsolete. Eventually, the project is cancelled. Management decides to continue using the prototype version written in Objective-C, even tho the source code has been lost and it doesn’t use any of the company-standard tools, because it does 80% of what the one customer wants, and no one else really needs the application anymore.

Luckily for me, these days, I’m down in low-level C/C++ land, mostly pushing string data around. Because we’re cross platform, we don’t have a lot of external libraries to fuck with; even parts of the C++ STL are off-limits.

But I’ve been in the job where there answer to every requirement is a toolkit someone read about in last month’s IT Mid-level-manager Journal. It’s a great way to do nothing but add acronyms to your resume.

Urgentz! CodePlz?!

In a postscript to her article “Plagiarism and the mechanics of privilege“, Teresa @ Making Light has some fun with a student who is trolling the Yahoo! Answers boards, looking for someone to write her essay on “Wuthering Heights”. Teresa replies:

Here’s a thesis statement: Wuthering Heights is an early work of science fiction that takes place in a dimensional bubble separate from our own universe. We know this because any time someone gets too far from the Wuthering Heights/Thrushcross Grange/Gimmerton Kirk triangle (think Bermuda Triangle), they cease to exist as far as the story is concerned. Nothing outside the bubble is a solution to anything that happens inside it, no matter how logical that should be. Characters living inside it only leave if they’re desperate, and for some strange reason, they keep coming back. …

Which is awesome.

Unfortunately, begging strangers to do your homework isn’t limited to English Lit. essays; it happens all the time on my favorite programming website, Code Project, too. People constantly post Please Do My Homework questions. And they usually get mocked pretty roundly by the rest of us.

Sometimes their questions are simply the text of the assignment (which they’ve copy/pasted), with no commentary or even a hint that they’ve done any thinking about the problem.

For example:

This is my question: please do my homework
Implement a standalone procedure to read in a file containing words and white space and produce a compressed version of the file in an output hie. The compressed version should contain all of the words in the input file and none.

Which generated the following replies:

should contain all of the words in the input file and none.

Well that’s quite a challenge, I wonder how you will solve it?

and

You should assume your teacher is not as dumb as you, and knows how to use google. They probably already know you are a cheat, you should talk to them before they talk to you, to increase your chances of not being kicked off your course.

Sometimes they put in the effort of retyping the assignment:

write a cprogram using nested for statements
write a cprogram usung for nested statements to dicplay the following output:
1
2 2
3 3 3
4 4 4 4
5 5 5 5 5

Which received, in reply, two complete programs. Some people are kind. I usually tell them to do their own homework.

For whatever reason, most of these pleas are from people for whom English is definitely not a first language, though maybe textspeak is:

I need a code
Pls help me to sort numbers in both ascending and descending orders in javascript with HTML.I need a code for this.Pls send it as soon as possible.

… and …

will u help me plz
dear Mr. basu
this is KAUSHAL VARSHNEY from aligarh. i m NIITian and persuing GNIIT from ther. we are in some coding trouble, actually we aer governing a project on USB security system, so will u plz help me in coding of my job. specially in identifying default antiviruse on a button click event by which we can provide more security to our project.
i will very thankful of urs. there is not much time we had. so respond me as soon as possible.

Or, the ever-popular chunk-of-code post:

How to convert c++ to c#…pls..
[…1000+ lines of C++ omitted…]

No question, no hint that the poster has tried anything, just a giant chunk of code. Even better, this was the kind of code that really couldn’t be translated; it was the kind of code that’s intimately tied to a C++ framework for which there is no C# equivalent. It’d be like translating e.e. cummings into Chinese. This person had no idea what he/she was doing. Or it was a joke.

There are plenty of wanna-be game developers:

can anyone give me the source code of pacman in java?
tnx!
i really need it to be able to do my assignment!
i hope you guys can help me on dis! tnx!

As the first reply notes, instead of typing that long, detailed question, the person could have just Googled “source code pacman java” and found the answer. “Have you tried Google?” or links to Let Me Google That For You are popular ways to answer these questions.

And…

How to develop a game from screenshots
hai i am dng project

i m dng blackberry gaming application prjct.
here i want 2 develop the code from screenshots.
plz suggest 2 me how to do it?

Replies included:

Look at the screen shots, and then write code to duplicate them.

..and..

here i want 2 develop the code from screenshots.

Usually the flow goes the opposite way…

But this one might be my favorite of recent questions:

I demand a lexical analyzer
prgram that convert from lex.l to lex.yy.a
plz
in lexical analizer

I realize there’s a language issue here, but I just love the phrase “I demand a lexical analyzer”. I think it would make a great T-shirt.

The Story of O

In addition to my day job, I run a small business writing image processing software. Sales have been pretty shitty for the past few months, so I’ve kindof lost interest in writing new stuff for it. It’s no fun writing new features when you don’t expect anyone to care – I work much better when people ask for things. But over the Christmas break, I found a research paper that got me interested in programming again; it describes a magic beast: an O(1) Median Filter.

What?

Come with me. I will tell you.
Continue reading