 | Level: Introductory Peter Seebach (crankyuser@seebs.plethora.net), Author, Freelance
01 Apr 2005 This series has looked at how the AltiVec instruction set can improve performance on G4 and G5 PowerPC chips. With the theoretical discussion covered in Parts 1 and 2, Part 3 tries to actually get some code optimized. This installment of the Unrolling AltiVec series looks at some real-world code that processors might spend a serious amount of time running, and shows how to tweak it to get extra performance.  |
About this series
The Unrolling AltiVec series has looked at the
history and usage of the SIMD components of the PowerPC architecture;
these are called by various names, including Velocity Engine and VMX, but
Freescale's name for it is AltiVec, which is how the previous installments refer to it. For more on the history and many names of
AltiVec, see the earlier installments of this series. Links are in the Resources section below.
|
|
Motorola® AltiVec™ can dramatically improve the performance of many tasks, even
tasks that you might initially think are too linear to get much advantage
from parallelizing. As always, the most important part of effective
optimization is determining where the system is spending most of its time,
and focusing your efforts there. For instance, if you were working on an
e-mail client, the most useful thing to optimize would be the code for
deleting spam. Pay careful attention to the code examples in this article
and you should be able to get truly incredible performance.
Since Apple makes the source code for Darwin available, it's possible
to even look into optimizing the kernel. One routine comes immediately to
mind: as it turns out, it takes as much as 85% of the CPU time
on my dual G4 tower. This routine simply updates a counter, then continues
operation. It's in the file src/xnu/osfmk/kern/sched_prim.c.
Listing 1 shows the complete source:
Listing 1. Where 85% of CPU time goes
void
idle_thread(void)
{
counter(c_idle_thread_block++);
thread_block(idle_thread_continue);
/*NOTREACHED*/
}
|
The counter() function is in fact simply a
macro that expands to its argument. So, if you focus on the work done in
this function, you'll see that all that's really happening is an increment
of a value of type mach_counter_t, which is an
unsigned int.
Analyzing the code
The most important thing in optimization is understanding what the code
actually does. This function can essentially be reduced to the simpler
form in Listing 2, once you've removed the inefficient
thread-blocking:
Listing 2. A slight simplification
static int x;
void
function(void)
{
++x;
}
|
On the PowerPC®, this expands into a fairly large chunk of assembly; you
can see for yourself using gcc -S -c file.c.
The output file, file.s, looks like Listing 3:
Listing 3. Assembly code for a simple function
_function:
mflr r0
bcl 20,31,"L00000000001$pb"
"L00000000001$pb":
mflr r10
mtlr r0
addis r9,r10,ha16(_x-"L00000000001$pb")
la r9,lo16(_x-"L00000000001$pb")(r9)
lwz r2,0(r9)
addi r2,r2,1
stw r2,0(r9)
blr
.lcomm _x,4,2
|
Most of this code is setup. (For more on PowerPC assembly, see the Resources section.) The actual work is just
done in the three lines right before the return, illustrated in Listing
4:
Listing 4. Incrementing a variable in PowerPC assembly
lwz r2,0(r9)
addi r2,r2,1
stw r2,0(r9)
|
Load the value, add one to it, and store it back. That's fine as far as
it goes -- but remember, the computer is spending something like 85% of its time calling this function over and over. The first thing
to do is figure out how long the loop is really taking. I used gettimeofday() to get time stamps before and after
running the loop a few times, as shown in Listing 5:
Listing 5. A minimalist test harness
int
main(void) {
struct timezone dontcare = { 0, 0 };
struct timeval before, after;
long long microsec;
gettimeofday(&before, &dontcare);
for (x = 0; x < 100000000; function())
;
gettimeofday(&after, &dontcare);
microsec = (after.tv_usec - before.tv_usec) +
1000000 * (after.tv_sec - before.tv_sec);
printf("%lld microseconds\n", microsec);
return 0;
}
|
The output from a test run is 3197848
microseconds -- a tad over 3 million microseconds. Three seconds is
a pretty long time in computing...
Unrolling the loop
The first thing to do is just unroll the loop. This is, for a simple
loop like this, quite easy, as you can see in Listing 6:
Listing 6. A first pass at unrolling the loop
void
function(void)
{
++x;
++x;
++x;
++x;
}
|
With this simple tweak, processing time is down to 1,332,169
microseconds, a substantial improvement. The function is now better than
twice as fast as it was before. It could be unrolled further, but when
you're performing the same operation on a 32-bit value four times in a
row, that suggests that maybe it'd be more efficient to use AltiVec. (Of
course, this first unrolling would improve performance even on a
non-AltiVec CPU.)
Converting to vector operations
Just as the main processor works by loading, modifying, and then
storing values, AltiVec also works by loading, modifying, and then storing
values. The difference is that the AltiVec unit works on 16-byte chunks of
memory at a time; for instance, it can operate on four 32-bit values at
once, instead of a single value at once. The load and store overhead will
be a bit high, but getting the values into a vector register should
produce a measurable speedup.
Consider Listing 7:
Listing 7. Naive vectorization of addition
typedef vector unsigned int v_u_int;
void
function(void)
{
unsigned int xs[4] = { x + 1, x + 2, x + 3, x + 4 };
register v_u_int v = vec_ld(0, xs);
register v_u_int inc4 = (vector unsigned int) { 4, 4, 4, 4 };
v = vec_add(v, inc4);
vec_st(v, 0, xs);
x = xs[3];
}
|
The code in Listing 7 initializes an array with the next four values
after x, then loads the array into a vector
register. Next, the value 4 is loaded into all four 32-bit values of a
second register, and the registers are added. Finally, the incremented
values are stored back into the array, and the last member of the array is
read out. The time to run this version of the loop is only about a million
microseconds -- not exactly the huge speedup AltiVec should offer when
well-tuned, but it's nice to note that, even without a serious effort at
further unrolling, performance has improved.
Unrolling further and further
The next step is to unroll a little further. The simplest thing to do
is to just repeat the line adding inc4 to v. This gets time down to about 430,000 microseconds.
That's pretty good. At this point, however, the dependency of each
operation on previous operations is starting to show.
To get past this limit, the code needs to think ahead, and perform more
operations at once. That means having more than one set of source and
destination registers active at a time, and adding more than four at a
time to each value, as Listing 8 shows:
Listing 8. Unrolling further
unsigned int xs[4] = { x + 1, x + 2, x + 3, x + 4 };
register v_u_int v1 = vec_ld(0, xs);
register v_u_int v2, v3, v4, v5, v6, v7, v8;
register v_u_int inc4 = (vector unsigned int) { 4, 4, 4, 4 };
register v_u_int inc8 = (vector unsigned int) { 8, 8, 8, 8 };
register v_u_int inc16 = (vector unsigned int) { 16, 16, 16, 16
};
v2 = vec_add(v1, inc4);
v3 = vec_add(v1, inc8);
v4 = vec_add(v2, inc8);
v5 = vec_add(v1, inc16);
v6 = vec_add(v2, inc16);
v7 = vec_add(v3, inc16);
v8 = vec_add(v4, inc16);
vec_st(v8, 0, xs);
x = xs[3];
|
The intent here is to reduce the number of times that the processor
needs to stall waiting for the results of a previous operation. So, for
instance, before adding anything to v2, the
code adds something to v1, storing it in a
different register. At the end of these operations, it stores the highest
value it has back into x. This brings the time
for the main loop down to just over 310,000 microseconds. At this point,
the vectorized version is faster than the original, unoptimized, version
by a factor of ten.
In fact, in this particular case, it's actually marginally faster just
to keep looping on adding the inc4 vector in
place; the overhead of setting up all the additional registers ends up
costing too much, and running eight adds in a row can trim the time to
250,000 microseconds.
However, once the set-up cost has been paid, it's possible to get a
dramatic improvement by making more use of your array of vectors. The four
lines filling vectors v5 through v8 can be expanded as Listing 9 shows:
Listing 9. Unrolling the unrolled code even further
v5 = vec_add(v1, inc16);
v6 = vec_add(v2, inc16);
v7 = vec_add(v3, inc16);
v8 = vec_add(v4, inc16);
v1 = vec_add(v5, inc16);
v2 = vec_add(v6, inc16);
v3 = vec_add(v7, inc16);
v4 = vec_add(v8, inc16);
v5 = vec_add(v1, inc16);
v6 = vec_add(v2, inc16);
v7 = vec_add(v3, inc16);
v8 = vec_add(v4, inc16);
|
This allows you to take full advantage of interleaving, making sure that
there are three operations between writing a value in a given vector and
using that value in another operation. This gets the execution time down
to a fairly zippy 200,000 microseconds.
Analyzing performance: Color wheels and color models
 |
Unrolling makes pretty code ugly
One of the reasons not to be too quick to optimize code is that
optimized code is often harder to read than unoptimized code. From a code
quality standpoint, unrolling is absolutely wrong. It creates multiple
copies of the same code, which introduces more opportunities for bit rot.
However, unrolled code also executes more quickly.
The same issues often apply to other comparable techniques. Converting
a recursive procedure to an iterative one is often difficult and ugly, but
may produce a substantial performance increase. In some cases, it may even
make it possible to begin vectorizing code that was previously
unvectorizable. This comes, in general, at a high cost in maintainability,
and a significant risk of introducing errors. For instance, about four
bugs were found in the non-vectorized version of the color transformer
example you'll see later in the article, and about thirty in the vectorized
version.
|
|
Unfortunately, while the previous example was an excellent theoretical
illustration, it does not work out so well in practice. Because its code
runs in the system idle loop, it ends up running the most (and thus
getting the largest speedup) when the system is fairly quiet. As the
system becomes more heavily loaded, and the system spends less time idle,
other functions step in to sabotage the efficiency of the system and make
the user aware that performance is lagging. I am talking, as any user of a
modern Mac OS X system will well know, of the times that the spinning ball
animation, sometimes called the "beach ball of death," appears.
This suggests an avenue for performance optimization: if the generation
of that spinning color wheel could be sped up substantially, the
performance enhancement would be immediately apparent. The easiest thing
to do would be to convert each pixel's RGB values to HSV, increment the
hue by some small amount, then convert back to RGB. The AltiVec pixel type
is, unfortunately, not particularly useful for this task; what's really
needed is a big array of RGB values.
The code for this is a lot more complicated, but it's still reasonably
approachable. The starting function, called inc_hsv(), takes a set of input values, converts them
to HSV, rotates them, and converts them back. Running this conversion on
65,536 data points takes only a fraction of a second, so the test case is
to run that same loop 100 times. That takes a little under 5 seconds.
Unrolling this loop will be a little more challenging. In fact, trying to
write out the unrolled version would be very painful, especially because
the natural size for AltiVec operations on unsigned characters is batches
of 16. In the interest of brevity, I have not provided an inline
listing of the non-vectorized version of this loop unrolled 16 times.
To follow along with this example, see the colors.c sidefile and the color_vec.c sidefile. The original version is colors.c; the vectorized version is color_vec.c.
The sample code provided does make a few assumptions. Most noticeably,
it simply assumes that all of the scratch spaces it allocates will be
16-byte aligned. Production code would want to ensure this, possibly by
putting objects (such as an array of 16 bytes) in a union with some sort
of vector type.
Unfortunately, the algorithm used in this example is a painfully bad
one for AltiVec, as it involves division. AltiVec simply does not have a
division operator. It has multiplication, and a reciprocal "estimate," but
no division. This means that some significant chunks of the process can't
be done in vector code. Still, enough of it is vectorized that, even on a
first attempt, there's a noticeable improvement: a hundred times through
the loop goes from just under 5 seconds, to a bit under 4. That's about a
20% speedup.
Going through the code line-by-line would would soon grow tedious. But
a few things are worth pointing out specifically. The AltiVec code gets a
huge advantage from having the vec_min()
operator. In Listing 10, you can compare the code to select minimum and
maximum values in the original program with the AltiVec version:
Listing 10. Getting the maximum of three values
/* original version */
if (r > g) {
if (r > b) {
max = r;
if (g > b) {
min = b;
} else {
min = g;
}
} else {
max = b;
min = g;
}
} /* second half of this code omitted */
/* AltiVec version */
max16 = vec_max(vec_max(r16, g16), b16);
min16 = vec_min(vec_min(r16, g16), b16);
|
Experienced programmers might well use a macro to calculate the minimum
and maximum values, making the code substantially shorter and easier to
read -- but not actually any faster. AltiVec wins dramatically on
this.
The AltiVec processor has unpack instructions, which sign-extend
values. For instance, unpacking a vector signed
char gives you a vector signed short.
That's a nice feature; unfortunately, there are no corresponding unsigned
operations. One way to deal with this is to use the vec_mul function, and multiply by all ones, as Listing 11 shows:
Listing 11. Unpacking unsigned vectors
scratch[0] = vec_mule(hue16, u8_1);
scratch[1] = vec_mulo(hue16, u8_1);
|
This looks wonderful, but it introduces a bug: the 16 elements of
the vector are no longer in order. The first vector holds even-numbered
elements, and the second odd-numbered elements. The solution I found for
this is the inorder vector, which is used to
permute a 16-byte vector such that expanding it out with vec_mule() and vec_mulo()
produces the first four values in one vector, the next four values in
another, and so on.
One thing that might seem a bit confusing is trying to perform
conditional operations. In general, you don't perform conditional
operations on an AltiVec unit. You perform operations unconditionally,
then select results from vectors based on the output of comparison
functions. For instance, in the original code, wraparound of the hue value
looks like Listing 12:
Listing 12. Conditional operations without AltiVec
if (h > 251)
h -= 251;
else if (h < 21)
h += 4;
|
In the AltiVec code (where hue has a range from 0 to 6), it's done
quite differently, as you can see in Listing 13:
Listing 13. Conditional operations with AltiVec
vector float six;
vector float sub6;
vector bool int dosub;
six = (vector float) { 6.0, 6.0, 6.0, 6.0 };
dosub = vec_cmpgt(fh[j], six);
sub6 = vec_sub(fh[j], six);
fh[j] = vec_sel(fh[j], sub6, dosub);
|
The dosub vector is filled with 1s for the fields where the vector fh[j] has a value greater than 6. A new vector, sub6 is created by subtracting 6 from every value in
the original vector; then fields are selected either from the original
vector, or from the modified one, depending on the bits in the comparison
vector.
With this in mind, look at the complete vectorized code listing, color_vec.c. There are two pieces of code being handled in loops. The
first is the actual hue calculation, which is made more complicated by the
need to divide by the delta value for each
pixel. It looks like it should be vectorizable; in particular, the
essential calculation made each time is identical, with only the source
arrays changing.
The second is the assignment of output values, drawing from the p, q, and t vectors. This is not a very large loop, but it's a
substantial chunk of code. If it could be vectorized, it would improve
things quite a bit. In the first attempt, these calculations were made
16 times, without vectors. Vectorizing the generation of these
numbers improved the speedup from 20% to 50%.
The updated code as is runs nearly twice as fast as the original,
unvectorized, version. You may be able to improve it even further. Once
you've done that, go ahead and put it into the relevant piece of the
Darwin kernel and see how much less time you spend waiting for the system
to render that endearing beach ball.
Summary
As you can see, an experienced programmer can easily avoid the pitfalls
of premature optimization, carefully selecting appropriate sections of
code to optimize. The AltiVec unit does have occasional shortcomings, such
as the lack of a division operator, but careful usage can nonetheless
improve performance dramatically. Attention to detail is important, and
making sure you understand the code you wish to optimize is absolutely
necessary.
Resources - See the sample code for the second example in this article: colors.c (the original pre-vectorization code) and color_vec.c (the AltiVec-optimized vectorized
code).
- Check out the previous parts of the Unrolling AltiVec series.
- Find the PowerPC Microprocessor Family: Vector/SIMD Multimedia
Extension Technology Programming Environments Manual from the Semiconductor solutions technical library.
- SIMDtech has working groups
involved with various SIMD extensions, including AltiVec, MMX, and
others.
- Apple's page about
the Velocity Engine is a slightly buzzword-heavy description of the
AltiVec variants used in Mac systems.
- A previous two-part developerWorks article, "TCP/IP
checksum vectorization using AltiVec" by Ayal Zaks, Dorit Naishlos,
and Daniel Citron, discussed TCP checksum vectorization using AltiVec;
Part 2 of the series addresses Scalar
optimizations and simple vectorization ( developerWorks, October, November 2004) .
- A
discussion of throughput vs. latency, on Apple's site, is of
particular interest.
- Apple provides detailed
performance information about the G4, G4+, and G5.
- Work is being done on auto-vectorization
in GCC.
- Crescent Bay
Software sells software to automatically vectorize C code.
- Apple's page
on performance tools gives links to a number of useful tools,
including
simg4 and simg5.
- The GCC Wiki serves as a
repository for information about GCC, with
up-to-the minute reports on status, useful tips, and everything else you
might want.
- IBM Senior Processor Architect Peter Sandon discusses vector
processing in the G5 in the Ars
Technica interview, "The real structure of the VMX (a.k.a. Altivec)
unit."
- "Save
your code from meltdown using PowerPC instructions," Jonathan Rentzsch
gets into some of the gritty detail of
PowerPC assembly code (developerWorks, November 2004).
- For more on the joys and dangers of writing code that directly
accesses memory, check out "Data
alignment on PowerPC," Jonathan Rentzsch (developerWorks, February
2005).
- Learn everything you needed to know about converting color
spaces, whether you were afraid to ask or not.
- It appears that rumors
of the CD-ROM cursor's death have been greatly exaggerated.
- Intel plans to release hardware in 2006 to
enhance TCP/IP stacks; but did you know you can do something similar
right now with AltiVec?
- The macstl project
tries to unify the architectures in a simple C++ template library. Read
more about it in the Slashdot posting Grand
Unified Theory of SIMD.
- The SIMD
Cross-Platform Headers project does something similar.
- How does SIMD architecture really work? An ancient but still useful
article by Holger Bettag, "Why
AltiVec is a Good Thing," has some answers.
- This
MacTech article about PowerPC assembly is roughly 10 years old, but
covers everything discussed here except AltiVec. Isn't stable
architecture nice?
- Snopes has
more information on April Fools' Pranks.
- You will also appreciate this list of the Top 100 April Fool's
Day Hoaxes of All Time.
About the author  | 
|  | Peter Seebach uses vector processing a lot, and is personally able to cook up to three eggs at once, making him something of an expert in the field. You can contact Peter at developerworks@seebs.plethora.net. |
Rate this page
|  |