 | Level: Introductory Peter Seebach (crankyuser@seebs.plethora.net), Author, Freelance
01 Mar 2005 AltiVec? Velocity Engine? VMX? If you've only been casually following PowerPC development, you might be confused by the various guises of this vector processing SIMD technology. In this first installment of a three-part series, Peter Seebach gives you the basics on what AltiVec is, what it does -- and how it stacks up against its competition.
On August 31, 1999, Apple® announced that it would begin selling
computers that were considered "supercomputers" by the U.S. government,
and could thus not be exported to some countries. This caused a lot
of controversy at the time. Now, five years later, lots of computers are
faster. But what is still interesting is that the technology that Apple
used as the basis for this fairly dramatic claim is still in use, and it's
still of major interest to developers trying to get the best performance
out of certain kinds of tasks.
Motorola®
AltiVec™
is one of the names (see the sidebar, "The nine billion names of AltiVec" for the others) for a specific example of
Single Instruction, Multiple Data, or SIMD, execution. Normally, a
single instruction to a computer does a single thing. A single SIMD
instruction will also generally do a single thing -- but it will do it to
multiple pieces of data at once, thus the fairly unimaginative (but at
least pronounceable!) name. The multiple pieces of data being operated on
at once are often called vectors, hence the name AltiVec. More
traditional supercomputer vector processors might well store fifty or more
pieces of data in a single vector register. AltiVec is a little more
conservative. (See the sidebar, "Why vectors? below for more on
vector processing.)
 |
The nine billion names of AltiVec (well, OK, only three)
VMX was the original code name for this extension inside IBM.
That term isn't in widespread use, with generic terms like SIMD or
vector processor preferred in IBM documentation. AltiVec is
Motorola's trade name for this set of extensions, and the company
thoughtfully trademarked the term. That's why Apple uses the name
Velocity Engine, which is nicely generic and refers to either
company's implementation of the technology.
The interaction of these names is confusing, and they are often bandied
about interchangeably. Whenever you hear one of these terms used, it's
probably safe to assume that it is technically interchangeable with the
others. This article uses the term AltiVec because it's the
prettiest.
|
|
Is this hardware or software?
If a processor has AltiVec support, that means that the processor has
support for an additional set of instructions, as well as the special
registers used by those instructions. However, the question of whether a
particular chip has AltiVec support can't be answered by just seeing if a
particular feature is present or absent. Different tasks performed using
AltiVec instructions are handled by different components of the chip, and
different chips may have different sets of components. For instance, the
original G4, the first shipping processor with AltiVec support, had only a
single unit to handle permutation instructions, but later models, called
the G4+ by Motorola, had two. Each of these units can process one
operation at a time, but if you have two units, you can start another
operation before the first one is finished.
Software support for AltiVec mostly consists of generating instructions
that the hardware can use. There are extensions to C that you can use to
specify these precise operations, or you can drop straight into assembly.
On the other hand, some compilers can do an acceptable job of transforming
code automatically into vector operations.
The existence of noticeably different implementations of AltiVec has
implications for programmers. The best way to get an algorithm to run may
vary from one processor to another. This isn't a problem specific to
AltiVec; the best possible optimization of a piece of code often varies
widely from one generation of any processor to another.
Who supports this?
To get use out of AltiVec, you need a chip that has the hardware, an
operating system that supports it, and a compiler or assembler that can
generate code to use it.
The only processors currently supporting AltiVec are the G4 and G5.
The G4 (including model numbers 7400 and 7410) and G4+ (7450 and 7455)
processors are made by Motorola. (There are more models than just the ones
listed here, but these are the most widely discussed.) The G5 chips
include the IBM 970 and 970FX; these are essentially POWER4™ cores
with an AltiVec unit bolted on. So far, only PowerPC® processors have had
AltiVec support, not the POWER™ line. If you want to buy "a computer with
AltiVec," Apple's Mac line is your most likely option. For evaluation
boards and custom designs, however, you can go with any of the many
vendors who do development kits based on either the G4 or G5.
 | |
Our apologies!
The original version of this article contained an erroneous reference to a "970GX." We apologize for any confusion this may have caused.
--Editors
|
|
Any operating system that runs on PowerPC and has been updated since
the year 2000 will almost certainly work with AltiVec; if there are
counterexamples, I have been unable to find them. The highest profile
OSes would be Linux™, Mac OS X, and AIX. In theory, an operating system not
specifically written for AltiVec-based hardware might not save information
stored in the AltiVec registers when switching from one task to another.
However, most OSes seem to have been long since patched to handle this
difficulty. Even this theoretical problem will not arise as long as only one user program at a time uses the AltiVec instructions; it's only
when more than one program on a system uses AltiVec that problems can
arise.
As for compiler support, the GNU project's GCC compiler supports AltiVec; so does the Metrowerks
CodeWarrior compiler, and, of course, IBM VisualAge®. All three produce
functional code using the AltiVec extensions. (This, of course, applies to
current compiler versions; no one is promising that a 1995 copy of any
compiler will have support for the vector instructions.)
A few technical details
Now that you know what AltiVec is, you're probably wondering what
exactly it does. It provides thirty-two 128-bit registers to hold
vectors. Each register can be seen as providing sixteen 8-bit values,
eight 16-bit values, or four 32-bit values. The 32-bit values can be
integer or floating point, and all integer values can be signed or
unsigned. AltiVec does not support 64-bit values, which can be a bit of a
crimp; but in AltiVec's defense, getting two operations at once might not
justify the overhead of getting vectors arranged. Furthermore, AltiVec
supports an additional type, called pixel, that
holds eight 16-bit pixel values. Past this, there's a fairly large number
of instructions that perform various operations: loading registers from
memory, performing arithmetic on them (in various types), and writing them
back.
 |
Why vectors?
Early vector processors didn't so much perform multiple operations at
once, rather they simply queued operations up so they could be performed
in series. You'd start loading values into the vector register, and a
little later start getting outputs at a ferocious pace. Instead of each
operation taking a full load/modify/store cycle, followed by another
load/modify/store, you would set up a series of loads, and then a little
later a series of stores would follow. Even naive use of AltiVec can
approximate this, by letting you load four (or more) values at once,
operate on them, then store them all back at once.
That said, AltiVec works best when you're doing multiple sets of
operations at once, which is one of the reasons it has a large number
of registers: you can load one register while another is being
processed, and so on. This is important to users simply because it's
still many times faster than individual operations. Furthermore,
AltiVec's design allows a little more flexibility than some larger
vector processors and is well-suited to the variety of tasks desktop
computers face.
|
|
Of particular interest is the existence of permutation
instructions, which allow a register to be populated with bits or pieces
of another register, possibly shuffled. These instructions allow bytes to
be reshuffled from two source registers into a single destination
register, in any order or pattern whatsoever. This is an incredibly
general tool, useful anywhere from image processing to cryptography.
You might think that the overhead of saving thirty-two 128-bit
registers would be a little steep, and the designers of AltiVec would
agree. To this end, a 32-bit register called VRSAVE has been provided. When the operating system
switches context, it saves only the registers whose corresponding bit in
the VRSAVE register has been set to one. This
is an excellent compromise, allowing a programmer who needs only a few
registers to save and restore those registers, and no others, on context
switches. On the down side, this register must be manually updated if you
are writing in assembly. (A C compiler targeting AltiVec will keep the
register up to date for you.)
In a couple of cases, AltiVec instructions deviate a little from the
"one operation performed on multiple operands" model that is generally
associated with vector processing. For instance, the vmaddfp instruction multiplies two operands together
and adds a third in a single operation. Furthermore, there are a few
operations that operate on the register as though it were a single 128-bit
value, such as bit shifts or boolean operations, which don't care what
type of data you think the register holds.
Compare and contrast
AltiVec units face competition from three sources. The first, and
sometimes the most effective, is the rest of the CPU. While AltiVec is
incredible for some kinds of tasks, there are others on which it simply
doesn't perform very well. Some algorithms that don't lend themselves well
to vectorization will be hard or impossible to convert to AltiVec. Any
process in which every stage of computation depends on the results of the
previous stage will probably see very little improvement from using
AltiVec.
AltiVec units also face competition from other AltiVec units, since, as
this article has noted, not all AltiVec units are created equal. The G4, G4+, and G5
each have different performance characteristics. The original G4 needs the
fewest cycles per instruction, but can run fewer instructions at a time;
the G5 needs the most cycles per instruction, but can run more at a time.
There are a number of additional complexities involved, but in general,
for a given clock speed, the G5 will get the most work done on
well-pipelined code. Ironically, however, badly pipelined code may run in
fewer cycles on an older processor. This doesn't mean that you actually
get better performance on the older processor, though. The newer
processors have uniformly higher clock speeds.
AltiVec could also be compared to other SIMD architectures. The x86
world has provided us with a great number of these, including MMX, MMX2,
3DNow!, SSE, SSE2, and SSE3. The MMX instruction set, dating back to the
days of 166 MHz Pentium processors, was carefully designed to require no
changes to operating system software. To this end, it shared registers
with the floating-point unit. Unfortunately, this made it impossible for a
program to use MMX acceleration and floating point at the same time, which
limited the usefulness of the original MMX instruction set.
More recent efforts, such as SSE2, are somewhat better. SSE2 provides
eight registers, which are not shared with the floating point unit. SSE2
does have 64-bit floating point types, which is a plus. However,
AltiVec's selection of instructions is more complete, and most of them
work from two registers into a third, letting the processor perform
moderately complicated vector operations entirely in registers, without
touching memory until the final data is ready to come out. This, and the
larger pool of registers, favors deeply pipelined operations that can come
close to saturating the processor's multiple execution units. AltiVec
still wins.
The competing 3DNow! instruction set, developed by AMD, has some
features of MMX and some of SSE, and has gone through a few revisions of
its own. However, since only AMD's chips use it, it's not as widely
supported. To add insult to injury, some programs that test for the
availability of 3DNow! support may generate false positives and try to use
these instructions on processors that don't support them, such as the
Pentium 4M. This has made support for these extensions less common than it
might otherwise be.
This highlights one of the real advantages that AltiVec has over the
various SIMD instruction sets available for x86 processors: its
comparative stability. Every AltiVec processor since the original G4 has
had the same essential functionality, the same large register pool that
isn't shared with anything, and a reasonably complete set of likely
operations. This has made it easier for support to become widespread: a
program designed to take advantage of the original G4 will still get a
noticeable performance improvement on today's G5.
Making use of AltiVec
Apple has been very aggressive about getting AltiVec optimizations into
core components of its operating system, to make sure that users feel
they're getting some benefit from it. Graphics applications on the Mac are
very likely to be AltiVec enhanced. The consistent architecture has made
the return on investment of AltiVec optimization quite good. Now that
every Mac shipped comes with an AltiVec processor, every user will benefit
if a program is able to make effective use of AltiVec for processing.
Any operating system can use AltiVec. As an example, consider the TCP
checksum algorithm (see Resources for an article
on vectorizing this), which chews up a substantial number of cycles on a
heavily loaded server, and which can usually be sped up dramatically using
AltiVec. This operation happens in the operating system, not in specific
network applications -- but all applications relying on the network
benefit from the performance boost.
Even if you don't want to specifically target AltiVec, you can still
get some benefit from it. Automatic vectorization of code, while not up to
the best human optimizations, can produce a noticeable performance
improvement. There are both commercial and free products in this arena.
There is support for autovectorization in a branch of GCC, appropriately enough called
autovec-branch. Automatic vectorization provides a substantial
benefit: because the compiler is doing the vectorizing, the original code
is not dependent on a specific processor's vector execution model, so your
code can remain portable.
The next article in this series will look in more detail at getting
good performance out of AltiVec when programming it directly, using either
C or assembly. That gives you enough time to go out and get a machine with
AltiVec support!
Resources
-
SIMD has working groups
involved with various SIMD extensions, including AltiVec, MMX, and
others.
- Check out the PowerPC Microprocessor Family: Vector/SIMD Multimedia Extension Technology Programming Environments Manual.
-
Apple's page
about the Velocity Engine is a slightly buzzword-heavy description
of the AltiVec variants used in Mac systems.
- Motorola recently spun off its chipmaking division into a separate
company called Freescale. The Freescale site also has a page about AltiVec (pdf).
- A previous two-part developerWorks article, "TCP/IP checksum
vectorization using AltiVec," by Ayal Zaks, Dorit Naishlos, and Daniel
Citron, discussed TCP checksum vectorization using AltiVec; start with
Part
1 (developerWorks, October 2004).
-
A discussion of throughput vs. latency,
on Apple's site, is of particular interest.
- Apple provides detailed
performance information about the G4, G4+, and G5.
- Work is being done on auto-vectorization in
GCC
.
-
Crescent Bay
Software sells software to automatically vectorize C code.
-
Apple's
page on performance tools gives links to a number of useful tools,
including
simg4 and simg5.
-
The GCC Wiki serves as a
repository for information about
GCC, with
up-to-the minute reports on status, useful tips, and everything else
you might want.
- IBM Senior Processor Architect Peter Sandon discusses vector
processing in the G5 in this interview (Ars Technica).
- "Save your
code from meltdown using PowerPC atomic instructions," by Jonathan Rentzsch, gets into the gritty detail of PowerPC
assembly code (developerWorks, November 2004).
- For more on the joys and dangers of writing code that directly
accesses memory, check out "Data
alignment on PowerPC," Jonathan Rentzsch (developerWorks, February
2005).
- Have experience you'd be willing to share with Power Architecture zone
readers? Article submissions on all aspects of Power Architecture technology from authors inside and outside
IBM are welcomed. Check out the Power Architecture author
FAQ to learn more.
- Have a question or comment on this story, or
on Power Architecture technology in general?
Post it in the Power Architecture technical forum
or send in a letter to the editors.
-
The Power Architecture Community Newsletter includes full-length articles as well as recent news about members of the Power Architecture community and upcoming events of interest. Subscribe to the newsletter today!
- All things Power are chronicled in the developerWorks Power
Architecture editors' blog, which is just one of many developerWorks
blogs.
- Find more articles and resources on Power Architecture
technology and all things
related in the developerWorks Power
Architecture technology content area.
About the author  | 
|  | Peter Seebach uses vector processing a lot, and is personally able to cook up to three eggs at once, making him something of an expert in the field. You can contact Peter at developerworks@seebs.plethora.net. |
Rate this page
|  |