 | Level: Introductory Ayal Zaks, Researcher, IBM Dorit Naishlos, Researcher, IBM Daniel Citron, Researcher, IBM
02 Nov 2004 This article, the second in a two-part series, focuses on unrolling loops in ways that allow the AltiVec unit to execute code more efficiently and give the GCC compiler hints for automatically generating vectorized code from plain C. Part 1 in this series considered a simple vectorized version of
the TCP/IP checksum algorithm and proposed a possible enhancement. This article shows how the enhancements can be better unrolled and expanded to produce dramatically increased performance and
how compilers can produce this code with a little friendly hinting.
Equalizing chunk sizes
The comparison between the MOT and IBM schemes is somewhat unfair. MOT reads 32 bytes per iteration, whereas IBM reads only 16. Loop unrolling
doesn't help much because the accumulators force serialization between
iterations. An enhancement, named IBM2, reads and computes 32 bytes per
iteration:
Listing 1: IBM algorithm, revised (IBM2)
while (vector_count-- >0) {
VD1 = *((vector ushort*)data);
VD2 = *((vector ushort*)(data+16));
data += 32;
V_Sum1 = vec_msum(VD1,One,V_Sum1);
V_Sum2 = vec_msum(VD2,One,V_Sum2);
}
|
Figure 1: MOT2 vs. IBM2 vector schemes on a G5: 32 bytes are processed each iteration into two distinct accumulator vectors
This scheme outperforms MOT for every buffer size. However, now IBM2
has an unequal advantage, the data is summed into two distinct
accumulators.
This leads to the following improvement in MOT (MOT2):
Listing 2: Motorola's algorithm, revised (MOT2)
while (vector_count-- >0) {
VD1 = *((vector uint*)data);
VD2 = *((vector uint*)(data+16));
data += 32;
V_CC = vec_addc(VD1,V_Sum);
V_Sum = vec_add(VD1,V_Sum);
VCAR = vec_add(V_CC,VCAR);
V_CC2 = vec_addc(VD2,V_Sum2);
V_Sum2 = vec_add(VD2,V_Sum2);
VCAR2 = vec_add(V_CC2,VCAR2);
}
|
Every 16-byte chunk is accumulated into a separate sum and carry pair.
Even so, IBM2 outperforms it on both a G4 and G5 by an average of 25%, as Figure 1 shows. This trend continues even beyond buffer sizes of 32KB
(not shown in graph), so the distinction isn't in memory processing but
rather in the execution pipeline. The IBM2 scheme can pipe two vec_msum instructions without causing any bubbles in
the execution pipe. The MOT scheme tries to pipe six instructions, some of
them dependent, through the execution pipeline. The pipeline "chokes" and
stalls.
Vectorization limits
The study in the previous section indicates that it might be possible
to process even more data per iteration and accelerate performance by
using additional independent accumulators. The following two schemes named
MOT4 and IBM4 process 64 bytes per iteration:
Listing 3: More revisions (IBM4 and MOT4)
/* MOT4 */
while (vector_count-- >0) {
VD1 = *((vector uint*)data);
VD2 = *((vector uint*)(data+16));
VD3 = *((vector uint*)(data+32));
VD4 = *((vector uint*)(data+48));
data += 64;
V_CC = vec_addc(VD1,V_Sum);
V_Sum = vec_add(VD1,V_Sum);
VCAR = vec_add(V_CC,VCAR);
/* perform 3 more accumulations */
}
/* IBM4 */
while (vector_count-- >0) {
VD1 = *((vector ushort*)data);
VD2 = *((vector ushort*)(data+16));
VD3 = *((vector ushort*)(data+32));
VD4 = *((vector ushort*)(data+48));
data += 64;
V_Sum1 = vec_msum(VD1,One,V_Sum1);
/* perform 3 more accumulations */
}
|
Figure 2: MOT4 vs. IBM2 and IBM4 vector schemes on a G5: 64 bytes are processed each iteration into four distinct accumulator vectors. Results are normalized to MOT2
Figure 2 normalizes the performance of MOT4 and IBM4 to MOT2. Unrolling
the Motorola scheme isn't worthwhile and doesn't improve performance at
all. On the other hand, unrolling the IBM scheme provides over a 100%
performance boost until the L1 cache is saturated, and then the speedups
slowly degenerate until 1.5. The four, independent, vec_msum operations fit neatly into the five stage
Vector Complex Integer Unit (VCIU).
If so, it is possible that more unrolling can unearth even more
performance. We manually unrolled the algorithm until we had 256 bytes
processed per iteration into 16 distinct accumulators. Figure 3 shows the
speedups relative to the basic IBM scheme. The upper graph is in buffer
increments of 16 bytes and the lower of 256 bytes. Obviously, for small
buffer sizes adding more than four accumulators is detrimental to
performance. The IBM4 scheme yields the best performance in ranges until
256 and then matches the higher unrolled schemes until 1KB buffer sizes.
At buffer size of over 1KB, IBM4 is comparable to the schemes that use
6, 8, and even 12 accumulators. However, the IBM16 scheme shows that the
potential of unrolling the accumulators is as high as the number of vector
registers available (32).
Figure 3: Unrolling of the IBM scheme into 2, 4, 6, 8, 12, and 16 accumulators per iteration (results are normalized to the basic IBM scheme)
Checksum autovectorization
Another advantage of the IBM vectorization scheme is that it can be
applied automatically by a vectorizing compiler. Automatic vectorization
of loops consists of analysis and transformation stages. The analysis
stage checks that the loop is amenable for vectorization. This primarily
requires proving that operations from different iterations can run in
parallel without violating the semantics of the program. In the following
transformation phase, scalar operations are replaced with their equivalent
vector form, and the loop bound is appropriately modified.
For the scalar checksum loops, most of the vectorization properties
required are relatively easy to establish:
- The control-flow structure of the loop is simple enough: an
inner-most, single-basic-block loop, whose iteration count can be
evaluated before the loop starts to execute.
- It is simple to determine the access pattern of the memory references
in the loop. Furthermore, these memory accesses are consecutive, and
require no special data manipulations.
- Proving that there are no data dependences between memory accesses in
the loop is trivial, as there is only a single memory reference in the
loop -- the read from memory through the data pointer.
The properties indeed simplify some of the aspects of
autovectorization; however there are other features in the loop that
present some difficulties. The accumulation into the variable sum creates a dependence between the write into sum in one iteration, and the read from sum in the subsequent iteration. Such dependence
cycles inhibit vectorization in general, but can be dealt with in the case
of accumulation using reduction pattern recognition.
Reduction operations compute a scalar result from a set of data
elements; they are generally vectorized by computing partial results
(partial sums in this case) in parallel, and combining them at the end into
a single result. To do this, the compiler needs to recognize the reduction
pattern, create the proper prologue to initialize the partial sums and
epilogue code to reduce them into a single result. Vectorizing the
particular reduction pattern that appears in the checksum loop is further
complicated by the presence of multiple data types. The data elements
operated upon are short, which means that eight elements will be loaded
per iteration (using AltiVec). However, the accumulation is performed on
int data types, only four of which can fit into
an AltiVec vector. Promotion or demotion of data types in vector code
requires packing or unpacking of data elements between different vectors,
which are relatively complex and generally not supported by vectorizing
compilers. However, as demonstrated before, on AltiVec this entire pattern
can be supported by the vec_msum operation --
the intermediate int sums of four pairs of
short data elements are added to four int
values and stored as a second vector of four int. To be able to use this instruction, a compiler
needs to recognize the entire pattern and map it to the vec_msum operation. The MOT scheme, on the other
hand, requires explicit handling of multiple types and 64-bit
accumulators, and is therefore less amenable to automatic
vectorization.
Using the techniques above, a compiler can autovectorize the checksum
loop. It remains to be examined whether the compiler can also apply the
more optimized IBM2 vectorization scheme, which involves unrolling the
loop while introducing multiple vector accumulators. This transformation
is in fact a general technique called modulo accumulator expansion or
simply variable expansion, and can be applied to loops with an
accumulation or induction pattern, whether in scalar or vector form, by
the instruction scheduler. The fact that this optimization can be applied
by a later compilation pass independent of the autovectorizer makes the
IBM2 scheme feasible to incorporate into a compiler, and allows it to
generate the most efficient vector code.
We have been developing the vectorization optimization in GCC. The
basic infrastructure is in place to support initial vectorization
capabilities. These capabilities include some of the features described
above -- pattern recognition and pointer handling. Work is under way to
extend these capabilities and to introduce more advanced vectorization
features, including the handling of reduction and multiple-data types.
Summary
This article analyzes the performance potential of vectorizing the
TCP/IP checksum calculation using the IBM AltiVec instruction set.
Straightforward vectorization as exemplified by Motorola's scheme of
accumulating 32-bit values into sum and carry accumulators (due to the
lack of 64-bit computing) does not fully utilize the AltiVec unit. The
large number of vec_add instructions that are
required quickly "stuffs" the Vector Simple Instruction Unit (VSIU)
pipeline. Unrolling the loop to use independent accumulators does nothing
to alleviate the problem.
On the other hand, the scheme of using the vec_msum instruction, neatly fills the VCIU pipeline
and can scale even up to 16 independent accumulators (which uses all 32
vector registers available) for buffer sizes larger than 1KB. The IBM2
scheme (32 bytes per iteration) beats the comparable MOT2 scheme by 25%,
and the IBM4 scheme (64 bytes per iteration) outperforms the MOT4 scheme
by over 100%.
From the compiler's perspective, the IBM scheme is more amenable to
autovectorization due to its simpler structure. The MOT scheme requires
mapping of each scalar operation into two vector operations (for sum and
carry), which complicates the vectorization process.
This study has uncovered several limitations and requirements of the
AltiVec unit:
- Support for 64-bit arithmetic (or even just the storing of 64-bit
results) would simplify checksum type computations.
- An unsigned and modulo version of the
vsum4shs (Vector Sum Across Partial (1/4) Signed Half
Word Saturate) instruction is missing and could possibly enhance the
checksum algorithm. However, implementing checksum with the vec_sum4s intrinsic instead of the vec_msum intrinsic, yielded wrong results but no
performance improvement. Both instructions are executed by the VCIU and
suffer from the same latency.
- The current pipeline structure that enables issuing only one
arithmetic instruction per cycle is an obstacle to performance gains.
Adding a VSIU to the permute pipeline would be beneficial to the MOT
family of algorithms.
The bottom line is that successful vectorization must utilize the
richness of the AltiVec instruction set and use non-obvious solutions.
Resources - Read Part 1 of this series on scalar optimizations and simple vectorization (developerWorks, October 2004).
- Apple maintains an AltiVec
Instruction Cross-Reference.
- Learn more about automatic
vectorization in GCC.
- While the MPC7441
is end-of-lifed, it was used in drafting this article. It's the earlier
revision of the "G4".
- Further discussion of loop unrolling and vector accumulators may be
found in Advanced
Compiler Design and Implementation, by Seven S. Muchnick. Page
562.
- Autovectorization was the topic of a talk by D. Naishlos at the 2004
GCC Developers' Summit.
- You can learn more about RFC 791, the original IP
protocol specification.
- The PowerPC 970 section of the IBM Microelectronics Technical Library offers background information, documentation, and application notes for the 9xx family of chips.
- This article refers to information taken from TCP/IP Illustrated,
Volume 1: The Protocols. The author's page has additional
information.
- This article refers to information taken from TCP/IP Illustrated,
Volume 2: The Implementation. The author's page has additional
information.
- Learn more about AltiVec from the PowerPC
Microprocessor Family: AltiVec Technology Programming Environments
Manual.
- This article refers to Jacob Pan's Enhanced TCP/IP Performance
technical report, published by Motorola in 2003. It doesn't seem to be
available online (any longer?), but the author gave a presentation
on Accelerating Networking Data Movement Using AltiVec Technology at
the Smart Networks Developer Forum 2003, which is available.
- Have experience you'd be willing to share with Power Architecture zone
readers? Article submissions on all aspects of Power Architecture technology from authors inside and outside
IBM are welcomed. Check out the Power Architecture author
FAQ to learn more.
- Have a question or comment on this story, or
on Power Architecture technology in general?
Post it in the Power Architecture technical forum
or send in a letter to the editors.
-
Get a subscription to the Power Architecture Community Newsletter when
you Join the Power Architecture community.
- All things Power are chronicled in the developerWorks Power
Architecture editors' blog, which is just one of many developerWorks
blogs.
- Find more articles and resources on Power Architecture
technology and all things
related in the developerWorks Power
Architecture technology content area.
- Download a IBM PowerPC 405 Evaluation Kit to demo a SoC in a simulated
environment, or just to explore the fully licensed version of
Power Architecture technology.
About the authors  | |  | Ayal Zaks, Dorit Naishlos, and Daniel Citron are IBM Haifa Labs researchers specializing in microarchitecture and compiler development and optimizations. |
 | |  | Ayal Zaks, Dorit Naishlos, and Daniel Citron are IBM Haifa Labs researchers specializing in microarchitecture and compiler development and optimizations. |
 | |  | Ayal Zaks, Dorit Naishlos, and Daniel Citron are IBM Haifa Labs researchers specializing in microarchitecture and compiler development and optimizations. |
Rate this page
|  |