Skip to main content

skip to main content

developerWorks  >  Power Architecture technology  >

From the stacks: TCP/IP checksum vectorization using AltiVec, Part 2

Scalar optimizations and simple vectorization

developerWorks
Document options

Document options requiring JavaScript are not displayed


Rate this page

Help us improve this content


Level: Introductory

Ayal Zaks, Researcher, IBM
Dorit Naishlos, Researcher, IBM
Daniel Citron, Researcher, IBM

02 Nov 2004

This article, the second in a two-part series, focuses on unrolling loops in ways that allow the AltiVec unit to execute code more efficiently and give the GCC compiler hints for automatically generating vectorized code from plain C.

Part 1 in this series considered a simple vectorized version of the TCP/IP checksum algorithm and proposed a possible enhancement. This article shows how the enhancements can be better unrolled and expanded to produce dramatically increased performance and how compilers can produce this code with a little friendly hinting.

Equalizing chunk sizes

The comparison between the MOT and IBM schemes is somewhat unfair. MOT reads 32 bytes per iteration, whereas IBM reads only 16. Loop unrolling doesn't help much because the accumulators force serialization between iterations. An enhancement, named IBM2, reads and computes 32 bytes per iteration:


Listing 1: IBM algorithm, revised (IBM2)

while (vector_count-- >0) {
    VD1 = *((vector ushort*)data);
    VD2 = *((vector ushort*)(data+16));
    data += 32;
    V_Sum1 = vec_msum(VD1,One,V_Sum1);
    V_Sum2 = vec_msum(VD2,One,V_Sum2);
}


Figure 1: MOT2 vs. IBM2 vector schemes on a G5: 32 bytes are processed each iteration into two distinct accumulator vectors
Figure 1: MOT2 vs. IBM2 vector schemes on a G5: 32 bytes are processed each iteration into two distinct accumulator vectors

This scheme outperforms MOT for every buffer size. However, now IBM2 has an unequal advantage, the data is summed into two distinct accumulators.

This leads to the following improvement in MOT (MOT2):


Listing 2: Motorola's algorithm, revised (MOT2)

while (vector_count-- >0) {
    VD1 = *((vector uint*)data);
    VD2 = *((vector uint*)(data+16));
    data += 32;
    V_CC = vec_addc(VD1,V_Sum);
    V_Sum = vec_add(VD1,V_Sum);
    VCAR = vec_add(V_CC,VCAR);
    V_CC2 = vec_addc(VD2,V_Sum2);
    V_Sum2 = vec_add(VD2,V_Sum2);
    VCAR2 = vec_add(V_CC2,VCAR2);
}

Every 16-byte chunk is accumulated into a separate sum and carry pair. Even so, IBM2 outperforms it on both a G4 and G5 by an average of 25%, as Figure 1 shows. This trend continues even beyond buffer sizes of 32KB (not shown in graph), so the distinction isn't in memory processing but rather in the execution pipeline. The IBM2 scheme can pipe two vec_msum instructions without causing any bubbles in the execution pipe. The MOT scheme tries to pipe six instructions, some of them dependent, through the execution pipeline. The pipeline "chokes" and stalls.

Vectorization limits

The study in the previous section indicates that it might be possible to process even more data per iteration and accelerate performance by using additional independent accumulators. The following two schemes named MOT4 and IBM4 process 64 bytes per iteration:


Listing 3: More revisions (IBM4 and MOT4)

/* MOT4 */
while (vector_count-- >0) {
    VD1 = *((vector uint*)data);
    VD2 = *((vector uint*)(data+16));
    VD3 = *((vector uint*)(data+32));
    VD4 = *((vector uint*)(data+48));
    data += 64;
    V_CC = vec_addc(VD1,V_Sum);
    V_Sum = vec_add(VD1,V_Sum);
    VCAR = vec_add(V_CC,VCAR);
    /* perform 3 more accumulations */
}

/* IBM4 */
while (vector_count-- >0) {
    VD1 = *((vector ushort*)data);
    VD2 = *((vector ushort*)(data+16));
    VD3 = *((vector ushort*)(data+32));
    VD4 = *((vector ushort*)(data+48));
    data += 64;
    V_Sum1 = vec_msum(VD1,One,V_Sum1);
    /* perform 3 more accumulations */
}


Figure 2: MOT4 vs. IBM2 and IBM4 vector schemes on a G5: 64 bytes are processed each iteration into four distinct accumulator vectors. Results are normalized to MOT2
Figure 2: MOT4 vs. IBM2 and IBM4 vector schemes on a G5: 64 bytes are processed each iteration into four distinct accumulator vectors. Results are normalized to MOT2

Figure 2 normalizes the performance of MOT4 and IBM4 to MOT2. Unrolling the Motorola scheme isn't worthwhile and doesn't improve performance at all. On the other hand, unrolling the IBM scheme provides over a 100% performance boost until the L1 cache is saturated, and then the speedups slowly degenerate until 1.5. The four, independent, vec_msum operations fit neatly into the five stage Vector Complex Integer Unit (VCIU).

If so, it is possible that more unrolling can unearth even more performance. We manually unrolled the algorithm until we had 256 bytes processed per iteration into 16 distinct accumulators. Figure 3 shows the speedups relative to the basic IBM scheme. The upper graph is in buffer increments of 16 bytes and the lower of 256 bytes. Obviously, for small buffer sizes adding more than four accumulators is detrimental to performance. The IBM4 scheme yields the best performance in ranges until 256 and then matches the higher unrolled schemes until 1KB buffer sizes. At buffer size of over 1KB, IBM4 is comparable to the schemes that use 6, 8, and even 12 accumulators. However, the IBM16 scheme shows that the potential of unrolling the accumulators is as high as the number of vector registers available (32).


Figure 3: Unrolling of the IBM scheme into 2, 4, 6, 8, 12, and 16 accumulators per iteration (results are normalized to the basic IBM scheme)
Figure 3: Unrolling of the IBM scheme into 2, 4, 6, 8, 12, and 16 accumulators per iteration (results are normalized to the basic IBM scheme)

Checksum autovectorization

Another advantage of the IBM vectorization scheme is that it can be applied automatically by a vectorizing compiler. Automatic vectorization of loops consists of analysis and transformation stages. The analysis stage checks that the loop is amenable for vectorization. This primarily requires proving that operations from different iterations can run in parallel without violating the semantics of the program. In the following transformation phase, scalar operations are replaced with their equivalent vector form, and the loop bound is appropriately modified.

For the scalar checksum loops, most of the vectorization properties required are relatively easy to establish:

  • The control-flow structure of the loop is simple enough: an inner-most, single-basic-block loop, whose iteration count can be evaluated before the loop starts to execute.
  • It is simple to determine the access pattern of the memory references in the loop. Furthermore, these memory accesses are consecutive, and require no special data manipulations.
  • Proving that there are no data dependences between memory accesses in the loop is trivial, as there is only a single memory reference in the loop -- the read from memory through the data pointer.

The properties indeed simplify some of the aspects of autovectorization; however there are other features in the loop that present some difficulties. The accumulation into the variable sum creates a dependence between the write into sum in one iteration, and the read from sum in the subsequent iteration. Such dependence cycles inhibit vectorization in general, but can be dealt with in the case of accumulation using reduction pattern recognition.

Reduction operations compute a scalar result from a set of data elements; they are generally vectorized by computing partial results (partial sums in this case) in parallel, and combining them at the end into a single result. To do this, the compiler needs to recognize the reduction pattern, create the proper prologue to initialize the partial sums and epilogue code to reduce them into a single result. Vectorizing the particular reduction pattern that appears in the checksum loop is further complicated by the presence of multiple data types. The data elements operated upon are short, which means that eight elements will be loaded per iteration (using AltiVec). However, the accumulation is performed on int data types, only four of which can fit into an AltiVec vector. Promotion or demotion of data types in vector code requires packing or unpacking of data elements between different vectors, which are relatively complex and generally not supported by vectorizing compilers. However, as demonstrated before, on AltiVec this entire pattern can be supported by the vec_msum operation -- the intermediate int sums of four pairs of short data elements are added to four int values and stored as a second vector of four int. To be able to use this instruction, a compiler needs to recognize the entire pattern and map it to the vec_msum operation. The MOT scheme, on the other hand, requires explicit handling of multiple types and 64-bit accumulators, and is therefore less amenable to automatic vectorization.

Using the techniques above, a compiler can autovectorize the checksum loop. It remains to be examined whether the compiler can also apply the more optimized IBM2 vectorization scheme, which involves unrolling the loop while introducing multiple vector accumulators. This transformation is in fact a general technique called modulo accumulator expansion or simply variable expansion, and can be applied to loops with an accumulation or induction pattern, whether in scalar or vector form, by the instruction scheduler. The fact that this optimization can be applied by a later compilation pass independent of the autovectorizer makes the IBM2 scheme feasible to incorporate into a compiler, and allows it to generate the most efficient vector code.

We have been developing the vectorization optimization in GCC. The basic infrastructure is in place to support initial vectorization capabilities. These capabilities include some of the features described above -- pattern recognition and pointer handling. Work is under way to extend these capabilities and to introduce more advanced vectorization features, including the handling of reduction and multiple-data types.



Back to top


Summary

This article analyzes the performance potential of vectorizing the TCP/IP checksum calculation using the IBM AltiVec instruction set. Straightforward vectorization as exemplified by Motorola's scheme of accumulating 32-bit values into sum and carry accumulators (due to the lack of 64-bit computing) does not fully utilize the AltiVec unit. The large number of vec_add instructions that are required quickly "stuffs" the Vector Simple Instruction Unit (VSIU) pipeline. Unrolling the loop to use independent accumulators does nothing to alleviate the problem.

On the other hand, the scheme of using the vec_msum instruction, neatly fills the VCIU pipeline and can scale even up to 16 independent accumulators (which uses all 32 vector registers available) for buffer sizes larger than 1KB. The IBM2 scheme (32 bytes per iteration) beats the comparable MOT2 scheme by 25%, and the IBM4 scheme (64 bytes per iteration) outperforms the MOT4 scheme by over 100%.

From the compiler's perspective, the IBM scheme is more amenable to autovectorization due to its simpler structure. The MOT scheme requires mapping of each scalar operation into two vector operations (for sum and carry), which complicates the vectorization process.

This study has uncovered several limitations and requirements of the AltiVec unit:

  • Support for 64-bit arithmetic (or even just the storing of 64-bit results) would simplify checksum type computations.
  • An unsigned and modulo version of the vsum4shs (Vector Sum Across Partial (1/4) Signed Half Word Saturate) instruction is missing and could possibly enhance the checksum algorithm. However, implementing checksum with the vec_sum4s intrinsic instead of the vec_msum intrinsic, yielded wrong results but no performance improvement. Both instructions are executed by the VCIU and suffer from the same latency.
  • The current pipeline structure that enables issuing only one arithmetic instruction per cycle is an obstacle to performance gains. Adding a VSIU to the permute pipeline would be beneficial to the MOT family of algorithms.

The bottom line is that successful vectorization must utilize the richness of the AltiVec instruction set and use non-obvious solutions.



Resources



About the authors

Ayal Zaks, Dorit Naishlos, and Daniel Citron are IBM Haifa Labs researchers specializing in microarchitecture and compiler development and optimizations.


Ayal Zaks, Dorit Naishlos, and Daniel Citron are IBM Haifa Labs researchers specializing in microarchitecture and compiler development and optimizations.


Ayal Zaks, Dorit Naishlos, and Daniel Citron are IBM Haifa Labs researchers specializing in microarchitecture and compiler development and optimizations.




Rate this page


Please take a moment to complete this form to help us better serve you.



YesNoDon't know
 


 


12345
Not
useful
Extremely
useful
 


Back to top