 | Level: Introductory Ayal Zaks, Researcher, IBM Dorit Naishlos, Researcher, IBM Daniel Citron, Researcher, IBM
26 Oct 2004 This two-part article demonstrates the kinds of performance gains AltiVec can produce on the TCP/IP checksum, or on code similar to it. It gives special attention both to instructions that help improve performance, and to general unrolling and scheduling techniques. The net result? Performance increased by a factor of four. One of the key components in the TCP/IP protocol stack is the checksum
computation, which ensures the integrity of the transferred data. This
computation can be greatly accelerated with the use of single instruction,
multiple data (SIMD) units prevalent in state-of-the-art processors. In
particular, the AltiVec unit of the IBM® PowerPC® architecture is
well-suited for this type of computation. This article analyzes a former
vectorization effort, shows how it can be improved upon, and then enhances
it further. For packets of up to 64KB, an average speedup of around 4.0
is obtained. The second part of this series also examines the performance
gains obtained by hand coding the algorithm and discusses the issues that
must be resolved for it to be autovectorized by the compiler.
The checksum field is part of the TCP header structure, and when a packet
arrives at any final, or intermediate, destination a new checksum is
calculated and compared to the checksum field. Any discrepancies indicate
that the transferred data has been violated in some way. The checksum
field is the 16-bit one's complement of the one's complement sum of all
16-bit words in the header and text. This lends itself to data level
parallelization (DLP) that can be supported by the AltiVec unit in the IBM
PowerPC processor line.
This work isn't the first attempt at vectorizing the checksum algorithm
-- Motorola published a white paper that does exactly this on the AltiVec
unit of its MPC7440 processor (see Resources).
However, we propose to vectorize the algorithm to take full potential of
the AltiVec unit on the PowerPC 970 processor. Based on the knowledge
obtained during this study, we will suggest a technique for having the
compiler autovectorize the checksum algorithm. The main contributions of
the article are:
- A superior scheme for vectorizing the checksum algorithm
- A study of the limits of checksum vectorization
- Autovectorization based on the proposed scheme
Scalar checksum
This section describes the basic scalar TCP/IP checksum
algorithm and looks at techniques to enhance it. The final enhancement will
be compared to the vectorized versions presented in the following
sections.
Much has been said about TCP/IP and the role of the checksum integrity
check. Whether the check is sound, and whether it is a computational
bottleneck, are issues this article does not resolve.
The focus is on delivering the highest performance possible for it.
The basic checksum algorithm as described in TCP/IP Illustrated, Volume
2: The Implementation receives a buffer of unsigned chars, accumulates
16-bit half-words into a 32-bit accumulator, and finally takes the one's
complement of the one's complement sum. This implies that the test isn't
sound beyond 64KB, which is the theoretical limit of an IP packet. The
protocol specifies a practical upper limit of 576 bytes. Nevertheless, in
other systems (Microsoft Windows, for instance) there is no such limit, so
this article explores all buffer sizes until 64KB. Given that data transfer
optimization is one of the key components of the protocol, assume that
previous stages have aligned the buffer on 16-byte boundaries and padded
it with zeros where necessary.
The basic 16-bit version is:
Listing 1: The TCP/IP
checksum algorithm
ushort checksum16(uchar* data, int len)
{
uint sum = 0;
if ((len & 1) == 0)
len = len >> 1;
else
len = (len >> 1) + 1;
while (len > 0) {
sum += *((ushort*)data);
data += sizeof(ushort);
len--;
}
sum = (sum >> 16) + (sum & 0xffff);
sum += (sum >> 16);
return(~sum);
}
|
This example assumes that the buffer is padded with a zero
if its length isn't even (in all future code snippets, the length
calculation will be omitted). This code can be potentially improved upon
by accumulating 32-bit words into a 64-bit double word -- reducing the
number of memory reads and additions by half (this scheme only works
correctly for big-endian machines).
Listing 2: Accumulating 32 bits at a
time
ushort checksum32 (uchar *data, int len) {
unsigned long long sum = 0;
len = len >> 2;
while (len > 0) {
sum += *((uint *)data);
data += sizeof(uint);
--len;
}
while (sum > 0xffffffffULL) {
sum = (sum & 0xffffffffULL) + (sum >> 32);
}
sum = (sum & 0xffff) + (sum >> 16);
sum += (sum >> 16);
return (~sum);
}
|
The sum variable is defined as a long long data type in order to make sure the GCC
compiler recognizes it as a 64-bit value. Figure 1 shows the speedups of the 32-bit version on two different Macintosh computers, a G4 and a G5 (Table 1 summarizes their main characteristics.). On both machines,
the -O3 optimization ags were used (-mpowerpc64 was set on the G5), buffer sizes of up to
8KB (256 byte increments) were measured, and the checksums were run
1,000,000 times for each buffer size (which implies warm caches). The
performance measured is the user CPU time.
Figure 1: Comparison of 32- to 64-bit checksum computation on Macintosh G4 and G5 computers
The graphs show that no advantage is gained by using a 64-bit
accumulator on the G4 because the implementation of the long
long data type uses two registers and two adds per accumulation. On
the G5, which possesses true 64-bit computing, the speedup is immediately
apparent and constant at around 1.5.
Table 1. G4 and G5 main characteristics
| Attribute | G4 | G5 | | Processor | MPC 7441 | PowerPC 970 | | MHz | 700 | 1800 | | Word Size | 32 | 64 | | L1 DCache size | 32KB | 32KB | | L1 DCache line | 32B | 128B | | L2 Cache | 256KB | 512KB | | Compiler | GCC 3.1 | GCC 3.3 | | OS | Darwin 6.8 | Darwin 7.3 |
The resulting assembly code shows that loop unrolling isn't performed
by the compiler. Adding the -funroll-loops that
unrolls four iterations results in enhanced performance on both machines.
However, on the G5, both scalar versions are improved by 100%, where on
the G4 the 16-bit version is improved by 1.5, and the 32-bit version by
1.1. This means that on a G4 the 16-bit unrolled version is the scalar
baseline to consider.
Figure 2: The AltiVec unit on the PowerPC 970
Vector processing on the PowerPC architecture
This article uses the name AltiVec to denote both the
instruction set and implementing functional unit. The AltiVec unit
contains thirty-two 128-bit registers. Operations can be applied to 8-,
16-, and 32-bit signed and unsigned integer values, single precision
(32-bit) floating point (FP) values, and 16- and 32-bit pixel values.
Thus, each instruction performs 16, 8, or 4 operations. The instructions
are divided into five groups that are executed by different functional
units. Figure 2 shows a block-level diagram of the AltiVec unit on the PowerPC 970 processor. Although there are four distinct functional units, only two instructions can be issued per cycle, and one is to the permute
unit. All units are fully pipelined. Table 2 shows the units, instruction groups, and latencies. A complete description of the instruction set is available in PowerPC Microprocessor Family: AltiVec Technology Programming Environments Manual (see Resources).
Table 2. Latencies of AltiVec instructions on the G4 and G5.
| Unit | Instructions | G4 lat. | G5 lat. | | VSIU | ALU | 1 | 2 | | VCIU | mul, div, sum, max | 4 | 5 | | VFPU | FP | 5 | 8 | | VPERM | permute | 2 | 2 | | LSU | load, store | 3 | 4 |
Load/Store instructions are handled by the processor's Load/Store Unit
(LSU).
Hand coding in high-level languages is simple with the use of
intrinsics, which are macros that enable direct insertion of AltiVec
instructions into C code. All the following code examples will be written
in C using intrinsics. For example, the following code snippet adds two
vectors containing four signed integers each:
Listing 3: Vector add in C
vector int a,b,c;
c = vec_add(a,b);
|
Translates into:
Listing 4: Vector add in assembly
If necessary, a,b will be loaded from
memory:
Listing 5: Vector add with load
lvx v0,0,r3 #r3 contains &a
lvx v1,16,r3
|
Checksum vectorization
A previously published vectorized checksum algorithm was by Motorola on
a MPC7455 processor. Eight int objects (32
bytes total) are read each cycle, added in pairs, and then accumulated
into sum and carry accumulators. The main loop of the MOT algorithm is
(all variables are defined as vector unit):
Listing 6: Motorola's algorithm (MOT)
while (vector_count-- >0) {
VD1 = *((vector uint*)data);
VD2 = *((vector uint*)(data+16));
data += 32;
V_CC = vec_addc(VD1,VD2);
V_Temp = vec_add(VD1,VD2);
V_CS = vec_addc(V_Temp,V_Sum);
V_Sum = vec_add(V_Temp,V_Sum);
VCAR = vec_add(V_CC,VCAR);
VCAR = vec_add(V_CS,VCAR);
}
|
The absence of 64-bit operations implies that 32-bit addition results
must be stored in two registers, resulting in an additional four vector
add instructions used to manipulate these carries. Thus, two vector loads
and six vector adds are run for every 32 bytes of data. This technique is
naive on two counts: it doesn't try to use more powerful AltiVec
instructions; and it doesn't use the fact that the TCP/IP buffer limit is
64KB.
Figure 3: Diagrams of the vsum4shs and vsumsuhm vector instructions
The AltiVec instruction set offers two other options that can reduce
the number of instructions (Figure 3 displays diagrams of them.):
vsum4shs vD,vA,vB Vector Sum Across Partial (1/4) Signed Half Word Saturate
The four pairs of half-words in vA are added together and then added to
the four words in vB. The result is stored in vD. This instruction is
exactly what you need, eight 16-bit values are accumulated into four 32-bit
values. However, there are two catches: the results saturate when reaching
2^31-1, and the additions are signed. The first obstacle can be overcome
if you assume that the number of data items is smaller than 32KB, the
second obstacle results in an incorrect checksum.
vmsumuhm vD,vA,vB,vC Vector Multiply Sum Unsigned Half Word Modulo
Each of the eight half-words in vA and vB are multiplied together, the
corresponding eight 32-bit results are added in pairs and then added to
the four words in vC. The result is stored in vD. At first glance, this
instruction is totally irrelevant to the problem: there is no need in
multiplication and the 32-bit adds will overflow. However, if vB is a
vector set to contain only 1s, the ensuing products won't overflow when
added together, nor will the accumulator register vC if the number of
additions is smaller than 64KB (which is the case in TCP/IP checksum
computation).
By using the vec_msum intrinsic (that
implements the vmsumuhm instruction), every 16 bytes are processed by one
vector load and one complex vector instruction. This algorithm will be
denoted as the IBM scheme (VD1 and One are of type vector ushort, V Sum is
a vector uint):
Listing 7: IBM's algorithm (IBM)
while (vector_count-- >0) {
VD1 = *((vector ushort*)data);
data += 16;
V_Sum = vec_msum(VD1,One,V_Sum);
}
|
We compared the algorithms above to the best scalar version on a G4.
Three levels of granularity are displayed: 16-byte, 256-byte and 1-KB
increments. The two vector versions were unrolled as well in order to
provide a fair comparison (although the unrolled versions provide a
maximal speedup of 10% and a slight slowdown when the buffer sizes are
small). You can conclude the following from Figure 4:
- The IBM scheme matches the scalar scheme for a 32-byte buffer size,
the MOT scheme matches only after 64 bytes.
- The simpler IBM scheme outperforms MOT due to its shorter and simpler
loop (can start computation after 16 bytes are read) and epilogue (fewer
long long adds). At around 1KB, MOT finally
outperforms it due to fewer iterations.
- Until 32KB, MOT slightly outperforms IBM, but then the trend is
reversed. The knee in the graph is probably an artifact of the data cache
size, which is 32KB; the buffer is flushed from the cache since the last
run. The cache line size is 32 bytes, thus the MOT scheme encounters a
cache miss every iteration. The IBM version encounters a cache miss only
on the even iterations.
Figure 4: MOT and IBM vector schemes vs.16-bit scalar on a G4 (all versions are unrolled)
A similar comparison was performed on a G5, where the 32-bit scalar
scheme was compared to the IBM and MOT schemes (all unrolled). The
speedups are displayed in Figure 5 (one graph, 256B increments until
64KB). The main difference is that MOT outperforms IBM faster and even
after the 32KB buffer size. This can be attributed to the larger line
sizes (128 bytes) of the G5 cache and the hardware prefetching
mechanism.
Figure 5: MOT and IBM vector schemes vs. 16-bit scalar on a G5 (all versions are unrolled)
Summary, and what lies
ahead
While the Motorola algorithm is perhaps naive, it works reasonably well
on sample data. In the next article in this series, you'll see how this
code can be better unrolled and vectorized, and how recent versions of GCC
can be persuaded to generate the right vectorized code automatically.
Resources - Read Part 2 of this series on scalar optimizations and simple vectorization (developerWorks, November 2004).
- Apple maintains an AltiVec
Instruction Cross-Reference.
- Learn more about automatic
vectorization in GCC.
- While the MPC7441
is end-of-lifed, it was used in drafting this article. It's the earlier
revision of the "G4".
- Further discussion of loop unrolling and vector accumulators may be
found in Advanced
Compiler Design and Implementation, by Seven S. Muchnick. Page
562.
- Autovectorization was the topic of a talk by D. Naishlos at the 2004
GCC Developers' Summit.
- You can learn more about RFC 791, the original IP
protocol specification.
- The PowerPC 970 section of the IBM Microelectronics Technical Library offers background information, documentation, and application notes for the 9xx family of chips.
- This article refers to information taken from TCP/IP Illustrated,
Volume 1: The Protocols. The author's page has additional
information.
- This article refers to information taken from TCP/IP Illustrated,
Volume 2: The Implementation. The author's page has additional
information.
- Learn more about AltiVec from the PowerPC
Microprocessor Family: AltiVec Technology Programming Environments
Manual.
- This article refers to Jacob Pan's Enhanced TCP/IP Performance
technical report, published by Motorola in 2003. It doesn't seem to be
available online (any longer?), but the author gave a presentation
on Accelerating Networking Data Movement Using AltiVec Technology at
the Smart Networks Developer Forum 2003, which is available.
- Have experience you'd be willing to share with Power Architecture zone
readers? Article submissions on all aspects of Power Architecture technology from authors inside and outside
IBM are welcomed. Check out the Power Architecture author
FAQ to learn more.
- Have a question or comment on this story, or
on Power Architecture technology in general?
Post it in the Power Architecture technical forum
or send in a letter to the editors.
-
Get a subscription to the Power Architecture Community Newsletter when
you Join the Power Architecture community.
- All things Power are chronicled in the developerWorks Power
Architecture editors' blog, which is just one of many developerWorks
blogs.
- Find more articles and resources on Power Architecture
technology and all things
related in the developerWorks Power
Architecture technology content area.
- Download a IBM PowerPC 405 Evaluation Kit to demo a SoC in a simulated
environment, or just to explore the fully licensed version of
Power Architecture technology.
About the authors  | |  | Ayal Zaks, Dorit Naishlos, and Daniel Citron are IBM Haifa Labs researchers specializing in microarchitecture and compiler development and optimizations. |
 | |  | Ayal Zaks, Dorit Naishlos, and Daniel Citron are IBM Haifa Labs researchers specializing in microarchitecture and compiler development and optimizations. |
 | |  | Ayal Zaks, Dorit Naishlos, and Daniel Citron are IBM Haifa Labs researchers specializing in microarchitecture and compiler development and optimizations. |
Rate this page
|  |