 | Level: Introductory PowerPC Applications Engineering team, Engineers, IBM
30 Sep 2005 The IBM PowerPC® 970FX processor is a superscalar design with multiple, pipelined execution units. A well-optimized compiler and tuned operating system are important for getting the most performance from the processor -- however, here are six additional things you can do to improve performance.
Welcome to the Power Architecture Community PowerPC processor tip. The latest tips will also be highlighted in the e-mail version of the Power Architecture Community Newsletter, so if you haven't already, you might want to take time to subscribe.
Use data cache streams
The PowerPC 970FX contains a data prefetch engine capable of prefetching data along eight different streams. With hardware-controlled prefetching, the processor automatically looks for patterns of cache misses and will allocate a data prefetch stream when certain criteria are met. Prefetching will continue on a demand basis until a 4KB page boundary is reached, or until the stream is reallocated. Refer to the 970FX User's Manual for details on the hardware prefetch engine (see Resources).
A certain amount of overhead is associated with the hardware prefetch mechanism. For applications where the programmer knows the data access pattern in advance (a large table walk, for example), prefetching can be initiated by using one of the two dcbt variants. When using one of these variants, it is often more efficient to initiate a stream starting with an address two cache lines beyond the current data address.
Use aligned code
Aligned code will let you take full advantage of the 970FX fetch and instruction dispatch mechanisms. In general, loop starting addresses should be aligned on 32-byte boundaries. You should also align the target of an unconditional branch and function calls (aligning the function return address has no benefit). Most compilers offer options to align code (gcc's -falign-loops=32, for example), or use .align if you are programming in assembly language. Aligning code will increase the size of the compiled image, so use alignment judiciously.
Use branch prediction
Even the best compilers can't always predict data-driven conditional branches. For circumstances where a branch can be reliably predicted (loop control, checking for exception conditions, and so on), use static branch prediction. To take full advantage of branch prediction using the link stack, use the link register exclusively for function calls and returns.
On a related note, C language conditional expressions (z = (a < b) ? a : b;, for example) are difficult for compilers to predict. It is typically better to use more inline code than risk a mispredicted branch.
Use large pages
The 970FX supports both 4KB and 16MB page sizes. Using a large page size will reduce the number of TLB (translation look-aside buffer) misses and can make a big difference in performance. Each entry in the Effective to Real Address Translation tables (ERATs) can only translate 4KB of address space, so there will be ERAT misses (a 16MB page would require 4096 ERAT entries), but a miss in the ERAT with a TLB hit will only take four to five cycles to replace. A TLB miss would at best require a load from the cache, and might require a fetch of the TLB entry from the in-memory page table.
Avoid microcoded instructions
Microcoded instructions, such as load/store multiple, were designed to save space in compiled code and offer no performance advantage over using multiple instructions. Because of the way these instructions are handled inside the processor, they might have a greater latency and take longer to execute than a sequence of individual instructions that produces the same results. Some compilers (gcc for example) have options that prevent generation of these instructions.
Understand dcbz
The Data Cache Block Clear to Zero (dcbz) instruction sets the bytes in the block containing the byte addressed in the effective address to zero. In the 970FX, the block size can be either 32 or 128 bytes and is selected by bits (56 and 57) in the HID5 register. Regardless of the HID5 bit settings, when a dcbz is encountered in the code, an entire 128-byte line is fetched from memory. This will probably cause a line to be cast out (and written to memory if dirty). If operating in 32-byte mode, the 32 bytes in the block containing the target address are set to zero. If another dcbz is encountered with an effective address in the same cache line (such as in a loop stepping through memory in 32-byte increments), the line must be fetched again. In this memory-walk example, the same line would be fetched four times, even though the data would already be in the cache 75% of the time.
Looking for a tip?
We don't call the races, but if you have requests for future PowerPC processor tips or requests for additional information, Application Briefs, or Application Notes, log on to the PowerPC processor tips forum and let us know what would be most useful! Not promising anything, but if we can help, we will.
Summary
This tip is really six tips in one, all designed to help further the performance of the PowerPC 970FX in addition to providing a well-optimized compiler and well-tuned operating system.
Resources Learn
Get products and technologies
Discuss
About the author
Rate this page
|  |