IBM®
Skip to main content
    Country/region [select]      Terms of use
 
 
    
     Home      Products      Services & solutions      Support & downloads      My account     
PowerPC processor tips: Improve PowerPC 970FX performance
skip to main content

developerWorks  >  Power Architecture technology  >

PowerPC processor tips: Improve PowerPC 970FX performance

Six tips to improving performance for the 970FX

developerWorks
Document options

Document options requiring JavaScript are not displayed

Discuss


Rate this page

Help us improve this content


Level: Introductory

PowerPC Applications Engineering team, Engineers, IBM

30 Sep 2005

The IBM PowerPC® 970FX processor is a superscalar design with multiple, pipelined execution units. A well-optimized compiler and tuned operating system are important for getting the most performance from the processor -- however, here are six additional things you can do to improve performance.

Welcome to the Power Architecture Community PowerPC processor tip. The latest tips will also be highlighted in the e-mail version of the Power Architecture Community Newsletter, so if you haven't already, you might want to take time to subscribe.

Use data cache streams

The PowerPC 970FX contains a data prefetch engine capable of prefetching data along eight different streams. With hardware-controlled prefetching, the processor automatically looks for patterns of cache misses and will allocate a data prefetch stream when certain criteria are met. Prefetching will continue on a demand basis until a 4KB page boundary is reached, or until the stream is reallocated. Refer to the 970FX User's Manual for details on the hardware prefetch engine (see Resources).

A certain amount of overhead is associated with the hardware prefetch mechanism. For applications where the programmer knows the data access pattern in advance (a large table walk, for example), prefetching can be initiated by using one of the two dcbt variants. When using one of these variants, it is often more efficient to initiate a stream starting with an address two cache lines beyond the current data address.



Back to top


Use aligned code

Aligned code will let you take full advantage of the 970FX fetch and instruction dispatch mechanisms. In general, loop starting addresses should be aligned on 32-byte boundaries. You should also align the target of an unconditional branch and function calls (aligning the function return address has no benefit). Most compilers offer options to align code (gcc's -falign-loops=32, for example), or use .align if you are programming in assembly language. Aligning code will increase the size of the compiled image, so use alignment judiciously.



Back to top


Use branch prediction

Even the best compilers can't always predict data-driven conditional branches. For circumstances where a branch can be reliably predicted (loop control, checking for exception conditions, and so on), use static branch prediction. To take full advantage of branch prediction using the link stack, use the link register exclusively for function calls and returns.

On a related note, C language conditional expressions (z = (a < b) ? a : b;, for example) are difficult for compilers to predict. It is typically better to use more inline code than risk a mispredicted branch.



Back to top


Use large pages

The 970FX supports both 4KB and 16MB page sizes. Using a large page size will reduce the number of TLB (translation look-aside buffer) misses and can make a big difference in performance. Each entry in the Effective to Real Address Translation tables (ERATs) can only translate 4KB of address space, so there will be ERAT misses (a 16MB page would require 4096 ERAT entries), but a miss in the ERAT with a TLB hit will only take four to five cycles to replace. A TLB miss would at best require a load from the cache, and might require a fetch of the TLB entry from the in-memory page table.



Back to top


Avoid microcoded instructions

Microcoded instructions, such as load/store multiple, were designed to save space in compiled code and offer no performance advantage over using multiple instructions. Because of the way these instructions are handled inside the processor, they might have a greater latency and take longer to execute than a sequence of individual instructions that produces the same results. Some compilers (gcc for example) have options that prevent generation of these instructions.



Back to top


Understand dcbz

The Data Cache Block Clear to Zero (dcbz) instruction sets the bytes in the block containing the byte addressed in the effective address to zero. In the 970FX, the block size can be either 32 or 128 bytes and is selected by bits (56 and 57) in the HID5 register. Regardless of the HID5 bit settings, when a dcbz is encountered in the code, an entire 128-byte line is fetched from memory. This will probably cause a line to be cast out (and written to memory if dirty). If operating in 32-byte mode, the 32 bytes in the block containing the target address are set to zero. If another dcbz is encountered with an effective address in the same cache line (such as in a loop stepping through memory in 32-byte increments), the line must be fetched again. In this memory-walk example, the same line would be fetched four times, even though the data would already be in the cache 75% of the time.



Back to top


Looking for a tip?

We don't call the races, but if you have requests for future PowerPC processor tips or requests for additional information, Application Briefs, or Application Notes, log on to the PowerPC processor tips forum and let us know what would be most useful! Not promising anything, but if we can help, we will.



Back to top


Summary

This tip is really six tips in one, all designed to help further the performance of the PowerPC 970FX in addition to providing a well-optimized compiler and well-tuned operating system.



Resources

Learn

Get products and technologies

Discuss


About the author

The PowerPC processor tip appears in this space on the first and fifteenth of every month. To receive a reminder each time a new tip is posted, subscribe to the Power Architecture Community Newsletter.




Rate this page


Please take a moment to complete this form to help us better serve you.



YesNoDon't know
 


 


12345
Not
useful
Extremely
useful
 


Back to top



    About IBMPrivacyContact