 | Level: Introductory Neil Leeder (neileedr@us.ibm.com), Software Engineer, IBM
10 Feb 2005 The PowerPC 405 and 440 processors each have optimization rules to be used when creating assembler code. This document describes a concise set of rules which cover the most common cases.
Introduction
The code optimization rules for the PowerPC 405 and 440 processors are different, but it is often useful to write one set of code which will perform well on both CPU types. One problem when trying to write optimized code is that the rules can be complex
and span many page of documentation.
The rules given here apply to code designed to run well on both 405 and 440. Clearly some of the rules will only produce a benefit on one of the processors, but by following the rules code
will run optimally on either processor. Additional details can be found in the User Manuals for the respective processors, for example the 405GP and 440GP.
Four main rules
The four main optimizing rules are covered in the
following sections, and are:
- Instruction pairing
- Load/Use dependencies
- Operand dependencies
- Compare and branch
Instruction Pairing
On the 440, instructions fall into one of three
categories. The most commonly occurring
instructions are highlighted:
Special categories:
- Loads and Stores
- Any of these:
- Instructions which set a bit in the CR; compares and dot forms of instructions
- Branches
- Multiply and Divide
- SPR accesses, system instructions (sc, rfi),
XER-updating "o forms"
General categories:
- Anything else, primarily non-CR, non-XER
updating arithmetic and logical instructions
The 440 processor can dispatch two instructions per cycle, provided that the two instructions are not from the same special category. If there are two adjacent instructions from the same Special Category, the second one will wait until at least the cycle after the first one has dispatched. 405 and 440 processors only dispatch instructions in-order, so having to wait results in a wasted cycle in the CPU. It’s therefore useful to mix instructions from different categories to allow
pairs to be dispatched together.
Load/Use dependencies
When data is loaded into a register from cache, it takes several cycles before that data is available to be used by another instruction. An instruction which uses data that’s been loaded cannot be dispatched until the third cycle after
the load. Therefore it’s most efficient to place other instructions between the data load and its use. How many instructions are needed is dependent on the type of instructions, and whether they can be paired for dispatching together. A maximum of five instructions are needed: this consists of one instruction which will be paired with the load, then two the next cycle, and two the cycle after that. The "use" instruction can then dispatch on the next instruction, see Table 1.
Table 1. Load/use - 5 separating instructions needed
|
Instructions
|
Cycle
|
Comment
|
| lwz
r4,0(r5) | n | Load dispatches | | addi
r6,r6,0x20 | n | addi dispatches in same cycle as load | | or r7,r8,r9 | n+1 | | | stw r6,32(r1) | n+1 | | | subf r10,r11,r12 | n+2 | | | srawi r8,r8,4 | n+2 | | | add r9,r4,r12 | n+3 | Load data in r4 is available for use 3 cycles after the load instruction |
Depending on the instruction mix, as few as two instructions may be inserted between the load and the use. This occurs when the following two instructions cannot be paired with any other instructions. This leads to a performance
degradation due to the non-pairing of instructions, but by including the two instructions between the load and use there is no additional penalty. See Table 2. Note here that the stw and lbz instructions in cycles n+1 and N+2 cannot pair with the other loads and stores because they are all part of the same "loads and stores" special category.
Table 2. Load/use - 2 separating instructions needed
|
Instructions
|
Cycle
|
Comment
|
| lwz r4,0(r5) | n | Load dispatches | | stw r6,32(r1) | n+1 | Store must wait until cycle n+1 | | lbz r11,20(r1) | n+2 | Load must wait until cycle n+2 | | lwz r9,0(r4) | n+3 | Load data in r4 is available. Load must wait until cycle n+3 to be dispatched - no additional
delay is incurred for waiting for r4 data. |
Although most load operations only have one target register, be aware that the "update" instructions, such as lwzu, also update a source register. This updated register will not be available to be used until the load instruction completes,
so be careful to apply the same use rules to it as to the target register.
Operand dependencies
Two instructions cannot be dispatched in the same cycle if the first one updates a register which is used by the second instruction. So it’s useful to separate instructions like this by at least one other instruction. In Table 3 the srawi instruction uses r4 which is updated by the preceding addi instruction. The srawi has to wait until the next cycle before it
dispatches, leaving a dead cycle where nothing dispatches in parallel with the addi. Eventually the subf instruction executes 2 cycles after the addi.
Table 3. Operand dependency: wasting a cycle
|
Instructions
|
Cycle
|
Comment
|
| addi
r4,r5,r6 | n | | | n | Dead cycle - nothing executes here | | srawi r7,r4,4 | n+1 | Shift must wait for r4 data - cannot dispatch in same cycle as addi | | lwz r8,0(r1) | n+1 | load can pair with shift in same cycle | | subf r9,r10,r11 | n+2 | subf executes 2 cycles after addi |
By moving the srawi instruction after the lwz, the lwz can dispatch in the same cycle as the addi, and there is no delay in the srawi instruction which dispatches in the following cycle, as shown in Table 4. The subf executes one
instruction after the addi, showing that a cycle has been saved.
Table 4. Operand dependency: saving a cycle
|
Instructions
|
Cycle
|
Comment
|
| addi r4,r5,r6 | n | | | lwz r8,0(r1) | n | load can pair with addi in same cycle | | srawi r7,r4,4 | n+1 | Shift uses r4 data - does not have to wait | | subf r9,r10,r11 | n+1 | subf executes 1 cycle after addi |
Compare and Branch
CR (Condition Register) -updating instructions should be separated from branch instructions by one instruction. CR-updating instructions are usually compares or the "dot form" of arithmetic and logical instructions (but also includes the rarer mcrf, mcrxr, mtcrf). (It’s possible to have best performance with no instructions separating them, but this is only in the situation where you know that the CR-setting instruction will be paired with the preceding instruction (for 440), or the branch is correctly predicted (for 405). Often this is difficult to guarantee, so the best rule is to separate by one instruction.)
Techniques for analyzing optimized code
It can often be useful to trace a segment of code as it runs, then observe the number of cycles taken by each instruction. This can be a useful tool in spotting trouble spots which are slowing down execution. One method of observing code performance is by using the RISCTrace tool, part of the IBM RISCWatch debugger. This shows all the
instructions executed, and keeps track of the number of cycles used. The usual method of using RISCTrace effectively is as follows. Create a source file containing the code you wish to analyze. It is often helpful to create a scaffold
routine which can set up the environment (registers, data areas, stack etc.) that the test code is expecting. The scaffold code typically sets up its environment, branches to the test code and ends with an instruction which branches to itself
(a spin loop). This spin loop prevents execution of the code from "running off the end" of the loaded file. Of course, if you have a complete run-time environment available you may run your code without the need for scaffolding.
Compile and link the code you wish to analyze along with the scaffold code, and load it onto the target system by using the RISCWatch "load file" command. Make a note of the entry point of the code, you will need it later. Make sure that
you are set up to use hardware breakpoints (use RISCWatch menus Source -> Breakpoints then select "Hardware BPs", or use the command "bpmode hardware"). Set a breakpoint on the spin loop instruction at the end of the scaffold
code. Usually you can open the assembly debug window (Hardware -> Asm debug) and scroll through the scaffold code until you see the final branch instruction, then click on it to set the breakpoint.
The next step is to ensure the code and data you use are loaded into cache. Check that caching is enabled for the areas your code is loaded into: in 405 real mode check the ICCR and DCCR registers, in 405 virtual mode or on the 440 check the relevant TLB entries to ensure that the I (caching inhibited) bit is not set. Click "run" on the assembly debug window. Your code should run and stop at the hardware breakpoint you set. At this point all code and data should be in cache. Now we will run it a second time, this time tracing it.
Set the instruction pointer back to the original entry point you noted earlier, by entering it into the "Set IAR" field in the assembly debug window. Open the trace window (Hardware -> Trigger/ trace). Set the "Cycles Before Trigger" field
to the maximum number of cycles you wish to trace - usually a value of 1000 to 10000 is good. Set the "Cycles After Trigger" to 0. Click "Run Trace". Your code will execute, stop at the hardware breakpoint again, and produce a trace listing.
When looking at the trace, the "Cycles/Instruction" column provides an indication of any pipeline stalls. For 405, ideally all the entries in this column should be 1, indicating that the instruction executed in 1 cycle (clearly some instructions, such as divide, are designed to take more than one cycle to execute, and this does not indicate a performance problem). Instructions taking more than one cycle should be investigated to see if reordering them using the rules above can improve their execution speed.
Sample 405 RISCTrace output
# Instr Total Cycle/
# Count Cycle Instr Address Opcode Disassembly
# ------ -------- ------- -------- -------- -----------
00000086 00000085 0000001 00040158 813E0000 lwz R9,0x00000000(R30)
00000087 00000086 0000001 0004015C 57AB103A rlwinm R11,R29,2,0,29
00000088 00000088 0000002 00040160 7CAB482E lwzx R5,R11,R9
00000089 00000089 0000001 00040164 813E0004 lwz R9,0x00000004(R30)
00000090 00000092 0000003 00040168 7CCB482E lwzx R6,R11,R9
00000091 00000093 0000001 0004016C 813E0008 lwz R9,0x00000008(R30)
00000092 00000096 0000003 00040170 7CEB482E lwzx R7,R11,R9
|
Figure 5 shows the trace output from RISCTrace on a 405 (some columns omitted for space reasons). In the 3rd column, "Cycle/Instr", you can see that some instructions take more than one cycle to execute. For example, the third instruction, lwzx R5,R11,R9 takes an extra cycle because it has to wait for the load of R9 (in line 1) to complete
before it can dispatch. Similarly, the two other lwzx instructions each take 3 cycles, because they use R9 and must wait for the previous instruction to complete its load of that register. In the first case, there is an instruction, rlwinm, between the load of R9 and its subsequent use, so the using instruction only waits one extra cycle, thereby taking a total of two cycles to complete. There are no instructions between the load and use in the last two cases, so the maximum
penalty of 3 cycles is incurred. In this case, it would be advantageous to investigate surrounding code to see if other instructions could be placed between the load and use instructions to do useful work while the use instructions are waiting for the load to complete.
For 440, multiple instructions can complete in a
cycle, and things are further complicated by the
fact that the 440 reports the instructions which
have completed in a 4-cycle block. Up to 8
instructions can complete in a single 4-cycle
period. The RISCTrace output shows the start of
a 4-cycle block by placing a 4 in the
"Cycles/Instruction" column, followed by 0’s
for all following instructions which completed
in the same 4-cycle block. Ideally, a total of 8
instructions should complete in a 4-cycle block.
Fewer instructions indicate that the area of code
might be worth investigating for optimization.
Useful Hints
On small code fragments, an empty trace file
might be produced. Trace requires a minimum
number of cycles to execute before it can produce
results. To ensure enough cycles are executed,
start your code with a small loop, for
example loading a small value into the CTR register
and decrement it, test it and loop using the
bdnz instruction. The value that you load into
CTR is dependent on how long your test code is.
Notice that all commands you issue in RISCWatch
are saved in its main command line window,
and may be re-executed by double-clicking
on them. This can be useful when running
the same (or updated) code multiple times.
The sequence of operations:
- load file
- set breakpoint
- run
- reset instruction pointer
- trace
can be quickly re-executed by double-clicking
the commands in the main window.
No-ops considered harmful
When testing or measuring the performance of a
fragment of code, it is often useful to have the
processor start in a steady-state, so that preceding
code does not influence the execution of the
code being tested.
If a preceding instruction takes an unpredictable
number of instructions to execute (for example,
a load which resulted in a cache miss, a mis-predicted
branch, or a multiply where the values of
the operands determine the time taken), it is difficult
to determine if an observed pipeline delay
is caused by the preceding instructions or the
code being measured. In order to make sure that
the number of cycles used by the test code is
reported accurately, it is typical to execute a
known sequence of instructions so the actual
time used by the test instructions can be determined. The usual instructions used are a series
of no-op (no operation) instructions, or nops.
The nop instruction is an extended mnemonic
for the instruction "ori r0,r0,0". This is a simple
single-cycle operation which does not alter
any of the machine state - register r0 is written
with its previous contents. In the 405, a
sequence of nops executes at the maximum processor
speed and guarantees no pipeline stalls,
or "bubbles". However, in the 440 this is not the
case. The nop instruction is both a producer and
consumer of r0. When two adjacent nop instructions
are executed, the second one cannot be
paired to execute the same cycle as the first,
because of the dependency on the shared register
r0, as described in Operand Dependencies.
This leads to a pipeline "bubble". If you need a
series of nop-like instructions which will fill the
pipeline and execute at maximum speed, you
should consider alternating two instructions,
using separate registers. An example of a second
no-op instruction would be "ori r2,r2,0". A
sequence of pipeline-filling no-ops would be:
nop
ori r2,r2,0
nop
ori r2,r2,0
|
Conclusions
Although there are many code optimization
details documented for the 405 and 440 processor
families, by following the basic rules
described here you can gain most of the optimization
benefits available, and have code which
will perform well on both processor families.
By using the RISCWatch trace tool, you can
verify that critical code is performing optimally.
Resources
- Download the source code used in this article.
-
The Linux Documentation Project is a repository of Linux documentation including documents about individual software, HOWTO documents, FAQs, and more.
- "Build a better GUI" (developerWorks, October 2001) discusses the use of Java layout managers for better overall GUI design.
-
Practical UNIX & Internet Security
(O'Reilly & Associates; 1996), by Garfinkel and Spafford, is an excellent reference on all aspects of system security from user management to drafting a security policy.
About the author  | |  | Neil Leeder works in the PowerPC enablement group within IBM Microelectronics. He has worked in PowerPC embedded software development since 1993, and was part of the team which created OS Open, the first real-time operating system for embedded PowerPC processors. He played a key role in the development of IBM's software evaluation kits for the 405GP and 440GP processors. Neil has a degree in Computing and Information Systems from the University of Manchester in England. You can contact Neil at neileedr@us.ibm.com. |
Rate this page
|  |