 | Level: Intermediate STG Systems Performance team (dwpower@us.ibm.com), Development, IBM
25 Apr 2006 Make substantial improvements in performance analysis with a CPI analysis model built on the tools introduced in Part 1. Learn ways to analyze the specific performance counter data produced by profiling runs to obtain statistics for events which the CPU cannot directly report on.
As Part 1 explained, the POWER5™ has six PMU counters -- two are dedicated to counting PowerPC® instructions completed and total cycles, and the four other counters are available to count other events.
The POWER5 is capable of completing one group per cycle (or up to five instructions per cycle). Some PowerPC instructions are expanded into multiple IOPS (number of PowerPC instructions completed) during the decode stage and may span more than one group. The POWER5 PMU can count the total number of groups completed as well as the number of groups that contained at least one PowerPC instruction. The difference is the number of additional groups required due to expansion. A simplistic view of CPI is to break it into a base component, when the processor is completing work (groups completed), and a stall component, when the processor is not completing instructions (total cycles - groups completed). The stall component can be further divided into cycles when the pipeline was empty (GCT empty) and cycles when the pipeline was not empty but completion is stalled (stall - GCT empty).
These divisions, and the multiple counters available, make it possible to analyze performance more precisely, by comparing related statistics to derive some of the performance characteristics the processor cannot directly measure.
Table 1. CPI breakdown model
Total cycle
<# cycles> | Completion cycles
<A:group complete cycles> | PowerPC Base completion cycles
<A1: One or more PowerPC instructions completed this cycle> |
|---|
overhead of cracking/microcoding and grouping restriction
<A2:(A)-(A1)> | Completion Table empty (GCT empty) cycles
<B> | I-cache miss penalty
<B1> | Branch redirection (branch misprediction) penalty
<B2> | others (Flush penalty etc.)
<B4: (B)-(B1)-(B2> | Completion Stall cycles
<C: total-(A)-(B)> | Stall by LSU inst
<C1> | Stall by reject
<C1A> | Stall by translation (rejected by ERAT miss)
<C1A1> | Other reject
<C1A2: (C1A)-(C1A1)> | Stall by D-cache miss
<C1B> | Stall by LSU basic latency, LSU Flush penalty
<C1C: (C1)-(C1A)-(C1B)> | Stall by FXU inst
<C2> | Stall by any form of DIV/MTSPR/MFSPR inst
<C2A> | Stall by FXU basic latency
<C2C: (C2)-(C2A)> | Stall by FPU inst
<C3> | Stall by any form of FDIV/FSQRT inst
<C3A> | Stall by FPU basic latency
<C3B: (C3)-(C3A)> | others (Stall by BRU/CRU inst , flush penalty (except LSU flush), etc.)
<C4: (Completion Stall cycles)-(C1)-(C2)-(C3) > |
Table 1 represents a CPI breakdown model where the total cycles of a
workload is divided into three components: Completion cycles,
Completion Table empty or GCT empty cycles, and Completion Stall
cycles. The base completion cycles are the number of cycles that would
be needed if grouping was perfect. Otherwise, stalls happen, and they
can be attributed to either Completion Table empty or Completion Stall
cycles. A Completion Table empty condition occurs when no groups are
completing on a given cycle because of either Icache miss or branch
misdirection. Meanwhile, the Completion Stall cycles are those stalls
caused by any of the following instructions: LSU, FXU, FXU long (all
forms of div, mtspr, mfspr), FPU, and FPU long (all forms of fsqrt,
fdiv); or by other events such as Dcache miss, Reject, and Reject by
translating (ERAT miss).
The events are hierarchical. Load/Store Unit (LSU) stalls will include Dcache miss and Rejects. Reject stalls will include Translation stalls due to Effective to Real Address Translation (ERAT) miss. This method only reports the last condition to clear and will not detect dependency chains. For example, a fixed point add instruction that is dependent on a load in the same group that misses the Dcache will cause the completion stall condition to be reported as Fixed Point Unit (FXU). When multiple conditions are reported at the same time, Load/Store conditions are favored over all others, and Dcache miss is favored over Rejects. Stalls due to GCT empty are primarily caused by instruction cache misses and branch mispredicts. The POWER5 has speculative events for these conditions that are handled in the same way as completion stalls. Counting begins when the GCT goes empty and stops when at least one group is in the GCT.
The performance events used in Table 1 are defined as follows:
-
Completion cycles = PM_GRP_CMPL
-
Completion cycles = PM_INST_CMPL (SCOMx360 configured with bits 0-10 = 0 00 0000 0000
-
PowerPC Base Completion cycles = PM_PPC_CMPL (or PMC 5)
-
Overhead of cracking/microcoding = PPC_Completion_Cycles - PPC_Base_Completion_Cycles
-
Completion Table empty (GCT empty) = PM_GCT_EMPTY
-
I-Cache Miss Penalty = PM_GCT_EPMTY_IC_MISS
-
Branch Misprediction Penalty = PM_GCT_EMPTY_BR_MPRED
-
Others GCT stalls = GCT_Empty - I-Cache Miss Penalty - Branch Mispredict Penalty
-
Completion Stall cycles = Total cycles - Completion cycles - GCT empty
-
Stall by LSU Instruction = PM_CMPLU_STALL_LSU
-
Stall by LSU Reject = PM_CMPLU_STALL_REJECT
-
Stall by LSU Translation Reject = PM_CMPLU_STALL_ERAT_MISS
-
Stall by LSU Other = Stall by LSU Reject - Stall by LSU Translation Reject
-
Stall by LSU Dcache miss = PM_CMPLU_STALL_DCACHE_MISS
-
Stall by LSU basic latency = Stall by LSU Instruction - Stall by LSU Reject- Stall by LSU Dcache miss
-
Stall by FXU Instruction = PM_CMPLU_STALL_FXU
-
Stall by DIV,MTSR, or MFSPR = PM_CMPLU_STALL_DIV
-
Stall by FXU basic latency = Stall by FXU Instruction - Stall by DIV,MTSR, or MFSPR
-
Stall by FPU Instruction = PM_CMPLU_STALL_FPU
-
Stall by FDIV or FSQRT = PM_CMPLU_STALL_FDIV
-
Stall by FPU basic latency = Stall by FPU Instruction - Stall by FDIV,FSQRT
-
Stall by others = Completion Stall cycles - Stall by LSU Instruction - Stall by FXU Instruction - Stall by FPU Instruction
Table 1 shows the breakdown of CPI components and the events that used to calculate the breakdown. The shaded cells are measured directly. The total cycles represent the aggregate CPI of the workload. The cycle values in each category are either measured by the performance counters or calculated from the observed values. As an example, the CPI cost in Completion Table empty <B> in Table 1 was broken down in three components: I-cache Miss Penalty <B1>, Branch redirection (Branch Misprediction Penalty) <B2>, and others (Flush Penalty, and so on) <B4: (B)-(B1)-(B2)>. <B1> and <B2> are measured values and <B4> is a calculated value based on <B>, <B1>, and <B2>.
How to build a CPI breakdown model
To build a CPI breakdown model for your workload, you need to collect the counter data, calculate the CPI fraction for its components, and populate the model with the CPI values.
Collecting counter data
The data illustrated in Table 1 was collected through the following 16 counters from seven pmcount groups: 0, 1, 5, 28, 29, 30, and 31. Table 2 gives their description.
Table 2. Performance monitor counter data
collection
| Counter No. | Event No. | Group No. | Description |
|---|
| Completion cycles: |
|---|
| 2 | 71 | 0 | PM_INST_CMPL - Number of PowerPC inst completed (iops) | | 5 | 0 | Any | PM_INST_CMPL - Number of PowerPC inst completed | | 6 | 0 | Any | PM_RUN_CYC - Processor cycles gated by the run latch | | 3 | 49 | 1 | PM_GRP_CMPL - Group completed | | Completion Table empty (GCT empty) cycles: |
|---|
| 1 | 60 | 5 | PM_GCT_NOSLOT_CYC - Cycles no GCT slot allocated | | 2 | 59 | 5 | PM_GCT_NOSLOT_IC_MISS - No slot in GCT caused by Icache miss | | 3 | 46 | 5 | PM_GCT_NOSLOT_SRQ_FULL - No slot in GCT caused by SRQ full | | 4 | 51 | 5 | PM_GCT_NOSLOT_BR_MPRED - No slot in GCT caused by branch mispredict | | Completion Stall cycles: |
|---|
| 2 | 13 | 28 | PM_CMPL_STALL_LSU - Completion stall caused by LSU instruction | | 4 | 10 | 28 | PM_CMPL_STALL_REJECT - Completion stall caused by Reject | | 2 | 10 | 29 | PM_CMPL_STALL_DCACHE_MISS - Completion stall caused by Dcache miss | | 4 | 8 | 29 | PM_CMPL_STALL_ERAT_MISS - Completion stall caused by ERAT miss | | 2 | 12 | 30 | PM_CMPL_STALL_FXU - Completion stall caused by FXU instruction | | 4 | 7 | 30 | PM_CMPL_STALL_DIV - Completion stall caused by DIV instruction | | 2 | 11 | 31 | PM_CMPL_STALL_FDIV - Completion stall caused by FDIV or FQRT instruction | | 4 | 9 | 31 | PM_CMPL_STALL_FPU - Completion stall caused by FPU instruction |
Example of counter data collection
To collect the pmcount data, you need to use the pmcount command available under AIX. For Linux on POWER, you will need a kernel patch. As an example, you can use the following command to collect the pmcount data for Group 0:
pmcount -kuny -G 0 workload >> pmcount_ku.out
Find more detailed discussion of performance data collection tools in the first article in this series.
Calculating CPI components
The CPI and its components are readily available from the pmcount data collected through the pmcount command. Listing 1 depicts a sample of pmcount data collected for Group 0 taken on an SMP system. The data represents the values in cycles of six counters: Counter 1, ?, Counter 6 which are displayed in columns PMC 1, ..., PMC 6 respectively.
To compute the CPI, take the ratio of total cycles / number of PPC instructions completed in Group 0, which is the total value of PMC 6 / PMC 5, or 302936029042 / 117749670719 which is equal to 2.57.
Listing 1. pmcount Data for Group 0
pmcount -kun -G 0 sleep 5
Processor name: POWER5
*** Configuration for all CPUs :
kernel: runlatch enabled;
Mode = user, kernel; Process tree = off; Thresholding = off
Thresholding multiplier: 1, 32, 64
MMCR0[400c003] MMCR1[0] MMCR1[a02121e] MMCRA[0]
regs mode [0]
sregs: MMCRA[4000000] mode [40000000]
Group 0: pm_utilization Name: CPI and utilization data
Counter 1, event 190: PM_RUN_CYC
Counter 2, event 71: PM_INST_CMPL
Counter 3, event 56: PM_INST_DISP
Counter 4, event 12: PM_CYC [shared]
Counter 5, event 0: PM_INST_CMPL
Counter 6, event 0: PM_RUN_CYC
*** All CPU data:
cpu PMC 1 PMC 2 PMC 3 PMC 4 PMC 5 PMC 6
==== ======== ======== ======== ======== ======== ========
[ 0] 9480159805 3627365099 4399841370 9562151386 3627365196 9480160495
[ 1] 9479204529 3652180319 4422613553 9562559351 3652180416 9479204595
[ 2] 9485236988 3621234799 4390923166 9563162485 3621234936 9485237345
[ 3] 9476339229 3703301514 4483370620 9563799360 3703301651 9476343842
.
.
.
[ALL] 302936012787 117749668145 142731148496 306273117094 117749670719 302936029042
All done
|
To compute the CPI component of I-cache miss penalty, take the ratio of PMC 2 / PMC 6 in Group 5, or 2543186641 / 302213025090 which is equal to 0.0084. That is, the execution of this workload was stalled 0.84% of the time due to the pipeline empty (GCT empty) caused by I-cache miss.
To compute the CPI component of branch miss prediction penalty, take the ratio of PMC 4 / PMC 6, or 14448342651 / 302213025090 which is equal to 0.0478. That is, the pipeline stall caused by branch miss prediction happened 4.78% of the total clock cycles. See Listing 2 for pmcount data in Group 5.
Listing 2. pmcount Data for Group 5
pmcount -kun -G 5 sleep 5
Processor name: POWER5
*** Configuration for all CPUs :
kernel: runlatch enabled;
Mode = user, kernel; Process tree = off; Thresholding = off
Thresholding multiplier: 1, 32, 64
MMCR0[400c003] MMCR1[40000000] MMCR1[8380838] MMCRA[0]
regs mode [0]
sregs: MMCRA[2000000] mode [40000000]
Group 5: pm_gct_empty Name: GCT empty reasons
Counter 1, event 60: PM_GCT_NOSLOT_CYC
Counter 2, event 59: PM_GCT_NOSLOT_IC_MISS
Counter 3, event 46: PM_GCT_NOSLOT_SRQ_FULL
Counter 4, event 51: PM_GCT_NOSLOT_BR_MPRED
Counter 5, event 0: PM_INST_CMPL
Counter 6, event 0: PM_RUN_CYC
*** All CPU data:
cpu PMC 1 PMC 2 PMC 3 PMC 4 PMC 5 PMC 6
==== ======== ======== ======== ======== ======== ========
[ 0] 812182509 78693417 83 444722215 3558607561 9448261358
[ 1] 816986288 77820762 53 446641392 3567264412 9446647500
[ 2] 817819375 76560140 38 448521089 3574674746 9453831148
[ 3] 828622714 78542730 51 452914994 3636947915 9466769676
.
.
.
[ALL] 26489520676 2543186641 1764 14448342651 115642620104 302213025090
All done
|
To compute the CPI components of completion stall cycles, use the pmcount data in Group 30 shown as an example in Listing 3. The CPI fraction represented by the completion stall caused by FXU instruction is equal to PMC 2 / PMC 6, or 39341080413 / 303713892054, or 0.1295, or 12.9%.
The CPI component of stall caused by DIV instruction is equal to 18279140851 / 303713892054, or 0.0601, or 6.01% of the total CPI. The CPI component of stall caused by FXU latency is equal 0.1295 - 0.0601, or 0.0693, or 6.93% of the aggregate CPI.
Listing 3. Listing 3. pmcount Data for Group 30
pmcount -kun -G 30 sleep 5
Processor name: POWER5
*** Configuration for all CPUs :
kernel: runlatch enabled;
Mode = user, kernel; Process tree = off; Thresholding = off
Thresholding multiplier: 1, 32, 64
MMCR0[400c003] MMCR1[40000008] MMCR1[22320232] MMCRA[1]
regs mode [0]
sregs: MMCRA[2000001] mode [40000000]
Group 30: pm_fxu_stall Name: FXU Stalls
Counter 1, event 68: PM_GRP_IC_MISS_BR_REDIR_NONSPEC
Counter 2, event 12: PM_CMPLU_STALL_FXU
Counter 3, event 55: PM_INST_CMPL
Counter 4, event 7: PM_CMPLU_STALL_DIV
Counter 5, event 0: PM_INST_CMPL
Counter 6, event 0: PM_RUN_CYC
*** All CPU data:
cpu PMC 1 PMC 2 PMC 3 PMC 4 PMC 5 PMC 6
==== ======== ======== ======== ======== ======== ========
[ 0] 498611452 1202471828 3611765439 554313577 3611765535 9505036884
[ 1] 500444149 1210730985 3627920945 560516353 3627921041 9503526335
[ 2] 499069721 1215432409 3622917629 561661375 3622917725 9483176391
[ 3] 500842376 1212605068 3621113502 562567343 3621113598 9497993084
.
.
.
[ALL] 16193491341 39341080413 117465600498 18279140851 117465602898 303713892054
All done
|
 |
Constructing the workload CPI breakdown table
After the pmcount data is collected, the CPI percentage and individual CPI fractions can be computed and populated as Table 3 illustrates.
Table 3. Workload CPI breakdown
| CPI Breakdown Model | %CPI | CPI Fraction |
|---|
| Total Cycles | Completion Cycles | PowerPC Base completion cycles | 7.77 | 0.20 | | Overhead of Cracking/Grouping Restriction | 6.23 | 0.16 | | Completion table empty (GCT empty) cycles | I-cache miss penalty | 0.84 | 0.02 | | Branch redirection (branch misprediction) penalty | 4.78 | 0.12 | | Others (flush, etc.) | 3.14 | 0.08 | | Completion Stall Cycles | Stall by LSU Inst | Stall by reject | Stall by translation | 1.28 | 0.03 | | Other reject | 3.75 | 0.10 | | Stall by D-cache Miss | 34.23 | 0.88 | | Stall by LSU basic latency | 9.50 | 0.25 | | Stall by FXU Inst | Stall by Div/MTSPR/MFSPR inst | 6.02 | 0.16 | | Stall by FXU basic latency | 6.93 | 0.18 | | Stall by FPU Inst | Stall by FDIV/FSQRT inst | 0.03 | 0.00 | | Stall by FPU basic latency | 2.10 | 0.05 | | Other Stalls | 13.39 | 0.35 | | CPI | | 2.57 |
In Table 3, the CPI of the workload under study is 2.57, which is on the medium-high
side of CPI. By breaking down the CPI into the base component, approximately 14% of processing cycles were spent in completing work while the rest was spent in various stalled components. The CPI percentage for GCT empty is about 8.7%, of which 4.78% is in branch prediction penalty -- considered to be minimal when compared to 34.23% in stall by D-cache miss. This observation suggests the high CPI was mainly caused by data cache misses. As a matter of fact, the total stall by LSU instruction is 48.7% compared to 12.9% stall in FXU instruction and 2.1% stall in FPU instruction. The cause for high LSU stall is probably in memory subsystem including the workload memory working set, memory latency, memory hierarchical organization, cache sizes, or TLB sizes.
CPI analysis -- an example
The following section illustrates an example showing how CPI was used to identify and resolve a performance problem on a POWER5 system. The application under study is the program 179.art in the floating-point test suite of the SPECcpu2000 benchmark. The C program 179.art uses the Adaptive Resonance Theory 2 neural network to recognize objects in a thermal image. The objects are a helicopter and an airplane. First, the neural network is trained on the objects. When training is complete, the learned images are found in the scanfield image. A window corresponding to the size of the learned objects is scanned across the scanfield image and serves as input for the neural network. The ART2 neural network attempts to match the windowed field of view image with the learned image. The field of view with the highest level of matching confidence is output. For a detailed description of the benchmark, see the article "Performance Workloads Characterization on POWER5 with Simultaneous Multi Threading Support" linked in Resources.
Table 4 shows the CPI breakdown of the application where 0.31, or 13%, of the total
CPI was spent by the processor to complete instructions, while 2.11, or 87%, of the total CPI was spent in various stalling conditions. When the pipeline is stalled, 1.63, or 68%, of the CPI was spent in the Load and Store Unit (LSU). Further examination of the data reveals that 1.43 out of 1.63, or 88% of the CPI was spent in data cache miss penalty.
Table 4. CPI Breakdown Analysis 1
| | CPI |
|---|
| Group Complete Cycles (GCC) | Base Completion Cycles | PPC Base Completion Cycles | 0.31 | | Overhead of Cracking | 0.00 | | Overhead of Grouping Restriction | 0.00 | | total GCC | 0.31 | | GCT Empty Cycles | Icache Miss Penalty | 0.02 | | Branch Mispredict Penalty | 0.00 | | Store Stall Penalty | 0.00 | | Other (flush, etc.) | 0.01 | | total GCT empty cycles | 0.03 | | Completion Stall Cycles (CSC) | Stall by LSU Inst | Stall by Reject | Stall by Xlate | 0.00 | | Other Reject | 0.09 | | Stall by Dcache miss | | 1.43 | | Stall by LSU Latency | | 0.10 | | total LSU stall | 1.63 | | Stall by FXU Inst | Stall by Div/MTSPR/MFSPR | | 0.00 | | Stall by FXU Latency | | 0.03 | | total FXU stall | 0.03 | | Stall by FPU Inst | Stall by FDIV/FSQRT | | 0.00 | | Stall by FPU Latency | | 0.40 | | total FPU stall | 0.40 | | Other Stalls | | | 0.02 | | total CSC | 2.08 | | Total CPI | 2.42 |
|---|
Data cache misses normally require flushing the data cache contents to memory. This
activity involves flushing each cache line until all of the cache lines are flushed. As discussed in "Performance Monitoring on the PowerPC 604 Microprocessor" (see Resources), POWER5 systems employ a hardware preftech engine to prefetch data into the L1 data cache.
"When load instructions miss sequential cache lines, either ascending or descending, the prefetch engine initiates accesses to the following cache lines before being referenced by future load instructions. In order to ensure that the data will be in the L1 data cache when needed, the L1 data cache prefetch is initiated when a load instruction references data from a new cache line. At the same time, the transfer of a line into L2 from memory is requested. Since the latency for retrieving a line of data from memory into the L2 is longer than that for moving it from L2 to the L1, the prefetch engine requests data from memory twelve lines ahead of the line being referenced by the load instruction. Eight such streams per processor are supported."
Table 5 shows the data on flush statistic, where 96% of flushes are due to LSU penalty. In addition, the number of instructions per LSU flush is quite low, and the unaligned load percentage is 5% with the unaligned load rate about 2%. A perfect loading condition would have the unaligned load percentage around 0%.
Table 5. Flush statistics
| Loads_flushed | 5% | | Unaligned_load_percentage | 5% | | Unaligned_load_rate | 2% | | Unaligned_store_percentage | 0% | | Unaligned_store_rate | 0 | | LSU_Flush_Rate | 2% | | PCT_of_flushes_due2_LSU | 96% | | Instructions_per_LSU_Flush | 47.86 | | LRQ_Flush_Rate | 0 | | Instructions_per_LRQ_Flush | 24694.53 | | SRQ_Flush_Rate | 0 | | Instructions_per_SRQ_Flush | 1354.2 |
The use of aligned code is important in performance especially in the allocation of
data prefetch stream mechanism employed by POWER5 systems. It has been observed that aligned memory references are always executed in fewer cycles than misaligned references, which might require two sequential accesses to transfer the data (see Resources). The IBM compiler XL C/C++ Enterprise Edition Version 7.0 for AIX provides the -qalign option which allows you to specify which alignment rules the compiler should use for data and code alignment boundary. The default rule is power alignment where vector type members have an alignment of 16 bytes and the first element has its natural alignment. The other rule is natural where all elements have their natural alignment boundary. For example, data type float will align in four bytes boundary, double in eight bytes boundary, long double in eight bytes boundary, and pointer in four bytes boundary, and so on. Table 6 shows the impact of using the compiler option -qalign=natural upon the application under test. The percent of flushes due to LSU and the LSU flush rate after the program was recompiled with -qalign=natural have become nil. The unaligned load rate and load percentage are also 0%. The number of instructions per LSU flush has increased by almost 1,000 times which shows the importance of alignment of load and store instructions on the POWER5 prefetch engine.
Table 6. Impact of align=natural
| | Before | After |
|---|
| Loads_flushed | 5% | 0% | | Unaligned_load_percentage | 5% | 0% | | Unaligned_load_rate | 2% | 0% | | Unaligned_store_percentage | 0% | . | | Unaligned_store_rate | 0 | . | | LSU_Flush_Rate | 2% | 0% | | PCT_of_flushes_due2_LSU | 96% | 0% | | Instructions_per_LSU_Flush | 47.86 | 45485.87 | | LRQ_Flush_Rate | 0 | 0 | | Instructions_per_LRQ_Flush | 24694.53 | 531303.18 | | SRQ_Flush_Rate | 0 | 0 | | Instructions_per_SRQ_Flush | 1354.2 | 49469.69 | | DL1_reloads_per_DL1_Miss | 0.96 | 1.74 | | LMQ_Merge_per_miss | 0.7 | 0.59 | | LMQ_Full_reject_per_miss | 1.95 | 6.24 | | SRQ_LHS_reject_per_load | 0.01 | 0.04 | | LSU_REJECT_CDF_per_reference | 0.29 | 0.08 |
Table 7 shows the impact of the compiler option -qalign=natural on the CPI breakdown
analysis. In the column After , which means after the application was recompiled with -qalign=natural, the CPI for LSU penalty has gone down from 1.63 to 0.32, the completion stall cycles CPI went from 2.08 to 0.75, and the total CPI was reduced from 2.42 to 1.04.
Table 7. CPI breakdown analysis 2
| | Before | After |
|---|
| Group Complete Cycles (GCC) | Base Completion Cycles | PPC Base Completion Cycles | 0.31 | 0.29 | | Overhead of Cracking | 0.00 | 0.00 | | Overhead of Grouping Restriction | 0.00 | 0.00 | | total GCC | 0.31 | 0.29 | | GCT Empty Cycles | Icache Miss Penalty | 0.02 | 0.00 | | Branch Mispredict Penalty | 0.00 | 0.00 | | Store Stall Penalty | 0.00 | 0.00 | | Other (flush, etc.) | 0.01 | 0.01 | | total GCT empty cycles | 0.03 | 0.00 | | Completion Stall Cycles (CSC) | Stall by LSU Inst | Stall by Reject | Stall by Xlate | 0.00 | 0.00 | | Other Reject | 0.09 | 0.06 | | Stall by Dcache miss | | 1.43 | 0.32 | | Stall by LSU Latency | | 0.10 | 0.07 | | total LSU stall | 1.63 | 0.45 | | Stall by FXU Inst | Stall by Div/MTSPR/MFSPR | | 0.00 | 0.00 | | Stall by FXU Latency | | 0.03 | 0.01 | | total FXU stall | 0.03 | 0.01 | | Stall by FPU Inst | Stall by FDIV/FSQRT | | 0.00 | 0.00 | | Stall by FPU Latency | | 0.40 | 0.27 | | total FPU stall | 0.40 | 0.27 | | Other Stalls | | | 0.02 | 0.02 | | total CSC | 2.08 | 0.75 | | Total CPI | 2.42 | 1.04 |
|---|
Conclusion
This article has shown that CPI analysis is a methodology for studying microprocessor performance as well as in software tuning. With the performance monitoring facilities provided in POWER5, workload analysis through CPI breakdown on POWER5 has become a valuable tool to analyze the effects of system architecture on workload performance.
Resources Learn
-
This is Part 2 of a two-part series. See also Part 1.
-
"Performance Workloads Characterization on
POWER5 with Simultaneous Multi Threading Support" was presented at the Eighth
Workshop on Computer Architecture Evaluation using Commercial Workloads in February
2005.
-
"A detailed
description of the POWER5 design" is available from the IBM Journal of Research
and Development's POWER5 System Microarchitecture.
-
The book The
PowerPC Compiler Writer's Guide (IBM, 1996) describes, mainly by coding
examples, the code patterns that perform well on Power Architecture processors.
-
Alex Mericas' "Performance
Monitor PowerPC Perspective" was originally presented by Alex Mericas, in
February 2005 (in PDF format).
-
Frank Levine's "A
Programmer's View of Performance Monitoring in the PowerPC Microprocessor" is a
detailed discussion on performance monitor support on the PowerPC 604 and 604e.
-
"Performance
Monitoring on the PowerPC 604 Microprocessor" by Charles Roth, Frank Levine, and
Ed Welbon, provides an in-depth discussion on performance monitoring on the PowerPC
604 (in PDF format).
-
Sam Siewert's "Big iron
lessons, Part 3: Performance monitoring and tuning" contains a general discussion
on performance tuning considerations for the system architect (developerWorks 2005).
-
For a typical workload CPI analysis, see the article "Performance Workloads
Characterization on POWER5 with Simultaneous Multi Threading Support" (in PDF format).
-
You will find Oprofile at SourceForge,
and a
guide to using OProfile at developerWorks.
-
Find benchmarks to fulfil your every need (or at least many of them) at spec.org.
-
The book Link for
Performance Tuning for Linux Servers doesn't just cover kernel tuning: it shows
how to maximize the end-to-end performance of real-world applications and databases
running on Linux.
-
Keep abreast of all Power Architecture-related
news and publications: subscribe to the
Power Architecture Community Newsletter.
Get products and technologies
Discuss
About the author  | |  | Duc Vianney, Alex Mericas, Bill
Maron, Thomas Chen, Steve Kunkel, and Bret Olszewski of the
IBM Systems & Technology Group Systems Performance team
contributed to this article. |
Rate this page
|  |