IBM®
Skip to main content
    Country/region [select]      Terms of use
 
 
    
     Home      Products      Services & solutions      Support & downloads      My account     
CPI analysis on POWER5, Part 2: Introducing the CPI breakdown model
skip to main content

developerWorks  >  Power Architecture technology | Linux  >

CPI analysis on POWER5, Part 2: Introducing the CPI breakdown model

Understanding the numbers

developerWorks
Document options

Document options requiring JavaScript are not displayed


Rate this page

Help us improve this content


Level: Intermediate

STG Systems Performance team (dwpower@us.ibm.com), Development, IBM

25 Apr 2006

Make substantial improvements in performance analysis with a CPI analysis model built on the tools introduced in Part 1. Learn ways to analyze the specific performance counter data produced by profiling runs to obtain statistics for events which the CPU cannot directly report on.

As Part 1 explained, the POWER5™ has six PMU counters -- two are dedicated to counting PowerPC® instructions completed and total cycles, and the four other counters are available to count other events.

The POWER5 is capable of completing one group per cycle (or up to five instructions per cycle). Some PowerPC instructions are expanded into multiple IOPS (number of PowerPC instructions completed) during the decode stage and may span more than one group. The POWER5 PMU can count the total number of groups completed as well as the number of groups that contained at least one PowerPC instruction. The difference is the number of additional groups required due to expansion. A simplistic view of CPI is to break it into a base component, when the processor is completing work (groups completed), and a stall component, when the processor is not completing instructions (total cycles - groups completed). The stall component can be further divided into cycles when the pipeline was empty (GCT empty) and cycles when the pipeline was not empty but completion is stalled (stall - GCT empty).

These divisions, and the multiple counters available, make it possible to analyze performance more precisely, by comparing related statistics to derive some of the performance characteristics the processor cannot directly measure.


Table 1. CPI breakdown model
Total cycle
<# cycles>
Completion cycles
<A:group complete cycles>
PowerPC Base completion cycles
<A1: One or more PowerPC instructions completed this cycle>
overhead of cracking/microcoding and grouping restriction
<A2:(A)-(A1)>
Completion Table empty (GCT empty) cycles
<B>
I-cache miss penalty
<B1>
Branch redirection (branch misprediction) penalty
<B2>
others (Flush penalty etc.)
<B4: (B)-(B1)-(B2>
Completion Stall cycles
<C: total-(A)-(B)>
Stall by LSU inst
<C1>
Stall by reject
<C1A>
Stall by translation (rejected by ERAT miss)
<C1A1>
Other reject
<C1A2: (C1A)-(C1A1)>
Stall by D-cache miss
<C1B>
Stall by LSU basic latency, LSU Flush penalty
<C1C: (C1)-(C1A)-(C1B)>
Stall by FXU inst
<C2>
Stall by any form of DIV/MTSPR/MFSPR inst
<C2A>
Stall by FXU basic latency
<C2C: (C2)-(C2A)>
Stall by FPU inst
<C3>
Stall by any form of FDIV/FSQRT inst
<C3A>
Stall by FPU basic latency
<C3B: (C3)-(C3A)>
others (Stall by BRU/CRU inst , flush penalty (except LSU flush), etc.)
<C4: (Completion Stall cycles)-(C1)-(C2)-(C3) >

Table 1 represents a CPI breakdown model where the total cycles of a workload is divided into three components: Completion cycles, Completion Table empty or GCT empty cycles, and Completion Stall cycles. The base completion cycles are the number of cycles that would be needed if grouping was perfect. Otherwise, stalls happen, and they can be attributed to either Completion Table empty or Completion Stall cycles. A Completion Table empty condition occurs when no groups are completing on a given cycle because of either Icache miss or branch misdirection. Meanwhile, the Completion Stall cycles are those stalls caused by any of the following instructions: LSU, FXU, FXU long (all forms of div, mtspr, mfspr), FPU, and FPU long (all forms of fsqrt, fdiv); or by other events such as Dcache miss, Reject, and Reject by translating (ERAT miss).

The events are hierarchical. Load/Store Unit (LSU) stalls will include Dcache miss and Rejects. Reject stalls will include Translation stalls due to Effective to Real Address Translation (ERAT) miss. This method only reports the last condition to clear and will not detect dependency chains. For example, a fixed point add instruction that is dependent on a load in the same group that misses the Dcache will cause the completion stall condition to be reported as Fixed Point Unit (FXU). When multiple conditions are reported at the same time, Load/Store conditions are favored over all others, and Dcache miss is favored over Rejects. Stalls due to GCT empty are primarily caused by instruction cache misses and branch mispredicts. The POWER5 has speculative events for these conditions that are handled in the same way as completion stalls. Counting begins when the GCT goes empty and stops when at least one group is in the GCT.

The performance events used in Table 1 are defined as follows:

  • Total cycles = PM_CYCLES
  • Completion cycles = PM_GRP_CMPL
  • Completion cycles = PM_INST_CMPL (SCOMx360 configured with bits 0-10 = 0 00 0000 0000
  • PowerPC Base Completion cycles = PM_PPC_CMPL (or PMC 5)
  • Overhead of cracking/microcoding = PPC_Completion_Cycles - PPC_Base_Completion_Cycles
  • Completion Table empty (GCT empty) = PM_GCT_EMPTY
  • I-Cache Miss Penalty = PM_GCT_EPMTY_IC_MISS
  • Branch Misprediction Penalty = PM_GCT_EMPTY_BR_MPRED
  • Others GCT stalls = GCT_Empty - I-Cache Miss Penalty - Branch Mispredict Penalty
  • Completion Stall cycles = Total cycles - Completion cycles - GCT empty
  • Stall by LSU Instruction = PM_CMPLU_STALL_LSU
  • Stall by LSU Reject = PM_CMPLU_STALL_REJECT
  • Stall by LSU Translation Reject = PM_CMPLU_STALL_ERAT_MISS
  • Stall by LSU Other = Stall by LSU Reject - Stall by LSU Translation Reject
  • Stall by LSU Dcache miss = PM_CMPLU_STALL_DCACHE_MISS
  • Stall by LSU basic latency = Stall by LSU Instruction - Stall by LSU Reject- Stall by LSU Dcache miss
  • Stall by FXU Instruction = PM_CMPLU_STALL_FXU
  • Stall by DIV,MTSR, or MFSPR = PM_CMPLU_STALL_DIV
  • Stall by FXU basic latency = Stall by FXU Instruction - Stall by DIV,MTSR, or MFSPR
  • Stall by FPU Instruction = PM_CMPLU_STALL_FPU
  • Stall by FDIV or FSQRT = PM_CMPLU_STALL_FDIV
  • Stall by FPU basic latency = Stall by FPU Instruction - Stall by FDIV,FSQRT
  • Stall by others = Completion Stall cycles - Stall by LSU Instruction - Stall by FXU Instruction - Stall by FPU Instruction

Table 1 shows the breakdown of CPI components and the events that used to calculate the breakdown. The shaded cells are measured directly. The total cycles represent the aggregate CPI of the workload. The cycle values in each category are either measured by the performance counters or calculated from the observed values. As an example, the CPI cost in Completion Table empty <B> in Table 1 was broken down in three components: I-cache Miss Penalty <B1>, Branch redirection (Branch Misprediction Penalty) <B2>, and others (Flush Penalty, and so on) <B4: (B)-(B1)-(B2)>. <B1> and <B2> are measured values and <B4> is a calculated value based on <B>, <B1>, and <B2>.



Back to top


How to build a CPI breakdown model

To build a CPI breakdown model for your workload, you need to collect the counter data, calculate the CPI fraction for its components, and populate the model with the CPI values.

Collecting counter data

The data illustrated in Table 1 was collected through the following 16 counters from seven pmcount groups: 0, 1, 5, 28, 29, 30, and 31. Table 2 gives their description.


Table 2. Performance monitor counter data collection
Counter No.Event No.Group No.Description
Completion cycles:
2710PM_INST_CMPL - Number of PowerPC inst completed (iops)
50AnyPM_INST_CMPL - Number of PowerPC inst completed
60AnyPM_RUN_CYC - Processor cycles gated by the run latch
3491PM_GRP_CMPL - Group completed
Completion Table empty (GCT empty) cycles:
1605PM_GCT_NOSLOT_CYC - Cycles no GCT slot allocated
2595PM_GCT_NOSLOT_IC_MISS - No slot in GCT caused by Icache miss
3465PM_GCT_NOSLOT_SRQ_FULL - No slot in GCT caused by SRQ full
4515PM_GCT_NOSLOT_BR_MPRED - No slot in GCT caused by branch mispredict
Completion Stall cycles:
21328PM_CMPL_STALL_LSU - Completion stall caused by LSU instruction
41028PM_CMPL_STALL_REJECT - Completion stall caused by Reject
21029PM_CMPL_STALL_DCACHE_MISS - Completion stall caused by Dcache miss
4829PM_CMPL_STALL_ERAT_MISS - Completion stall caused by ERAT miss
21230PM_CMPL_STALL_FXU - Completion stall caused by FXU instruction
4730PM_CMPL_STALL_DIV - Completion stall caused by DIV instruction
21131PM_CMPL_STALL_FDIV - Completion stall caused by FDIV or FQRT instruction
4931PM_CMPL_STALL_FPU - Completion stall caused by FPU instruction

Example of counter data collection

To collect the pmcount data, you need to use the pmcount command available under AIX. For Linux on POWER, you will need a kernel patch. As an example, you can use the following command to collect the pmcount data for Group 0:

pmcount -kuny -G 0 workload >> pmcount_ku.out

Find more detailed discussion of performance data collection tools in the first article in this series.

Calculating CPI components

The CPI and its components are readily available from the pmcount data collected through the pmcount command. Listing 1 depicts a sample of pmcount data collected for Group 0 taken on an SMP system. The data represents the values in cycles of six counters: Counter 1, ?, Counter 6 which are displayed in columns PMC 1, ..., PMC 6 respectively.

To compute the CPI, take the ratio of total cycles / number of PPC instructions completed in Group 0, which is the total value of PMC 6 / PMC 5, or 302936029042 / 117749670719 which is equal to 2.57.


Listing 1. pmcount Data for Group 0

pmcount -kun -G 0 sleep 5 
Processor name: POWER5
*** Configuration for all CPUs :
kernel: runlatch enabled; 
Mode = user, kernel; Process tree = off; Thresholding = off
Thresholding multiplier: 1, 32, 64
MMCR0[400c003] MMCR1[0] MMCR1[a02121e] MMCRA[0]
regs mode [0]
sregs: MMCRA[4000000] mode [40000000]
Group  0: pm_utilization  Name: CPI and utilization data
Counter  1, event 190: PM_RUN_CYC
Counter  2, event 71: PM_INST_CMPL
Counter  3, event 56: PM_INST_DISP
Counter  4, event 12: PM_CYC [shared]
Counter  5, event  0: PM_INST_CMPL
Counter  6, event  0: PM_RUN_CYC

*** All CPU data:
cpu    PMC 1        PMC 2         PMC 3         PMC 4         PMC 5         PMC 6     
====   ========     ========      ========      ========      ========      ========  
[ 0]  9480159805    3627365099    4399841370    9562151386    3627365196    9480160495  
[ 1]  9479204529    3652180319    4422613553    9562559351    3652180416    9479204595  
[ 2]  9485236988    3621234799    4390923166    9563162485    3621234936    9485237345  
[ 3]  9476339229    3703301514    4483370620    9563799360    3703301651    9476343842  
.
.
.
[ALL] 302936012787  117749668145  142731148496  306273117094  117749670719  302936029042  
All done

To compute the CPI component of I-cache miss penalty, take the ratio of PMC 2 / PMC 6 in Group 5, or 2543186641 / 302213025090 which is equal to 0.0084. That is, the execution of this workload was stalled 0.84% of the time due to the pipeline empty (GCT empty) caused by I-cache miss.

To compute the CPI component of branch miss prediction penalty, take the ratio of PMC 4 / PMC 6, or 14448342651 / 302213025090 which is equal to 0.0478. That is, the pipeline stall caused by branch miss prediction happened 4.78% of the total clock cycles. See Listing 2 for pmcount data in Group 5.


Listing 2. pmcount Data for Group 5

pmcount -kun -G 5 sleep 5 
Processor name: POWER5
*** Configuration for all CPUs :
kernel: runlatch enabled; 
Mode = user, kernel; Process tree = off; Thresholding = off
Thresholding multiplier: 1, 32, 64
MMCR0[400c003] MMCR1[40000000] MMCR1[8380838] MMCRA[0]
regs mode [0]
sregs: MMCRA[2000000] mode [40000000]
Group  5: pm_gct_empty  Name: GCT empty reasons
Counter  1, event 60: PM_GCT_NOSLOT_CYC
Counter  2, event 59: PM_GCT_NOSLOT_IC_MISS
Counter  3, event 46: PM_GCT_NOSLOT_SRQ_FULL
Counter  4, event 51: PM_GCT_NOSLOT_BR_MPRED
Counter  5, event  0: PM_INST_CMPL
Counter  6, event  0: PM_RUN_CYC

*** All CPU data:
cpu    PMC 1       PMC 2       PMC 3      PMC 4        PMC 5         PMC 6     
====   ========    ========    ========   ========     ========      ========  
[ 0]  812182509    78693417    83         444722215    3558607561    9448261358  
[ 1]  816986288    77820762    53         446641392    3567264412    9446647500  
[ 2]  817819375    76560140    38         448521089    3574674746    9453831148  
[ 3]  828622714    78542730    51         452914994    3636947915    9466769676  
.
.
.
[ALL] 26489520676  2543186641  1764       14448342651  115642620104  302213025090  
All done

To compute the CPI components of completion stall cycles, use the pmcount data in Group 30 shown as an example in Listing 3. The CPI fraction represented by the completion stall caused by FXU instruction is equal to PMC 2 / PMC 6, or 39341080413 / 303713892054, or 0.1295, or 12.9%.

The CPI component of stall caused by DIV instruction is equal to 18279140851 / 303713892054, or 0.0601, or 6.01% of the total CPI. The CPI component of stall caused by FXU latency is equal 0.1295 - 0.0601, or 0.0693, or 6.93% of the aggregate CPI.


Listing 3. Listing 3. pmcount Data for Group 30

pmcount -kun -G 30 sleep 5 
Processor name: POWER5
*** Configuration for all CPUs :
kernel: runlatch enabled; 
Mode = user, kernel; Process tree = off; Thresholding = off
Thresholding multiplier: 1, 32, 64
MMCR0[400c003] MMCR1[40000008] MMCR1[22320232] MMCRA[1]
regs mode [0]
sregs: MMCRA[2000001] mode [40000000]
Group 30: pm_fxu_stall  Name: FXU Stalls
Counter  1, event 68: PM_GRP_IC_MISS_BR_REDIR_NONSPEC
Counter  2, event 12: PM_CMPLU_STALL_FXU
Counter  3, event 55: PM_INST_CMPL
Counter  4, event  7: PM_CMPLU_STALL_DIV
Counter  5, event  0: PM_INST_CMPL
Counter  6, event  0: PM_RUN_CYC

*** All CPU data:
cpu    PMC 1       PMC 2        PMC 3         PMC 4        PMC 5         PMC 6     
====   ========    ========     ========      ========     ========      ========  
[ 0]  498611452    1202471828   3611765439    554313577    3611765535    9505036884  
[ 1]  500444149    1210730985   3627920945    560516353    3627921041    9503526335  
[ 2]  499069721    1215432409   3622917629    561661375    3622917725    9483176391  
[ 3]  500842376    1212605068   3621113502    562567343    3621113598    9497993084  
.
.
.
[ALL] 16193491341  39341080413  117465600498  18279140851  117465602898  303713892054  
All done



Back to top


Constructing the workload CPI breakdown table

After the pmcount data is collected, the CPI percentage and individual CPI fractions can be computed and populated as Table 3 illustrates.


Table 3. Workload CPI breakdown
CPI Breakdown Model %CPI CPI Fraction
Total Cycles Completion Cycles PowerPC Base completion cycles 7.77 0.20
Overhead of Cracking/Grouping Restriction 6.23 0.16
Completion table empty (GCT empty) cycles I-cache miss penalty 0.84 0.02
Branch redirection (branch misprediction) penalty 4.78 0.12
Others (flush, etc.) 3.14 0.08
Completion Stall Cycles Stall by LSU Inst Stall by reject Stall by translation 1.28 0.03
Other reject 3.75 0.10
Stall by D-cache Miss 34.23 0.88
Stall by LSU basic latency 9.50 0.25
Stall by FXU Inst Stall by Div/MTSPR/MFSPR inst 6.02 0.16
Stall by FXU basic latency 6.93 0.18
Stall by FPU Inst Stall by FDIV/FSQRT inst 0.03 0.00
Stall by FPU basic latency 2.10 0.05
Other Stalls 13.39 0.35
CPI 2.57

In Table 3, the CPI of the workload under study is 2.57, which is on the medium-high side of CPI. By breaking down the CPI into the base component, approximately 14% of processing cycles were spent in completing work while the rest was spent in various stalled components. The CPI percentage for GCT empty is about 8.7%, of which 4.78% is in branch prediction penalty -- considered to be minimal when compared to 34.23% in stall by D-cache miss. This observation suggests the high CPI was mainly caused by data cache misses. As a matter of fact, the total stall by LSU instruction is 48.7% compared to 12.9% stall in FXU instruction and 2.1% stall in FPU instruction. The cause for high LSU stall is probably in memory subsystem including the workload memory working set, memory latency, memory hierarchical organization, cache sizes, or TLB sizes.



Back to top


CPI analysis -- an example

The following section illustrates an example showing how CPI was used to identify and resolve a performance problem on a POWER5 system. The application under study is the program 179.art in the floating-point test suite of the SPECcpu2000 benchmark. The C program 179.art uses the Adaptive Resonance Theory 2 neural network to recognize objects in a thermal image. The objects are a helicopter and an airplane. First, the neural network is trained on the objects. When training is complete, the learned images are found in the scanfield image. A window corresponding to the size of the learned objects is scanned across the scanfield image and serves as input for the neural network. The ART2 neural network attempts to match the windowed field of view image with the learned image. The field of view with the highest level of matching confidence is output. For a detailed description of the benchmark, see the article "Performance Workloads Characterization on POWER5 with Simultaneous Multi Threading Support" linked in Resources.

Table 4 shows the CPI breakdown of the application where 0.31, or 13%, of the total CPI was spent by the processor to complete instructions, while 2.11, or 87%, of the total CPI was spent in various stalling conditions. When the pipeline is stalled, 1.63, or 68%, of the CPI was spent in the Load and Store Unit (LSU). Further examination of the data reveals that 1.43 out of 1.63, or 88% of the CPI was spent in data cache miss penalty.


Table 4. CPI Breakdown Analysis 1
CPI
Group Complete Cycles (GCC) Base Completion Cycles PPC Base Completion Cycles 0.31
Overhead of Cracking 0.00
Overhead of Grouping Restriction 0.00
total GCC 0.31
GCT Empty Cycles Icache Miss Penalty 0.02
Branch Mispredict Penalty 0.00
Store Stall Penalty 0.00
Other (flush, etc.) 0.01
total GCT empty cycles 0.03
Completion Stall Cycles (CSC) Stall by LSU Inst Stall by Reject Stall by Xlate 0.00
Other Reject 0.09
Stall by Dcache miss 1.43
Stall by LSU Latency 0.10
total LSU stall 1.63
Stall by FXU Inst Stall by Div/MTSPR/MFSPR 0.00
Stall by FXU Latency 0.03
total FXU stall 0.03
Stall by FPU Inst Stall by FDIV/FSQRT 0.00
Stall by FPU Latency 0.40
total FPU stall 0.40
Other Stalls 0.02
total CSC 2.08
Total CPI 2.42

Data cache misses normally require flushing the data cache contents to memory. This activity involves flushing each cache line until all of the cache lines are flushed. As discussed in "Performance Monitoring on the PowerPC 604 Microprocessor" (see Resources), POWER5 systems employ a hardware preftech engine to prefetch data into the L1 data cache.

"When load instructions miss sequential cache lines, either ascending or descending, the prefetch engine initiates accesses to the following cache lines before being referenced by future load instructions. In order to ensure that the data will be in the L1 data cache when needed, the L1 data cache prefetch is initiated when a load instruction references data from a new cache line. At the same time, the transfer of a line into L2 from memory is requested. Since the latency for retrieving a line of data from memory into the L2 is longer than that for moving it from L2 to the L1, the prefetch engine requests data from memory twelve lines ahead of the line being referenced by the load instruction. Eight such streams per processor are supported."

Table 5 shows the data on flush statistic, where 96% of flushes are due to LSU penalty. In addition, the number of instructions per LSU flush is quite low, and the unaligned load percentage is 5% with the unaligned load rate about 2%. A perfect loading condition would have the unaligned load percentage around 0%.


Table 5. Flush statistics
Loads_flushed 5%
Unaligned_load_percentage 5%
Unaligned_load_rate 2%
Unaligned_store_percentage 0%
Unaligned_store_rate 0
LSU_Flush_Rate 2%
PCT_of_flushes_due2_LSU 96%
Instructions_per_LSU_Flush 47.86
LRQ_Flush_Rate 0
Instructions_per_LRQ_Flush 24694.53
SRQ_Flush_Rate 0
Instructions_per_SRQ_Flush 1354.2

The use of aligned code is important in performance especially in the allocation of data prefetch stream mechanism employed by POWER5 systems. It has been observed that aligned memory references are always executed in fewer cycles than misaligned references, which might require two sequential accesses to transfer the data (see Resources). The IBM compiler XL C/C++ Enterprise Edition Version 7.0 for AIX provides the -qalign option which allows you to specify which alignment rules the compiler should use for data and code alignment boundary. The default rule is power alignment where vector type members have an alignment of 16 bytes and the first element has its natural alignment. The other rule is natural where all elements have their natural alignment boundary. For example, data type float will align in four bytes boundary, double in eight bytes boundary, long double in eight bytes boundary, and pointer in four bytes boundary, and so on. Table 6 shows the impact of using the compiler option -qalign=natural upon the application under test. The percent of flushes due to LSU and the LSU flush rate after the program was recompiled with -qalign=natural have become nil. The unaligned load rate and load percentage are also 0%. The number of instructions per LSU flush has increased by almost 1,000 times which shows the importance of alignment of load and store instructions on the POWER5 prefetch engine.


Table 6. Impact of align=natural
Before After
Loads_flushed 5% 0%
Unaligned_load_percentage 5% 0%
Unaligned_load_rate 2% 0%
Unaligned_store_percentage 0% .
Unaligned_store_rate 0 .
LSU_Flush_Rate 2% 0%
PCT_of_flushes_due2_LSU 96% 0%
Instructions_per_LSU_Flush 47.86 45485.87
LRQ_Flush_Rate 0 0
Instructions_per_LRQ_Flush 24694.53 531303.18
SRQ_Flush_Rate 0 0
Instructions_per_SRQ_Flush 1354.2 49469.69
DL1_reloads_per_DL1_Miss 0.96 1.74
LMQ_Merge_per_miss 0.7 0.59
LMQ_Full_reject_per_miss 1.95 6.24
SRQ_LHS_reject_per_load 0.01 0.04
LSU_REJECT_CDF_per_reference 0.29 0.08

Table 7 shows the impact of the compiler option -qalign=natural on the CPI breakdown analysis. In the column After , which means after the application was recompiled with -qalign=natural, the CPI for LSU penalty has gone down from 1.63 to 0.32, the completion stall cycles CPI went from 2.08 to 0.75, and the total CPI was reduced from 2.42 to 1.04.


Table 7. CPI breakdown analysis 2
Before After
Group Complete Cycles (GCC) Base Completion Cycles PPC Base Completion Cycles 0.31 0.29
Overhead of Cracking 0.00 0.00
Overhead of Grouping Restriction 0.00 0.00
total GCC 0.31 0.29
GCT Empty Cycles Icache Miss Penalty 0.02 0.00
Branch Mispredict Penalty 0.00 0.00
Store Stall Penalty 0.00 0.00
Other (flush, etc.) 0.01 0.01
total GCT empty cycles 0.03 0.00
Completion Stall Cycles (CSC) Stall by LSU Inst Stall by Reject Stall by Xlate 0.00 0.00
Other Reject 0.09 0.06
Stall by Dcache miss 1.43 0.32
Stall by LSU Latency 0.10 0.07
total LSU stall 1.63 0.45
Stall by FXU Inst Stall by Div/MTSPR/MFSPR 0.00 0.00
Stall by FXU Latency 0.03 0.01
total FXU stall 0.03 0.01
Stall by FPU Inst Stall by FDIV/FSQRT 0.00 0.00
Stall by FPU Latency 0.40 0.27
total FPU stall 0.40 0.27
Other Stalls 0.02 0.02
total CSC 2.08 0.75
Total CPI 2.42 1.04


Back to top


Conclusion

This article has shown that CPI analysis is a methodology for studying microprocessor performance as well as in software tuning. With the performance monitoring facilities provided in POWER5, workload analysis through CPI breakdown on POWER5 has become a valuable tool to analyze the effects of system architecture on workload performance.



Resources

Learn

Get products and technologies

Discuss


About the author

Duc Vianney, Alex Mericas, Bill Maron, Thomas Chen, Steve Kunkel, and Bret Olszewski of the IBM Systems & Technology Group Systems Performance team contributed to this article.




Rate this page


Please take a moment to complete this form to help us better serve you.



YesNoDon't know
 


 


12345
Not
useful
Extremely
useful
 


Back to top



    About IBMPrivacyContact