 | Level: Intermediate Ken Milberg, Future Tech UNIX Consultant, Technology Writer, and Site Expert, Future Tech
24 Apr 2007 Identify which AIX® tools to use to monitor your Central
Processing Unit
(CPU) for a given situation and find out why some tools might be better than others.
Part 1 of this series discussed the tuning methodology and the importance of having
procedures for CPU performance tuning. It also briefly introduced some performance
tools to use as a part of your tuning repertories, gave an overview of the POWER
CPU, and discussed how the architectural improvements of the evolution of the POWER
Chip have contributed to the hardware improvements of the System p™ product line.
About this series
This three-part series focuses on the various aspects of the Central Processing
Unit (CPU) performance and monitoring. The first installment of the series
provides an overview of how to efficiently monitor your CPU, discusses the
methodology for performance tuning, and gives considerations that can impact
performance, either positively or negatively. Though the first part of the series
goes through some commands, the second installment focuses much more on the detail
of actual CPU systems monitoring and analyzing trends and results. The third
installment focuses on proactively controlling thread usage and other ways to tune
your CPU to maximize performance. Throughout this series, I'll also expound on
various best practices of AIX® CPU performance tuning and monitoring.
Introduction
Performance tuning is clearly more than running some commands and observing the
output. A UNIX® administrator needs to know which tools to run for what
purpose and what the best methods are for capturing data. There are times when you
might not have 30 days to systemically analyze your data to determine trends, and
there might be instances where you find that you might not even have 30 minutes to
make an important judgment call on what your bottleneck is. After all, that is the
main purpose behind CPU monitoring -- determining exactly what your bottleneck is.
You do not want to tune your CPU unless the data that you've compiled clearly
shows that CPU is the bottleneck. In fact, more often than not, you'll find your
bottleneck will be memory or I/O related rather than CPU related.
As an AIX administrator, one of your most important roles is to tune your
systems. Tuning cannot be done without first monitoring your system and analyzing
the results. This goes for both long-term trending and short-term (that job must
finish in the next hour) issues. While there are specific tools that you can use
that analyze only the CPU, for given circumstances, you might want to use tools
that look at all possible bottlenecks on your system. As you probably already
know, the CPU is the fastest component of the system. If your CPU is a bottleneck,
it affects performance throughout your system. As I go through the tools, please
note that the following commands have been enhanced in AIX 5.3 to allow the tools
to report back accurate statistics on shared partitions using Advanced Power
Virtualization: mpstat, sar,
topas, and vmstat.
Furthermore, the following trace-based tools have also been updated: curt,
filemon, netpmon, pprof, and splat.
Enough with the chatter, let's start monitoring your systems.
UNIX generic CPU
monitoring tools
In this section, you'll examine UNIX generic tools that are available in all UNIX
distributions (Solaris to AIX). While some of the output varies among
distributions, most flags work across all UNIX systems. These can help you gather
information on the fly, but I wouldn't rely on them for historical trending and
analysis.
Let's start with vmstat. vmstat reports back information about processes, memory,
paging, blocked I/O, and overall CPU activity. While it has its roots in virtual
memory (the vm in vmstat), I have found unquestionably that running vmstat on a
host is the quickest way for me to determine exactly what is happening on an AIX
server.
Using vmstat
You just received that dreaded call, "Why is the system so slow?", and you need
to do a quick analysis to determine where the bottleneck might be. vmstat is the
best place to start. See Listing 1 for an example of running
vmstat.
Listing 1. Running vmstat
# vmstat 1
System configuration: lcpu=2 mem=3920MB
kthr memory page faults cpu
----- ----------- ------------------------ ------------ -----------
r b avm fre re pi po fr sr cy in sy cs us sy id wa
0 0 229367 332745 0 0 0 0 0 0 3 198 69 0 0 99 0
0 0 229367 332745 0 0 0 0 0 0 3 33 66 0 0 99 0
0 0 229367 332745 0 0 0 0 0 0 2 33 68 0 0 99 0
0 0 229367 332745 0 0 0 0 0 0 80 306 100 0 1 97 1
0 0 229367 332745 0 0 0 0 0 0 1 20 68 0 0 99 0
0 0 229367 332745 0 0 0 0 0 0 2 36 64 0 0 99 0
0 0 229367 332745 0 0 0 0 0 0 2 33 66 0 0 99 0
0 0 229367 332745 0 0 0 0 0 0 2 21 66 0 0 99 0
0 0 229367 332745 0 0 0 0 0 0 1 237 64 0 0 99 0
0 0 229367 332745 0 0 0 0 0 0 2 19 66 0 0 99 0
0 0 229367 332745 0 0 0 0 0 0 6 37 76 0 0 99 0
|
The most important fields to look at here are:
-
r -- The average number of runnable kernel threads over whatever
sampling interval you have chosen.
-
b -- The average number of kernel threads that are in the virtual
memory waiting queue over your sampling interval. r should always be
higher than b; if it is not, it usually means you have a CPU
bottleneck.
-
fre -- The size of your memory free list. Do not worry so much if the
amount is really small. More importantly, determine if there is any paging
going on if this amount is small.
-
pi -- Pages paged in from paging space.
-
po -- Pages paged out to paging space.
-
CPU section:
Let's look at the last section, which also comes up in most other CPU monitoring
tools, albeit with different headings:
-
us -- user time
-
sy -- system time
-
id -- idle time
-
wa -- waiting on I/O
Clearly, this system has no bottleneck to speak of. How do you determine this?
Let's look at the more important fields to analyze in the vmstat output. Even
though this system is running AIX 5.3, you will not see the number of physical
processors or the percentage of your consumed entitled capacity because it is not
running in a micro-partitioned environment. If it were running in a
micro-partitioned environment, you would see these additional fields, as vmstat
was enhanced to work in a virtualized and micro-partitioned environment.
If your us and sys entries consistently average over 80 percent,
you more than likely have a CPU bottleneck. If they add up to 100 percent, your
system is really breathing heavy. If the numbers are small, but wa (waiting
on I/O) is high (usually > then 30), this means there might be I/O problems
on the system, which can cause the CPU not to work as hard as it could. If more
time is spent in sy time rather then us time, this means your system
is spending less time crunching numbers than actually processing kernel data. This
is also not a good thing.
While the vmstat command is more commonly associated
with memory, I have found that it is the quickest and most accurate way to
determine what my bottleneck is.
So why did the user complain about the system? Because it really seemed like it
was running slow to him. I was only able to get to the root cause after I
determined there were no systems problems and his buddy in the adjoining cube had
no issues to speak of. So I had him reboot his PC and everything came up clean
afterwards. Apparently, something was running haywire on the PC client.
The next day I get another call and start vmstat again
(see Listing 2).
Listing 2. Running vmstat again
# vmstat 1
System configuration: lcpu=2 mem=3920MB
kthr memory page faults cpu
----- ----------- ------------------------ ------------ -----------
r b avm fre re pi po fr sr cy in sy cs us sy id wa
9 0 4200 2746 0 0 0 0 0 0 3 198 69 70 30 0 0 0
4 7 4200 2746 0 0 0 0 0 0 3 33 66 67 31 2 0 0
2 6 4200 2746 0 0 0 0 0 0 2 33 68 65 34 1 0 0
3 9 4200 2746 0 0 0 0 0 0 80 306 100 80 20 0 1 0
2 7 4200 2746 0 0 0 0 0 0 1 20 68 80 20 0 0 0
|
So what does this tell you?
Clearly, this system is CPU bound. There is no paging going on, nor any I/O
problems to speak of. There are lots of runnable threads and not enough CPU cycles
to process what needs to be done. How long did it take for me to reach this
conclusion? Exactly five seconds. Try doing that with other utilities.
Using sar
The next command, sar, is the UNIX System Activity
Reporting tool (part of the bos.acct fileset). It has been around for what seems
like forever in the UNIX world. This command essentially writes to standard output
the contents of the cumulative activity, which you would have selected as its
flag. For example, the following command using the -u
flag reports CPU statistics. As with vmstat, if you are using shared partitioning
in a virtualized environment, it reports back two additional columns of
information; physc and entc, which define the number of physical
processors consumed by the partitions as well as the percentage of entitled
capacity utilized.
I ran this command on the system (see Listing 3) when there
were no users around. Unless there were some batch jobs running, I would not
expect to see a lot of activity.
Listing 3. Running sar with no users around
# sar -u 1 5 (or sar 1 5)
AIX test01 3 5 03/18/07
System configuration: lcpu=2
17:36:53 %usr %sys %wio %idle physc
17:36:54 0 0 0 100 2.00
17:36:55 1 0 0 99 2.00
17:36:56 0 0 0 100 2.00
17:36:57 0 0 0 100 2.00
17:36:58 0 0 0 100 2.00
Average 0 0 0 100 2.00
|
Clearly, this system also shows no CPU bottleneck to speak of.
The columns used above are similar to vmstat entry outputs. The following table
correlates sar and vmstat descriptives (see Table 1).
Table 1. sar output fields and the
corresponding vmstat field
| sar | vmstat |
|---|
| %usr | us | | %sys | sy | | %wio | wa | | %idle | id |
One of the reasons I prefer vmstat to sar is that it gives you the CPU
utilization information, and it provides overall monitoring information on memory
and I/O. With sar, you need to run separate commands to pull the information. One
advantage that sar gives you is the ability to capture daily information and to
run reports on this information (without writing your own script to do so). It
does this by using a process called the System Activity Data Collector, which is
essentially a back-end to the sar command. When
enabled, usually through cron (on a default AIX partition, you would usually find
it commented out), it collects data periodically in binary format.
AIX-specific CPU
monitoring tools
Let's now discuss commands that are specific to AIX. These commands were written
to enable administrators to monitor systems in a partitioned environment. They are
particularly helpful when you are using Advanced POWER Virtualization features,
such as shared processors and Micro-Partitioning.
Using lparstat
When the user first reported system slowness, a decision was made to kick off
lparstat. The purpose of the lparstat command is to
report logical partition (LPAR) information and related statistics. In AIX 5L
Version 5.3, the lparstat command displays hypervisor
statistical data about many POWER Hypervisor calls. The
lparstat command is a relatively new command that is
typically used to assist in shared processor partitioned environments.
I used the -h flag, as shown in
Listing 4, because I also wanted to see the POWER Hypervisor
statistics.
Listing 4. The -h flag for the lparstat command
# lparstat -h 1 5
System configuration: type=Dedicated mode=Capped smt=On lcpu=4 mem=3920
%user %sys %wait %idle %hypv hcalls
----- ---- ----- ----- ----- ------
0.0 0.7 0.0 99.3 44.4 5933918
0.4 0.3 0.0 99.3 44.9 5898086
0.0 0.1 0.0 99.9 45.1 5930473
0.0 0.1 0.0 99.9 44.6 5931287
0.0 0.1 0.0 99.9 44.6 5931274
|
As you can see, in some ways, the output generated above is similar to the
sar command. Note that for partitions running AIX 5.2
or AIX 5.3 in either a dedicated environment or shared and capped, the overall CPU
utilization is based on the user, sys, wait, and idle
values. In AIX 5.3 partitions running in uncapped mode, the utilization would be
based on the entitled capacity percentage.
mpstat
Another command I use frequently is the mpstat command
(see Listing 5), which is part of the bos.acct fileset. This
is a tool created specifically for AIX 5.3 (unlike lparstat) that displays the
overall performance number for all logical CPUs on your partitioned system. When
you run the mpstat command, two sections of statistics
are displayed. The first section shows the system configuration, which is
displayed when the command starts and whenever there is a change in the system
configuration. The second section shows utilization statistics, which will be
displayed at user-specified intervals.
Listing 5. Running mpstat
# mpstat 1 1
System configuration: lcpu=2 ent=2.0
cpu min maj mpc int cs ics rq mig lpa sysc us sy wa id pc %ec lcs
0 0 0 0 164 83 40 0 1 100 17 0 0 0 100 0.17 8.3 113
1 0 0 0 102 1 1 1 0 100 3830453 66 34 0 0 0 100 .83 41.6
|
I like the mpstat command, because it reports back
collection information for each logical CPU on your partition in a format that is
clearly illustrated. You can even see the simultaneous multithreading (SMT) thread
utilization by using the -s option. The downside to
both the lparstat and mpstat
commands is that they require the writing of scripts and other tools to deal with
the formatting of data and graph output. Essentially, you would need to write your
own shell scripts. Though most administrators love to script, they also don't like
to reinvent the wheel. If there are already tools in place to help you analyze
historical data, it makes little sense to write your own utilities.
GUI tools
In this section, take a look at the utilities that enable you to graphically look
at your analysis and also allow you to analyze historical data. Although it takes
some time to fully understand these tools, they are more flexible than the
command-line tools you already looked at.
procmon
Let's start with procmon (see Figure 1). This utility
(released in AIX 5.3) not only provides overall performance statistics, but it
also allows you to take action on the actual running processors. It essentially
allows an administrator to either kill or renice a process on the fly. You can
also export procmon data to a file, which makes it a nice data collection tool.
procmon actually runs as a plug-in to the performance workbench, which is started
by using the perfwb (in /usr/bin) command (part of the
bos.perf.gtools.perfwb fileset).
Figure 1. procmon output
What I like about procmon is that it allows you to take action on a process,
which might increase performance on a system. While it has its limitations, I
strongly recommend that you download and use this tool, which I have found that
most administrators have a tendency not to do.
topas
Another tool that you should be aware of is topas. Truthfully, I've never been a
huge fan of topas (part of the bos.perf.tools fileset), although it has been
improved substantially in AIX. 5.3. Prior to these changes, it did not have the
ability to capture historic data, nor was it enhanced for usage in shared
partitioned environments. By incorporating these changes to allow you to collect
performance data from multiple partitions, it has really simplified the capability
of topas as a performance management and capacity planning tool. The look and feel
of topas (see Figure 2) is quite similar to top and monitor
(used in other UNIX variants). topas is a utility that displays all kinds of
information on your screen in a text-based GUI type of format. In its default
mode, it shows you the hostname, the refresh interval, and a potpourri of CPU,
memory, and I/O information.
Figure 2. topas display
Some new features also include the ability to run topas on a Virtual I/O Server
(VIO Server). To do this, you would use the following command:
On an LPAR, you would run:
Regarding the performance monitoring features that were introduced in 5.3 TL 4,
topas uses a daemon named xmwlm, which is automatically started form the inittab.
In TL_5 of AIX 5.3, it keeps seven days of data as a default and records almost
all of the topas data, which is displayed interactively except for process and
Workload Manager (WLM) information. It uses the
topasout command to generate the text-based reports.
While topas has come a long way in addressing its deficiencies, a lot of
administrators might prefer another utility -- nmon, for example.
nmon
Easily my favorite of all performance monitoring tools is nmon (not an
"officially" supported IBM tool). The data that you collect from nmon (see
Figure 3) is available either from your screen or through
reports that usually are run from cron. In the words of it's creator, Nigel
Griffiths, "Why use five or six tools when one free tool can give you everything
you need?"
Figure 3. nmon sample outpout
It's important to note that unlike some of the other tools already discussed,
nmon is also available for Linux®, which really helps the Linux on POWER
user base with performance issues. What attracts most administrators to nmon is
that not only does it have a very efficient front-end monitor, as shown in
Figure
3 (which the admin can call upon on the fly), but it
also provides the ability to capture data to a text file for graphing reports, as
the output is in a .csv (spreadsheet) format (see Figure 4).
In fact, moments after running an nmon session, you can actually see the nicely
rendered charts on an Excel spreadsheet, which can be handed off to senior
management or other technical teams for further analysis. Further, unlike topas,
I've never seen any performance-type overhead associated with this utility.
Let's look at a simple task. First let's tell nmon to create a file, name the
run, and do data collection every 30 seconds for 180 intervals (1.5 hours):
# nmon -f -t -r test2 -s 30 -c 180 |
When this is completed, sort the file, as shown in
Listing 6.
Listing 6. Sorting the file
# sort -A testsystem_yymmd_hhmm.nmon > testsystem_yymmdd.hhmm.csv
|
When this is completed, FTP the .csv file to your workstation, start the nmon
analyzer spreadsheet (make sure you enable macros), and then click on analyze
nmon data (see Figure 4).
Figure 4. nmon analyzer output
The nmon analyzer is an awesome tool, written by Stephen Atkins, that graphically
presents data (CPU, memory, network, or I/O) from an Excel spreadsheet. Perhaps
the only drawback which prevents it an enterprise type of tool is that it lacks the
ability to gather statistics on large numbers of LPARs at once, as it is not a
database (nor was it meant to be). That is where a tool such as Ganglia (see
Resources for a link) helps, which has actually received
the blessing of Nigel Griffiths, as the tool can integrate nmon analysis.
Summary
Part 2 of this series reviewed many tools and utilities that you can use to
capture and analyze performance data from System p servers running AIX. Some of
these commands have been available since the beginning days of UNIX. Many are for
AIX and others are unsupported IBM utilities, but most AIX administrators use them
all. Regardless of which tool you like the best, you need to use one to instantly
look at performance activity and another tool to capture data for historical-based
performance tuning and trending and capacity planning analysis. Some tools can do
both (for example, nmon), but most are more geared for one or the other. I
encourage you to play around and find the tools that not only work best for you,
but ones that can also provide value to folks that might not be systems
administrators capable of reading endless vmstat displays.
Resources Learn
-
Optimizing AIX 5L performance:
Check out other parts in this series.
-
High-Performance Architecture with a History
:
Read this paper for a brief description of PowerPC® architecture.
- "Power to the
People"
(developerWorks, May 2004): Read this article for a history of chip making at IBM.
- "Processor Affinity on AIX"
(developerWorks, November 2006): Using process affinity settings to bind or unbind
threads can help you find the root cause of troublesome hang or deadlock problems.
Read this article to learn how to use processor affinity to restrict a process and
run it only on a specified central processing unit (CPU).
- "CPU Monitoring and Tuning"
(March, 2002): Read this article to learn how standard AIX tools can help you
determine CPU bottlenecks.
- "AIX 5L Version 5.3:
What's in it for you?"
(developerWorks, June 2005): Learn what features you can benefit from in AIX 5L
Version 5.3.
-
Operating System and Device Management:
This document from IBM provides users and system administrators with complete
information that can affect your selection of options when performing such tasks
as backing up and restoring the system, managing physical and logical storage, and
sizing appropriate paging space.
- "nmon
performance: A free tool to analyze AIX and Linux performance"
(developerWorks, February 2006): This free tool gives you a huge amount of
information all on one screen.
- "nmon analyser -- A free tool to produce AIX performance reports"
(developerWorks, April 2006): Read this article to learn how to produce a wealth
of report-ready graphs from nmon output.
- Check out the following IBM Redbooks:
- Check out other articles and tutorials written
by Ken Milberg:
-
AIX and
UNIX:
The AIX and UNIX developerWorks zone provides a wealth of information relating to
all aspects of AIX systems administration and expanding your UNIX skills.
-
New to AIX and UNIX?:
Visit the New to AIX and UNIX page to learn more about AIX and UNIX.
-
AIX 5L™ Wiki:
A collaborative environment for technical information related to AIX.
- Search the AIX and UNIX library by topic:
-
Safari bookstore:
Visit this e-reference library to find specific technical resources.
- See the IBM infocenter for information on
LPARSTAT.
- Check out the following wikis:
-
developerWorks technical events and webcasts:
Stay current with developerWorks technical events and webcasts.
-
Podcasts: Tune in and
catch up with IBM technical experts.
-
Future Tech:
Visit Future Tech's site to learn more about their latest offerings.
Get products and technologies
-
IBM trial software:
Build your next development project with software for download directly from
developerWorks.
Discuss
- Participate in the
developerWorks blogs
and get involved in the developerWorks community.
- Participate in the AIX and UNIX forums:
About the author  | |  | Ken Milberg is a Technology Writer and Site Expert for techtarget.com and provides Linux technical information and support at searchopensource.com. He is also a writer and technical editor for IBM Systems Magazine, Open Edition. Ken holds a bachelor's degree in computer and information science and a master's degree in technology management from the University of Maryland. He is the founder and group leader of the NY Metro POWER-AIX/Linux Users Group. Through the years, he has worked for both large and small organizations and has held diverse positions from CIO to Senior AIX Engineer. Today, he works for Future Tech, a Long Island-based IBM business partner. Ken is a PMI certified Project Management Professional (PMP), an IBM Certified Advanced Technical Expert (CATE, IBM System p5 2006), and a Solaris Certified Network Administrator (SCNA). You can contact him at kmilberg@gmail.com. |
Rate this page
|  |