Skip to main content

skip to main content

developerWorks  >  Information Management | Open source  >

DB2/Informix and open source: Database defense against the dark political arts

Real-time monitoring of your database for career enhancement

developerWorks
Document options

Document options requiring JavaScript are not displayed

Discuss

Sample code


New site feature

Check out our new article design and features. Tell us what you think.


Rate this page

Help us improve this content


Level: Introductory

Marty Lurie (lurie@us.ibm.com), Information Technology Specialist, IBM 
Aron Y. Lurie (aron.lurie@gmail.com), 9th Grader/Webmaster, Newton South High School/Hebrew College

21 Dec 2006

A down system is one of the most stressful events for system administrators. This article shows you how to create a real-time monitor for your database so you can be alerted when problems arise, and also provide valuable information to others about the status of the database servers. The focus should be on fixing the problem, not the blame. This is also an asset for application and network administrators for rapid diagnosis of what has caused the system outage. Complete source code is provided for the monitor, including the PHP pages to publish the database status.

QuickStart for the impatient

  1. Download and un-tar the sample code, found in the Downloads section.
  2. Setup open database connectivity (ODBC) to point to the database you wish to monitor.
  3. Configure PHP to graph using the JpGraph package.
  4. Modify the "goqc" script to fit your environment.
  5. Use JMeter or Rational® Performance tester to provide a load for the database.
  6. Configure a remote agent to contrast local connection performance to network performance.

Political dark arts: The blame game

DBAs take great pride in providing reliable database services to their organizations. It comes as no surprise they are very stressed when the database is blamed for performance issues or application outages. This article shows you how to provide a real-time graph of database status (see Figure 1) to reduce false accusations, and identify problems for rapid resolution.


Figure 1. Real-time database status
Database Status graph

The X-axis is time, the left Y-axis is response time in seconds, and the right Y-axis is the number of user connections. Figure 1 clearly illustrates that despite a heavy load of almost 300 users, the database is holding up quite well with response times quicker than 1.5 seconds. Since this simulation is running on a single laptop with only one disk, the performance is outstanding. With this type of graph available in your organization, it will show that there no issue with the database. Conversely, if there is something broken, you'll know about it sooner and will be able to find the root cause that much faster.

Database quality control and statistical sampling

To produce the graph in Figure 1, you need data about the status of the server. Gathering the data and making it available through a Web page is a simple step in Statistical Process Control. For more information on Statistical Process Control, refer to the Resources section. This quality improvement process, championed by Dr. Demming, propelled "made in Japan" from inferior quality to the excellence that it represents today.

There are two pitfalls when gathering data:

To get an accurate sample of an event that takes, for example, 20 seconds, samples must be measured every 10 seconds according to the Nyquist theorem. If you want to measure a database outage to a one minute resolution, that means you have to sample every 30 seconds. In the example script provided, the sampling rate is every two seconds, which provides rapid feedback, but generates quite a bit of sample data in a short period of time.

What about the impact on the result from taking a sample? Since I have calculated the exact momentum of my car keys, the Heisenberg Uncertainty Principle indicates they could be anywhere in the universe. Heisenberg was actually talking about elementary particles' position and momentum, but you get the idea. When a database server slows down, the added workload from people logging in to test and troubleshoot can easily make matters worse. The instrumentation in this example offloads the status Web page display to a different server. This architecture keeps the "spectators" away from the database server so they can see the status without adding to an existing problem.

The Quality Control Sampling Agent

The Quality Control Sampling Agent has the simple task of measuring the status of the database and recording this data in a database table. Almost any data about the database server can be obtained from the very rich capabilities of DB2® table functions or the Informix® sysmaster database.

The deployment architecture of the agent is shown in Figure 2. There is both a local and remote agent option. The sample code implements a single agent. To deploy two agents the schema and code require modification.


Figure 2. Sampling agent deployment architecture
Deployment architecture

Why use a local agent and a remote agent that communicates over the network? Since you are trying to do fault isolation, to fix the problem and not the blame, any differences in the local and remote agent response times immediately identifies network slowdowns. Imagine how much the network team will appreciate getting a call from a DBA helping them identify a problem.

There are two implementations of the agent in the download. The Informix directory contains a native esql/c agent. The ODBC directory contains a DB2 implementation written in PHP for maximum portability.

The program executes the following steps:
  1. Start a timer.
  2. Connect to the database.
  3. Measure the time to connect to the database.
  4. Create a new row in the transtimes table with timestamp, unique key, and connect time.
  5. Gather any statistics about the database that are interesting, and simulate a simple transaction.
  6. Update the row created in step 4, and record the statistics gathered in step 5 and the time to complete step 5.

The graph in Figure 1 shows the number of connections, connect time, and insert-update time. The actual schema of the table (Listing 1) used in the Informix port of this agent includes how much virtual memory is allocated and used. This is an illustration of how Informix sysmaster or DB2 table functions can provide an easy interface to instrumenting the server. You don't have to sed, awk, grep, and perl your way through reams of onstat or snapshot data.

Step 5 above includes simulating a simple transaction representative of the typical workload on the system. The simple transaction makes the data sample representative of what a typical person using the system would experience. For an Online Transaction Processing System (OLTP) taking phone orders, a simulated interaction could be a customer lookup. If the database system you are monitoring is for decision support (DSS), and the queries are long and complicated, don't have the quality control (QC) agent run a large query. See the Heisenberg Uncertainty Principle discussion above. A large DSS query by the QC agent will just waste cycles and slow everyone else down.


Listing 1. Schema used for the Informix esqlc version of the QC agent
                
--  (c)2006 copyright Martin Lurie, sample code, not supported
create database oltpqc;
database oltpqc;
drop table transtimes;

-- the DB2 identity datatype is similar to Informix serial
-- in the php version of the code a simple integer was used and
-- php increments the transkey, this gives maximum portability
create table transtimes ( transkey serial primary key,
			timestamp	datetime year to second,
			connect_time	float,
			session_count	integer,
			vblkused		integer,
			vblkfree		integer ,
			query_insert_time	float
);
      

The DB2 schema used in conjunction with a PHP agent is show in Listing 2.


Listing 2. Schema used for the DB2 PHP version of the QC agent
                
--  (c)2006 copyright Martin Lurie, sample code, not supported
create database oltpqc;
connect to  oltpqc;
drop table transtimes;

create table transtimes ( transkey int primary key,
			timestamp	date,
			connect_time	float,
			session_count	integer,
			query_insert_time	float
);
      

You are not in any way limited to this schema for gathering data. Whatever you choose to instrument with the agent can be included in the data sample table. This project originated due to a "blame game" between the application developers and the database administrators. Instrumenting the number of users and the total memory consumed, showed a linear relationship for memory consumption. This ended the debate and the application was changed so as not to keep spawning new connections to the database when a slow response time was detected.

The table function to find the number of users connected to DB2 is shown in Listing 3. See the Resources section for how to use other DB2 table function capabilities, and in the case of Informix, how to use the sysmaster database.


Listing 3. Query from DB2 table function to obtain concurrent user count
                
--  (c)2006 copyright Martin Lurie, sample code, not supported

 select local_cons +rem_cons_in from table (snapshot_dbm (-1))as snapshot_dbm

      

Informix uses the sysmaster database instead of table functions to provide information about the server status. See Listing 4 for the Informix query to get a user count.


Listing 4. Query from Informix sysmaster to obtain concurrent user count
                
/* this sql is from the esqlc program, so the result set is
stored in a host variable.   */
        select count(*) 
           into :sessioncount
           from sysmaster:syssessions;
      

The agent code is straight forward. The "secret sauce" to get the ODBC database connection for PHP platform independent version going is the odbc.ini file. The documentation for getting a PHP ODBC connection is widely available, refer to the Resources section. Listing 5 shows the odbc.ini for DB2 and Listing 6 shows the Informix listing. If you want to invest the time, you can build PHP with direct drivers instead of ODBC. The ODBC performance is quite adequate for this example, and will be used both as a client, and to gather the data for graphing the response time results. connection


Listing 5. DB2 odbc.ini file for ODBC configuration on Linux
                
[ODBC Data Sources]
SAMPLE=IBM DB2 ODBC DRIVER

[SAMPLE]
Driver=/home/db2inst1/sqllib/lib32/libdb2.so
Description=DB2 Sample database

      


Listing 6. Informix odbc.ini file for ODBC configuration on Linux
                
[ODBC Data Sources]
SAMPLE=INFORMIX DB2 ODBC DRIVER

Driver=/home/db2inst1/sqllib/lib32/libdb2.so
Description=Informix smaple odbc.ini


[ODBC Data Sources]
Infdrv1=IBM INFORMIX ODBC DRIVER
Infdrv2=IBM INFORMIX ODBC DRIVER

;
; Define ODBC Database Driver's Below - Driver Configuration Section
;
[Infdrv1]
Driver=/home/informix/lib/cli/iclit09b.so
Description=IBM INFORMIX ODBC DRIVER
Database=stores_demo
LogonID=odbc
pwd=odbc
Servername=ids_server1

[Infdrv2]
Driver=/home/informix/lib/cli/iclit09b.so
Description=IBM INFORMIX ODBC DRIVER
Database=oltpqc
LogonID=informix
pwd=useYourOwnPassword
Servername=k_ids
Trace=0
TraceFile=/tmp/odbctrace.out
InstallDir=/home/informix
      

Early warning system: Generating alerts

When response time is slow, it is a good idea for the DBA to know prior to getting a call from the operations center. This is a matter of taking responsibility and pride in knowing when problems arise. You want to know early that there is a problem. This isn't about the blame game, this is about delivering high quality database services.

The oltpqc code expects a positional parameter on the command line for the alarm threshold. This value is in milli-seconds. Responses slower than this time will trigger a Unix/Linux script. The implementation of this alert is shown in Listing 7 for esqlc and Listing 8 for PHP.


Listing 7. ESQLC code to generate an alarm when response time exceeds the threshold
                
/* code to capture the positional paramter from the command line */

 if (argc == 2 ) {
        thresholdms = atoi ( argv[1] ); 

#ifdef PRTDIAG
        printf( "argv[1] %s  thresholdms %d \n", argv[1], thresholdms );
#endif
        printf( "(c)copyright 2005 - alarm %d milli-seconds \n", thresholdms );
        }
        else
        {
        printf( 
		"(c)copyright 2005 - USAGE: %s threshold_millisec\n", argv[0] );
        printf( "example:  %s 1500\n" , argv[0] );
        printf( "generates an alarm when response time is over 1.5 seconds\n" );
        return -1 ;
        }


/* ... lines deleted .... */

/* logic to run a shell script and pass in the response time data 

to the shell script */
	 asprintf ( &alarmstring , "./slowalarm.sh %f %f " , 
                elapsed_con, elapsed_wrk );


 	if ( ( elapsed_wrk+elapsed_con ) > (float)thresholdms/(float)1000 ) 
                system ( alarmstring); 
#ifdef PRTDIAG
        printf ( "times elapsed_wrk %f , elapsed_con %f , thres  %f \n", 
        elapsed_wrk,elapsed_con, ( (float)thresholdms/(float)1000 ) );

#endif
      


Listing 7. PHP code to generate an alarm when response time exceeds the threshold
                

//check to see if alarm should be sounded
if (($third_time - $first_time)*1000 > $thresholdms)
{
        echo "DATABASE SLOW ALERT - CONNECT AND QUERY TIMES" ;
        echo "GREATER THAN $thresholdms MILISECONDS\n\n";
        system ("alert.sh");
}
      

Publish the the sampling results, the limitations of spreadsheets

This project cries out for publishing the data on a Web page. The sample data is not worth much without reporting and distribution. Spreadsheet technology can only take you so far for real time graphing. Emailing a spreadsheet around with the status is quite tedious. If the spreadsheet is configured to do a dynamic refresh and re-connect to the database, then every refresh from every desktop monitoring the database costs database cycles, which is an expensive proposition. Figure 3 shows a spreadsheet analysis of users and response times, but this is here for completeness and is not recommended.


Figure 3. Use a spreasheet to graph the quality samples - note the Y-axis log scale
Spreadsheet analysis

Publish with a Web site

An excellent technology to publish the status information gathered by the quality control agent is a Web site. The graph shown in Figure 4 is based on the JpGraph package (For more information, refer to the Resources section.). This PHP code is a really great set of graphing tools. You'll need to install some prerequisites including gd and php-gd. Graphing Informix against DB2 data was as simple as changing the connect statement. This is the really elegant aspect of using PHP and an ODBC connection to the database to achieve portability.


Figure 4. Sample database status using JpGraph, a PHP graphing package
Database status graph

Notice the JpGraph package allows you to put a jpg image behind the graph. Figure 4 shows the Informix sponsored race car in the background.

The status graph implements an automatic refresh. A page with real-time status is rather lame if it doesn't refresh. This doesn't take anything fancy like AJAX but is simple HTML. Look at Listings 9 and 10, can you tell why one is "bad" and the other is "good"? Listing 9 assumes a cron job is running periodically that performs the following: php qcgraph.php > foo.jpg


Listing 9. The "good" HTML page for auto refresh
                

<META >-EQUIV=Refresh CONTENT="3; URL=index.html">
<h1>Database Real Time Status</h1>
Click the refresh button if the graph does not appear below.
<hr>
<img src="foo.jpg">
<hr>
<!-- this page assumes a cron job is running periodically that 
performs the following: php qcgraph.php > foo.jpg -->
      


Listing 10. The "bad" listing of an auto refresh page, it works, but don't do it this way
                

<META HTTP-EQUIV=Refresh CONTENT="10; URL=auto.html">
<h1>Database Real Time Status</h1>
<img src="qcgraph.php">
<hr>
      

You've probably figured out by now that the "bad" auto refresh code would create a significant workload on the server if lots of people want to see the status. It is the Heisenberg Uncertainty Principle working with Murphy's law. Measuring the status of the database creates more work on the database. When something "goes wrong" the problem is compounded when lots of inquisitive people fire up their browsers to look at the server.

The "good" implementation, Listing 9, minimizes the workload on the database. The status graph jpg image is shared by all the voyeurs, and the workload to support the status seekers can be isolated at a Web server. The Web server doesn't even have to be resident on the same machine as the database. A periodic refresh of the JpGraph .jpg output is required. This could be scheduled in cron, in the sample code provided the Linux watch command is used.

Server performance characterization: Load testing with Apache JMeter

Now put all the pieces together in a single script. To get an interesting graph, and validate that the graph is reporting what you think, you'll need to generate some load on the database. Apache's JMeter is a good option, and of course IBM's Rational Performance Tester offers some advanced capabilities. All the graphs in this article were generated with JMeter as the load generator, also know as the "test harness."

Create a test harness and load test your database. Even if you don't implement any of the performance monitoring, or graphing, do load testing. If the only thing you take from the article is doing load testing prior to putting an application in production, it will have been well worth writing it.

Load testing to the point of breaking the server(s) lets you know the performance envelope of you systems. The number of users, complexity of queries or transactions, and other aspects of your application should be characterized. Armed with performance envelope graphs, combined with tracking your server workload growth trends, you can win the battle for upgrading a server before there is a major crisis.

The script in Listing 11 fires up all the elements of the performance monitor and the load testing environment. Using a shell script for complex starting sequences is highly recommended to minimize typos and "thinkos."

It may come as a shock to many, but there are comments in the script that explain each step...


Listing 11. Putting it all together with automated load testing.
                
# (c)copyright 2006 Martin Lurie   Sample code, not supported
echo must run as root... so we can restart the httpd and informix 
# read foo just makes the script pause till you hit enter
# todo: automate a user id check instead of this prompt
read foo
#  everyone tails the server log on the screen, right ?
xterm -exec tail -f /opt/informix/online.log &
. /opt/informix/iunset
# pick up the ids v10 env
 . /opt/informix/ifx_env 
# stop the server in case it is active
 su informix -c "onmode -yuk"
# start the server
 su informix -c oninit
# now pick up the csdk 2.8x environment - better yet - get 
# the bug fixed version of the 2.9 csdk
. /opt/informix/iunset
 . /home/informix/ifx_env 
 /etc/init.d/httpd restart
# must install php graph libraries 
cd /var/www/html/
xterm -exec watch -d -n4 "php ./qcgraph.php > foo.jpg" &
mozilla http://karmiel/index.html&
 cd /home/lurie/edrive/src/esql/oltpqc
 xterm &
 xterm -exec watch -d -n 5 ./oltpqc 1500 &
cd /home/lurie/tmp/jakarta-jmeter-2.2/bin/
./gojm &
      

JMeter in action doing JDBC requests is shown in Figure 5. This graph shows an average response time of 573 milliseconds. Some might be happy with this response time. If you look at the standard deviation of 796 milliseconds, the users may not be as happy as the average would predict. The good news is the standard deviation, the red line, is trending downward. The variability in response time is decreasing, so users see less variation when they press the Enter key. Closer examination of the average, the dotted black line, and of the workload, shows this server is in a startup transient state. As the connection pool from JMeter is established, the server provides much more consistent workload response.


Figure 5. JMeter load generation, graph results page
Database status graph

Troubleshooting example

Figure 6 shows periodic long connect times. Most DBAs assume this is a long checkpoint, but surprisingly all the checkpoints in online.log are 0 seconds. So what is the problem?


Figure 6. Periodic long response time problem
Database status graph

The issue here turned out to be unrelated to the database. Whenever a system seems to "hang" for about eight to 10 seconds and then runs fine for a while, a DNS network problem is one of the most likely suspects. The periodic long connect time was eliminated by fixing a temporary DNS problem (in this case the fix was as simple as re-connecting the laptop computer to the LAN so it could access the DNS server).

Where to next?

You've seen a number of technologies to use for doing database monitoring. The techniques included code portability using a standard API (ODBC), and graphing results for Web publication using JpGraph. The performance characterization of your environment with a test harness is critical to managing your servers instead of being managed by them.

There is always a build vs buy option. If creating the real-time monitor described in this article is to daunting, check out the IBM Tivoli® Monitor. It has many predefined agents, and a console that allows drill-down on individual issues. Figure 7 is an example of Tivoli Monitoring.


Figure 7. Tivoli Monitoring, if you'd rather buy than build
Tivoli Monitor

There are a number of enhancements that would be useful if implementing the sample code. If the network or database is down, the oltpqc client continues to wait for a response. Limiting the wait time and sending an alert on an error return code would be a very good thing to add.

When you discover issues with the server or network, there are a wealth of tools available to diagnose the problems and find a root cause. At the operating system level the sar, vmstat, and top commands are a good place to start to verify basic system hygene. Look for excessive CPU utilization and memory paging or swapping -- these are performance killers. The ping, netstat, nmap, and ethereal utilities provide lots of capabilities for understanding network issues.

If the operating system is in good shape, then the Informix onstat and DB2 snapshot utilities can be used to dig into database specific diagnostic information. These utilities are well documented and lots of examples exist on the Internet -- a simple google query provides lots of resources.

I hope you've found this useful and that it helps you do better problem isolation. There is no point wasting time blaming the database when real-time database status reporting helps locate the real issue.




Back to top


Download

DescriptionNameSizeDownload method
Source code for monitoroltpqc.zip10KBHTTP
Information about download methods

More downloads



Resources

Learn

Get products and technologies
  • Download IBM product evaluation versions and get your hands on application development tools and middleware products from DB2, Informix, Lotus®, Rational®, Tivoli, and WebSphere®.

  • Build your next development project with IBM trial software, available for download directly from developerWorks.


Discuss


About the authors

Photo: Marty Lurie

Marty Lurie started his computer career generating chads while attempting to write Fortran on an IBM 1130. His day job is in WebSphere Systems Engineering at IBM, but if pressed he will admit he mostly plays with computers. His favorite program is the one he wrote to connect his Nordic Track to his laptop (the laptop lost two pounds, and lowered its cholesterol by 20%). Marty is an IBM-certified Advanced WebSphere Administrator, Certified DB2 DBA, Certified Business Intelligence Solutions Professional, Informix-certified Professional, Linux+ Certified, and has trained his dog to play basketball. You can contact Marty at lurie@us.ibm.com.


Aron Lurie was midway through 10th grade while writing the article. He has had a web development business for 5 years, and ever since 4th grade he has been teaching himself new languages. He is the Webmaster for his school newspaper, and has been Webmaster for his USY chapter and the Hebrew College. In his spare time, Aron is a member of the Newton South ski racing team. Unfortunately, he has not yet found a way to combine computers with skiing.




Rate this page


Please take a moment to complete this form to help us better serve you.



 


 


Not
useful
Extremely
useful
 


Share this....

digg Digg this story del.icio.us del.icio.us Slashdot Slashdot it!



Back to top


This is the first trademark attribution statement. This is the second trademark attribution statement. Other company, product, or service names may be trademarks or service marks of others.