 | Level: Intermediate Marcelo Perazolo (mperazol@us.ibm.com), Autonomic Computing Architecture, IBM Abdi Salahshour (abdis@us.ibm.com), Senior Software Engineer, IBM
10 Apr 2007 So how do you set up "triage" problem determination? This article describes aspects of event visualization for triage problem determination that use concepts of autonomic computing -- such as Log and Trace Analyzer for Java Desktop (LTA-JD) -- and symptoms to represent, detect, evaluate, and resolve incidents and problems related to business mission-critical infrastructure management and operations. This two-part article also covers event and symptom visualization and processing methods of LTA-JD to enable efficient proactive avoidance of these incidents and problems. In this second part, you'll take a more detailed tour of the framework in action.
To recap from the first part of this article, it's a simple equation -- the task of event monitoring increases in complexity as the volume and number of event sources increases. And poor visualization of events leads to poor problem detection and root cause analysis, which equates to time being lost, bad business practices, and an increased cost in recovery. There is a need to improve visualization of events and associated symptoms and thereby, to improve human experience as it relates to problem detection, isolation, and prevention. Autonomic computing management applications can support specific management styles that define their capabilities and requirements for the set of manageable resources they monitor.
There are several different approaches to event visualization. Typically, an event-monitoring solution involves a human operator who is responsible for the analysis and reaction to problems associated with events. Operators rely on their experience and perception of event combinations to determine when a problem happens and how to resolve it. Now, combinations of multiple events can reveal more complex problems in the IT environment; it is at this point where human analysis may become difficult and time-consuming.
It is also at this point that monitoring solutions should be implementing automatic correlation of event combinations to make the job less onerous for system engineers; this automatic correlation includes the running of root cause analysis and the grouping of events by their relative contribution to problem trails. These automatically determined root cause events should then be presented to human operators for review and reaction (and hopefully, a successful resolution of the problem).
A supported management style can be hands-on, hands-off, or both. When using a hands-on style, the autonomic manager polls the resources it is managing to determine when it needs to take action. In other words, it is the method of choice for making the combination of human and user interface play the role of a manual manager in the autonomic computing architecture (see Resources). For example, a manual manager may monitor multiple event sources and when it observes one or more events of specific significance (which is a pattern also known as a symptom in autonomic computing terminology), the manager may initiate one or more actions to mitigate or resolve the observed problem.
Commonly, problematic symptoms are detected when all the events that satisfy the criteria, also known as symptom rules, for that symptom are observed. In fact, one of the main goals of this series of articles is to describe a way to facilitate incremental detection and visualization of symptoms and a method to share domain knowledge, or symptom definitions, among human operators by combining events and symptom visual semantics together in a most efficient way. This method is implemented as a simple event visualizer known as the Log and Trace Analyzer for Java Desktop (LTA-JD), a tool capable of collecting, merging, filtering, sorting, displaying, and analyzing contents of standardized event sources (for example, Common Base Event and Web Services Distributed Management [WSDM] Event Format or WEF) for problem isolation and triage to problem analysis.
Together, the triage function along with the superior visualization mechanisms offered by the LTA-JD improve root cause analysis as well as problem prediction and reaction. Domain expertise and semi-structured information resembling symptom rules can be easily mined and captured using industry-standard XPath expressions for quick detection and visualization of symptomatic events.
In this two-part article, we'll show you the fundamentals of event monitoring, symptom detection for problem analysis and reaction, and the enhanced visualization attributes implemented by the LTA-JD. We focused on the fundamentals in the first part; this article focuses on the details.
Gradual detection capabilities
As previously stated, the LTA-JD is a stand-alone, simple-to-use Java™ event viewer that provides the ability to gather, merge, filter, sort, display, and analyze contents of log files and events from a large number of products in a single view for problem isolation and triage to problem analysis. It uses standard event formats to aggregate event data, including the Common Base Event.
The LTA-JD is also capable of associating visual information to sets of events as they are processed, informing the end user of their participation in a "composite event" or a symptom. Furthermore, symptoms can be correlated together to compose incidents and problems that are traditional elements in business processes. The following list summarizes the major capabilities of the LTA-JD:
- Enables end-to-end viewing of event sources across the heterogeneous environment
- Correlates on timestamp or sorting on any Common Base Event property
- Filtering and multilevel sorting of any event properties
- Custom highlighting of triage events (simple symptomatic event selection rules)
- Ability to analyze events using symptom catalogs
- Ability to compress events view to the triage events only
- Save and share filters, highlighters, and configuration settings (import/export) for collaboration
- Customizable summary event view with any event property
- Ability to select and expand any row from the summary view to display the full Common Base Event attributes
- On-demand conversion of the time zone and format of the collected events
- Support remote event source access through HTTP, FTP, or IBM® Tivoli® Common Event Infrastructure
- Forward events (analysis results) to other LTAs for more advanced correlation and collaboration
The functions of the LTA-JD can best be described using the following scenario. An end user is working at his desk is running a WebSphere® Application Server application, PDWebApp, which uses a DB2® database server in a Microsoft Windows® XP environment. The user experiences difficulties accessing the server at about 10 a.m. and wonders why.
To avoid undesired "blame storming" that commonly occurs when a multiple-components solution experiences failures, you can use the LTA-JD to help determine the cause of the problem -- you simply analyze the log events generated using PDWebApp, DB2, and Windows Application/System Events.
You start by investigating the events from components' diagnostic sources that are deemed to be the source of the problem. This includes log files from the WebSphere Application Server activity.log file that contain the operation events generated by the app server and the application running in that environment. In addition, you will need to examine the diagnostic logs from DB2 and Windows event logs for any possible database manager or network anomaly.
To import the above-mentioned event sources from the menu bar, select File > Add/Remove Event Sources. Then select Import Event Source from the Add/Remove event source window to display the Add event source window. You must select an Event source type that represents the type of event source you intend to import for each log file. The event source types are associated with the adapters provided by the Generic Log Adapter that convert product proprietary log formats to a consistent event format, the Common Base Event format.
Figure 1 shows the WebSphere Application Server activity.log file. You can repeat this for the DB2 and Windows event logs.
Figure 1. Import event source (the WebSphere Application Server activity.log file)
In addition, you may define a filter or adjust the timestamp of the events per event source. Figure 2 shows how to provide a filter or events timestamp. You may enter a filter directly in the filter entry field, select one from the pull-down items, or compose one using the Rule Builder as shown in Figure 2. This filter is used to filter individual event sources, the filter that is defined in the Result view.
Figure 2. Individual event source filter and delta time
Furthermore, you may add a delta time to adjust the timestamp of the incoming events. This does not modify the original timestamp of the incoming events, but does modify the events displayed in the Result view. You may repeat the filter and delta time for each event source.
When all the intended event sources are specified, they are imported in a Common Base Event format, filtered, merged, and displayed in the LTA-JD main view.
Next, you need to define simple rules to identify events of interest -- this helps in identifying symptomatic rules and to better visualize them by highlighting such events. Commonly, these rules are defined by subject matter experts or support engineer performing problem determination. In the previous article in the section "Symptom definition and composite visualization," we described how to define symptomatic rules/highlighters. For this scenario, we used the Rule Builder and defined four rules to highlight symptomatic events as follows:
- Application normal Start Situation
- WebSphere Application Server fatal connection failures (Red)
- WebSphere Application Server Communication errors (Pink)
- Application with Connection Failure Situations (dark yellow)
- DB2 Database Manager stop situations (yellow)
- Network connection failure (orange)
As the events are imported from the specified event sources (log files), they are matched against the rules and are highlighted if they meet the rules that identify the symptomatic events. Figure 3 shows the result of importing event sources specified earlier, sorted in ascending order or occurrence. To sort the events, you may select and click on any of the columns that appear in the result view and toggle-click through each in ascending, descending, and no sort as they were merged. Also, you may choose to sort based on two or more columns; to do that after the first column is sorted, use a combination of the Ctrl key and a click on subsequent columns as appropriate. Furthermore, the column in the Result view can be customized. To do so you may select View > Select result columns and choose any of the Common Base Event properties that appear in the Available Properties list.
Figure 3. Combined event/symptomatic events visualization
Figure 3 shows, ascending order, events from multiple sources (WebSphere Application Server, DB2, and Windows) sorted by creation time. Notice there are 7,205 valid Common Base Event event results and two invalid events. Typically when an error occurs, thousands of events are generated -- this can make the time-to-resolution very costly.
7,205 events merged and sorted in ascending order of occurrence (this is also known as time correlation). To sort the events by creation time, you may place the cursor on the creation time column and click the mouse select button. Press the button once to sort in descending order, again for ascending order, and a third time to return to the order they were originally merged. To observe individual Common Base Event event details, select the desired event from the results summary area (any field). The events are displayed in the events detail area as you move through the events in the results summary area. To navigate through the events detail area, select the tabs along the top of that area.
Knowing the timeframe that an error occurred (in this example between 10 a.m. and 2 p.m.) lets you focus on only those events that are meaningful (also known as removing noise). You can do this by employing a filter. In this example, we employed a filter based on creation time. To do this, you may use the Rule Builder to build a filter rule (from menu bar select View > Add/Remove filter, and then press the Rule Builder button). If you already have defined a filter, then using the filter field on the top of the result view, select the pull-down arrow and choose your desired filter. In this case, the result is fantastic -- we went from more than 7,200 events to 93 events! See Figure 4.
Figure 4. Highlighted symptomatic events
To further emphasize the events of interest, we use the highlighter function of the LTA-JD that lets you define rules to identify and highlight the events that may be indicative of a symptom of a problem. The highlighter uses a spectrum of colors that will further enhance visualization of symptomatic events and problem pattern (again, Figure 4). To do this, you may use the Rule Builder to build a highlighter rules, also known as simple symptom rules (from menu bar select View > Add/Remove highlighters and press the Rule Builder button).
Figure 4 also shows many of the potential events that play a role in the symptoms defined by the highlighter rules and other events occurring around the same period of time. Indicated by the numbers (1-5 in circles), they are deemed to be event-related to the same problem and they would be overlaid by the visualization aspects of the root cause (higher-level) symptom such that all the symptomatic events that are matched against the simple symptom rules events would change their color at once. This provides for gradual colorization and is similar to the deductive process a human administrator experiences when performing diagnosis manually. In this example, this is exactly what happens; it indicates a true incident between instances of an application server (WebSphere Application Server) and a database server (DB2 Universal Database).
You may click on each individual event in the result/summary area and view the details in the Event Detail Area at the section of the display. Placing your cursor on any highlighted event provides you with a tooltip. These tooltips display the description of the rules associated with that color.
At this point, you may start analyzing the triaged, or highlighted, events or further compress and isolate the events to view those highlighted only and reduce the number of events to a more manageable collection as necessary.
Following is the analysis associated with the five numbered buttons in Figure 4:
- This event indicates the start of the application, PDWebApp (there are other highlighted events in green that indicate the start of other applications).
- This event indicates the beginning of the application failure. It appears that the application may have tried to recover seven times.
- This event indicates a communication error between WebSphere Application Server and DB2.
- This event indicates a "Fatal" communication failure between WebSphere Application Server and DB2.
- These are other events that were occurring around the same time our application was experiencing the problem: Of specific interest is when the DB2 database was stopped (that may have been an innocent act of nature).
By analyzing the sequence of events, what do you think caused the problem? (The answer is in the conclusion.)
If necessary, you may further isolate the event collection to view only the highlighted events. To do this from the main view where the events are displayed, select View > Show highlighted events (for details on this, refer to the LTA-JD's online help). In this case, the result is remarkable -- we went from more than 7,200 events to 18 events! See Figure 5.
Figure 5. Isolated highlighted events
Furthermore, you may analyze desired events using the Analyze function. This function uses the IBM Symptom Catalog 2.0 and provides additional description of the problem, recommendations, and likely corrective actions to resolve the problem for each selected event separately. To do this you must first import one or more symptom catalogs into the LTA-JD and then from the main view where the events are displayed, select one of the analysis functions (Analyze selected events, Analyze highlighted events, or Analyze all events), either from the File menu bar or by right-clicking on the desired events in the view. See Figure 6.
Figure 6. Analyzed events
Keep in mind that the LTA-JD provides time-based correlation (time sequence) and simple analysis for each event; at best it provides visual correlation that requires a human operator to visualize and correlate triage events. If further analysis is required, you may select one or more events, or all the highlighted events, and save them in a file to send or e-mail to the appropriate person for action; you can even send the events directly to another LTA tool such as the IBM Log and Trace Analyzer for Eclipse that can consume the events for deeper correlation and analysis. To do this from the main view in which the events are displayed, select File > Save selected events to Analyzer or one of the other three Save functions. For details on how to do this, refer to the LTA-JD online help.
How did the LTA-JD help with the problem determination scenario we described here?
- First, problems are diagnosed and discovered faster due to the restructuring and standardization of incoming event sources formats. Facilitated merging and time-based correlation of different logs and filtering and highlighting of relevant events required no programming skills and produced a 75 percent events reduction (7,200+ -> 93 -> 18).
- Second, known problems are recognized and resolved quickly using a symptom catalog of known problems, which contains detailed problem description and resolution information.
- Third, it facilitated and improved collaboration, sharing filters, highlighters, symptom catalogs, and knowledge sharing. For details on collaboration, the export and import functions, refer to the LTA-JD's online help.
In conclusion
In this article, we've presented the fundamentals of problem determination and some of the autonomic computing artifacts that are necessary for automating common problem determination tasks, those still normally performed manually in many circumstances. We demonstrated the common infrastructure necessary to perform collaborative and more complex analysis, which are usually beyond the grasp of single or manual management. We also presented a method and an application capable of harmonizing manual operation and showed you more of the common infrastructure necessary to perform collaborative and complex analysis.
Our goal is to demonstrate that the LTA-JD has in place the infrastructure necessary for performing triage analysis, enabling problem isolation and diagnosis, and can provide symptom analysis to facilitate triage problem determination in a very easy and uniform way. We've shown that you can add extra avoidance rules and actions as components of existing symptoms to make a high level of predictive analysis and proactive avoidance possible.
After IT system administrators are confident enough to delegate common tasks to autonomic elements, they can dedicate themselves to producing the content necessary for such complex tasks as we've described, all integrated in a common visualization and execution environment. They may also find that the interactivity provided by the LTA-JD is a natural way to create their automation elements -- symptoms, rules, and actions. The infrastructure necessary for doing this is already in place -- we just need to work on better prediction and avoidance content to make this function viable.
The problem from Figure 4 appears to have been caused by the database being stopped inadvertently.
Resources Learn
- Be sure to catch the first part of this article.
- A good resource on the use of symptoms is the "Symptoms deep dive" series, including an intro to the symptoms format (developerWorks, October 2005), fun things to do with symptoms (developerWorks, December 2005), and a standard taxonomy to help classify symptoms (developerWorks, May 2006).
- For more on the Common Base Event format see:
- For more on the WSDM Event Format, check out the OASIS Committee Draft "WSDM: Management Using Web Services (MUWS) Part 1" and "Part 2."
- For more information on the XPath language, try the W3C version 1.0 spec, developerWorks core XML standards, and the Meet the specs series on WS-RT 1.0 (developerWorks, starting in September 2006).
- The IT Infrastructure Library (ITIL) is a cohesive set of best practices, drawn from the public and private sectors internationally, designed to provide approaches to IT service management problems.
Get products and technologies
Discuss
About the authors  | 
|  | Marcelo Perazolo is a member of the IBM Autonomic Computing Architecture team, where he serves as an architect for symptoms and other knowledge formats and defines Management Integration Taxonomies related to autonomic computing. He has worked for IBM since 1990 with various assignments in network and systems management. Marcelo received an M.S. degree in Electrical Engineering in 1994. His interests include problem determination and prediction, process optimization techniques, security, correlation technologies, and knowledge representation. |
 | 
|  | Abdi Salahshour is a Senior Software Engineer, problem determination architect, and Master Inventor at IBM's Autonomic Computing Technology and Development, who started with IBM in 1982 and served in many roles -- from design and development of database diagnostic tools to system management and self-healing architecture and enablement in heterogeneous and distributed environments. He was a member of IBM Problem Determination Council, is one of the authors of the IBM Common Base Event specification, one of the principal designers and implementers of the Generic Log Adapter, and the architect and designer of the Log and Trace Analyzer for Java Desktop. |
Rate this page
|  |