Level: Intermediate Leon Gong (leongong@us.ibm.com), Software Engineer, IBM Mike Olivas (molivas@us.ibm.com), Senior IT Specialist, IBM Christine Posluszny (cpos@us.ibm.com), Software Engineer, IBM Donna Venditti (donnav@us.ibm.com), Project Leader, IBM George McMillan (gmcmillan@us.ibm.com), Software Engineer, IBM
04 Aug 2005 Design and implement a warehouse ETL process, and learn about warehouse performance and security issues in this article, the third in a series on planning, designing, and implementing a basic data warehouse solution.
Introduction
Data integration is the key concept in data warehousing.
The design and implementation of an ETL (data Extraction, Transformation, and Loading) process is a very important part
of a data warehousing solution. ETL processes are used
to extract business data from multiple data sources, clean the data, integrate it, and load it into data warehouse databases in preparation for data analysis.
ETL process design
Even though the actual ETL design and implementation greatly depends on the ETL tool you choose for your data warehouse project, a high-level systematic ETL design will help to build efficient and flexible ETL processes.
Before diving into the data warehouse ETL process design, remember the rule of the thumb of ETL: The ETL process should not change the data, it should make the data better. If you find that you need to make some changes to business data, and you are not sure whether those changes will alter the meaning of the data, consult your customer before making the changes.
Modulated ETL process design
Because of their procedural nature and potential for hundreds or thousands of operations, it is extremely important to design ETL processes in an elegant way so they will be efficient, scalable, and maintainable. ETL data transformation operations can be roughly classified into six groups or modules: data extraction, verification, cleaning, integration, aggregation, and loading. Arrange these groups to execute the operations in a logical order so that the procedure is the most simplified, has the best performance, and is easy to change. The execution order is shown in the following figure.
Figure 1. ETL data transformation process functional module design
During the business requirement and data analysis phase of the project, you created the data mapping information. There are many ways to document data mapping; an ETL data mapping table is one of the most effective ways to guide ETL process design. You can also use this table as a way to communicate with business customers about data mapping and ETL process issues. There are different levels of ETL data mapping tables, such as entity and attribute levels. Each level has a different level of detailed data mapping information in it. The following table is a simplified example of an entity-level ETL data mapping table. Every "X" in this table represents a link to a document of operation detail or lower level of data mapping.
Table 1. ETL entity mapping table
| Source | Verification | Cleaning | Transformation | Integration | Aggregation | Target | | Account customer | X | X | | X | X | Customer | | Credit customer | X | X | X | | Loan customer | X | | X | | Checking account | X | X | | X | X | Account | | Savings account | X | | X | | Credit account | X | | X | | Loan account | X | X | |
Implement the ETL process in DB2 Data Warehouse
DB2® Universal Database™ Data Warehouse Editions provide improved performance and usability for data warehouse features. DB2 Data Warehouse Center (DWC) is a visual ETL design and implementation tool, which is a part of DB2 UDB. This section examines how to design and implement warehouse ETL processes using DB2 UDB (Version 8.2.1) Data Warehouse Center.
Create the warehouse control database
The warehouse control database contains the control tables that are required to store the Data Warehouse Center metadata. In Version 8.2 or later of the Data Warehouse Center, the warehouse control database must be a UTF-8 (Unicode Transformation Format, or Unicode) database. This requirement provides expanded language support for the Data Warehouse Center. If you try to log on to the Data Warehouse Center using a database that is not in Unicode format, you will receive an error message that you cannot log on. You can use the Warehouse Control Database Management tool to migrate the metadata from a specified database into a new Unicode database.
Here are the steps for creating and initiating a new warehouse control database:
- Make sure the DB2 Warehouse server and related services are started. From the warehouse control database management window, fill in the control database name, schema name (IWH), user ID, and password, and create the warehouse control database. If you already have a warehouse on a previous version of DB2 DWE, you can also use this process to migrate your warehouse control database to the current version.
- Log in to the DB2 Data Warehouse Center with your newly created or migrated control database, as shown in Figure 2. Make sure you use the same user ID and password as in step 1. If the warehouse control database is a remote database, you must catalog the node and the control database.
Figure 2. Login DB2 DWE warehouse center
Note: The DB2 Data Warehouse Center login window allows you to switch among multiple warehouse control databases. This feature is very useful when there are many projects or developers working on same DB2 Data Warehouse server.
Define an agent site
Warehouse agents manage the flow of data between the data sources and the target warehouses. Warehouse agents are available on the AIX®, Linux, iSeries™, z/OS™, Windows® NT, Windows 2000, and Windows XP operating systems, and for the Solaris™ Operating Environment.
The agents use Open Database Connectivity (ODBC) drivers or DB2 CLI to communicate with different databases. Several agents can handle the transfer of data between sources and target warehouses. The number of agents that you use depends on your existing connectivity configuration, and on the volume of data that you plan to move to your warehouse. If multiple processes that require the same agent are running simultaneously, you can generate additional instances of the agent.
An agent site is a logical name for a workstation where agent software is installed. The agent site name is not the same as the TCP/IP host name. A single workstation can have only one TCP/IP host name. However, you can define multiple agent sites on a single workstation. A logical name identifies each agent site.
When you set up your data warehouse, you must define the agent sites that the warehouse will use to access source and target databases. The Data Warehouse Center uses the local agent as the default agent for all Data Warehouse Center activities. However, you might want to use a warehouse agent on a different site from the workstation that contains the warehouse server. You must define the agent site in the Data Warehouse Center, identifying the workstation where the agent is installed. The Data Warehouse Center uses this definition to identify the workstation on which to start the agent.
Figure 3. DB2 Warehouse agent
The figure above illustrates the relationship between the warehouse agent, data source, target, and warehouse server.
Define a warehouse source
Warehouse sources identify the tables and files that will provide data to your warehouse. The Data Warehouse Center uses the specifications in the warehouse sources to access the data. DB2 Data Warehouse Center supports a wide variety of relational and non-relational data sources on all major platforms as shown in the following diagram.
Figure 4. Warehouse data sources
This makes it very easy to configure the connection to the supported data sources from DB2 Data Warehouse Center.
After you establish connectivity to your source and determine which source tables you want to use, you can define a DB2 warehouse data source in the Data Warehouse Center. If you are using source databases that are remote to the warehouse agent, you must register the databases on the workstation that contains the warehouse agent.
The procedure of defining a warehouse data source is different depending on the type of data sources. Here is an example of defining a relational warehouse data source in the DB2 Data Warehouse Center.
To define a relational data source in the Data Warehouse Center:
- In the Data Warehouse Center, open the Define Warehouse Source notebook.
- Add information about the warehouse source.
- Specify an agent site to access a warehouse source.
- Specify information about the source database as shown in Figure 5 below.
- Import the source tables and views into the warehouse source.
- Authorize warehouse groups to access a warehouse source.
Figure 5. Define a warehouse relational data source
Define a warehouse target
Warehouse targets are database tables or files that contain data that has been transformed. You can use warehouse targets to provide data to other warehouse targets. For example, a central warehouse can provide data to data marts on departmental servers. There are two ways to create a warehouse target. One is to import from an existing table or file, and the other is to generate the target by using the warehouse system.
Figure 6. Define a warehouse target table
As you can see from Figure 6, when you define a DB2 warehouse target table, you can specify whether the table is created by the DB2 Data Warehouse Center, and whether it is a part of an OLAP schema, which means it might eventually be used as a dimension or fact table in a multidimensional data model, such as a star model.
Define warehouse subject areas, processes, and steps
A Warehouse step is the definition of a single operation within the warehouse. The Warehouse steps define how you move and transform data. There are many types of warehouse steps you can use in the DB2 Data Warehouse Center:
- SQL (insert, update, and replace)
- File (FTP, file data import and export)
- DB2 Programs (data export, load, table reorganization and statistics update)
- Warehouse Transformers (data clean, key table and time table generation, and invert and pivot data)
- Statistical transformers
When you run a step, a transfer of data between the warehouse source and the warehouse target, or any transformation of that data, can take place. A step is a logical entity in the Data Warehouse Center that defines:
- A link to its source data.
- The definition of and a link to the output table or file.
- The mechanism (either a SQL statement or a program) and definition for populating the output table or file.
- The processing options and schedule by which the output table or file is populated.
A warehouse process contains a series of steps that perform a transformation and movement of data for a specific warehouse use. A process can produce a single table or a set of summary tables. A process might also perform some specific type of data transformation.
Figure 7. Define a warehouse process
A subject area identifies and groups the processes that relate to a logical area of the business. For example, if you are building a warehouse of marketing and sales data, you define a Sales subject area and a Marketing subject area.
Warehouse process and step modes
The Data Warehouse Center lets you manage the development of your steps by classifying steps in one of three modes: development, test, or production. The mode determines whether you can make changes to the step and whether the Data Warehouse Center will run the step according to its schedule. Promoting a step means moving the step to a higher mode (Development -> Test -> Production).
- Development mode: When you first create a step, it is in development mode. You can change any of the step properties in this mode. A warehouse-generated target table for the step does not exist in the target warehouse under this mode. If you want to run the step for testing, you must promote it to test mode.
- Test mode: In the step properties, you can specify that you want the Data Warehouse Center to create a target table for the step. When you promote the step to test mode, the Data Warehouse Center creates the target table. Therefore, after you promote a step to test mode, you can make only those changes that are not destructive to the target table. For example, you can add columns to a target table when its associated step is in test mode, but you cannot remove columns from the target table. Under test mode, you run each step individually. The Data Warehouse Center will not run the step according to its automated schedule.
- Production mode: To activate the schedule and the task flow links, you must promote steps to production mode. Production mode indicates that steps are in their final format. In production mode, you can change only the settings that will not affect the data that is produced by the step. You can change schedules, processing options (except population type), or descriptive data about the step. You cannot change the parameters of the step.
The warehouse step can be promoted, demoted, and run manually, from Data Warehouse Center or programmatically from outside of DB2 Warehouse Center by using an external trigger program.
 |
DB2 Warehouse external trigger program
An external trigger program is a warehouse program that calls the Data Warehouse Center. The external trigger server script can be used to promote and demote a DB2 Warehouse step, and to start or run the process and step. The external trigger program is especially useful for batch promoting and demoting the warehouse processes and steps in both ETL development and testing. You can also use the script to manage the execution order of ETL processes and steps.
The external trigger program consists of two components: the external trigger server (XTServer) and the external trigger client (XTClient). The XTServer is installed with the warehouse server. The XTClient is installed with the warehouse agent for all agent types. Before you can trigger a step from the external trigger program, you must specify the Run on demand option on the Processing Options page of the Properties notebook for that step.
Here are the example external trigger scripts for promoting ETL steps for the Windows platform:
- Make sure that db2XTrigger.jar and common.jar are in your class path:
CLASSPATH=%CLASSPATH%;%DB2PATH%\tools\db2XTrigger.jar;%DB2PATH%\java\common.jar
- From a command window on the XTserver system, go to the %DB2PATH%\java\jdk\bin\ directory and start the external trigger server as:
java db2_vw_xt.XTServer <port_number> |
Where <port_number> is the external trigger service port number on Warehouse server. The default port number is 11004.
- Create and run a client-side script command file that may contain many lines of code. For example (include all parameters on one line when you execute the file):
"%DB2PATH%\java\jdk\bin\java" db2_vw_xt.XTClient <server>
<port> <DWC_user> <password> <ETL_step> <command>
<wait_for_result> |
Where:
- <server> is the Warehouse server.
- <port> is the external trigger service port number on Warehouse server. The default port number is 11004.
- <ETL_step> is the name of the ETL step or process that you are trying to promote.
- <DWC_user> is the Data Warehouse Center user id.
- <command> can have one of following values:
- populate/run a step
- promote a step to test mode
- promote a step to production mode
- demote a step to test mode
- demote a step to development mode
- populate/run a process
- verify that the Data Warehouse Center server is running
- <wait_for_result> is optional. This parameter indicates whether the external trigger program is to return the results of the step or process processing. Choose one of the following values:
- 1: Wait for the step or process to complete. Return 0 if the step or process completes successfully, or return an error if the step or process fails
- 0 or blank: Do not wait for the step or process to complete.
For example: You have three ETL steps (step_A, step_B, and step_C) defined in a warehouse process. The execution order is step _A -> step_B -> step_C. The XTServer name is server1, the data warehouse user ID is dwusr, and the password is psword. Based on this information, the client-side script command file (promoProcess.bat) for promoting the warehouse process should look like:
rem set the variables
rem ********************
Set server=%1
Set DWC_user=%2
Set password=%3
rem promote all steps to test mode
rem *******************************
java db2_vw_xt.XTClient %server% 11004 %DWC_user% %password% step_A 2 1
java db2_vw_xt.XTClient %server% 11004 %DWC_user% %password% step_B 2 1
java db2_vw_xt.XTClient %server% 11004 %DWC_user% %password% step_C 2 1
rem promote all steps to product mode
rem ***********************************
java db2_vw_xt.XTClient %server% 11004 %DWC_user% %password% step_A 3 1
java db2_vw_xt.XTClient %server% 11004 %DWC_user% %password% step_B 3 1
java db2_vw_xt.XTClient %server% 11004 %DWC_user% %password% step_C 3 1 |
Note: When you demote the steps in the process, make sure to reverse the order of the steps.
- Open a DB2 command window on the XTclient. Go to the %DB2PATH%\java\jdk\bin\ directory, and run the client-side script command file promoProcess.bat. If the XTServer and XTClient are on a single system, make sure you run the client-side script command from a different DB2 command window.
>>promoProcess.bat server1 dwusr psword |
 |
Data extraction
Data extraction is the process to capture the source data. There are two primary approaches to capturing data:
- Full refresh
- Incremental update
A full refresh, as the name implies, is simply a full copy of the data to be moved into the staging database. This copy may replace what is in the data warehouse, add a complete new copy at the new point in time, or be compared to the target data to produce a record of changes in the target. The incremental update focuses on capturing only what has changed in the source data.
How to capture the data changes is tightly associated with the data sources themselves; it is really a case-by-case implementation issue. DB2 Data Warehouse Center supports many data capturing methods, including direct SQL select to capture all the data or a subset of it, FTP to capture data from remote data source, direct importing or loading of a data file, and data replication.
Extract large-volume data
The extraction of large-volume data from warehouse data sources is an important issue in any data warehousing project. There are approaches you can take in DB2 Warehouse Center: Select and Insert SQL step or warehouse load utilities.
-
Select and Insert SQL
Incremental commit is an option that allows you to control the commit scope of the data that is managed by the Data Warehouse Center. Incremental commit is available for Select and Insert SQL steps. You can use it when the volume of data to be moved by the agent is large enough that the DB2 log files might fill up before the entire work of the step is complete, or when you want to save partial data. SQL steps will complete with an error if the amount of data being moved exceeds the DB2 maximum log files that have been allocated. The performance of the database can deteriorate because a significant number of commits can occur when you use incremental commit. Use the incremental commit option to specify the number of rows to be processed before a commit is performed. The agent selects and inserts data, committing incrementally, until it completes the data movement successfully.
Figure 8. Options for DB2 Warehouse SQL Select and Insert step
-
Warehouse load utilities
DB2 Load transformer can load bulk data from a delimited file into a DB2 table, replacing or appending to the existing data in the database. By default, DB2 Load only logs the progress messages, not the actual input data, so there is no concern with log file size here. After data load is finished, the tablespace is in pending status; you need to backup the tablespace to make the target table usable. However, the DB2 Load transformer does give you an option to keep a copy of the input data, and if you use this option, the table will be usable immediately after data load.
You can also specify the maximum number of rows to be processed, and many other performance related options as shown in Figure 9.
Figure 9. DB2 Warehouse load utility performance options
Warehouse data replication
The data replication service of DB2 Data Warehouse Center is a powerful and widely used approach to capture data for both full refreshes and incremental updates in the data warehouse.
For an incremental update, the data changes are captured from log files and time-sampled sources. The typical way to extract logged data is data replication. DB2 data replication service is tightly integrated with DB2 Data Warehouse, so data replication can be configured and scheduled from DB2 Data Warehouse Center.
You can use DB2 Data Warehouse Center to define a replication step, which will replicate changes between all DB2 Data Warehouse supported data sources. The replication environment that you need depends on when you want to update and how you want to handle transactions. The Data Warehouse supports five types of replication:
- User copy: a complete, condensed copy of the replication source table. Condensed means that the target table contains a primary key and uses it for updating.
- Point-in-time: a complete, condensed copy of the replication source table at a certain time. The target table has a primary key and a timestamp column.
- Base aggregation: a history table in which new rows are appended for each subscription cycle, using the result of an SQL column function calculation on the replication source table (the base table).
- Change aggregate: a history table in which new rows are appended for each subscription cycle, using the result of an SQL column function calculation against the replication source change-data table, which contains the recently changed data.
- Staging table: a table that produces target tables that contain data from committed transactions. Also called consistent change-data tables.
To set up replication in DB2 Data Warehouse Center:
- Prepare source databases for replication.
- Create the replication control tables.
- Register a source table in DB2 Replication Center.
- Import the defined replication source table into DB2 Data Warehouse Center.
- Define a replication step in DB2 Data Warehouse Center.
- Create the password file that is required by your warehouse step.
- Start the Capture program on the same system as the source database.
- Promote the replication step to test mode or production mode in the DB2 Data Warehouse Center.
- Run the step. When the step runs, the warehouse agent starts the Apply program to process the replication subscription.
Prepare the DB2 source database for replication
In order to enable the DB2 source database for replication, you must set the database configuration parameter LOG_RETAIN to RECOVERY for the retention of the database log files.
- Set the source database configuration parameter LOG_RETAIN value to RECOVERY. This action puts the database into a BACKUP PENDING state.
- Back up the source database to take the database out of the BACKUP PENDING state.
- Check the source database configuration parameters LOGPRIMARY, LOGFILSIZ, and LOGSECOND. Make sure the values are big enough to handle the data changes.
Create capture and apply control tables
Before you can set up replication in the DB2 Data Warehouse Center, you must create both capture and apply replication control tables. The capture control tables are used by the capture program to capture the replication data from the source database. The apply control tables are used to apply the replication data to the target tables.
DB2 provides some flexibility as to where the control tables reside and where the replication programs run. A good approach is to run as much as possible on the staging database to alleviate strain on the operational database. However, the capture program and capture control tables are usually located on the source (or operational) server. One valid scenario is to create the replication control tables for capture on the source database, and the apply control tables on the target database.
Here are the steps for creating the control tables from the DB2 Replication Control Center:
-
For creating capture control tables: In the DB2 Replication Control Center, right-click Capture Control Servers and select Create Capture Control tables. In the Select a Server window, select the warehouse control database.
For creating apply control tables: In the DB2 Replication Control Center, right click Apply Control Servers and select Create Apply Control tables. In the Select a Server window, select the warehouse target database.
-
For creating capture control tables: Select the option Host sources for replication and capture changes to those sources.
For creating apply control tables: Select the option Apply captured changes to target tables.
- Enter appropriate sizing values. The values in the following table are an example:
| Question | Answer |
|---|
| How many source tables will you have? | Less than 100 | | Typically, for one source table, how many target tables will you have? | 1 | | How often will the data be applied to the target tables? | Daily | | How many transactions do you expect to capture in one day? | 1 |
- Leave the default control table schema of ASN, and default table spaces of TCASNCA and TCASNUOW. Select Run Now to create the replication control tables.
Register the source tables for replication
Registering source tables tells DB2 which table changes need to be captured. Before you can use a table or view as a replication source in the DB2 Data Warehouse Center, you must define it as a replication source using the DB2 Replication Center.
- In DB2 Replication Center, select SQL Replication > Capture Control Servers. Choose the source database, then navigate to Capture Schemas. Choose the schema name (ASN) and navigate to Registered Tables.
- Choose the appropriate source tables, and take the default settings for these tables. Select Run now to register the source tables for replication.
Import source tables into Data Warehouse Center
In the DB2 Data Warehouse Center, check the databases or tables listed under Warehouse Sources. If the replication source databases or tables are not already listed, import or add them as warehouse data sources.
Import target tables into Data Warehouse Center
In the DB2 Data Warehouse Center, check the databases or tables listed under Warehouse Targets. If the replication target databases or tables are not listed, import them as warehouse targets. Even though these tables may have already been listed, they need to be re-imported so that DB2 DWC is aware that the tables are registered for replication.
Define the replication step in the Data Warehouse Center
Defining a replication step is similar to defining other warehouse steps. There are five types of replication steps you can define in a replication warehouse step: user copy, point-in-time, base aggregation, change aggregate, and staging table. (These are the same steps explained above.)
By default, the replication step is set to generate the target table. If you want to link the replication step to an existing target table, double-click the target table to bring up the Properties window. On the first tab, check the option Data Warehouse Center created this table. Then use the data link tool to connect the replication step to the target step. After connecting, go back into the target table’s properties, and uncheck the option Data Warehouse Center created this table.
Take special care when you use existing user-created target tables, as opposed to DWC-generated target tables. As mentioned previously, when the replication steps are demoted to development mode, DWC drops the Apply and subscription set associated with the steps. It will create them again when you promote the step, and since the Applies are new at that point, the Capture program will think it needs to provide a full refresh to the target tables. If the Data Warehouse Center created this table option had been checked on each target table, it would have dropped the tables when demoted and re-created when promoted, (along with the apply and subscription set and the table), and the tables would be empty, ready to each receive a full refresh of data. However, if the tables never get dropped, the data remains, and if a full refresh is performed, there will likely be errors if the target tables are connected by foreign keys. The Apply portion of full refresh will fail on a target table that is a foreign key to another table that is already populated, because it will not be able to delete from the parent table.
Figure 10 shows the properties of the warehouse User Copy replication step includes data mapping between data source and target tables, replication subscript set name, event name, apply qualifier(see next section for detail), and user ID/password.
Figure 10. Define a Warehouse user copy replication step
Create password files using the asnpwd utility
For replication in Version 8 or later of the Data Warehouse Center, you must set up and maintain the replication password files by using the replication program asnpwd.
Replication password files are encrypted. The Data Warehouse Center replication steps assume that the password file for a replication step is:
- Located in the directory specified by the environment variable VWS_LOGGING.
- Named
<applyqual>.pwd, where <applyqual> is the Apply qualifier for the Data Warehouse Center replication step.
Before staring the Apply program, the Data Warehouse Center agent adds the user ID and password from this file to the Apply program command.
To create the pwd file, the first time you use asnpwd, open a DB2 command window. Switch to the LOGGING directory (for example, C:\Program Files\IBM\SQLLIB\LOGGING). Run this command:
asnpwd init encrypt password using applyqual.pwd |
Then add a user ID and password:
asnpwd add alias db id <userID> password <password> using applyqual.pwd |
Instead of repeating this command for all the apply qualifiers, you can do it once to generate pwd files, then make copies of the file for every apply qualifier, and rename them to applyqual.pwd -- assuming the replication steps all use the same database, user ID, and password.
Starting the capture program
You can start the replication capture control server from the DB2 Replication Center. Starting the Capture program will initiate the database logging changes that occur to the replication-enabled source tables, including deletes, inserts, and updates. Make sure that the logging directory you specified has enough space to handle the changing data volume.
You can also start the capture program by saving it to a file and running it from the command line, or by saving it as a task.
Promote the replication step to test or production mode
In DB2 Data Warehouse Center, promote the replication steps to test or production. When you promote the step, DB2 will create the subscription set and the Apply programs, and when you demote it will delete them.
In DB2 Replication Center, you can find the newly generated Subscription Sets and Apply Qualifiers under the database name. You might need to refresh the view. From the properties of the subscription sets, you can see the DB2 Data Warehouse Center has mapped the source to target table columns for you.
You can change the properties of a replication step in test mode, and the changes will be valid as long as the replication step is not demoted to development mode. These changes will be lost the next time the replication step is demoted to development.
DB2 replication error logging
If you have any trouble, the default location for the log files can be found in the logging directory (usually C:\Program Files\IBM\SQLLIB\LOGGING). For capture program errors, check the file DB2.<source db name>.<replication schema name>.CAP.log. (For example: DB2.DSACCT.ASN.CAP.log)
The file named <ApplyName>.trc will usually give good error handling information for Apply-related errors.
The file DB2.<target db name>.<Apply name>.APP.log also contains some information related to Apply. (For example: DB2.STAGEDB.APPLYPROD.APP.log)
Data verification
During the business data analysis phase of your project, you produced a set of data quality assumptions. These assumptions specify the responsibilities of both sides of customer and solution provider in data quality issues. Solution providers usually take care of data cleaning and enhancement issues. The customer should at least take care of any data quality issues that can only be addressed in the data source itself and are related to interpreting the meaning of the data. For example:
- Missing data recovery
- Fuzzy data conversion
- Business operation application-related data issues -- data quality issues which can only be fixed from the application itself
You should include the data quality assumptions in the project contract document, because the quality of the business data might greatly affect the project schedule if it is not solved in time and in the proper way. The data quality assumptions can be a good basis for schedule negotiations with your customer.
Even though it is assumed that customers will take their share of responsibilities to address the data quality issues in their business data source, there is still the possibility that poor-quality data will be generated in business data sources in the future. It is important to implement the data verification ETL filter module to reject those data before they negatively affect the subsequent ETL processes. Validation of data consists of a number of checks, including:
- Valid values for an attribute (domain check)
- Attribute is valid in the context of the rest of the row
- Attribute is valid in the context of related rows in this or other tables
- Relationship is valid between rows in this and other tables (foreign key check)
This is not an exhaustive list. It is only meant to highlight the basic concepts of data validation.
Error handling is a process that determines what to do with less-than-perfect data. Data can be rejected, stored for repair in a holding area, or passed on with its imperfections to the data warehouse. Rejected data should be stored in a place where your customer can access it; make sure your customer is informed every time data rejection happens. You should allow fixed rejected business data to be moved into the data warehouse later.
The Clean Data transformer in DB2 Data Warehouse Center can be used for basic data verification purposes. You can also build your own data verification SQL steps or special data verification steps for complex data verification.
Data cleaning
Data cleaning is the process of cleaning valid data to make it more precise and meaningful. Data cleaning includes tasks such as:
- Data merge from data sources
- Domain conversion and synchronization
- Data type and format conversion
- Data splitting for different target tables
A common example of data merge is name and address information. Name and address information for customers are often stored in multiple locations. Over time, these tend to become unsynchronized. Merging data for the customer is often difficult because the data used to match the different images of the customer no longer matches. Data enhancement resynchronizes this data.
You can use the Clean Data transformer in DB2 Warehouse Center to perform basic cleaning, substitution, and mapping operations on source data. The Clean Data transformer operates on specified data columns of the source table that your step accesses. Then the transformer inserts new rows in the target table to which your step writes. Depending on the processing options that you select, data that cannot be cleaned will be written to the target error table. You can also use the Clean Data transformer to clean and standardize data values after loading or importing data as part of a process.
The Clean Data transformer provides the following clean types that you can specify:
- Find and replace: Locates the selected source column value in the Find column of the rules table, and then replaces the value in the target table with the corresponding replacement value from the rules table. A rules table is required for this clean type. A rules table designates the values that the Clean Data transformer will use during the find and replace process.
- Numeric™ clip: Shortens numeric input values that are not within the specified range. Input values within the range are written to the output without any changes. Input values outside the range are replaced by the lower bound replacement value, or the upper bound replacement value. A rules table is required for this clean type.
- Discretize into ranges: Performs discretization of input values based on the ranges in the rules table. A rules table is required for this clean type. If you allow nulls for this clean type, you must put a null value in the Bound column of the rules table.
- Carry over with null handling: Specifies columns in the input table to copy to the output table. You can select multiple columns from the input table to carry over to the output table. A rules table is not required for this clean type. This clean type allows you to replace null values with a specified value. You can also reject nulls and write the rejected rows to the error table.
- Convert case: Converts the characters in the source column from uppercase to lowercase or from lowercase to uppercase, and inserts them into the target column. The default is to convert the characters in the source column to uppercase. A rules table is not required for this clean type.
- Encode invalid values: Replaces with the specified value any values that are not contained in the valid values column of the rules table that you are using. You specify the replacement value in the Properties notebook for the Clean Data transformer. You must specify a replacement value that is of the same data type as the source column. For example, if the source column is of type numeric, then you must specify a numeric replacement value. Valid values are not changed when they are written to the target table. A rules table is required for this clean type.
Most of the cleaning types have the Matching Options window to let you specify how you want matches to be handled.
Data integration
Data integration is a process of consolidating multiple data sources into a uniform data interface for data analysis. The data integration is the most important step in warehouse data transformation processes, and it is the key concept in data warehouse design.
Data integration can be very complicated. In this module you apply data integration business rules, and data transformation logic and algorithms. Source data of an integration process could come from two or more data sources; it usually involves different join operations. The source data could also come from a single data source; this kind of data integration usually involves domain value merge and conversion. The result of integration usually generates new data entities or attributes, which are easy for end users to access and understand.
Data aggregation
Data aggregation is any process in which information is gathered and expressed in a summary form. Data aggregation is always part of data warehouse requirements, which are usually in the form of business reports.
In a multidimensional model, the data aggregation path is the important part of dimension table design. In the data repository or data warehouse, the level of data aggregation is decided on a case-by-case basis. Since the data warehouse is still mostly in a relational data model style, it is best to advise your customer to build business reports from data marts. However, some customers prefer to build reports from the data warehouse directly. In this case, the data aggregation should be considered in warehouse data models. Make sure the data aggregation tables are relatively isolated from the rest of warehouse data schema, so the business requirement changes of the reporting will not affect the basic data warehouse data structure.
Load data to warehouse target tables
Moving data to target tables in the central data warehouse is usually the final step of an ETL process. The best way to load the data depends on what kinds of operations are performed and how much data needs to be loaded. There are two basic ways that you can insert and modify data in database tables:
- SQL insert/update/delete (IUD)
- Bulk load utility
Most applications use SQL IUD operations because they are logged and recoverable. However, bulk load operations are easy to use and very fast when loading a large quantity of data. Which data loading approach to use really depends on the business circumstances; it should be specified in ETL design documents.
ETL data rejection processing
How to deal with rejected business data is an important issue in ETL design. Business data could be rejected when it goes against:
- Business data quality assumptions
- Data reference integrity
- Business data integration rules implemented in ETL processes
You should store rejected business data in a location which both the data warehouse developer/administrator and the end users have agreed on. Fixing the problems in rejected business data is part of data warehouse maintenance; it is usually your customer's responsibility. Since it requires both domain knowledge and database skills, both database administrators and end users should be involved in this task. The fixed business data will eventually re-enter the ETL cycle to make its way to the data warehouse.
ETL processes and steps execution order
Execution order is another important ETL design issue. Even though more and more parallel processing is expected from the data warehouse server, not all ETL processes can be executed concurrently. There are many factors that affect execution order:
- Entity dependency: Referential integrity enforcement determines table and object dependency. For example, the parent entity table needs to be loaded before a child data or relationship table.
- Attribute dependency: Attribute dependency usually means the value of an attribute is calculated based on one or more values of one or more attributes.
- ETL logic modules: ETL modular design order usually determines the ETL step execution order within an ETL process. It is easy to understand that data needs to be verified and cleaned before the data integration step.
- Data integration dependency: Data integration business rules usually contain object and data dependency.
Within a warehouse process, the execution order is defined during the design stage using the warehouse link tool as shown in Figure 7. You can define shortcuts between the steps among warehouse processes to control the process execution order.
Running Warehouse ETL steps
You can run a step on demand, or you can schedule a step to run:
- At a set time
- Only one time
- Repeatedly, such as every Friday
- In Sequence, so that when one step finishes running, the next step begins running
- Upon completion, either successful or not successful, of another step
If you schedule a process, the first step in the process runs at the scheduled time.
Figure 11. DB2 Warehouse work in progress window
You can combine these methods to run the steps in a process. You can schedule the first step to run at a specified date and time. These schedules and cascades are active when the step is in production mode. After you schedule the first step, you can specify that another step is to start after the first step runs, and specify that a third step is to start after the second step runs, and so on.
You can schedule processes as well as steps, and specify that a process is to start after another process has run. You must carefully group your steps into a meaningful process, so that you can schedule and specify the task flow of your processes properly. Using the Process Task Flow page of the Scheduler notebook, you can start a process based on the completion of another process.
The Work in Progress window lets you monitor the progress of all steps and processes in the Data Warehouse Center that are running or scheduled to run. The Work in Progress window shows an entry for the step or process that is running. While the step is running, it has a status of Populating. If the processing fails, you can use the Show Log action to find the problem.
Manage ETL metadata
Metadata management is essential to effective ETL development and operation. The ETL metadata includes everything involved in ETL process design, ETL process execution history, rejected data process records, schedule information, data growth and storage management records, and user data access records.
You can export the metadata stored in Data Warehouse Center, and also import metadata from another metadata source.
Export metadata
You can use the Data Warehouse Center export capabilities to export subjects, processes, sources, targets and user-defined program definitions. When an object is exported, all of the dependent and subordinate objects will be exported to the tag language file or XML file by default. You can export the following types of metadata:
- Tag language (in XML format)
- Common Warehouse Meta-model metadata
- DB2 OLAP Integration Server metadata (Windows only)
By default, exporting includes the selected objects and all of the objects to which the selected objects refer. For example, if you select a process to export, the sources and targets used by the steps, dependent steps, and dependent processes are included.
When exporting metadata to tag language, you can exclude the source definitions in an export by unchecking the Export dependent source properties option. If you do this, you must define the sources in the target system before you import the tag file to avoid an error.
You can limit the number of exported objects to decrease the size of the tag file. By default, the export operation includes steps that have a data dependency. For example, consider the following scenario: Process P1 contains step S1 that populates T1, and process P2 contains step S2 that contains a source as T1 so that the following dependency is established: S1 –> T1 –> S2 –> T3. If you export only process P2, P1 is exported in the tag file too, because S2 is dependent on S1 for its data. The data dependency goes backwards as well. So, even if you export P2 only, P1 is included in the tag file too. Exporting P1 and P2 separately does not help, so the best approach is to export them together. You can check the option Do not export dependent steps from unselected process to exclude the dependent steps when you export the metadata to a tag file.
In addition to the data dependency, you must also consider cascading. Consider a step in process P5 with a shortcut to a step in process P6. If P5 is exported, P6 is exported too. In this case, the export cascades downward to the next step through a shortcut. By default, the export operation includes cascade steps and processes, however it does give a choice to not include cascade steps and processes when exporting to tag file.
Figure 12. Warehouse meta data export
Import metadata
You can also import object definitions for use in your Data Warehouse Center system.
When you import metadata, all of the objects are assigned to the security group that is specified in the tag language file. If no security group is specified, all of the objects are assigned to the default security group.
You can import the following types of metadata:
- Tag language file
- Common Warehouse Metamodel metadata
- ERwin
- IBM MQSeries
- Trillium
Tips:
- If you move a tag language file from one system to another, you must move all the associated files (for example: source files) with it, and they must reside in the same directory.
- If a process with unlinked shortcuts is exported and then imported into another control database as a .tag file, the unlinked shortcut data will cause error DWC3142: "<dirID> was not found in the Data Warehouse Center control database." This error displays when the unlinked shortcut dirIDs are not translated, and they refer back to the original control database.
Tips for both exporting and importing of metadata:
- Because the warehouse import and export formats are release-dependent, you cannot use exported files from a previous release to migrate from one release of the Data Warehouse Center to another.
- Both exporting and importing processes use a large amount of system resources. You might want to limit the use of other programs while you are exporting object definitions. When you do large export operations, you might want to increase the DB2 application heap number size of the warehouse control database to 8192.
- Both the server name and the user name associated with the Warehouse data source, target, and agent are also exported to the tag file, and that information needs to be updated after importing to the new system. The password, however, is not exported, so you need to provide password information for accessing the warehouse data source, target, and agent.
Set Up Prototype
As soon as the first group of business domain areas of the data warehouse project is implemented, you should set up a warehouse implementation prototype to verify:
- The technologies that were used
- The design and implementation
- The project business requirements
- The warehouse performance
Technical quality assurance
When it comes time to validate the design with the user, hands-on testing is the best approach. Let the user try to answer questions through manipulation of the test target. Document any areas where the test target cannot provide the data requested.
A functional validation of the proposed solution must be performed, together with the end users. This usually results in end users using the constructed solution for a while, giving them the opportunity to work with the information that has been made available to them in a local solution (perhaps in a data mart). In addition, the local solution may then be integrated into more business-wide data warehouse architecture, including the model of the data produced.
Aside from testing, review with the user any additions and changes to the model that have resulted from the design phase to ensure they are understandable. As with the model validation step, pass what works on to the implementation phase. What does not work should return to the requirements phase for clarification and re-entry into modeling.
Data Warehouse Performance Tuning
A data warehouse is a collection of systems, network configurations, applications, databases, reports, and people. Data warehouse performance is affected by all these factors. This section focuses on how to troubleshoot data warehouse performance issues from the end user's point of view, which means query workload and response time.
The workload on a data warehouse seldom remains static. New users bring their own kinds of demands, existing users change their focus and often the depth of their studies, the business cycle presents its own kinds of peaks and valleys, and in most cases the data warehouse expands as it stores data to cover longer periods of time.
Use of indexes should be more liberal in the read-only data warehouse, because the index is defined for efficient data retrieval. The indexes should be optimized based on the access patterns and query needs of the decision support environment of the data warehouse.
As the demand on a data warehouse changes, some indexes become obsolete and others need to be created, some aggregates are no longer referenced and others need to be evaluated, and the limits on parallel processing must be assessed and adjusted to fit the current demand. These and other tuning tasks should be carried out periodically to keep query performance up to business requirements.
Query performance evaluation and tuning is best approached as an ongoing process consisting of a series of successive refinements. Each refinement begins with complaints or observations of poor response time of group queries.
Before performing any query tuning tasks, you need to know the expected response time of the queries in question. To characterize a workload, you must sort through queries, organize them into families, and then determine their resource demands in terms of processing times, I/O requests, memory demand, network data traffic if applicable, and so on. The expected query response time is estimated based on the evaluation of the query workload characters.
Once you know the expected response times, and have measured the current response times of the queries, you can periodically tune the data warehouse as follows:
- Analyze monitored response times to determine whether they meet expected response times.
- Consider tuning when queries fail to meet their response time objectives:
- Set up system and database performance monitoring to collect data and analyze monitored response time information to look for bottlenecks.
- Work down the list of performance determinants from those of greatest to least influence and make adjustments accordingly.
- Determine which queries still fail to meet their response time objectives. There should be few. Develop detailed profiles for these queries using the DB2 Query Performance Monitor and an action plan. Most likely the action plan will contain performance trade-offs which might very well lead to other resource bottlenecks.
- Continue with adjustments to the system and database until all queries meet their performance objectives.
Data Warehouse security
The data warehouse contains confidential and sensitive business data. Very often, data warehouse security is ignored until the later stage of data warehouse design. The complexity of data warehouse security increases as more data sources or business subject areas are involved. Different data sources usually have different security requirements or users, so defining the data access for the integrated warehouse data can be difficult. Fortunately, the DB2 database system and DB2 Data Warehouse Center provides very comprehensive data access security service, which makes data warehouse security much easier.
You need to consider many factors in data warehouse security design:
- Data access: The data warehouse is created for a decision support vehicle; the end users should only dig out information from it. The access to the data in the data warehouse should be read-only.
- End users: Knowing who will use the data warehouse will steer the warehouse design. If the end users are entitled to access all data in the warehouse, you only need to set up one system or database group for accessing the data warehouse. However, in the real business world, not every end user is allowed to access all business data, and different end users are entitled to access different subsets of data in a warehouse.
- Data analysis approaches: There are several ways to generate reports from data marts, including standardized business reports, ad hoc OLAP reports, and data mining. For standardized reports, security is easy to enforce when you allow end users to access a group of predefined reports. For ad hoc OLAP and data mining reports, security is most likely enforced by assigning data marts or subsets of data marts at the database level to groups of users.
- Performance: A restrictive security plan comes at the cost of performance in different ways. It is important to find the balance between the security and performance requirements.
- Data warehouse design: Data security itself is an important issue in data warehouse design. The two-tier data warehouse design in this solution assumes that end users will only access more user-friendly data in data marts, not the complex data structure in a data warehouse. This greatly simplifies the data warehouse end-user security, because the data marts are usually defined for specific departments or groups of users.
- Data warehouse tool: The DB2 Data Warehouse Center security structure is separate from the database and operating system security. This structure consists of warehouse groups and warehouse users. Users gain privileges and access to Data Warehouse Center objects by belonging to a warehouse group. A warehouse group is a named grouping of warehouse users and privileges, which give users the authorization to perform functions. Warehouse users and warehouse groups do not need to match the database users and database groups that are defined for the warehouse control database.
The following figure shows the relationship between DB2 Warehouse users, warehouse groups, and user IDs and passwords for the warehouse databases.
Figure 13. DB2 Data Warehouse user management structure
Data Warehouse solution deliverables
Here is the list of important data warehouse deliverables:
- Business requirements
- Data quality assumptions
- Architecture documents
- Logical and physical data models
- Business data integration specifications
- ETL design documents
- Test plan and results
- Deployment plan and results
- Customer education documents
- Technical support documents
Conclusions
This article series has introduced you to business intelligence and presented a basic approach for delivering a flexible and cost-effective data warehousing solution. This approach uses IBM DB2 Data Warehouse Edition, Version 8.2.1, which offers end-to-end business intelligence software at very affordable price, especially for the mid-market customer.
The information provided on specific areas of business intelligence should give you insight into the tasks, decisions, and issues you will encounter when developing a data warehousing solution for a customer. From the first meeting with a customer to deploying a business intelligence solution into production, this article contains many useful ideas and tips, including ETL processes, data warehouse design and implementation, as well as data warehouse security and performance.
IBM is the world’s top business intelligence solutions provider, and business intelligence is a key initiative for IBM. Business intelligence solutions can range from department-level data marts -- for specialized functions such as sales or financial analysis -- to massive enterprise data warehouses scaling into terabytes. This article provides insight into what it takes for the consultant or solution provider to deliver a solution as part of IBM’s business intelligence strategy.
Resources Learn
Discuss
About the authors  | 
|  | Leon Gong is a solution architect working in the IBM Solution Builder Express team. He has been working to help enable business partners to create solutions in different industries and solution areas including, but not limited to, business intelligence, infrastructure, and e-commerce. He has DB2 certifications in both application development and administration. You can reach him at leongong@us.ibm.com. |
 | 
|  | Mike Olivas is an architect working in the IBM Solutions Builder Express team. Lately, he has been working to help enable business partners to create solutions in different industries and solution areas including but not limited to business intelligence, infrastructure, and workplace services. You can reach him at molivas@us.ibm.com. |
 | 
|  | Christine Posluszny is a solutions developer in IBM Solution Builder Express Portfolio. She has 6 years of experience developing Java and SQL applications leveraging DB2. Recently, she has developed integrated solutions for IBM Mid-Market Business Partners. She is a Sun certified Java Programmer. You can reach her at cpos@us.ibm.com. |
 | 
|  | Donna Venditti is a project lead in Solutions Builder Express. For the last 6 years, she has delivered both industry and cross-industry starting point solutions for IBM Mid-Market Business Partners. Her focus has been on e-commerce and business intelligence solutions for the retail industry. You can reach her at donnav@us.ibm.com. |
 | 
|  | George McMillan is an architect working in the IBM Solutions Builder Express team. Most recently, he has been working with business partners and SBE architects developing consultant's tooling, and as methodologist developing method strategies for the partner channel. His work includes creating solutions for business partners in different industries and solution areas. He acts as technical lead in the areas of collaboration and content management, and participates on teams developing business intelligence and infrastructure solutions. You can reach him at gmcmillan@us.ibm.com. |
Rate this page
|