 | Level: Intermediate Leon Gong (leongong@us.ibm.com), Software Engineer, IBM Mike Olivas (molivas@us.ibm.com), Senior IT Specialist, IBM Christine Posluszny (cpos@us.ibm.com), Software Engineer, IBM Donna Venditti (donnav@us.ibm.com), Project Leader, IBM George McMillan (gmcmillan@us.ibm.com), Software Engineer, IBM
14 Jul 2005 Second in a three-part series introducing a flexible and effective approach to plan, design, and implement a basic data warehouse solution, this article focuses on warehouse data modeling issues.
Introduction
The business climate is changing rapidly, and so are the types of business data. The basis of a successful data warehousing solution is a flexible design which can adapt to changing business data. Data warehouse architecting and warehouse data modeling are the core processes in warehouse design.
Data warehouse architecture
When you use data models to capture business requirements, you are already doing some part of the data warehouse design work. However, the formal data warehouse design should begin with data warehouse architecture.
The warehouse architecture is a key decision that will be based on such factors as the current infrastructure, the business environment, the desired management and control structure, the commitment to and scope of the implementation effort, the capability of the technical environment the organization employs, and the resources available.
Architecture choices
The warehouse architecture will determine -- or be determined by -- the locations of the data warehouses and data marts themselves, and where the control resides. For example, the data can reside in a central location that is managed centrally. Alternatively, the data can reside in distributed local and/or remote locations that are either managed centrally or independently.
The following architecture choices are available:
- Business-wide data warehouse
- Independent Data Marts
- Interconnected Data Marts
These architecture choices can also be used in combinations. For example, the data warehouse architecture could be physically distributed or centrally managed.
Business-wide data warehouse architecture
A business-wide data warehouse is considered one that will support all, or a large part, of a business that requires a more fully integrated data warehouse, with a high degree of data access and usage across departments or lines of business. That is, the warehouse is designed and constructed based on the needs of a business as a whole. It could be considered to be a common repository for decision-support data that is available across the entire organization, or a large subset of that data. The term "business-wide" is used here to reflect the scope of data access and usage, not the physical structure. The business-wide data warehouse can be physically centralized or physically distributed throughout the organization.
Independent data mart architecture
Independent data mart architecture implies stand-alone data marts that are controlled by a particular workgroup, department, or line of business and are built solely to meet their needs. There may, in fact, not even be any connectivity with data marts in other workgroups, departments, or lines of business.
Figure 1. Data warehouse architecture choices
Interconnected data mart architecture
Interconnected data mart architecture is basically a distributed implementation. Although separate data marts are implemented in a particular workgroup, department, or line of business, they can be integrated, or interconnected, to provide a more global business-wide view of the data. In fact, at the highest level of integration, they can become a business-wide data warehouse. This means that end users in one department can access and use the data in a data mart in another department.
Which architecture should you choose?
Business-wide centralized data warehouse architecture is the most elegant choice if your customer’s business and data sources are relatively centralized. This is actually not an uncommon situation for mid-market companies. Otherwise, the interconnected data marts and business-wide distributed data warehouse are more practical choices for businesses with a wide geographic distribution.
The independent data mart architecture is not a good approach because it goes against the key concept of a data warehouse: data integration.
Data warehouse implementation approaches
The choice of an implementation approach is influenced by such factors as the current IT infrastructure, the resources available, the architecture selected, the scope of the implementation, the need for more business-wide data access across the organization, the return-on-investment requirements, and the speed of implementation. The selection of an implementation approach is a very important part of data warehouse design; the decision needs to be made at an early stage of the data warehouse project.
Top-down implementation
The top-down approach is to implement the data warehouse in a single project phase. A top-down implementation requires more planning and design work to be completed at the beginning of the project. This brings with it the need to involve people from each of the workgroups, departments, or lines of business that will be participating in the data warehouse implementation. Decisions about data sources to be used, security, data structure, data quality, data standards, and an overall data model will typically need to be completed before actual implementation begins.
Bottom-up implementation
A bottom-up implementation involves the planning and design of a data warehouse without waiting for a more business-wide data warehouse design to be put in place. This does not mean that a more business-wide data warehouse design will not be developed; it will be built incrementally as initial data warehouse implementations expand. This approach is more widely accepted today than the top-down approach, because immediate results from the data warehouse can be realized and used as justification for expanding to a more business-wide implementation.
Which implementation choice should you make?
Each implementation approach has advantages and disadvantages. In many cases the best approach may be a combination of the two. One of the keys to this approach is to determine the degree of planning and design that is required for the business-wide architecture to support integration as the data warehouse is being built with the bottom-up approach.
When you use a bottom-up or phased data warehouse project model to build a series of data marts in a business-wide architecture, you can integrate the data marts of different business subject areas one by one to form a well designed warehouse of business data. Such an approach can be very appealing to businesses. Each data mart can tackle an identifiable business problem or subject area, making it possible to calculate ROI. The approach also offers a valuable learning curve for the build team, who can test products and processes until they get it right.
For a mid-market company, there are extra reasons to employ the bottom-up approach:
- There is more volatility in mid-market businesses and their business data structures than there is in enterprise business data.
- Smaller companies will usually have limited project budgets.
- Mid-market companies need quick solutions to ease their business pains.
- The people required for this type of project must have both a broad understanding of the business and a detailed knowledge of a specific business area. It is difficult to find such people, but even if you can, it is more difficult to use their time for data modeling rather than performing their standard business duties.
Data warehouse infrastructure
Now that you have made some decisions about high-level data warehouse architecture, you can begin to consider what components the data warehouse should have.
Figure 2. High-level view of Business Intelligence infrastructure components
A data warehouse should have all the components in the business intelligence infrastructure pictured in the figure above. This article will focus on data warehouse construction, which covers all of these components except information analysis.
These business intelligence components can be defined as follows:
- Data source: clear definition of current available business and external data sources and future data sources if possible
- Data acquisition: ETL (Extraction, Transformation, and Loading) processes to acquire, clean, transfer, and integrate data
- Business data warehouse: Warehouse data repository databases
- Data propagation: ETL process to aggregate and enhance data for data marts
- Data mart: subset of a data warehouse in a more user-friendly data structure
- Information analysis (not covered in this solution)
- Metadata management: business requirements, data models, ETL process designs, user manuals, and so on.
- System management: data management, data warehouse security, system and database backup and recovery, and so on.
Modeling the data warehouse
Data is simply a record of all business activities, resources, and results of the organization. The data model is a well organized abstraction of that data, so it is quite natural that the data model has become the best method to understand and manage the business of an organization. The data model plays the role of a guideline, or plan, to implement the data warehouse. Consolidating the data models of each business area before the real implementation begins can help assure that the result will be an effective data warehouse and can help reduce the cost of implementation.
Modeling the target warehouse data is the process of translating requirements into a picture along with the supporting metadata that represents those requirements. For readability purposes, this article separates the discussions about requirements and modeling, but in reality these steps often overlap. As soon as some initial requirements are documented, an initial model starts to take shape. As the requirements become more complete, so too does the model.
It is of the utmost importance to provide your end users with logical models of the data warehouse that are well integrated and easy to interpret. These logical models are one of the centerpieces of the data warehouse metadata. Simplicity for the end user and integration and consolidation of the historical data are key principles that the modeling approach should help provide.
Warehouse data modeling vs. operational database modeling
Keep the following issues in mind during the modeling process:
- A data warehouse should be end-user-oriented. In an operational database, the user does not directly interact with the database. They use applications, which have predefined or fixed queries. A data warehouse database -- especially a data mart -- is very near to the end user, and it usually does not have fixed queries. It must therefore be easier to understand.
- A data warehouse should be designed for data analysis. End users deal almost directly with the data, and there are no fixed workflows (with a few exceptions here and there). End users are not interested in recording data in the warehouse; they want to get information out of it. They raise questions against the warehouse, test and verify hypotheses with information they extract, reconstruct chains of events, analyze those events to detect possible patterns or seasonal trends, and make extrapolations and projections for the future.
- End-user requirements may be fuzzy or incomplete. These incomplete requirements call for a flexible modeling process and for techniques which are appropriate for evolutionary development. The risks of flexible and evolutionary software development are incoherent and inconsistent end results. These issues certainly require attention when developing the data model.
- A data warehouse is an integrated collection of databases rather than a single database. It should be conceived as the single source of information for all decision support processing and all informational applications throughout the organization. A data warehouse is an "organic" thing, and it tends to become big, if it is not already big at the beginning.
- A data warehouse contains data that belongs to different information subject areas. These subject areas can be the basis for logically partitioning the data warehouse in several different (conceptual or even physical) databases. A data warehouse also contains different categories of data.
- A data warehouse usually contains historical data rather than snapshots of daily operational data. Required legacy data may not be available or perhaps cannot be captured at a sufficient level of detail, unless money and effort are spent changing the legacy input environment. As a result, data warehouse enablement projects often get involved with business process and source application reengineering.
 |
Two-tiered data warehouse design
How to model a data warehouse is probably one of the most controversial issues in the business intelligence field, and this article does not enter that discussion. This section introduces the two-tiered warehouse modeling approach, which is the best fit for a bottom-up implementation.
Figure 3. Two-tiered warehouse modeling
The data repository tier, or back-end tier, comprises all of the model artifacts and the full structure of the models, which handle integrated business data from all required data sources. The data repository tier has two data modules: data staging and data repository. The data staging module stores the original data and temporary data from all data sources for ETL processing. The data repository module stores all integrated business data, it is the final target of data acquisition ETL process.
The data mart tier contains all the data marts, which are subsets of the data repository module, made simple enough for specific groups of end users to use in their data analysis activities.
Since data warehouse and data mart are interchangeable concepts in many situations, it is necessary to specify their definition based on the two-tiered warehouse model. In this model, the business data warehouse is the collection of data repository databases (module), and the data marts are the databases from the data mart tier. The business data in the data mart can only come from the business data warehouse.
The benefits of two-tier data warehouse design include:
- Flexible and easy to maintain. The data mart's data structures can be changed at any time according to users' reporting needs. However, it does not affect data structure of the databases in data repository.
- Easy to scale and integrate. The relational style business data repository database is much easier to scale and integrate than denormalized and summarized databases in a data mart.
- User friendly. Isolating data marts from the data repository makes data mart design much more end-user-driven, because the data modeler does not need to consider much about data integration and schema scalability issues.
- Better security design. The two-tier approach isolates data store and data access management. End users can only access the data mart granted to them, not all data warehouse data.
- Better data management design. A data warehouse is designed for storing integrated business-wide historical data. However, many data marts in a data warehouse do not necessarily need that much historical data. A two-tier design is a good place to store the historical data.
Keep in mind, the two-tier warehouse design above is a conceptual warehouse layout. For example, in the data repository tier, staging and repository databases can be on different servers, on the same server, or even in the same database under different schemas.
Warehouse data modeling levels
There are three levels of data modeling: conceptual, logical, and physical. Each level of data modeling has its own purpose in data warehouse design.
Conceptual
The high-level data model is a consistent definition of all of business subject areas and data elements common to the business, from a high-level business view to a generic logical data design. From this, you can derive the general scope and understanding of the business requirements. This conceptual data model is the basis for both current and future phases of data warehouse development.
Logical
The logical data model contains much more detailed information about the business subject areas. It captures the detailed business requirements in the target business subject areas. It is the basis for the physical data modeling for the current project.
Starting from this stage, this solution is adapting the bottom-up approach, which means that only the most important and urgent business subject areas are targeted in this logical data model.
The features of the logical data model include:
- Specifications for all entities and relationships among them
- Specifications for each entity's attributes
- Specifications for all primary keys and foreign keys
- Normalization and aggregation
- Specification for multidimensional data structure
Physical
The physical data modeling applies physical constraints, such as space, performance, and the physical distribution of data. The physical data model is tightly related to the database system and data warehouse tools that you will use. The purpose of this phase is to design the actual physical implementation.
It is particularly important to clearly separate logical modeling from physical modeling. Good logical modeling practice focuses on the essence of the problem domain. Logical modeling addresses the "what" question. Physical modeling addresses the question of "how" for the model, which represents implementation reality in a given computing environment. Since the business computing environment changes from time to time, the separation of logical and physical data modeling will help stabilize the logical models from phase to phase.
Once a data warehouse is implemented and your customers begin using it, they will often generate new requests and requirements. This will start another cycle of development, continuing the iterative and evolutionary process of building the data warehouse. As you can see, the logical data model is a living part of a data warehouse, used and maintained throughout the entire life cycle of the data warehouse. The process of data warehouse modeling can be truly endless.
Figure 4. Data warehouse logical data model life cycle
Repository database modeling
A warehouse data repository database stores all the cleaned and integrated business data from all business data sources, and it is the end point of data integration and transformation in a data warehouse.
While the conceptual data model is designed according to business requirements, the data repository logical data model is designed based on both business requirements and the analysis of available business data sources. It is a natural checkpoint for verifying whether the existing business data supports the business requirements for data warehouse project.
The repository database is still basically a normalized relational database. Because they are source-driven, the source data models can be used as aids in the process of fully developing the warehouse data repository models. You might need to construct the source data models by using reverse engineering techniques, developing Entity and Relation (ER) models from existing source databases. You might need to start by integrating several of these models into a global model representing the sources in a logically integrated way.
The data in a data warehouse is cleansed and scrubbed during the data transformation processes. It should be at least as good as the data in the source operational system. Even though referential integrity and check constraints are very helpful during the ETL process to help detect data problems, implementing them on final data warehouse tables is not efficient.
One important characteristic of a data warehouse is that it contains historical data. There are two kinds of history data according to the update frequency: slow-update and fast-update history data. For the slow-update history data, updates are handled using the child history tables with valid status and time-frame data.
Transaction and Web traversal data are typical examples of fast-update data; they usually have big (old and new) data volumes. The most important design issue for hosting fast-update historical data is the performance. For example, there is a large volume of transaction data that starts from 1999, but only the most recent three years of transactional data are frequently accessed for reporting, as indicated in the business requirements. Figure 5 is a high-level logical and physical partition design for the transaction table.
Figure 5. Transaction table logical and physical partitions
Data modeling for data mart
Since warehouse end users interact directly with data marts, the data mart modeling is one of the most effective tools in capturing end-user business requirements. The data mart modeling process depends on many factors. Three of the most important are described below.
Data mart modeling is end-user-driven. End users must be involved in the data mart modeling process, as they obviously are the ones who will use the data mart. Because you should expect that end users are not at all familiar with complex data models, the modeling techniques and the modeling process as a whole should be organized such that complexity is transparent to end users.
Data mart modeling is driven by business requirements. Data mart models are useful for capturing the business requirements because they are often used directly by end users, and are easy to understand.
Data mart modeling is greatly affected by data analysis technologies. The techniques of data analysis can impact the type of data models selected and their content. There are several techniques for data analysis that are in common use today: query and reporting, multidimensional analysis, and data mining.
If the intent is simply to provide query and reporting capability, an ER model with a normalized or denormalized data structure would be most appropriate. A dimensional data model might also be a good choice because it is user-friendly and has better performance. If the objective is to perform multidimensional data analysis, a dimensional data model would be the only choice here. Data mining, however, usually works best with the lowest level of detail available. Thus, if the data warehouse is used for data mining, a low level of detailed data should be included in the model.
Since ER modeling is not included in this article, this section will discuss dimensional data modeling, which is the most important data modeling methodology in data mart design.
Star and snowflake models
There are two basic data models that can be used in dimensional modeling:
Star schema (or model)
Snowflake model
Star schema has become a common term used to refer to a dimensional model. Database designers have long used the term "star schema" to describe dimensional models because the resulting structure looks like a star. Some business users feel uncomfortable with the term schema, so they have embraced the more simple-sounding term "star model." The terms star model and star schema are interchangeable.
The star model is the basic structure for a dimensional model. It typically has one large central table (called the fact table) and a set of smaller tables (called the dimension tables) arranged in a radial pattern around the fact table.
Whereas the traditional ER model has an even and balanced style of entities and complex relationships among entities, the star model is very asymmetric. The fact table in the dimensional model has a single join with all the other dimension tables.
Dimensional modeling typically begins by identifying facts and dimensions, after the business requirements have been gathered. The initial dimensional model is usually star-like in appearance, with one fact in the center and one level of several dimensions around it. The snowflake model is the result of decomposing one or more of the dimensions, which sometimes have hierarchies themselves. You can define the many-to-one relationships among members within a dimension table as a separate dimension table, forming a hierarchy.
The decomposed snowflake structure visualizes the hierarchical structure of dimensions very well. The snowflake model is easy for data modelers to understand and for database designers to use for the analysis of dimensions. However, the snowflake structure seems more complex and could tend to make the business users feel more uncomfortable working with it than with the simpler star model. Developers can also elect the snowflake because it typically saves data storage. Consider a banking application where there is a very large account table for one of the dimensions. You can expect to save quite a bit of space in a table of that size by not storing the very frequently repeated text fields, but rather putting them once in a subdimension table.
Although the snowflake model does save space, it is generally not significant when compared to the fact table. Most database designers do not consider the savings in space to be a major decision criterion in the selection of a modeling technique. This article will focus on the star model.
Figure 6. Sample star schema structure
Dimensions and measures
A user typically needs to evaluate or analyze some aspect of the organization’s business. The requirements that have been collected must represent the two key elements of this analysis: what is being analyzed, and the evaluation criteria for what is being analyzed. The evaluation criteria are referred to as measures (a numeric attribute of a fact), and what is being analyzed is referred to as dimensions (a description attribute of a fact). Together, one set of dimensions and its associated measures make up what is called a fact.
Dimensions
The base structure of a dimension is the hierarchy. Dimension hierarchies are used to aggregate business measures, such as Total Revenue of Sales, at a lesser level of detail than the base granularity at which the measures are present in the dimensional model. In this case, the operation is known as roll-up processing. Roll-up processing is performed against base facts or measures in a dimensional model.
If measures are rolled up to a lesser level of detail, the end user can obviously also perform the inverse operation (drill-down), which consists of looking at more detailed measures or, to put it differently, exploring the aggregated measures at lower levels of detail along the dimension hierarchies.
One of the essential activities of dimension modeling is to capture the aggregation paths or aggregation hierarchies along which end users perform roll-up and drill-down. This process will produce dimension models that you will extend and change later on, when you perform other modeling activities, such as modeling the variance of slow-varying time dimensions, dealing with constraints within the dimensions, and capturing relationships and constraints across dimensions.
Dimension modeling is tightly associated with end users and business processes. To make a dimensional model longer-lasting and suitable for more user groups, it is of particular importance to model the dimensions from a conceptual point of view, looking for fundamental aggregation paths and aggregation levels in which the end-user community at large may be interested.
You can usually add measures to well defined facts without much impact on the model at all. For the dimensions, however, this is certainly not true because the structure of dimension hierarchies can potentially get complicated.
When you consider the problem domain in a broad context, you should expect to have several different aggregation paths within a single dimension. Splits in the dimension hierarchies can occur at different levels. Hierarchies that were split can later be split again. This process can result in complex schemas, perhaps too complex for end users to deal with. Consult your end users to avoid unnecessary complications.
An important and often difficult decision to make is whether an aggregation level actually is an element of a (structural) hierarchy, or whether it simply is a property of an item in the dimension. Is it, for instance, wise or required to keep the product packaging unit, the brand, or the storage type as explicit elements (that is, as potential entity types in the dimension's hierarchies) of the dimension path? Or can they simply be considered properties of the products?
Looking for fundamental aggregation paths within a dimension such as a Product dimension usually means investigating a number of typical relationships within the dimension.
Construction or structural relationships: These relationships are used by information analysts to explore constructive relationships between products and their components. For instance, they can be used to calculate the cost of a product, using the costs associated with the product's components and the costs associated with the construction of the product.
Variation relationships: Variations are used to differentiate products in terms of product models, versions, implementations, component mixtures, blends, and so on.
Variations may also be used to identify product replacements. Information analysts use variation relationships to group related products and aggregate associated measures, because the lower-level categories of products may only exist for a limited period of time, or because they are frequently used to replace each other in a process (for example, when a version of a product is sold to customers when the "original" item is out of stock).
Classification relationships - Classifications are arrangements of like products into groups. Classifications of relationships are obviously the most frequently occurring relationships between products that information analysts use to roll up detailed measures. Notice that several different kinds of classifications are usually required. For example, products may be classified according to sales-oriented, manufacturing-oriented, stocking-oriented, or supply-oriented characteristics. Information analysts use classifications for aggregating measures in statistical groupings such as totals, averages, minima, and maxima.
Facts
The fact table only contains IDs for referencing dimensions tables, and measures for measuring the changing or performance of all dimension members. The next step is organizing the dimensions and measures into facts. This is the process of grouping dimensions and measures together in a manner that can address the specified requirements.
There are several important issues to address in fact table design:
Granularity (the level of detail at which a fact is recorded): If data is to be analyzed effectively, it must all be at the same level of granularity. As a general rule, you should keep data at the most detailed level of granularity. This is because you cannot change data to a more detailed level than what you have decided to keep. You can, however, always roll up (summarize) the data to create a table with a less detailed level of granularity.
Additivity (the ability of measures to be summarized): Measures fall into three categories: fully additive, nonadditive, and semiadditive. An example of a nonadditive measure is a percentage. You simply cannot add the percentages from two facts together and come up with a meaningful result. An example of a semiadditive measure is a balance. Although you can add the balances from two accounts to get a total balance, you cannot add two balances from the same account at two different points in time. Because the balance is additive only across some dimensions, we call it a semiadditive measure. A value that can be added across all dimensions is considered to be fully additive. Additivity becomes important when you consider the possible summarizations that will occur on a fact table. Generally, it is desirable for measures to be fully additive. When they are not, you should consider breaking them down into their atomic elements.
Key Selection: Key selection in multidimensional data modeling is a difficult issue. It involves a trade-off between performance and ease of management. Key selection applies mainly to dimensions. The keys you choose for the dimensions must be the foreign keys of the fact. There are two choices for dimension keys: you can assign an arbitrary key, or use identifiers from the operational system. An arbitrary key is usually just a sequential number, where the next available number is assigned when a new key is required.
To uniquely represent a dimension using identifiers from an operational system, you sometimes need to use a composite key. A composite key is a key made up of multiple columns. An arbitrary key is one column and is almost always smaller than an operationally derived key. Therefore, arbitrary keys will generally perform joins faster.
The last factor in key selection is its impact on the fact table. When a fact is created, the key from each dimension must be assigned to it. If the dimensions use operationally derived keys with time stamps for history, there is no additional work when a fact is created. The linkage happens automatically. With arbitrary keys, or arbitrary history identifiers, a key must be assigned to a fact at the time the fact is created.
There are two ways to assign keys. One is to maintain a translation table of operational and data warehouse keys. The other is to store the operational keys and, if necessary, time stamps as attribute data on the dimension.
The choice, then, is between better performance of an arbitrary key and easier maintenance of an operational key. The questions of how much better performance and how much more maintenance must be evaluated in your own organization.
Regardless of the choice you make, the keys, and the process that generates them, must be documented in the metadata. This information is necessary for the technical staff that manages and maintains the data warehouse. If the tools you use do not hide join processing, the user may also need to understand this.
Now that you understand the process for both the dimension and fact tables, let's look at a real world example to explore how to identify the dimension and measures from the business requirements. This example is only a primary analysis of a list of business questions. The business questions are defined as sample requirements:
- What is the average balance and transaction count this month for customers by bank branch?
- What is the annual gross profit and interest paid for each customer, summarized by branch, product, and region?
- What percentage of customers is profitable? Classify them by branch, region, and year.
- What is a customer's total transaction amount this year?
- What are the top five most profitable products by region?
- What are the top five most profitable branches for the last five years?
- What are the demographic and geographic characteristics of most profitable customers?
By analyzing these questions, we define the dimensions and measures needed to meet the requirements (see Table 1).
Table 1. Dimensions and measures table
| Dimensions and measures | Q1 | Q2 | Q3 | Q4 | Q5 | Q6 | Q7 | | Dimensions | | | | | | | | | Branch | X | X | X | | | X | X | | Region | | X | X | | X | | X | | Customer | X | X | X | X | | | X | | Product | | X | X | | X | | | | Time | X | X | X | X | | X | | | | | | | | | | | | Measures | | | | | | | | | Balance | X | | | | | | | | Transaction Amount | | | | X | | | | | Transaction Count | X | | | | | | | | Gross Profit | | X | X | | X | X | X | | Interest Paid | | X | | | | | |
At this point you review the dimensions to ensure:
- You have the data you need to answer your questions.
- All the measures are defined at the most detailed level.
Use these simplified analysis points to decide what to include and what to exclude in your final star model:
- The Balance and Transaction Count are based the aggregation of Transaction Amount, so they are derived measures.
- The Interest Paid is calculated as the product of the account interest rate and balance. It is an account-based and month-based calculation. Since the interest rate is an attribute of Account table, the Account dimension table needs to be added. As you can see, the Interest Paid is also a derived measure.
- Assume that the Gross Profit is based on the calculation of (the investment return of the account balance) - (the interest paid). Since the investment return is a bank-level measure and Interest Paid is a derived measure, the Gross Profit is also a derived measure.
The conclusions from the analysis above are:
- The transaction amount is the only measure needed.
- An account dimension is needed to bring in interest rate and investment return information.
Model metadata
In the traditional development cycle, after a model is completed it is only used when changes need to be made, or when other projects require the data. In the warehouse, however, your model is used continuously. The users of the warehouse constantly refer to the model to determine the data they want to use to analyze the organization. The rate of change of the data structure in a warehouse is much greater than that of operational data structures. Therefore, the technical users of the warehouse (administrators, modelers, designers, and so on) will also use your model on a regular basis.
This is where the metadata comes in. Far from just a pretty picture, the model must be a complete representation of the data you are storing, or it will be of little use to anybody.
To properly understand the model, and to be able to confirm that it meets requirements, a user must have access to the metadata that describes the warehouse in business terms that are easily understood. Therefore, you should document non-technical metadata at this point. During the design phase, you will add the technical metadata to it.
At the warehouse level, you should provide a list of what is available in the warehouse. This list should contain the models, dimensions, facts, and measures available, as these will all be used as initial entry points when a user begins analyzing data.
For each model, provide a name, definition, and purpose. The name simply gives the user something to focus on when searching. Usually, it is the same as the fact. The definition identifies what is modeled, and the purpose describes what the model is used for. The metadata for the model should also contain a list of dimensions, facts, and measures associated with it, as well as the name of a contact person so that users can get additional information when they have questions about the model.
Model verification
Before investing a lot of time and effort in implementing your warehouse database, it is a good idea to validate your model with the user, especially data mart models. The purpose of such a review is twofold. First, it confirms that the models can actually meet the user’s requirements. Second, a review should confirm that the user can understand the model. Remember that once the warehouse is implemented, the user will be relying on the model on a regular basis to access data in the warehouse. No matter how well the model meets the user’s requirements, your warehouse will fail if the user cannot understand the model well enough to access the data.
Validation at this point is done at a high level. Review this model with the user to confirm that it is understandable. Together with the user, test the model by resolving how you will answer some of the questions identified in the requirements.
It is fine that the model does not necessarily meet all of the user’s requirements. This does not mean that you should stop and go back to the beginning. Expect your first cut of the model to meet perhaps 50% of the requirements. Take this 50% (or however much is validated) of the model and start working on the physical design. The remainder should be sent back to the requirements-gathering stage. Either the requirements need to be better understood, or as is often the case, they have changed and need to be redefined. Usually, this will lead to additions and changes to the model you have already created. In the meantime, the validated portion of the model will go through the design phase and begin providing benefits to the user.
The iteration of development and the continued creation of partially complete models are the key elements that provide the ability to rapidly develop data warehouses.
Figure 7. Warehouse data model evaluation process
During requirements validation, you will:
Check the coherence and completeness of the initial dimensional models and validation against the given end user requirements. Analyze the initial models with the end users. This allows the requirements analyst to perform more investigations, and to adapt the initial models (in an attempt to fix the requirements as they are expressed in the models) before passing them to the requirements modeling phase.
Identify candidate data sources. Establish an inventory of required and available data sources.
Map the initial dimensional models, possibly completed with informal end-user requirements, to the identified data sources. This is usually a tedious task. The source data mapping must investigate the following mapping issues:
Which source data items are available and which are not? For those that are not available, should the source applications be extended? Can they perhaps be found using external data sources? Or should end users be informed about their unavailability, and as a consequence, should the coverage of the dimensional model be reduced?
Are there other interesting data items that are available in the data sources but have not been requested? Identifying data items that are available but not requested may reveal other interesting facets of the information analysis activities and may therefore have significant impact on the content and structure of the dimensional model being constructed.
Verify that the data mart design does not violate any business security settings. Since the data mart is designed for a specific group of end users, it is good idea to make sure it only contains necessary information for the group.
Perform an initial sizing of the model. If at all possible, the initial sizing should also investigate volume and performance aspects related to populating the data warehouse.
The results of requirements validation will help you to assess the scope and complexity of the data warehouse development project and to (re-)assess the business justification, which includes technical, financial, and resource evaluation. Requirements validation must be performed in collaboration with the end users, to reveal and correct any problems with incomplete or incorrect initial models. Requirements validation may involve building a prototype of the dimensional model.
The requirements validation process should confirm or reestablish end-user requirements and expectations. You may also identify and evaluate source data reengineering recommendations as a result of requirements validation. At the end of requirements validation, you should obtain a (new) "sign-off" for the data warehouse modeling project.
Coming next
Stay tuned for the upcoming articles in this series, which will continue to explore warehouse ETL process design and implementation, warehouse performance, security, and more.
Resources
About the authors  | 
|  | Leon Gong is a solution architect working in the IBM Solution Builder Express team. He has been working to help enable business partners to create solutions in different industries and solution areas including, but not limited to, business intelligence, infrastructure, and e-commerce. He has DB2 certifications in both application development and administration. You can reach him at leongong@us.ibm.com. |
 | 
|  | Mike Olivas is an architect working in the IBM Solutions Builder Express team. Lately, he has been working to help enable business partners to create solutions in different industries and solution areas including but not limited to business intelligence, infrastructure, and workplace services. You can reach him at molivas@us.ibm.com. |
 | 
|  | Christine Posluszny is a solutions developer in IBM Solution Builder Express Portfolio. She has 6 years of experience developing Java and SQL applications leveraging DB2. Recently, she has developed integrated solutions for IBM Mid-Market Business Partners. She is a Sun certified Java Programmer. You can reach her at cpos@us.ibm.com. |
 | 
|  | Donna Venditti is a project lead in Solutions Builder Express. For the last 6 years, she has delivered both industry and cross-industry starting point solutions for IBM Mid-Market Business Partners. Her focus has been on e-commerce and business intelligence solutions for the retail industry. You can reach her at donnav@us.ibm.com. |
 | 
|  | George McMillan is an architect working in the IBM Solutions Builder Express team. Most recently, he has been working with business partners and SBE architects developing consultant's tooling, and as methodologist developing method strategies for the partner channel. His work includes creating solutions for business partners in different industries and solution areas. He acts as technical lead in the areas of collaboration and content management, and participates on teams developing business intelligence and infrastructure solutions. You can reach him at gmcmillan@us.ibm.com. |
Rate this page
|  |