Skip to main content

skip to main content

developerWorks  >  Information Management  >

Breaking Relational Vows

developerWorks
Document options

Document options requiring JavaScript are not displayed


New site feature

Check out our new article design and features. Tell us what you think.


Rate this page

Help us improve this content


Level: Introductory

Blair AdamacheIBM

03 Jul 2001

Ever wanted to stretch the limits? This article shows you basic and fancy techniques in setting up your relational database, as well as how to break the traditional rules.

Introduction

The relational model is a thing of beauty. Compromising it is a travesty, like putting a baseball cap on top of Michelangelo's David. Yet, stuff that doesn’t belong in a pure relational model may end up in your database or application. If you value your relational vows with monastic gravity, read no further. For those of you who might enjoy a walk on the wild side, this article will take you to the red light district of an RDBMS. Read on to discover how to:

  • Store derived values
  • Number each row to create an artificial key
  • Retrieve an identity value generated by DB2®
  • Number rows in an answer set
  • Ask for data and then truncate the answer set
  • Delete all rows in a table without logging
  • Use a view and the rename table command to "delete" a column
  • Widen a varchar column
  • Learn when to use a true view versus a summary table (the infamous materialized view)
  • Learn the difference between declared temporary tables and common table expressions

Some of these techniques can improve performance, like the pragmatic advice you may read in an RDBMS reference manual, advising to take your data to fourth normal form, and then make some compromises before implementing your design to achieve desired performance. Other techniques make life easier for the application programmer, and have unpredictable affects on performance. Finally, some things are ways to get behavior you have taken advantage of in other database products. See DB2 Version 7 SQL Reference, Version 7 Command Reference, and DB2 Version FixPak 2 and FixPak 3 Release Notes for a complete explanation of all SQL and Command syntax. IDENTITY columns are described under CREATE TABLE in the SQL Reference. The IDENTITY_VAL_LOCAL function is in the release notes. See “OLAP Functions in the SQL Reference” for fancy stuff like rownum and rank. All samples in this article were tested with DB2 Version 7, FixPak 2 on Windows NT. They will have the same behavior on UNIX® and OS/2® as well.



Back to top


Deriving values

Somewhere in a college course on relational databases you might have learned not to store derived values. After all, these can be calculated when the answer set is assembled, avoiding redundant data in the database, and get the correct answer. There are a few reasons DB2 introduced generated columns in Version 7.

  • The user requesting the data may be using an application you have no control over. If they're using a point and click application to do a SELECT *, they may not realize that they really want to see COMPENSATION, and have to add the values in the COMMISSION and SALARY columns. A generated column allows you to store this value, and keep it accurate:
    create table employee
    (name char(10), salary dec(10,2), commission dec(10,2), 
    compensation dec(11,2)
    generated always as (commission + salary))
    

    This will keep the COMPENSATION column accurate through the following SQL:

    insert into employee (name, salary, commission) values ('Blair',5,10)

    update employee set salary=0
    

  • You need to uniquely identify each row. This is described later.
  • Case insensitive indexing, which is described next.


Back to top


Case sensitivity

Case sensitivity is powerful, and an RDBMS can search more quickly if it knows that Greenland is not a match for greenland. However, a user probably wants your application to return “MacInnis” if they submit a search on “Macinnis." For name searches, you’ll likely consider creating an index on the NAME column. However, the values in a DB2 index are also case sensitive. Making MacInnis=Macinnis is easy, just use the UPPER or UCASE function:

SELECT NAME FROM EMPLOYEE WHERE UPPER(NAME) = 'MACINNIS'

But, this will force a table scan, and you won't get the benefit of an index. Here's where generated columns come in: if the standard access method is a search on name, use a generated column to store the name in upper case:

CREATE TABLE EMPLOYEE (NAME VARCHAR(10),
             NAME_UP
        GENERATED ALWAYS AS (UPPER(name)))

Now create an index over the upper case version of the column:

CREATE INDEX NAME_IND ON EMPLOYEE ( NAME_UP )

This query can get the benefit of the index, and avoid a table scan:

SELECT NAME FROM EMPLOYEE WHERE UPPER(NAME) = 'MACINNIS'

Let's see how generated columns can be used to enumerate rows. Why do we like to number rows? Relational theory teaches us that rows and columns have no inherent order: you specify the order when you request the data. But people like to number things, from page numbers in books to numbers on sports jerseys. You may have learned computer science viewing data in a tabular relational model. How many of your users learned to view tabular data in Lotus and Excel spreadsheets, with a line numbers down the left of the screen? Most relational database management systems have an internal RID (row id) or TID (tuple id). DB2 on OS/390® and Oracle have externalized this, making it easy for a program to identify a row without knowing what it contains. We have not externalized the row id on DB2 on Windows/UNIX/OS/2 because we allow it to change: a dangerous property in a potential primary key. DB2 does have other methods of using a column to be an artificial primary key.

Before you resort to any of these primordial methods of uniquely identifying each row, try hard to find the true primary key: ask yourself the question, "how would we uniquely identify each row if we were writing it down on paper; what if a customer or supplier phoned in and asked for status -– how would we find the data they're asking about?” If you stamped the date and time on each piece of paper, this was your primary key.



Back to top


Generating row identifiers

Let's start with applications migrating to DB2 from other databases. If you're moving from SQL Server, you probably are tired of the preaching about relational purity and want to know about IDENTITY. Here's how you create an IDENTITY column with DB2 v7:

CREATE TABLE T1
    (C1 DECIMAL(15,0) GENERATED BY DEFAULT AS IDENTITY
        ;(START WITH 10),
          ;C2 INTEGER)

There are also options for caching identity values in memory, which makes inserts faster, but will leave gaps in the identity sequence if your system goes through a hardware or software crash when generating an IDENTITY value. The default is to increment by one integer at a time, but you can increment by other values (2, 10, etc.) as well. After the insert, there's a natural curiosity about the value that was generated. You might need to know this for the next piece of logic in your application. We have a function documented in the release notes (x:\sqllib\release.txt on Windows) called IDENTITY_VAL_LOCAL() to retrieve this value for you.

IDENTITY is unique per table. Oracle junkies will be glad to hear that Version 7, FixPak 3 of DB2 will bring SEQUENCE columns to DB2. A sequence is unique throughout the database - useful for values that go into multiple tables. You can also cycle through sequences to reuse values. SEQUENCE and IDENTITY are not data types: they use existing data types like SMALLINT, INTEGER, or DEC with a scale of zero. INT and BIGINT are the best choices, giving you good performance and a decent range. Negative values are also permitted.

There are other ways to generate artificial primary keys. A trigger is good if only one user accesses the table at one time (and only one row gets inserted each time). Define your primary key column as not null with default, so it gets a dummy value when it's not specified in an INSERT (the trigger will overwrite the dummy value):

CREATE TRIGGER AutoIncrement NO CASCADE BEFORE
   INSERT ON Foobar
     REFERENCING NEW AS n
          FOR EACH ROW MODE DB2SQL  SET (n.col1) =
     (SELECT COALESCE(MAX(col1),0) + 1 FROM Foobar )

DB2 also has a function called GENERATE_UNIQUE. This combines a timestamp with the node number (useful for multi-partition databases), so it works with Enterprise - Extended Edition (EEE). IDENTITY and SEQUENCE won't work with EEE until the next major version of DB2. GENERATE_UNIQUE has two drawbacks: the data type (CHAR(13) FOR BIT DATA) is not sequentially incremented, and is not as easy to work with as a numeric data type.

An even simpler solution is a scalar subquery expression:

INSERT INTO Foobar (key_col, ...)
      VALUES (COALESCE((SELECT MAX(key_col) FROM Foobar) +1, 0) ...)



Back to top


Get me one screen of data

These approaches are well and good for schemas and applications where you have a chance to do some design work before the database and application go into production. But remember those two 4.5 letter words that start with A (ARIES, the airlines reservation environment simultion, and ACID, atomicity, consistency, isolation, durability)? If you make an airline reservation, you want them to remember it when you arrive at the airport. This is durability: useful data is durable. This means that even if you defined a good primary key, someone may ask for “the first twenty rows” of a result set, regardless of how many rows are in the result set. Worse yet, someone may ask you to display rows 21 through 40. But wait, you protest, rows in a relational table have no order! You might as well be speaking Icelandic to the user who wants to see twenty rows at a time in their Netscape browser. DB2 allows you to order a result set on the fly, and fetch an arbitrary number of rows from the beginning or end of this result set:

SELECT NAME FROM ADDRESS
    ORDER BY NAME
         FETCH FIRST 10 ROWS ONLY

SELECT NAME FROM ADDRESS
    ORDER BY NAME DESC
        FETCH FIRST 10 ROWS ONLY

The ORDER BY will force a sort of the entire result set in memory, so we're not doing this so improve DB2 server performance (although sending only 10 rows to the client may improve network performance). If you don't care about the order, and just want to know that at least 10 rows qualify for the result set, eliminate the ORDER BY to save the sort on the DB2 server:

	
SELECT NAME FROM ADDRESS
         FETCH FIRST 10 ROWS ONLY

So now we've seen you number the rows and select an arbitrary subset. Presumably we numbered rows for some performance benefit that necessitated breaking the relational model. We're almost utterly compromised, and have committed six of the seven relational sins. There's one relational vow you haven't broken: let's number the rows on the fly, sacrificing performance and relational purity. How do we justify this? Blame it on the Internet.

Displaying corporate data to a customer using a browser apparently justifies breaking our pledges to both relational purity and the idol of performance. You assign row numbers to a result set on the fly using either the rownum or the rank functions. Below we order the rows from the table I use to keep track of addresses, and select rows 11 through 20. The result set consists of the name, and a column created on the fly called rn, which numbers the rows:

SELECT * FROM (SELECT NAME, rownumber() OVER
    (ORDER BY NAME)
      ;AS rn FROM ADDRESS)
    AS tr WHERE rn  BETWEEN 11 and 20

Rank is more sophisticated, and allows you to identity ties in the sequence you order by, ideal for the football pool:

	
create table football (team char(10), points int)
 insert into football values ('United', 20)
 insert into football values ('Arsenal', 20)
 insert into football values ('Liverpool', 10)

 select rank() over
      (order by points desc) as place,
          team, points
            from football

PLACETEAMPOINTS
1United20
1Arsenal20
3Liverpool10


Back to top


Emptying a table – without telling the logger (TRUNCATE TABLE)

Now that you've got all these neat ways to fool around with your data, it's time for your boss to announce that it's time to start over. Other products have something called “TRUNCATE TABLE” – delete all rows in the table, don't log the operation (so it's fast) but leave the table intact (otherwise you'd just issue a DROP TABLE). You do this in DB2 by running LOAD with the REPLACE option, and using a zero byte file as input. DB2 LOAD is not logged by default, so this is basically a little shell game.



Back to top


Shell games and the house of mirrors

Your table is perfect – why access it from a view? There are lots of reasons for this:

  • Column level security: omit the columns you don't want users so see in the SELECT that defines the view
  • Row level security: DB2 v7 on Windows/UNIX/OS/2 does not allow you restrict access to certain rows in a table unless you define a view (remember the check option if you want to limit updates to what the user is allowed to see):
    create view london_football as
        select * from football
          where team in ('Arsenal','Aston Villa')
            ;with check option
    

    Consider the power of this for a Human Resources application: a user can view employees with salaries under $nn,nnn, and give anyone a raise that does not increase their salary above $nn,nnn.

  • DROP COLUMN: DB2 does not allow you to drop a column. I can think of three reasons you might want to drop a column:
    • Reclaim space: if you want to do this, export the data you want to keep, drop the table, recreate the table with the columns you care about, and load the table. Is this expensive? Sure, but reclaiming space requires this or a REORG TABLE. These are inherently expensive operations.
    • The column is no longer a logical part of the row: for example, you realize that your employee might have two addresses, and stop tracking the address in the employee table (you now have an n:m relationship between the employee table and the employee_address table). Create a view on the employee table that does not include the address column.

      If you really want to get fancy, use the RENAME TABLE command to give the base table a new name, and use the original table name as the name of the view. Your view can also join the useful columns in the employee table with the addresses retrieved from employee_address. Now we're back the relational straight and narrow.

    • The column got wider. If it's a VARCHAR, you're in luck. DB2 allows you to widen VARCHAR columns up to the width of the page size defined in the tablespace (4005 with the default 4K page size, but up to 32,672 on 32K pages):
      create table t2 (col1  varchar(10))
      alter table t2 alter column col1 set data type varchar(12)
      



Back to top


I liked the view so much, I materialized it

If derived columns are not corrupt enough for you, how about an entire derived table? And the power to have it match, or not match the data in the base table (and make every SELECT a potential dirty read)? Oracle calls these materialized views. DB2 calls them automatic summary tables, with a special case called a replicated summary table. If a question is asked frequently (SELECT MAX(ORDERS) FROM LEADS), or an aggregation is often assembled (SELECT COUNT(FRANCHISES) FROM STORES WHERE STATE='TEXAS'), maybe it's worth storing the answer set on disk so DB2 doesn't have to recalculate it twenty times per day: especially if data several days old is accurate enough to support the decision based on the query.

Let's start with the ravenous sales managers who want to know which customer orders the most. They track this in a table called LEADS, reasoning that the number of orders a customer placed in the past might help in deciding which sales leads have the best probability of turning into real sales. This question gets asked several times a day (you can use a DB2 tool called Query Patroller to see the queries coming from your users if you have a hunch that this is happening and need to verify it). SELECT MAX() typically requires a table scan, forcing DB2 to look at each row in the table. If you have lot of leads, that's a lot of rows to scan to find one value. Defining a summary table allows DB2 to store this value on disk, leaving DB2 the option to read only one row to get the answer:

create summary table leads_max
    (MAX_ORDERS) as  (SELECT MAX(ORDERS) FROM LEADS )
      DATA INITIALLY DEFERRED
        REFRESH DEFERRED

After you create the summary table, populate it with this command:

REFRESH TABLE LEADS_MAX

Users don't have to know about the summary table. It's up to the DB2 optimizer to decide when to use the base table, and when to use the summary table. Note the REFRESH DEFERRED clause: you're telling DB2 that old data is acceptable in the summary table. This is appropriate when you don't need an exact or current answer. It's great for building a business plan, but not how you want to store your bank balance. Read about the special register CURRENT REFRESH AGE in the SQL Reference, and see the section “Creating a Summary Table” in the Administration Guide to learn how to set tolerances for old data in summary tables when an answer can be “close enough”, and need not be exact.

REFRESH DEFERRED is ideal for summarizing data on read-only tables. A special summary table for multi-partition databases is called a replicated summary table. You use this in DB2 EEE to have a copy of small tables (or read-only tables) on every partition. In EEE, you typically spread the largest table (called a fact table) over all partitions. Join keys that get used a lot (such as customer number) should be used as partition keys. DB2 hashes data to partition it. This means that join keys with less use (such as country or department) may be distributed in a sub-optimal manner. When you're joining data in a multi-partition database, joins with collocation are faster (for example, all the rows with Argentina as COUNTRY in the CUSTOMER and COUNTRY tables are on the same partition). This isn't possible if COUNTRY is not your partition key. To get collocation, you can restrict the smaller table to one partition, and create a replicated summary table that copies it to other partitions. The strategy works if the copied table is small or rarely changes (avoid this if you do business in countries that regularly rename themselves). If the table is really small, such as a list of all continents, don't bother to replicate it: DB2 will ship it to all partitions and keep it in memory during a join. Don't worry about joining to the replica by name: it's DB2's job to figure out when replica tables can improve performance.

Summary tables can be used for dynamic data by making them REFRESH IMMEDIATE. This has stricter rules than REFRESH DEFERRED, so read the SQL Reference carefully. You still have to use the REFRESH TABLE statement after the summary table is first created:

CREATE SUMMARY TABLE LEADS_BY_STATE
        ;(NUM_LEADS, GRP_STATE)
    AS (SELECT COUNT(ORDERS), STATE FROM LEADS GROUP BY STATE)
      DATA INITIALLY DEFERRED
        REFRESH IMMEDIATE
REFRESH TABLE LEADS_BY_STATE



Back to top


This too will pass

We've now had a look at two kinds of views. As a standard view, the view definition is stored in the database (in SYSCAT.VIEWS.TEXT) and the data is only stored in the base table. We can make SELECTs on this data faster by creating summary tables, which come at the expense of redundant data, chewing up more disk space, and making INSERTs, UPDATEs and DELETEs slower (or letting the base table and the summary table get out of sync, at least until the next refresh). There is another extreme: create an aggregate that lasts only for the life of the database connection, or even as brief as the life of the SQL statement. The first is known as a DECLARED TEMPORARY TABLE, and the second is called a COMMON TABLE EXPRESSION, also known as a TEMPORARY RESULT TABLE. A declared temp table requires a USER TEMPORARY TABLESPACE, something you create with the CREATE TABLESPACE command (see the SQL Reference). You can declare the temp table as a holding place for data while you application chugs along:

      
DECLARE GLOBAL TEMPORARY TABLE table1
    (column1 INT, column2 INT)
    NOT LOGGED

You qualify the table with the schema SESSION, as it belongs to the session created when you connect to the database:

INSERT INTO SESSION.TABLE1 VALUES (4,5)
SELECT * FROM SESSION.TABLE1

You may want the temp table to match an existing table, so you can populate it with a SELECT from the existing table. If so, use LIKE to create it:

DECLARE GLOBAL TEMPORARY TABLE TEMP_EMP
    LIKE EMPLOYEE
    NOT LOGGED

 INSERT INTO SESSION.TEMP_EMP
    SELECT * FROM EMPLOYEE

The temp table will be dropped by DB2 when you disconnect. For something even more transient, DB2 supports common table expressions, which allow you to define a table that exists only for a single statement. A common table expression is also the answer to a trivia question: name an SQL statement that does not begin with a verb:

	
WITH COMPENSATION AS
     (SELECT SUM(SALARY+COMMISSION)
     AS TOTAL FROM EMPLOYEE)
      SELECT TOTAL FROM COMPENSATION

You've now been guided out of the red light district. A common table expression violates no relational principles: it does not require DB2 to store derived data, nor does it add artificial columns. If I had to invent a moral for this story, let's say that there are n ways to solve a problem with a computer. One way is cheapest, one is fastest for you, one is fastest for the user, and another is fastest for the developer who inherits your design to do maintenance and add function. Which choice is best is left as an exercise for the reader.



About the author

Blair is a seventeen-year veteran of the IBM Toronto Lab. He uses names for primary keys in his address book, and consequently will never have two friends with the same name.




Rate this page


Please take a moment to complete this form to help us better serve you.



 


 


Not
useful
Extremely
useful
 


Share this....

digg Digg this story del.icio.us del.icio.us Slashdot Slashdot it!



Back to top


IBM, Informix, Informix Dynamic Server and DataBlade are trademarks or registered trademarks of IBM Corporation in the United States, other countries, or both. Windows and Windows NT are registered trademarks of Microsoft Corporation in the United States, other countries, or both. UNIX is a registered trademark of The Open Group in the United States and other countries. Other company, product, and service names may be trademarks or service marks of others. Other company, product, or service names may be trademarks or service marks of others.