Lumea lui Kudzu

"Programarea este o formă de artă care ripostează"

Kudzu World  »  Articles  »  DBDesign
English - Română - Русский - عربي

Ştiaţi că?

Planta kudzu poate creşte 7 picioare (peste 2 metri) pe săptămână? Descoperiţi mai multe despre uluitoarea plantă kudzu.




Using PayPal?
Read this


Database Design and Modeling for Programmers

Introduction

Most programmers aren't qualified database administrators, but were often thrust into the role of "Acting DBA". This article will cover some basic techniques of database design and modeling. As usual I will use real world examples and expose certain pitfalls of not following these guidelines. This article is designed for the programmer who is familiar with client server databases, and SQL (however much of this applies to ISAM databases such as Paradox also)

This article will allow you to design databases that are flexible, scalable and easy to maintain. They should also be of the caliber that when an actual DBA or another programmer inherits your system they will not want to go find another job. This is not intended to train you to be a full fledged DBA, or cover advanced design or performance issues. This is a crash course in "etiquette and procedure" for those of you that already know how to translate real world concepts into tables and relationships.

There are also some useful tips in here for DBAs. Too often DBAs do not receive input from developers, and have no idea what makes thing hard or easy for us. And in some cases there are DBAs who are so stuck in "acadamia" that they don't have a taste of reality and the great pains that must be endured to work with their data models. If you have to work with a "Tenth Normal Form" DBA or one who thinks the 3 in 3NF is "cubed", I advise you plaster this article on his cubicle wall.

Terminal Sins

Here are a few things you should avoid at all costs.

Possible Reserved Words

When you are creating column names and table names, NEVER give them names that can conflict with reserved words in your database, or any other database. Just because it is not a reserved word in the database that you are using, does not mean that it won't be in a future version, or in another database. After all, this system may be ported some day. Reserved words are legal as column and table names, but often require special changes to SQL to access them. Many databases let you access them by surrounding the name in quotes, but this can have other side effects, (Such as case sensitivity). Surrounding column names in quotes is tedious, unnecessary, and not standard practice.

I cannot stress this one enough. I cannot believe the number of data models that I run into that violate this cardinal rule.

Some common ones to avoid are: Name, Order, By, Sort, Number, Currency, Text, Desc, Date, Time, Username, Password, Group, Data, Min and Max.

I'm sure you can think of others, just use common sense. "Desc" is the one I most commonly run across. It is easy to slip by, and I've even used it a few times myself. "Desc" is often used to represent "Description", however "Desc" is also a modifier in SQL for the order by clause. Use "Descr" instead. The usage of obvious ones such as Name and Text is completely unacceptable, but is commonly used.

If you have a column which you would like to name a reserved word, modify the word with a prefix or suffix. Unless you have a valid modifier, a good practice is to prefix the column name with the table name and a underscore. For instance in a person table imagine that you have designed has two columns: Person_ID and Name. The second field violates this rule, so we need to change it. No modifiers make sense here (such as First, or Last since this is probably the whole name) so use the table name and make it Person_Name

Ground Rules and Helpful Tips

Let's cover a few ground rules first. These are rules that will require very little explanation.

Make Table Names Singular

To simplify things, make all table names singular instead of plural. Just think of how many times when constructing SQL you ask yourself "Is it Customer or Customers?". Most times a table like this would be named Customer. But what about a table named Order_Line? The tendency is to name it Order_Lines because it is on the many side of a relationship. To simplify things and eliminate any doubt for the users of the database, keep all table names singular.

Avoid Abbreviations

Unless it is blatantly obvious, or it is absolutely necessary, do not use abbreviations in your table or column names. It is just not worth the pain and frustration you are causing your users. If a user has to look at the design comments to merely figure out what a column or table name "expands" to, it's to abbreviated.

Here are some common examples of abbreviations that should not be used: (By no means a complete list)

Name

Possible Meanings
Inv_ID Inventory? Invoice? Investment?
Disc_Code Discontinued? Discount? Discovery?
Item_Ct Count? Connector?
Prod_Count Product? Produced? Production?
Proj_Amt Project? Projected?
Ext_Rate Extension? External? Extra?
Org Organization? Originator? Origin? Orangutang? (Just seeing if you are paying attention)
Req_ID Requisition? Request?
Rec_ID Received? Recipient? Receipt? Recovered?
Cat Catalog? Category?
Grp_Code Group? Grouping?
Comp_ID Company? Compensation? Compreshensive?

Here are some well recognized acceptable abbreviations. (These would be used as part of a column name, not necessarily by themselves, ie Pkg_ID, Pkg_Weight, etc..)

Name

Meaning
Pkg Package
No Number
Qty Quantity
Hdr Header
Attn Attention
Addr Address
Alt Alternate
Dept Department
Exp Expiration
Seq Sequence
Mgr Manager
Amt Amount
Min Minimum
Max Maximum
Doc Document

It is a good practice to make your own table of "acceptable abbreviations" and only use ones that are on it. This way all abbreviations are defined, limited, and available to your users for reference. This also makes you think twice about using an abbreviation. You should also be consistent, if you abbreviate a certain word in one place, do it everywhere.

In many databases abbreviations are well known for the industry that it serves. Just make sure they follow the rule of not having more than one common possibility, and document them in your abbreviation reference.

For example, we all know what CPU and MIPS stand for, but how many people outside of the computer industry know their meaning? It is the same in the industry you are serving, whether it be telecommunications, energy, or farming.

Use Underscores

Since many databases and tools either completely ignore case, or force things to upper case, use underscores. AltProductID is relatively easy to read since it's case varies, but ALTPRODUCTID is much tougher. So be kind to your users and make it ALT_PRODUCT_ID

Do Not Use Column Names That Match Parent Table Names

Do not create a column in a table which matches the table name it is contained in. Ex: Location Table which contains a field named Location. Doing so will cause alias problems and user confusion during table joins.

Do Not Use Spaces

Many databases will not accept spaces, but many will, and others allow you force them. Do not however use spaces in your table and column names. There are two main reasons for this. The first is that it will require additional steps in SQL statements and program code. The second is that it's not portable among most databases.

Happy Users

This tip is mainly for DBAs, but everyone should keep it in mind. Your users are your customers. Make them happy, and they will make less work for you. If you listen to their needs and find out what causes them problems, they will rely on you less for support and questions, and you can go about the rest of your business.

Pause and Review

After you design each table, take a second to review it and check it for each of these common ground rules. Some of them are easy to slip by if your not paying attention.

Exercise

Take a quick walk through a data model you are working against or have designed. How many of these rules does it violate? How badly?

Concepts

Let's go over a few concepts before we get to the design. If you are familiar with these, you may wish to skip over them.

Primary Keys

A primary key is a field, or collection of fields, whose values uniquely identify each record in a table.

Foreign Keys

A foreign key is a primary key from another table included in the second table to form a relationship between the two tables.

Surrogate Keys

A surrogate key is a key which has no inherent meaning. An example of this is a social security number (A United States government tax ID number which is given to each US citizen) when applied to a person. Given your social security number, information can be retrieved from government records (assuming access is available), but from the number alone, information cannot be extracted. Also referred to as a non-decomposable key.

(I have since been informed Social Security numbers are decomposable, but I have not been able to find an example to replace it).

Intelligent Keys

Intelligent keys are the opposite of Surrogate Keys. Intelligent keys have some built in meaning. An example of this is a phone number. Given the prefix of a phone number, it can be determined in which country and even region in which the phone is located. Also referred to as a decomposable key.

Referential Integrity

Referential integrity (RI, some times also referred to as constraints, although constraints encompass more than just referential integrity.) is the requirement that data values designed to reference row keys in other tables must indeed reference rows that exist. There are two kinds of referential integrity, procedural and declarative. Procedural is referential integrity that is implemented by your application or middle tier business objects. Declarative referential integrity is enforced by the database (Usually via Foreign Keys), or by triggers in the database.

Database Design

Further Naming Considerations

When creating table names and column names there are further considerations in addition to the ground rules. I have separated these from the ground rules, as the ground rules are exactly that: rules. These are design considerations and recommendations. You may already have system in place which better suits your needs, or is merely an alternative.

Use _ID and _Code

Use the suffixes _ID and Code on your column names. _ID should be used on any keys and their references that are numeric. Use _Code for keys and their references that are alphanumeric.

_ID Prefix Example

Person_ID

Person_Name
1 Chad Z. Hower
2 John Smith
3 Jason White

_Code Prefix Example

Person_Code

Person_Name
CZH Chad Z. Hower
JSS John Smith
JBW Jason White

Name Followed By Modifier

This concept at first seems foreign to those of use who speak English and many other languages. However, from a logic stand point it makes a lot more since, and will cause the columns to be grouped when alphabetized. Basically when you have a field like First_Name you should reverse it to Name_First. It makes more sense logically, because first and foremost it is a name, but it's sub type (ie it's modifier) is First. This is more apparent when you have the same column name duplicated, but with different modifiers such as: Name_First, Name_Middle, and Name_Last

Liven Your Booleans

Give your booleans some meaning. After all they only have two choices in life. <G>. What I mean by this is introduce some meaningful modifiers into them. Introduce words like Is, Does, Want, Has, Needs. Furhter more, these modifiers can even alter the meaning if needed. (See Contact Example)

Here are some practical examples of how they can help:

[Image]

Define Your Booleans

Many databases do not support a boolean type. Typically what occurs is that a Char(1) data type and values of Y/N or T/F are used.

However more times than not, no constraints or validation are placed on such columns so realistically any data can be put in these fields and two common problems occur.

Problem 1: In some tables Y/N is used, and other T/F. This is inconsistent and will not only confuse the users, but sometimes one developer may use T/F on a table that already has Y/N causing severe data problems and unpredictable application behavior.

Problem 2: Someone misunderstands the column, and puts other data into it. This will cause unpredictable behavior in your applications.

You have the option to use many values Y/N, T/F or even custom values for each case like A/D for Active/Disabled, etc& The most common is probably Y/N, however I would recommend T/F because it is more universally applied to circumstances and better matches a true boolean definition. It is often the fact that Y/N makes for an odd partner in some situations that many databases use both T/F and Y/N. T/F works everywhere, and allows you to be consistent.

The solution to this is to add a constraint to these columns. Many databases will allow you to merely enter a rule (often called a check condition) and tell it to only allow T/F. Another alternative is to use a Insert and Update Trigger to validate it. A last resort it so set up a table with one column and two rows, each with T and F and use referential integrity. I would advise against the last method, but it is a viable option when others are not available. Be sure to validate or convert it to a upper case so that F and f cannot be distinct.

Many databases allow you to set up a user defined data type (often referred to as a domain). If so, you should create your own boolean type with a constraint of T/F and use that. It is advisable not to call it boolean or even bool, as your database vendor may add it in the future, and it will not only conflict but be incompatible with your type. Try something like BooleanChar.

Use Referential Integrity

What ever referential integrity can be implemented in the database should be implemented there. Otherwise you will need to duplicate code to every system which accesses your database.

The exception to this is three tier systems in which all clients access the middle tier (business objects) and they are the only ones to access the database. This is usually used to implement complex rules (ie Business Rules) that are difficult or impossible to do inside the database itself. Even in three tier systems however, the simple rules should be placed in the database because it can perform them much more efficiently and reliably. In addition, declarative referential integrity still functions during batch SQL updates.

An example of a referential integrity is a company. Imagine that you have a company table and a person table (Resist people, remember singular). Person has a field called Company_ID which refers back to the company table and determines which company a person works for. Referential integrity would ensure that the Company_ID field in the person table refers to a record in the Company table which actually exists. The database would not allow you enter or change this field to a non existent company, and it would also not allow you to delete a company which had related person records. In this case Company_ID would be referred to as a foreign key.

In a non production system which is evolving quickly and is apt to change, you may find it advisable to not use referential integrity until near completion. If so, you should plan for it and implement it in the final stages of development. Be warned though, this can also cause some problems by not detecting invalid data. But it can in many cases make it much easier to modify the database structure. 

A Database With a View

A view is a virtual table that is does not really exist. It is a virtual table that contains data from one or more tables or other views. Views are often used to store commonly used queries, but is not simply stored SQL (Although a few databases implement them this way). Most SQL databases support views.

Denormalization / Normalization

In the process of normalization during database design, data is split into many tables and the retrieval of rows and all of their associated data can involve large and complex SQL statements. You can minimize the impact on your users by creating views that contain the joins and essentially provide a denormalized access method. This will also provide a performance benefit in that it is not dynamic SQL which must be parsed and bound each time.

Security

It is a good practice to never let users access tables, but only access views. You should create your tables as normal, but remove permissions for the public and create a view with a direct mapping to the table for the users to access. The easiest way to do this is to name your tables with the TBL_ prefix, and their view counterparts without the prefix. Your users will not notice any difference, but this will allow you or a DBA later down the road to more easily modify the system, and implement robust security schemes.

Primary Keys

Compound Primary Keys

Many databases contain compound primary and foreign keys. These are keys which are constructed of more than one column. Often they consist of two or three columns. These are a way to properly construct a database, however they can also make SQL writers and developers life much more complex than needed. It's not uncommon to need to join 6 or more tables in a single SQL statement, and when each table to table join requires you to reference three columns, this severely complicates the issue.

My recommendation is to make all primary and foreign keys simple, a single 31 bit integer. Furthermore make them surrogate keys. The database integrity can still be enforced by placing constraints and/or secondary indexes on the other fields.

A single 31 bit integer will suffice in 99% of tables. I understand that there are systems which have tables that have more than 2 billion rows now or in the future, but this is definitely not even close to being a majority. Use this scheme for the majority, and use other schemes for tables that exceed this limitation.

Unique Primary Key Names

As mentioned above use <tablename>_ID for numeric keys. By doing this, every primary key has a unique name. This has several advantages, but the main one is in foreign key referencing. When you create foreign keys, simply name them the same as their reference. Many tools will detect this and match joins to the proper tables. In addition, this convention is easily read by your users, and eliminates the need to constantly dig through hard to read data models (Large data models are very hard to read, especially when joins intersect and change directions underneath other tables).

Auto Generation

Many users are accustomed to "AutoInc" fields that populate themselves during insert. There are some problems however with these types of fields in that the value of this field is not available until after the record has posted. Most times a developer needs the value of the inserted record to place references to it in foreign keys, or for a user interface element and autoinc fields do not allow for this. There is however a relatively simple solution to this dilemma.

For each surrogate key, create a sequence (Oracle) or generator (Interbase). Then create a before insert trigger. In this trigger, check to see if the primary key is null. If its null retrieve the next value from the sequence, and insert it into the primary key.
This approach allows the developer to retrieve the sequence ahead of time, and insert it manually if the value is needed in code as well.