Beginning Database Design - Spot the Flaws

  • Great idea for a series of articles, and a good teaser to set them up.

  • Hi Steve,

    I think you just invented the "community article" 🙂

    My hope is that this will be one of those articles with a lot of discussion and ideas.

    Here are a few comments:

    1. The ISBN column should be char instead of varchar, since it will always be 10 characters. This saves us those two extra bytes used for the var in the char.

    2. The author could be outsourced to a different table

    3. The image could be outsourced to a different table (or even stored within the table instead of just storing the URL?)

    4. Referential integrity could be set-up between bookstatus table and col. Depending on the list of possible values(fix or varying, big or small), a check constraint might also be sufficient.

    5. the Review column should nowadays be varchar(max) or nvarchar(max) instead of text. It might also make sense to split the review into summary and body. A full-text index could be created on the Review column.

    7. The owned column could be removed. Instead create a separate table with books that I own, storing the ISBN. This would make sense if the amount of books in the books table is significantly higher than the count of books that i own.

    Thats my "quickie" list so far, but theres probably a lot more out there. But I guess someone else is happy to translate this to 3NF.

    Best Regards,

    Chris Büttner

  • Very nice article. I agree with Old Hand with most recommendations.

    I might consider using NVARCHAR instead of VARCHAR - for if you want to expand your book reviews to books of languages where unicode characters might be used.

    I might also split the Reviewer out into a seperate table - and rather link reviews to books using relationship tables. Reason being is that you can then start building a history of what was reviewed by whom and when, and what they said.

    Also split the Author(s) of the book into a seperate table, so you can do searches on authors by their relationships to books, instead of their name found somewhere in a varchar column. So you can expand their relationship to the book. An author of TypeID = 1 might mean the main author. TypeID = 2 might mean the Publisher, etc.

    And then some other interesting thing might be added to the review table, such as last known price of the book, how many was printed, etc...


  • A somewhat off-topic remark: You write "As we're publishing more now there are 13 digit ISBNs ...". Just as background information ... reality is somewhat complicated here: Actually, the first 3 digits of a 13-digit ISBN are always either 978 or 979; the resulting number is then a valid EAN ("European Article Number" - a subrange of these are the JAN = "Japanese Article Number"s), which is now printed as a bar code on almost all things (or packages) sold anywhere.

    All "old" 10-digits ISBN get the prefix 978, and the checksum (last digit) is computed anew according to the EAN rules (hence the checksum is always a digit, in contrast to the ISBN, where a remainder of 10 is encoded as the letter 'X').

    The new 979-numbers actually increase the range of possible publishers by a factor of two, but not more ...


    Harald M.

  • Dear Steve, A very good and effectively simple article! Thanks.

    Hi Christian, I have two questiions for you.

    1. Well suppose changes has been made upto the level you have suggested in your reply, what level of normalization that wil be?

    2. What it will need to counted in 3nf form?


  • Sorry if I come in "high-level", but what irritates me most is that there is no high-level description of what is needed. Some of the previous posters' remarks show this: Old Hand wrote

    "1. The ISBN column should be char instead of varchar, since it will always be 10 characters. This saves us those two extra bytes used for the var in the char.

    2. The author could be outsourced to a different table"

    These two items are on totally different levels of abstraction: The first is a technical information (which is based on knowledege about ISBNs), whereas the second is a modelling issue, with implications like

    * do we want to unify identical authors?

    * do we want to support an m:n relationship between authors and books in the future?


    or his last remark "But I guess someone else is happy to translate this to 3NF.": Of course this is only possible if we identify the "entities" in contrast to the "values" in the model - and this is a modelling issue. Is an author "just a string" (a value) or "a real being" (an entity)? - only when we have decided on this, can we apply normalization.

    E/R modelling has long honored the disciplines of logical and physical modelling, exactly because of such issues; and with UML and other modelling concepts, it is even clearer that "logic"/"requirements" (what do we need conceptually to solve the problem) and "physics"/"design" (what do we need technically to solve the problem well for the chosen technology) must be separated - although, of course, they are very heavily dependent on each other (and of course, I do not want to promote "waterfall", i.e., someone must "know beforehand exactly what is needed" - this is nonsense, and certainly so in small and very small projects).

    Is it worth to pursue such a clear separation of issues in such a minimalistic project? I'd argue yes - maybe even more than in a large project: Because otherwise, discussions become muddled:

    - A: "I want to source out the author"

    - B: "But why, it's so small, there is no performance gain!"

    - A: "But we need to hava consistent author data"

    - B: "But then you will lose performance because of all these JOINs!"


    Such discussions are often quite fruitless, unless there is a common understanding that these two levels of issues must be separated.

    One issue which was not yet mentioned: Nullability. Is it already (on the requirements level!) decided that all data are nullable? - I'm quite sure that this is not ok; this must be clarified on the requirements/use-case level. And technically, your scripts dont even work: a primary key can only be declared on fields that are not null (so maybe the scripts are a clumsy way of expressing requirements? - then, Steve, please quickly switch to E/R diagrams or UML or whatever ... but not DDL/SQL).

    Otherwise, I support many of the items of the previous posters - only I'd like to see them "sorted into the right buckets" (of "requirements" and "design", respectively).


    Harald M.

  • Some of my thoughts on this:

    The table name should reflect a row in the table so should be book not books.

    The primary key would be better as a number - tableID, an identity. The reason for this is that the index will be much smaller and more efficient. Any foreign keys that relate to the book will also have smaller, more efficient indexes.

    I agree that it is probably better to use nvarchar instead of varchar and use nvarchar(max) for the review. This avoids collation issues and allows all characters if for instance we ant to store chinese characters etc. It is also more efficient if we wish to use a .Net front end as it converts everything to unicode anyway.

    A separate Author table and a link table AuthorBook to allow for an Author to write multiple books and a book to be written by multiple authors.

    A separate Review table to contain a foreign key to the Book table. This would have the rating, the ReviewDate, the bookReview nvarchar(max) and the book Status columns. Also the ReviewerId (a foreign key to a Reviewer table). This would allow multiple reviewers.

  • ["you" means "anyone" in my posting; "Steve" is "Steve" :-)]

    ... let me ramble on a bit more about basic issues of requirements and design (if you only want to read about performance or SQL syntax, please ignore my posting - I'd say ...; also, my posting may assume that you have thought and maybe learned about entity or class design somewhat).

    Here is another design for the problem:

    create table items (

    id integer not null primary key, -- artificial key for each "item"

    sort varchar(20) not null, -- what sort of item is this? a 'BOOK', a 'STATUS', ...

    readable_name varchar(200) not null


    create table properties (

    id integer not null references items(id),

    property_name varchar(20) not null,

    number_value numeric(16,4), -- nullable

    char_value varchar(200), -- nullable

    primary key(id, property_name)


    A book would then be stored as a record in the first table with e.g. values (12345, 'BOOK' '0123456789'). Its attributes are stored in the properties table e.g. as records with values (12345, 'AUTHOR', null, 'James Mitchener') and (12345, 'IMAGE_URL', null, 'http://...').

    Horrendous, isn't it? Yet, there are well-known systems that are implemented using this "key-value-pair" approach (Outlook comes to mind, for example; and other Microsoft tools). There are obvious disadvantages here:

    - defining constraints is almost impossible (for referential integrity, nullability)

    - even type-safety is hard (your AUTHOR could be a number_value!)

    - etc.

    But there is one big plus:

    - flexibility on the user side: It is easy to write the program in such a way that the user can add new "slots" to each item.

    So how does one decide between such designs (and there are many more intermediate ways which try to combine the advantages of the "classical" E/R design and the "flexible" key-value-pair design)? - based on the requirements! If end-user flexibility is a high-up item on your product manager's list, you will tend to key-value-pair design (at least for parts of your system); if perfect data (and performance?!?!!) are higher up, you'll use the classical approach.

    None of the two is inherently wrong - and so discussions about this are usually pointless, unless you have a clear vision of what your system is to accomplish - today and also in the future.

    Because I'm not sure where Steve's small nice tool should grow, I'm at odds to decide such issues - and, I'd say, everyone else is - unless one has implicit assumptions about this product. But, in my humble experience, differing implicit assumptions are one reason for the worst mud-wrestlings in teams and hence should be attacked immediately - by making the assumptions explicit!


    Harald M.

  • I'm fairly new to the db design world myself, but here goes my opinions...

    My first thoughts are purely datatype related.

    Rating is 1-5. Why is it an int? If you're planning on a huge table and no value will ever be below zero or above 255, go for tinyint. This will save you a couple of bytes per row.

    ReviewDate. Do you really need milliseconds? Do you expect your site to live past 2079 as is? Certainly you're never going to have a review before the 1990s, so use a smallDateTime here. You'll save 4 bytes per row and you can always increase it if your site does look like it's going to survive past 2079.

    Owned - again, using an int when a bit flag would be more appropriate. After all, the value is true or false. You own it or you don't. Why do you need to waste the space for the extra values.

    What is column BookStatus? A column and table with both the same name is, IMHO, not a good thing. If you get an error on BookStatus, you'll have to look closely to decide if the error is the table or the column. I recommend not using duplicate names between columns and tables. BookStatusID would be a better name if you're truly using it as an identity column. Not to mention, if you've only got a limited number of statuses, you could make that column a smallint or tinyint to save space.

    Given the requirements you listed, your imageURL column is already wrong. It's a character column that's only pointing to a URL. You don't have any place to store the actual image if there is no URL. This should be broken up into a separate table, with an ImageID column in BookStatus. Then you have the BookImage table with three columns, ImageID, ImagePic, and ImageURL. You could add a fourth bit column, ImageOnAmazon, to denote if your code should check the URL first or go straight for the stored image (I've seen developers do that), but truthfully, it'd be just as simple to check ImageURL to see if it's NULL or not.

    And a clustered index on a text column??? If you want to search on the review, make it a Full Text catalog / Index. Use clustered on the ISBN or a surrogate PK ID column. Not to mention, MS is getting rid of the Text datatype, so we'll want to make this varchar(Max) or nvarchar(Max) depending on if you care about foreign languages or not.

    Also, where's the foreign key constraint on Books.BookStatus? You don't want to leave this column orphaned if something happens on the BookStatus table.

    And nowhere in the article is the mention of security & access to this db. Are you the only one who's going to be accessing the data? If not, how do you plan to design the site? You'll want to be sure your indexes and any constraints realistically take into account how the data will be pulled. I.E., what queries are going to be entered from the site. But this might be a little beyond the scope of the article.

    Lastly, with all the new XML stuff available in SQL Server 2k5, if this is going to be displayed on a website, why not stuff some of this into an XML string? Like the author list and the review itself? It might make editing a little bit harder later on, but it doesn't look like you're going to be doing a lot of editing. From your description this is a Write Once-Read Many type of situation, so you might as well compress as much in an XML string that'll be easier to deal with on a website then have to format everything on the website.

    Anyway, that's my .02 cents worth. @=)

    Brandie Tarvin, MCITP Database AdministratorLiveJournal Blog:[/url]On LinkedIn!, Google+, and Twitter.Freelance Writer: ShadowrunLatchkeys: Nevermore, Latchkeys: The Bootleg War, and Latchkeys: Roscoes in the Night are now available on Nook and Kindle.

  • A Design issue:

    The indices seem totally ad-hoc. As a rule of thumb, clustered indices should be on the (logical or technical) primary key - i.e., on the isbn column for books, and on the bookstatus column. The reason is that in real systems, the vast majority of index uses is JOINs, not searches - and usually (and preferably), joins should be to primary keys.

    * In a system that will not grow and hence not get new associations, ad-hoc clustered on non-PK columns indices may be ok.

    * Also, if performance testing shows (after implementation! and with large and typical amounts of data) that performance can be improved considerably by having a (the) clustered index on a non-PK column, then it may be so ...

    - but in both cases, some 20 lines of comments should explain such a non-standard decision, IMHO.

    A Language issue:

    Adding primary key and foreign key and uniqueness constraints afterwards thru ALTER TABLE statements is clumsy. DDL allows to add such constraints to the column or table definitions, where they logically belong.

    (However, it is sometimes necessary to give names to constraints so that the can be dropped - e.g. during large imports; but again, this should be explicitly considered; and documented).

  • In addition to the above suggestions (most of which I agree with). Here's a few quick thoughts.

    It would be useful to know what database (version) we will be targeting. This would let us know which data types and features are available.

    On that subject. If `Rating` can only contain the numbers 1 through 5, a tinyint would be more appropriate than int.

    The same would apply for `BookStatus` as I really hope there will never be more than 255 statuses!

    A boolean data type will be more appropriate for the `Owner` field, if available.

    If not available, use a bit, which cannot be null, and with a default.

    A constraint to enforce the valid value of `Rating` may also be useful.

    SmallDateTime would be better than DateTime for `ReviewDate` as the additional level of accuracy and range of possible values is not required. (Not for 70+ years at least.)

    There should also be definition of the nullability of fields. I would assume that it would be possible/desirable to enter the basic details of a book (ISBN, Title, Author, Owner, ImageURL(?)) before it is reviewed. As such `ReviewDate` and `Review` should be nullable. `Rating` would not be nullable, but set to zero to indicate that a value have not been set.

    `ImageURL` is currently (as I understand it) used for two things: A URL for the affiliate link to Amazon; or the path to an image file.

    This could get confusing or lead to problems when trying to determine what the value in this field represents. Either have separate fields for the different types of information or have a separate field to serve as an indicator of the content of `ImageURL`. I would opt for the first option (data in separate fields). This would also allow for the situation where Amazon do not have an image of the cover. The own image could be used, but a link to Amazon could still be created.

    Add the foreign key enforcement for `BookStatus`.

    Also add an index on this field in the `Books` table. This will speed up searching by status.

    An index on `Rating` will also make it faster to "get the top books, worst books, etc.".

  • Matt Lacey (11/20/2007)


    An index on `Rating` will also make it faster to "get the top books, worst books, etc.".

    In this case, most probably not (see e.g. here, but there are many more pages with this info out there).

    The reason is the following: An index is stored sorted according to its defining fields. So if you need all records for a specific value in the index, you must grab a more or less random page on your table cluster for each primary key found in the index (with small records, you might hit a table page a few times, but still, the random table access order remains ...).

    With a table scan, the DB engine can "stream" the pages through in some arbitrary order - reading in file order is a good idea here.

    Therefore, if the percentage of pages that will be found is larger than some threshold, a table scan is more efficient. The rule of thumb we used on some projects 8although not with SQL Server) was 1/7 - i.e., if a query hits more than about 15% of a table, then using an index is useless.


    One more issue on Steve's project: What I need (or at least like) in database design, are some numbers - numbers of records, sizes of properties, numbers (or, better, distributions) of associated objects; and also, which searches will (probably) be performed often, which are maybe a bit less important etc.

    For this project, we have to assume that all tables are fully loaded - i.e. have at least some 10000s of recordings. Steve alone would probably be hard pressed to write that many reviews (or even read that many books), so one could argue that indices (for higher performance), type fine-tuning (to save storage) and the like are wasted on this small project - but let's assume that this is a "full-fledged one", ok? ... which then should imply that we also argue about use-cases this way, shouldn't we? - and have arbitrarily more of them than Steve assumed (e.g. - list all authors of all books with a rating better than 3 ...).



  • Let's not forget that too many indexes, especially on a small table, will probably cause more performance problems than they resolve.

    Best case is to do 1 clustered and maybe 1 non-clustered (based on assumed usage). Then when the site is up and running, run a tuning trace (Profiler) against it during business hours to see if design is living up to functionality or if design needs to be changed. Only in the case of the indices, though.

    But I agree that there isn't enough information on how the information is to be used. And there's also the question of scalibility.

    If the site would ever be expanded, what would be the deciding factors? This will help us decide about table structure & datatypes. Right now, we're reasoning in a vaccum, which has more than once gotten a database designer in hot water when the design fails to be scalable for future endeavors.

    Brandie Tarvin, MCITP Database AdministratorLiveJournal Blog:[/url]On LinkedIn!, Google+, and Twitter.Freelance Writer: ShadowrunLatchkeys: Nevermore, Latchkeys: The Bootleg War, and Latchkeys: Roscoes in the Night are now available on Nook and Kindle.

  • At this point, it doesn't make a whole lot of sense to me to be set on data types, or indexes. You don't know for sure how it is going to be searched, and you really don't know for sure how things will be stored. At this point, you are looking at design. Here are my ideas:

    Use a surrogate key for the primary key of the books table. It will give you a layer of abstraction in case an isbn of a book ever changes (I have no idea if it will or not), and you will save space if you are storing a huge number of books in your db by storing a number in other tables rather than a long string. Strings are harder to compare for joins anyway.

    A book could presumably have many different'll want a new table for that with a identity key for the author. That author will likely have written more than one book.

    I have no idea what ImageURL is cover? You may want to store more than one picture for the book, so you'll want another table for that, along with a xref table to join books and the images.

    Can a book have more than one owner?

    You'll want a review table to store the reviews for the book....I would guess a book would have many more than one review. I would store those seperately along with the Review Date.

    That is all I have for the time being...only knowing what I currently know about the project. As more details are unveiled, the design will likely change, indexes will change, etc.

    Hope this helps,

    Tim Chapman, MCITP

Viewing 15 posts - 1 through 15 (of 96 total)

You must be logged in to reply to this topic. Login to reply