Our Very Own Data Compression
One of the more interesting things that I've been looking to do for some time is write more about database design and why we've made some of the decisions we've made over the years here at SQLServerCentral.com. I'm not as theoretical or experienced as Dr. Codd, Dr. Date, or even Mr. Celko, but I think I've come up with some effective designs that have solved problems and worked well in past jobs. So with that in mind, I'd like to delve into an interesting problem we had.
When you run a web site such as this one, one of the things that you run into is how to deliver emails in response to a variety of actions. In our case, we went through the early progression of sending email from the web page using some sort of scripting, quickly realizing what most people do that this isn't a very scalable or maintainable solution.
So we created a table to store the emails and used inserts to put messages in the table and a scheduled task to send them on a periodic basis. In our case, we used a period of one minute, which worked well for quite some time. Our table was very simple with the following schema:
create table email ( messageid int identity(1,1) , datesent datetime , messagefrom varchar(250) , messageto varchar(250) , subject varchar(250) , datecomplete datetime , message text )
In this case most of the values are in the table are probably self explanatory. The messagefrom and to are the addresses to use when sending the message, and the subject and message contain the information. The datesent indicates when the message is inserted into the table and once the sending process retrieves the row and sends the message, it updates the datecomplete column with the current time. This gives us some ability to determine the latency for sending messages and we can also check for any rows with an extraordinary latency and notify an administrator (using a separate email system of course) that there is a problem.
For about a year this system worked extraordinarily well. We occasionally had messages get stuck because someone decided to register their email as something like "email@example.com;firstname.lastname@example.org", despite the instructions on the page to use one and only one valid email, but some quick notifications allowed us to quickly clean those up.
This design worked very well for us and it wasn't until we needed to increase the volume of emails that we had to rework the application to keep up with the load. The design here had worked very well for our needs and would probably work for most of you until you approached a volume of emails in the 100,000 a day range.
Some time last year we started doing our once a month list rental, sending out a volume of emails on behalf of an advertiser that paid for the privilege of reaching our audience. That decision is another story, but we soon learned there was a flaw in our database design. Take a moment and see if you can imagine what might be wrong with the schema above.
If you haven't guessed it, let me give you one more piece of information: each of our vendor messages is around 80kb. If you think about it, you'll realize that sending 100,000 messages (we were actually sending about 130k initially), that means that you need to insert 100,000 rows in this table. That's not a big deal for SQL Server, as it can easily handle a table with millions of rows. But think about the structure of a row. It's roughly 133 bytes + the message size. So in this case, that's roughly 80kb for each message. I've listed the sizes below to make it easy:
That means for this load of messages, I'm adding roughly 80kb * 100,000 or 8,000,000kb. While SQL Server doesn't have problems with that amount of data, it caused some serious log and data fluctuations for my systems. More logs, more backup space required, etc. added up to some nice space fluctuations, especially if you occasionally mess things up, like I've been prone to do. Insert rows, delete them, insert them again and you've got a very large log jump. For our database, it was a significant growth from what we've typically grown used to.
So we decided to try and save some space. The easy way to do this was to change our schema to that shown below. I've included the definitions, not the scripts we ran to make the changes.
create table email ( messageid int identity(1,1) , datesent datetime , messagefrom varchar(250) , messageto varchar(250) , subject varchar(250) , datecomplete datetime , templateid int , message text ) go create table templates ( templateid int identity(1,1) , templatedesc varchar(250) , templatedata text )
I'm sure most of you would come up with this as a quick and easy solution to saving space. We insert one copy of each message in the templates table and then link it to every message that we need to send. This was a good solution, but in our case it actually had a problem. We did not want to insert every message into the template table because there were a large number of one time messages (registration confirmations, forum notifications, etc) that we did not want to put in the templates table. We also did not want to alter all our existing code to put the messages there. So we wanted to be sure our design allowed messages in either the email table or the templates table.
We also wanted to limit the amount of processing we had to do in the stored procedure to retrieve the messages. We typically pulled back 1000 messages at a time for sending and wanted to be sure that we could easily pull back regular messages as well as those stored in the templates.
I wish that I could say that I came up with the final solution, but credit actually goes to Andy. While trying to test this design and still have a single query to retrieve messages, he came up with a query that had this line in it:
... isnull( templates.templatedata, email.Message) as Message, ... from email left outer join templates on email.templateid = templates.templateid ...
The query uses a left outer join on the tables to pull back both message columns. If no template is being used, the left outer join returns a NULL for the templates.templatedata column and the message column value is used. One easy query meets our needs and validates our design's strengths.
The names in the design listed above have been changed to protect the security of our system. And actually we've evolved this design further to handles some other needs we have, but this was a solid design for some time that allowed us to send 100s of thousands of messages a day for many months. We've added a few more columns to allow the application to use a few more bells and whistles rather than because we needed schema changes.
I hope that I've given you a quick idea of how your design could easily be changed to meet new needs and can help you more efficiently use your resources. The amount of data that has to be sent to the client is the same, but we have many less reads to disk and much less data that must be stored, backed up, and managed by simply building a design that takes the data size into consideration. I welcome your comments in the discussion forums and am looking to write a few more of these types of articles that look at other design goals and solutions.