Metadata Structure / Design Issues

Question

Post reply

Metadata Structure / Design Issues

A-251099

SSC Veteran

Points: 256
More actions
October 30, 2007 at 6:22 pm

#376606

Hi People,
I have been developing an application that needs to store 30 fields of meta data against each item on the system. The end users are able to customize the field names on a customer by customer basis. I’ve tried a couple mock designs and haven't been able to get any decent performance out of either of the two solutions I've tried. I would be really interested to see what others have some up and any suggestion for a design that might work.
I have five key tables relating to the metadata design the book table contains 2.1 million rows this is the primary table of the application this table require each item can have metadata stored against it (not every single book does):
BookStatus
|
Book -> BookType -> BookTypeGroup
|
Customer
most of the time the user will only be allowed to search against one metadata column with displaying a maximum of 5 columns in results.
Unfortunately I've had little feedback in the terms of how much data is expected to be put in the metadata table. It maybe those users don’t store metadata against every item,
but I am assuming worse case scenario that every item will have meta data against it.
The two solutions I have tried so far:
Created a table with 30 varchar(100) columns named meta1, meta2, meta3 ... meta28, meta29, meta30
I then bulk load all 30 fields with random length data to simulate having all the meta data.
the Query would be:
SELECT TOP 25
[Dev].[dbo].[Book].[Id],
[Dev].[dbo].[Book].[Code],
[Dev].[dbo].[Book].[AltCode],
[Dev].[dbo].[BookType].[Code] AS [Type],
[Dev].[dbo].[BookMetadata].[MetaColumn1],
[Dev].[dbo].[BookMetadata].[MetaColumn2],
[Dev].[dbo].[BookMetadata].[MetaColumn3],
[Dev].[dbo].[BookMetadata].[MetaColumn4],
[Dev].[dbo].[BookStatus].[Status]
FROM [Dev].[dbo].[BookMetadata]
INNER JOIN [Dev].[dbo].[Book] ON [Dev].[dbo].[BookMetadata].[BookId]=[Dev].[dbo].[Book].[Id] AND [Dev].[dbo].[Book].[CustomerId] = 512
INNER JOIN [Dev].[dbo].[BookStatus] ON [Dev].[dbo].[BookStatus].[Id]=[Dev].[dbo].[Book].[BookStatusId]
INNER JOIN [Dev].[dbo].[BookType] ON [Dev].[dbo].[Book].[BookTypeId]=[Dev].[dbo].[BookCharge].[Id] AND [Dev].[dbo].[BookCharge].[BookGroupId] = 1
WHERE[Dev].[dbo].[BookMetadata].[MetaColumn28] LIKE '%g%'
This performs relatively slowly in most cases (1.02 - 0.50 seconds) but better than my second option 😉
Second option I tried was creating a table with a single meta data column to store each metadata value in its own row:
Id | MetaData | ColumnOrder
SELECT TOP 25
[Dev].[dbo].[Book].[Id],
[Dev].[dbo].[Book].[Code],
[Dev].[dbo].[Book].[AltCode],
[Dev].[dbo].[BookType].[Code] AS [Type],
[Dev].[dbo].[BookMetadataView].[MetaColumn1],
[Dev].[dbo].[BookMetadataView].[MetaColumn2],
[Dev].[dbo].[BookMetadataView].[MetaColumn3],
[Dev].[dbo].[BookMetadataView].[MetaColumn4],
[Dev].[dbo].[BookStatus].[Status]
FROM [Dev].[dbo].[BookMetadataView]
INNER JOIN [Dev].[dbo].[Book] ON [Dev].[dbo].[BookMetadataView].[BookId]=[Dev].[dbo].[Book].[Id] AND [Dev].[dbo].[Book].[CustomerId] = 512
INNER JOIN [Dev].[dbo].[BookStatus] ON [Dev].[dbo].[BookStatus].[Id]=[Dev].[dbo].[Book].[BookStatusId]
INNER JOIN [Dev].[dbo].[BookType] ON [Dev].[dbo].[Book].[BookTypeId]=[Dev].[dbo].[BookCharge].[Id] AND [Dev].[dbo].[BookCharge].[BookGroupId] = 1
WHERE[Dev].[dbo].[BookMetadataView].[MetaColumn28] LIKE '%g%'
Then created a view to pivot the metadata into columns so it could be joined on. This worked well until a lot of metadata was imported.
Then it failed badly as you could imagine 2.1 million rows X 30 rows.
This frequently took up to 8 minutes.
I have analyzed both solutions in query analyzer and created relevant indexes.
So I'm a tad stuck, anyone have any ideas or faced something similar? or have I done something stupid? 🙂
If you need more info im happy to provide it.
Thanks!

Viewing 15 posts - 1 through 15 (of 17 total)

You must be logged in to reply to this topic. Login to reply

happycat59 One Orange Chip Points: 29387 More actions · Answer 1

Have you considered using full text catalogs for your meta data columns ?

Grant Fritchey SSC Guru Points: 398690 More actions · Answer 2

What about XML for the meta data? It really does sound like it's only semi-structured data.

"The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
- Theodore Roosevelt

Author of:
SQL Server Execution Plans
SQL Server Query Performance Tuning

A-251099 SSC Veteran Points: 256 More actions · Answer 3

Thanks heaps for the suggestions!

Full Text Indexing - To be honest its not a technology I've used a lot before, It does seems to improve the query time significantly if the search string is over 5 letters long eg like = '%123456%' takes nearly 8 minutes where as contains('"*123456*") takes 10 - 30 seconds so that’s a definite help. Only issue is the fact the ORM I support doesn't handle the full text transparently 🙁

XML - I’ve been using sql 2000 a lot and found XML to be hard to work with. we are now using 2005 so that could be an option, but wouldn’t putting all that data in a field and parse it out cause a performance overhead? XML might be a better option if I wanted to extended the metadata. I will do some research into this idea.

Unfortunately I'm still getting really slow response times even when using the full text index. out of the two options i posted which do you think is a better design?

Grant Fritchey SSC Guru Points: 398690 More actions · Answer 4

Neither full text or xml has me terribly excited. If it were my project, I'd be reexamining the assumptions and trying to find a method of defining the requirements more fully so I could achieve a more normalized design. But, shy that, in 2005, I'd go with the XML.

"The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
- Theodore Roosevelt

Author of:
SQL Server Execution Plans
SQL Server Query Performance Tuning

A-251099 SSC Veteran Points: 256 More actions · Answer 5

Hi Grant,

Thanks for the reply, I tend to agree with you, the full text solution works but its not perfect.

In regards to requirements, in simple terms they are:

1. Able to store 30 fields of user data aginst each book on the system.

2. Each field needs to be able to store at most 100 characters.

3. Need to be able to search aginst any of the 30 fields with a like or contains style search.

4. Also be able to search aginst book information in the same search.

That is what the client has requested and is in the specification. unfortunatly it is a generic requirement which leads to a design that needs to be generic. The downside of that in my experiance is often a generic solution often ends up being slow, hard for SQL to optermise if all the fields are varchar.

Would a SQL script with some of the table structures help in finding a solution?

I've got the weekend coming up so I might try the XML stuff on my laptop at home and see if i can find a solution.

Grant Fritchey SSC Guru Points: 398690 More actions · Answer 6

Unfortunately, I'm going out of town for a week, but posting a bit of code could help someone else to help you.

Based on the requirements... 30 fields each 100 characters in length... It's just so sloppy. Anyway, you definitely don't want the "One True Lookup Table" approach. What about something like... You have one table that has the well defined fields for the book. You also have one table that has our 100 character field. Then you have an interim table between them. Now you can index the 100 character field (without having to index 30 of them), you can index the interim table and you can index the book table, all independent of each other. This allows for more than 30 fields, but you could limit that by adding a FieldType or something to the interim table, making it part of the primary key and only allowing 30 different types (guessing a bit here). But I think this would work.

If you can't do that, I'd still go with XML over Text because, while it's certainly not an optimal solution, it does provide for the structure that the data would require.

That's my 1.5 cents worth for the moment.

"The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
- Theodore Roosevelt

Author of:
SQL Server Execution Plans
SQL Server Query Performance Tuning

Grant Fritchey SSC Guru Points: 398690 More actions · Answer 7

Thinking about it more, you should be able to get reasonable performance, even pivoting the data, assuming the indexes are clean, especially the clustered indexes. How did you set those up?

"The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
- Theodore Roosevelt

Author of:
SQL Server Execution Plans
SQL Server Query Performance Tuning

Matt Miller (4) SSC Guru Points: 124210 More actions · Answer 8

Well - you're going to be running into some serious compromises. For one - "30 100-character fields" sounds like someone's being sloppy/lazy with giving you the right specifications. You really ought to push a little harder to get them to give you some ACTUAL specifications (there are thirty possible, optional pieces of data - let's define what they are, and let's type them appropriately to what they REALLY are). You will likely find out that several of those will be very precise fields, which can be typed correctly, indexed correctly, etc...

You should also push to verify that they will use EVERY metadata element in their optional searching

In the meantime, Grant's solution should still provide you with some amount of gain. That being said - you NEED the full-text index on the 100-varchar field, since traditional indexes are worthless/not used in a lot of LIKE syntaxes. Anything other than a Like 'abcdef%' syntax will require a table scan.

----------------------------------------------------------------------------------
Your lack of planning does not constitute an emergency on my part...unless you're my manager...or a director and above...or a really loud-spoken end-user..All right - what was my emergency again?

A-251099 SSC Veteran Points: 256 More actions · Answer 9

Hi Guys,

I really appreciate the feed back and ideas you have offered so far, obviously my SQL skill aren't as developed as yours are 🙂

I agree on the loose requirements, but unfortunately I just can’t see a way around it, every customer will store what ever they link in each column, every customer comes from a different background and a different business type so it’s almost impossible to guess what thier going to put and then be able to limit it. One option is to have the fields set so you have 10 varchar(100), 5 smalldatetime, 5 int, 10 money, or some similar configuration. But I bet one customer will want an extra field of one or another type.

By the way the books were just an example, my client is fussy about information going out and I like to keep things simple.

I tried the single column solution the advantage is it only stores a value if the customer actually has something in there and allows for expansion later on. I had a table that contains the column headers and each metadata fields belong to one of them. I then had an order field to control the column display order, using that I then pivoted the data round. This worked well, but the performace suffered having 2.5 million rows X 30 fields. I didn't test using the full text index against that solution which i plan to do tommrow. I did a bit more digging and discovered from the client that not every item will have meta data stored against it. In fact the generally said that only 30-40% of the items will, this results in a bit of a saving, but ideally I want to find the best solution possible that meets the customers needs.

Grant, Attached is a SQL diagram of what I understood you to have proposed as a solution, I wasn't 100% sure what you meant from your description. Also ignore the contents of the item table and the customer table their mock ups normally there are more fields. On the table with the single varchar column I would add a Full-Text index as suggested. I would likely use a view to transform this data using a pivot into more easily used structure. The type table allows the client to define the type of column such as it being a date, int etc.

Currently the Id columns are the ones in the clustered indexes.

Thanks

Alex

happycat59 One Orange Chip Points: 29387 More actions · Answer 10

Just a quick note on the use of indexes and LIKE searches. Yes, prior to SQL Server 2005, a condition such as LIKE '%Xyz%' would not have benefited from the index.

However, the folk in Redmond have made significant improvements in SQL 2005 - you may be surprised but you will get much better performance than you would have previously. I would suggest that you try this before you discard this option.

Matt Miller (4) SSC Guru Points: 124210 More actions · Answer 11

Happycat -

My exposure to it (even within 2005) is that the index doesn't help with queries using like as a "contain", e.g. with leading and ending wildcards.

While I do agree with you that SQL 2005 seems to have improved the performance of LIKE queries in general, if anything, it seems to me that a "regular" index on the field impedes performance, and doesn't help at all.

After I saw you mention that, I ran a little test this afternoon, both with and without an index on a char(300) field, in 2000 and in 2005. It was to return 257 records out of 10 Million.

Results:

2000 no index 151.35 seconds

2000 index 151.47 seconds

2005 no index 27.31 seconds

2005 index 46.75 seconds

Where no index = table only had a clustered index on another field, and no non-clustered index

and index = clustered index on another field and a non-clustered on the char(300) column.

All of the runs except for the 2005 index were "Clustered index scans". "2005 index" tried unsuccessfully to use the index to do its job, but it seems to have made it worse.

I ran it a couple of times, and while the executions varied a little - these were certainly representative of the relative performance.

That being said - the improvements to LIKE are well noted, and make it a bit more useable.

----------------------------------------------------------------------------------
Your lack of planning does not constitute an emergency on my part...unless you're my manager...or a director and above...or a really loud-spoken end-user..All right - what was my emergency again?

Grant Fritchey SSC Guru Points: 398690 More actions · Answer 12

Sorry I haven't gotten back to you, I was out of town.

Yeah, that's approximately the format I was thinking of. It's just better when you have 'x' number of potential values, rather than store all 'x' with some large number of them being null and therefore making the indexes unworkable, you store only those values that contain something. The queries to bring the data back out of this type of structure can be a bit more complicated, but they're much more likely to use the indexes than storing 30 columns worth of nothing.

"The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
- Theodore Roosevelt

Author of:
SQL Server Execution Plans
SQL Server Query Performance Tuning

A-251099 SSC Veteran Points: 256 More actions · Answer 13

Hi Grant,

Thanks for reviewing that, glad to see I was on the same page, although I'm still not 100% about the purpose of the many to many join between the tables. Is this to reduce duplication of stored metadata values or was that not in the design you were suggesting? The only issue I had with this structure was 30 fields of data per record X 2 millions reocrd ended up being a very long table. Thankfully initially the data will be empty and the customers will fill it as we go along so perhaps keeping an eye on and see how it grows, then from there choose a more specialised design. I haven't had to much luck with the XML option as I've been trying to learn my way around typed schemas. plus it appears XML is a tad tricker to update.

Thanks!

A

Grant Fritchey SSC Guru Points: 398690 More actions · Answer 14

Oh updates to XML... Well, you just update the whole thing. You don't try to modify subsets within the string. Rely on the app to do that heavy lifting here.

As to the many-to-many, it's so that you don't have to store empty data sets, that's all. It eliminates the need to keep 30 columns around. If there's no data, there won't be a row.

"The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
- Theodore Roosevelt

Author of:
SQL Server Execution Plans
SQL Server Query Performance Tuning