HELP!!! Getting data from two huge tables in a huge DB

Question

Post reply

HELP!!! Getting data from two huge tables in a huge DB

halifaxdal

SSCoach

Points: 19741
More actions
December 31, 2006 at 9:31 pm

#171656

Hello folks,
I have a 19GB database(SQL2000), there are only two tables there and we need to do some data mining.
To do an experiment, I extract one day's data from the original database, it's 600MB, one table has 1300000 records, the other has 130000 records.
We want to know if there is any match(and if yes, how many matches ) between two fields in the two tables. So basically we need to do an inner join like:
select t1.*, t2.* from table1 t1 inner join table2 t2 on t2.field1=t1.field1
However, the two tables are so big, at least at my home PC, the query can't go through.
I am anxious to know what should I do now for this kind of case? (both the two tables have primary keys and clustered index.), what's the best solution to dig data out from the original DB which has a much bigger size than my experiment DB(and even for the much smaller DB I can't finish the query!)
Please help.
Thanks.
Happy new year!

Viewing 10 posts - 1 through 10 (of 10 total)

You must be logged in to reply to this topic. Login to reply

salim Say Hey Kid Points: 679 More actions · Answer 1

The query should run fine even on a pc. 19GB is common these days.

just make sure the fields in where clause are covered by an index.

we have tables that cross 100 Million rows and such joins are every day event without problems on sql2000 and 2005 too

Steve Jones - SSC Editor SSC Guru Points: 738357 More actions · Answer 2

I'd also get rid of the t1.* and t2.*. If you're looking for matches, just get the PKs in the select list and then use that list to get the individual records.

MohammedU SSCoach Points: 19644 More actions · Answer 3

I will agree with Steve why do you need all columns from both table? As Salim mentioned 19 GB data is nothing these days...

Check the estimated query plan?

What are you trying to with your query?

Post your sample code to see what your doing?

MohammedU
Microsoft SQL Server MVP

Ian Yates SSCoach Points: 19738 More actions · Answer 4

You say you want to know how many matches there are - this sounds like you don't actually need the matches themselves. Rephrased, if you have 100,000 matches then you just want a single result of 100,000 rather than 100,000 rows returned in a query. Is this right?

If so, then something along the lines of

select count(*)

from t1

inner join t2

on t1.myField = t2.myField

should do the trick. Make sure you have an index on the myField fields in each table - the field should be in the index as the first field of the index. Since you're not returning other rows the index need not be the clustered index (although your tables should have a clustered index) and you needn't have any other fields in the index.

With your current query (I'm guessing) if you are returning each matching row then this will take time. It will take even longer and include what's known as a bookmark lookup if your myField field from each table (using the made up names from my sample query) are not the leading fields in your 2 clustered indices.

Think of it as like a street directory - you have an index of street names in the front. If your goal was to find all of the schools near a certain street, the index there makes you look up the street, find the map page and then you scan around that map page for the required information. The index in this case helped you locate the data but it wasn't the data itself. The double lookup required is similar to what SQL Server would be doing for you - it would find the matching myField rows but then needs to get the extra data associated with those rows.

If you had an index of street names and nearby schools this would take up a lot more space and only help a certain number of queries but would make that query VERY fast - on the flipside, apart from the extra space, you need to keep the index up to date. For more info browse in SQL Server books online about "clustered index".

Of course, I may have misunderstood, so could you please, as others have asked, provide some sample code & results (even if made up) so we can see exactly what you want. Let us know how you get on. Cheers

TheSQLGuru SSC Guru Points: 134017 More actions · Answer 5

You mentioned doing data mining so I will assume that you actually DO need all the data from the day's join query so you can run it through some mining algorithm or analyze it with another tool. What I bet is happening is that you are running query analyzer's memory usage up so high that you are starving your PC for RAM, causing massive paging. To address this issue, make a batch file with your query and run it with osql and redirect the output to a disk file. It will still be slow (unless you direct the disk file to an external hard drive separate from where the sql server data file exists.

For additional performance gains, consider indexing the join fields (as was pointed out previously). Also consider partitioning the data based on date (perhaps just a month/week or even day per partition). With partitions, especially once that on a server can be placed on separate spindles, you can see a huge performance increase when extracting data.

I would also check for data fragmentation within sql server as well as disk file fragmentation. I never ceases to amaze me how many of my clients just install sql server, create a database and stuff it with GBs and GBs of data without ever changing the default disk file size and growth increments!!! I got one client an 18PERCENT throughput improvement simply by defragging their server hard drives. They had almost a half-MILLION file fragments!

Best,

TheSQLGuru

Best,
Kevin G. Boles
SQL Server Consultant
SQL MVP 2007-2012
TheSQLGuru on googles mail service

halifaxdal SSCoach Points: 19741 More actions · Answer 6

Happy New Year!

Thank you all guys for the response.

Yes I will try Kevin's suggest to run a query with osql and redirect the output to an external hard drive separate from where the sql server data file exists.

I am also considering partitioning the database, in terms of disk fragmentation, wouldn't that be a risk for a server?

Thanks.

TheSQLGuru SSC Guru Points: 134017 More actions · Answer 7

Proper sizing of disk files for databases is something that is optimally done by the DBA, not by sql server hitting a full file and then auto-growing the file based on db settings. Figure out how much data you have, what your growth rate is, how much future growth you want to have free space available for, then make the file (or files, if you use multiple-file file groups or partitioning) that big. Keep a watch on your growth, then grow the files yourself (perhaps using an automated script) during low-activity periods. Large file growths (pre-sql2005) can crush a system's performance if it occurs during high I/O loads.

Best,
Kevin G. Boles
SQL Server Consultant
SQL MVP 2007-2012
TheSQLGuru on googles mail service

halifaxdal SSCoach Points: 19741 More actions · Answer 8

At the moment the database is likely increased 1GB, or less, per day. I can see in the future all data will be required to sort of archived to be further investigated if necessary. In terms of space, it's not my concern.

Because the table is not created properly from the beginning, no index/primary has been set up, is there an efficient way to add it on?

Thanks lots.

TheSQLGuru SSC Guru Points: 134017 More actions · Answer 9

Well, if you build a clustered index it is gonna hurt! The entire table may well be rewritten out to the transaction log as the data is sorted and laid down in physical order. Non-clustered indexes will only have the leaf-levels logged which should not be too bad, assuming you don't put a bunch of columns in them. I would recommend doing this off-hours too. I would make the transaction log file a pretty large percentage (25-50 at least) of the main db file prior to creating any indexes (especially a clustered one). Set a pretty large growth increment too in case you guess wrong. You can drop the size back down a bit after you are done. But since you have only two tables (IIRC) any maintenance you do on the indexes could use a lot of log space too so don't make it too small.

Right now you have a heap table (no clustered index). If you have done lots of updates to the data it could very well be fragmented and contain lots of forwarding pointers (VERY inefficient). If that is the case you may want to do a clustered index to remove these problems. If you haven't done updates and don't need to do sorts/order bys/group bys I wouldn't bother with a clustered index.

Best,
Kevin G. Boles
SQL Server Consultant
SQL MVP 2007-2012
TheSQLGuru on googles mail service

HELP!!! Getting data from two huge tables in a huge DB

Cookies on SQLServerCentral