September 3, 2015 at 2:06 pm
With Luis' help, I just imported a huge data file into my local server/db on my laptop as well as a remote server (btw: The built-in import/export seems a little bit faster than bcp, both finished in less than 30 minutes which is amazing) and started some unexpected journey, here is the first stop:
The table has 40110117 rows, that's right, 40 millions.
What I want is maybe I should do some pre-analysis first and generates some staging tables for later use.
The table has only 8 columns, please see the screenshot
There is a primary key ID set for the table.
I want to get average(host_score) grouped by the other columns: network, OS, IP, Vuln_name
This query might work eventually, but I cancelled it after 10 minutes waiting.
SELECT Avg (host_score) as AvgScore, Network_Group_Name, OS, Vuln_Name, recordsWritten
FROM [IP360_Test].[dbo].z_SIVMS_currentVulns
Group By Network_Group_Name, OS, Vuln_Name, recordsWritten
As you can see those columns are all varchar and with 128+ characters except for IP with 15 characters.
I hope someone who ever work on this size of data can share your experience here, I basically cannot do anything on it now.
September 3, 2015 at 2:29 pm
I can't see the screenshot. A screenshot is pretty useless anyway. Actual text would be much better.
But, in short, here's the approach I recommend:
1) encode the large varchar columns to int values using a text_values conversion table.
2) create new columns to contain the encoded values.
3) cluster the table in the proper sequence based on your usage / needs. This is the critical factor.
4) Do the Group By on the int values, with a final outer query to lookup the actual (original) text values.
SQL DBA,SQL Server MVP(07, 08, 09) A socialist is someone who will give you the shirt off *someone else's* back.
September 3, 2015 at 4:47 pm
halifaxdal (9/3/2015)
With Luis' help, I just imported a huge data file into my local server/db on my laptop as well as a remote server (btw: The built-in import/export seems a little bit faster than bcp, both finished in less than 30 minutes which is amazing) and started some unexpected journey, here is the first stop:The table has 40110117 rows, that's right, 40 millions.
What I want is maybe I should do some pre-analysis first and generates some staging tables for later use.
The table has only 8 columns, please see the screenshot
There is a primary key ID set for the table.
I want to get average(host_score) grouped by the other columns: network, OS, IP, Vuln_name
This query might work eventually, but I cancelled it after 10 minutes waiting.
SELECT Avg (host_score) as AvgScore, Network_Group_Name, OS, Vuln_Name, recordsWritten
FROM [IP360_Test].[dbo].z_SIVMS_currentVulns
Group By Network_Group_Name, OS, Vuln_Name, recordsWritten
As you can see those columns are all varchar and with 128+ characters except for IP with 15 characters.
I hope someone who ever work on this size of data can share your experience here, I basically cannot do anything on it now.
Since "RecordsWritten" is likely different for each instance of any single Network_Group_Name, you'll probably end up with a 40 million row return that simply duplicates the individual values of Host_Score and that's the reason it's taking to long. If you were to SUM or AVG "RecordsWritten", things might be a bit more snappy. Of course, I can't actually tell because I don't know enough about the data itself.
--Jeff Moden
Change is inevitable... Change for the better is not.
September 5, 2015 at 5:46 am
Jeff Moden (9/3/2015)
Since "RecordsWritten" is likely different for each instance of any single Network_Group_Name, you'll probably end up with a 40 million row return that simply duplicates the individual values of Host_Score and that's the reason it's taking to long. If you were to SUM or AVG "RecordsWritten", things might be a bit more snappy. Of course, I can't actually tell because I don't know enough about the data itself.
Thanks Jeff, good catch! Still hanging after removing this field.
September 5, 2015 at 5:47 am
ScottPletcher (9/3/2015)
I can't see the screenshot. A screenshot is pretty useless anyway. Actual text would be much better.But, in short, here's the approach I recommend:
1) encode the large varchar columns to int values using a text_values conversion table.
2) create new columns to contain the encoded values.
3) cluster the table in the proper sequence based on your usage / needs. This is the critical factor.
4) Do the Group By on the int values, with a final outer query to lookup the actual (original) text values.
Thanks Scott, your suggestions definitely deserve a try later, but before I go ahead I do have a concern on the performance if joining couple tables in the query
September 5, 2015 at 5:00 pm
halifaxdal (9/5/2015)
Jeff Moden (9/3/2015)
Since "RecordsWritten" is likely different for each instance of any single Network_Group_Name, you'll probably end up with a 40 million row return that simply duplicates the individual values of Host_Score and that's the reason it's taking to long. If you were to SUM or AVG "RecordsWritten", things might be a bit more snappy. Of course, I can't actually tell because I don't know enough about the data itself.
Thanks Jeff, good catch! Still hanging after removing this field.
Removed from where? Did you also remove it from the GROUP BY?
--Jeff Moden
Change is inevitable... Change for the better is not.
Viewing 6 posts - 1 through 5 (of 5 total)
You must be logged in to reply to this topic. Login to reply