Huge headache to work with big data

Question

Post reply

Huge headache to work with big data

halifaxdal

SSCoach

Points: 19741
More actions
September 3, 2015 at 2:06 pm

#398994

With Luis' help, I just imported a huge data file into my local server/db on my laptop as well as a remote server (btw: The built-in import/export seems a little bit faster than bcp, both finished in less than 30 minutes which is amazing) and started some unexpected journey, here is the first stop:
The table has 40110117 rows, that's right, 40 millions.
What I want is maybe I should do some pre-analysis first and generates some staging tables for later use.
The table has only 8 columns, please see the screenshot
There is a primary key ID set for the table.
I want to get average(host_score) grouped by the other columns: network, OS, IP, Vuln_name
This query might work eventually, but I cancelled it after 10 minutes waiting.
SELECT Avg (host_score) as AvgScore, Network_Group_Name, OS, Vuln_Name, recordsWritten
FROM [IP360_Test].[dbo].z_SIVMS_currentVulns
Group By Network_Group_Name, OS, Vuln_Name, recordsWritten
As you can see those columns are all varchar and with 128+ characters except for IP with 15 characters.
I hope someone who ever work on this size of data can share your experience here, I basically cannot do anything on it now.

Viewing 6 posts - 1 through 5 (of 5 total)

You must be logged in to reply to this topic. Login to reply

ScottPletcher SSC Guru Points: 100948 More actions · Answer 1

I can't see the screenshot. A screenshot is pretty useless anyway. Actual text would be much better.

But, in short, here's the approach I recommend:

1) encode the large varchar columns to int values using a text_values conversion table.

2) create new columns to contain the encoded values.

3) cluster the table in the proper sequence based on your usage / needs. This is the critical factor.

4) Do the Group By on the int values, with a final outer query to lookup the actual (original) text values.

SQL DBA,SQL Server MVP(07, 08, 09) A socialist is someone who will give you the shirt off *someone else's* back.

Jeff Moden SSC Guru Points: 1003863 More actions · Answer 2

halifaxdal (9/3/2015)
With Luis' help, I just imported a huge data file into my local server/db on my laptop as well as a remote server (btw: The built-in import/export seems a little bit faster than bcp, both finished in less than 30 minutes which is amazing) and started some unexpected journey, here is the first stop:
The table has 40110117 rows, that's right, 40 millions.
What I want is maybe I should do some pre-analysis first and generates some staging tables for later use.
The table has only 8 columns, please see the screenshot
There is a primary key ID set for the table.
I want to get average(host_score) grouped by the other columns: network, OS, IP, Vuln_name
This query might work eventually, but I cancelled it after 10 minutes waiting.
SELECT Avg (host_score) as AvgScore, Network_Group_Name, OS, Vuln_Name, recordsWritten
FROM [IP360_Test].[dbo].z_SIVMS_currentVulns
Group By Network_Group_Name, OS, Vuln_Name, recordsWritten
As you can see those columns are all varchar and with 128+ characters except for IP with 15 characters.
I hope someone who ever work on this size of data can share your experience here, I basically cannot do anything on it now.

Since "RecordsWritten" is likely different for each instance of any single Network_Group_Name, you'll probably end up with a 40 million row return that simply duplicates the individual values of Host_Score and that's the reason it's taking to long. If you were to SUM or AVG "RecordsWritten", things might be a bit more snappy. Of course, I can't actually tell because I don't know enough about the data itself.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

halifaxdal SSCoach Points: 19741 More actions · Answer 3

Jeff Moden (9/3/2015)
Since "RecordsWritten" is likely different for each instance of any single Network_Group_Name, you'll probably end up with a 40 million row return that simply duplicates the individual values of Host_Score and that's the reason it's taking to long. If you were to SUM or AVG "RecordsWritten", things might be a bit more snappy. Of course, I can't actually tell because I don't know enough about the data itself.

Thanks Jeff, good catch! Still hanging after removing this field.

halifaxdal SSCoach Points: 19741 More actions · Answer 4

ScottPletcher (9/3/2015)
I can't see the screenshot. A screenshot is pretty useless anyway. Actual text would be much better.
But, in short, here's the approach I recommend:
1) encode the large varchar columns to int values using a text_values conversion table.
2) create new columns to contain the encoded values.
3) cluster the table in the proper sequence based on your usage / needs. This is the critical factor.
4) Do the Group By on the int values, with a final outer query to lookup the actual (original) text values.

Thanks Scott, your suggestions definitely deserve a try later, but before I go ahead I do have a concern on the performance if joining couple tables in the query

Jeff Moden SSC Guru Points: 1003863 More actions · Answer 5

halifaxdal (9/5/2015)
Jeff Moden (9/3/2015)
Since "RecordsWritten" is likely different for each instance of any single Network_Group_Name, you'll probably end up with a 40 million row return that simply duplicates the individual values of Host_Score and that's the reason it's taking to long. If you were to SUM or AVG "RecordsWritten", things might be a bit more snappy. Of course, I can't actually tell because I don't know enough about the data itself.
Thanks Jeff, good catch! Still hanging after removing this field.

Removed from where? Did you also remove it from the GROUP BY?

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)