Huge headache to work with big data

  • With Luis' help, I just imported a huge data file into my local server/db on my laptop as well as a remote server (btw: The built-in import/export seems a little bit faster than bcp, both finished in less than 30 minutes which is amazing) and started some unexpected journey, here is the first stop:

    The table has 40110117 rows, that's right, 40 millions.

    What I want is maybe I should do some pre-analysis first and generates some staging tables for later use.

    The table has only 8 columns, please see the screenshot

    There is a primary key ID set for the table.

    I want to get average(host_score) grouped by the other columns: network, OS, IP, Vuln_name

    This query might work eventually, but I cancelled it after 10 minutes waiting.

    SELECT Avg (host_score) as AvgScore, Network_Group_Name, OS, Vuln_Name, recordsWritten

    FROM [IP360_Test].[dbo].z_SIVMS_currentVulns

    Group By Network_Group_Name, OS, Vuln_Name, recordsWritten

    As you can see those columns are all varchar and with 128+ characters except for IP with 15 characters.

    I hope someone who ever work on this size of data can share your experience here, I basically cannot do anything on it now.

  • I can't see the screenshot. A screenshot is pretty useless anyway. Actual text would be much better.

    But, in short, here's the approach I recommend:

    1) encode the large varchar columns to int values using a text_values conversion table.

    2) create new columns to contain the encoded values.

    3) cluster the table in the proper sequence based on your usage / needs. This is the critical factor.

    4) Do the Group By on the int values, with a final outer query to lookup the actual (original) text values.

    SQL DBA,SQL Server MVP(07, 08, 09) A socialist is someone who will give you the shirt off *someone else's* back.

  • halifaxdal (9/3/2015)


    With Luis' help, I just imported a huge data file into my local server/db on my laptop as well as a remote server (btw: The built-in import/export seems a little bit faster than bcp, both finished in less than 30 minutes which is amazing) and started some unexpected journey, here is the first stop:

    The table has 40110117 rows, that's right, 40 millions.

    What I want is maybe I should do some pre-analysis first and generates some staging tables for later use.

    The table has only 8 columns, please see the screenshot

    There is a primary key ID set for the table.

    I want to get average(host_score) grouped by the other columns: network, OS, IP, Vuln_name

    This query might work eventually, but I cancelled it after 10 minutes waiting.

    SELECT Avg (host_score) as AvgScore, Network_Group_Name, OS, Vuln_Name, recordsWritten

    FROM [IP360_Test].[dbo].z_SIVMS_currentVulns

    Group By Network_Group_Name, OS, Vuln_Name, recordsWritten

    As you can see those columns are all varchar and with 128+ characters except for IP with 15 characters.

    I hope someone who ever work on this size of data can share your experience here, I basically cannot do anything on it now.

    Since "RecordsWritten" is likely different for each instance of any single Network_Group_Name, you'll probably end up with a 40 million row return that simply duplicates the individual values of Host_Score and that's the reason it's taking to long. If you were to SUM or AVG "RecordsWritten", things might be a bit more snappy. Of course, I can't actually tell because I don't know enough about the data itself.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Jeff Moden (9/3/2015)

    Since "RecordsWritten" is likely different for each instance of any single Network_Group_Name, you'll probably end up with a 40 million row return that simply duplicates the individual values of Host_Score and that's the reason it's taking to long. If you were to SUM or AVG "RecordsWritten", things might be a bit more snappy. Of course, I can't actually tell because I don't know enough about the data itself.

    Thanks Jeff, good catch! Still hanging after removing this field.

  • ScottPletcher (9/3/2015)


    I can't see the screenshot. A screenshot is pretty useless anyway. Actual text would be much better.

    But, in short, here's the approach I recommend:

    1) encode the large varchar columns to int values using a text_values conversion table.

    2) create new columns to contain the encoded values.

    3) cluster the table in the proper sequence based on your usage / needs. This is the critical factor.

    4) Do the Group By on the int values, with a final outer query to lookup the actual (original) text values.

    Thanks Scott, your suggestions definitely deserve a try later, but before I go ahead I do have a concern on the performance if joining couple tables in the query

  • halifaxdal (9/5/2015)


    Jeff Moden (9/3/2015)

    Since "RecordsWritten" is likely different for each instance of any single Network_Group_Name, you'll probably end up with a 40 million row return that simply duplicates the individual values of Host_Score and that's the reason it's taking to long. If you were to SUM or AVG "RecordsWritten", things might be a bit more snappy. Of course, I can't actually tell because I don't know enough about the data itself.

    Thanks Jeff, good catch! Still hanging after removing this field.

    Removed from where? Did you also remove it from the GROUP BY?

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

Viewing 6 posts - 1 through 5 (of 5 total)

You must be logged in to reply to this topic. Login to reply