What's the best way to get a sample set of a big table without primary key?

  • halifaxdal

    SSCoach

    Points: 19741

    mw112009 - Thursday, June 29, 2017 1:34 PM

    ScottPletcher - Thursday, June 29, 2017 9:57 AM

    Most typically in SQL Server I see this technique used:

    SELECT TOP ...
    FROM dbo.table_name
    ORDER BY NEWID()

    I presume that gives a roughly random sample.  Of course it likely does require sorting the entire record set.

    If you have an existing index that contains a unique set of columns, you could select just those columns in the "ORDER BY" query, allowing an index scan and sort rather than a full table scan and sort, then join back to the main table after that to get other column(s) from the table.

    How about adding that   WITH ( NOLOCK)   next to the table name

    Thank you.

    Any query with TOP is adding more work here in my situation:

    We need to generate 1% random subset, using TOP would require calculate the TOTAL in advance, every table has a different TOTAL. Getting TOP presumably means it is sorted somehow, which means not random.

  • cyrusbaratt

    SSC Veteran

    Points: 219

    ScottPletcher - Thursday, June 29, 2017 9:57 AM

    Most typically in SQL Server I see this technique used:

    SELECT TOP ...
    FROM dbo.table_name
    ORDER BY NEWID()

    I presume that gives a roughly random sample.  Of course it likely does require sorting the entire record set.

    If you have an existing index that contains a unique set of columns, you could select just those columns in the "ORDER BY" query, allowing an index scan and sort rather than a full table scan and sort, then join back to the main table after that to get other column(s) from the table.

    You may use the SQL tool : Silverlight to export data in the form of insert statement to other database tables with proper keys.
    I used it many times in the past.  The export file will include SQL insert statements.  You could provide good primary keys for the destination tables. 
    The destination table would be on a different SQL server.  You may use Microsoft excel to build your insert commands for this as well. (record count limitations)

  • halifaxdal

    SSCoach

    Points: 19741

    cyrusbaratt - Thursday, June 29, 2017 11:12 PM

    ScottPletcher - Thursday, June 29, 2017 9:57 AM

    Most typically in SQL Server I see this technique used:

    SELECT TOP ...
    FROM dbo.table_name
    ORDER BY NEWID()

    I presume that gives a roughly random sample.  Of course it likely does require sorting the entire record set.

    If you have an existing index that contains a unique set of columns, you could select just those columns in the "ORDER BY" query, allowing an index scan and sort rather than a full table scan and sort, then join back to the main table after that to get other column(s) from the table.

    You may use the SQL tool : Silverlight to export data in the form of insert statement to other database tables with proper keys.
    I used it many times in the past.  The export file will include SQL insert statements.  You could provide good primary keys for the destination tables. 
    The destination table would be on a different SQL server.  You may use Microsoft excel to build your insert commands for this as well. (record count limitations)

    Thank you for your suggestion but it doesn't apply to our case.

    I received reply from Cloudera and they confirmed there is a "bug" in their product, in particular it says:
    When a nondeterministic function is used in the WHERE clause of a query like yours, Impala produces an IllegalStateException or, worse, produces incorrect results.

  • Jeff Moden

    SSC Guru

    Points: 996661

    Unless your RDBMS generates some sort of accessible unique row number internally and if random sampling is going to become the norm rather than the exception, then it's seriously worth adding a unique sequential number to the table.

    In the meantime, although your table doesn't have an integer column to use for such a thing, what is the PK for the table that uniquely identifies a row?

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.
    "Change is inevitable... change for the better is not".
    "If "pre-optimization" is the root of all evil, then what does the resulting no optimization lead to?"

    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • halifaxdal

    SSCoach

    Points: 19741

    Jeff Moden - Friday, June 30, 2017 5:47 PM

    Unless your RDBMS generates some sort of accessible unique row number internally and if random sampling is going to become the norm rather than the exception, then it's seriously worth adding a unique sequential number to the table.

    In the meantime, although your table doesn't have an integer column to use for such a thing, what is the PK for the table that uniquely identifies a row?

    Thank you Jeff for your input, it's not practical to add a unique sequential number: many tables are in the size of billion records 🙂

  • Jeff Moden

    SSC Guru

    Points: 996661

    halifaxdal - Friday, June 30, 2017 7:59 PM

    Jeff Moden - Friday, June 30, 2017 5:47 PM

    Unless your RDBMS generates some sort of accessible unique row number internally and if random sampling is going to become the norm rather than the exception, then it's seriously worth adding a unique sequential number to the table.

    In the meantime, although your table doesn't have an integer column to use for such a thing, what is the PK for the table that uniquely identifies a row?

    Thank you Jeff for your input, it's not practical to add a unique sequential number: many tables are in the size of billion records 🙂

    You didn't answer my question.  What is the PK for the big table that uniquely identifies a row?

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.
    "Change is inevitable... change for the better is not".
    "If "pre-optimization" is the root of all evil, then what does the resulting no optimization lead to?"

    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Jeff Moden

    SSC Guru

    Points: 996661

    Heh... and why you don't think it's practical to serialize a billion row table is beyond me. 😉

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.
    "Change is inevitable... change for the better is not".
    "If "pre-optimization" is the root of all evil, then what does the resulting no optimization lead to?"

    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Ed Wagner

    SSC Guru

    Points: 286982

    Jeff Moden - Friday, June 30, 2017 9:04 PM

    Heh... and why you don't think it's practical to serialize a billion row table is beyond me. 😉

    Me too.  If you had a sequential integer column, getting 1% of the table would be fairly simple using a MOD function, if your database has one, that is.

  • Jeff Moden

    SSC Guru

    Points: 996661

    Jeff Moden - Friday, June 30, 2017 9:03 PM

    halifaxdal - Friday, June 30, 2017 7:59 PM

    Jeff Moden - Friday, June 30, 2017 5:47 PM

    Unless your RDBMS generates some sort of accessible unique row number internally and if random sampling is going to become the norm rather than the exception, then it's seriously worth adding a unique sequential number to the table.

    In the meantime, although your table doesn't have an integer column to use for such a thing, what is the PK for the table that uniquely identifies a row?

    Thank you Jeff for your input, it's not practical to add a unique sequential number: many tables are in the size of billion records 🙂

    You didn't answer my question.  What is the PK for the big table that uniquely identifies a row?

    Ah... missed the title on this.  There is no PK on this table, at least not one that you can easily use.  I have to admit that I'm a little gob-smacked that no one anticipated the problem of duplicates on a billion row table. 

    Shifting gears on this, it's difficult for me to understand that out of more than 40 columns, there are no candidate keys.  Someone needs to seriously analyze this table and start picking it apart for keys and, maybe, some lookup tables.  As a bit of a sidebar, I'd be interested in what the data is in this heap and what people expect to draw from it.  Perhaps there's something that we could all glean from such knowledge and, perhaps, even come up with a solution to the original problem of detecting duplicates and doing something about them.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.
    "Change is inevitable... change for the better is not".
    "If "pre-optimization" is the root of all evil, then what does the resulting no optimization lead to?"

    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • sgmunson

    SSC Guru

    Points: 110459

    Jeff Moden - Sunday, July 2, 2017 12:14 PM

    Jeff Moden - Friday, June 30, 2017 9:03 PM

    halifaxdal - Friday, June 30, 2017 7:59 PM

    Jeff Moden - Friday, June 30, 2017 5:47 PM

    Unless your RDBMS generates some sort of accessible unique row number internally and if random sampling is going to become the norm rather than the exception, then it's seriously worth adding a unique sequential number to the table.

    In the meantime, although your table doesn't have an integer column to use for such a thing, what is the PK for the table that uniquely identifies a row?

    Thank you Jeff for your input, it's not practical to add a unique sequential number: many tables are in the size of billion records 🙂

    You didn't answer my question.  What is the PK for the big table that uniquely identifies a row?

    Ah... missed the title on this.  There is no PK on this table, at least not one that you can easily use.  I have to admit that I'm a little gob-smacked that no one anticipated the problem of duplicates on a billion row table. 

    Shifting gears on this, it's difficult for me to understand that out of more than 40 columns, there are no candidate keys.  Someone needs to seriously analyze this table and start picking it apart for keys and, maybe, some lookup tables.  As a bit of a sidebar, I'd be interested in what the data is in this heap and what people expect to draw from it.  Perhaps there's something that we could all glean from such knowledge and, perhaps, even come up with a solution to the original problem of detecting duplicates and doing something about them.

    If it matters, I've seen this poster just repeatedly abuse the assistance being received.   The database is NOT a SQL Server database, and rarely does this poster make much effort to solve the problem on their own.  I'm pretty sure he/she doesn't really care one whit about what effort goes into helping, and is only concerned with getting a solution.   If one doesn't magically appear pretty quickly, he/she usually loses interest.   This person has been around the forum for more than long enough to know how to get good help, and yet continues to obfuscate and avoid answering any "inconvenient" questions.   I'll worry about helping when I see a more concerted effort on the part of the original poster to help solve their own problem by providing the right kind of information and sufficient detail to allow folks to help in a practical way.

    Steve
    ‌(aka sgmunson)
    ‌:) 🙂 🙂
    Health & Nutrition

  • halifaxdal

    SSCoach

    Points: 19741

    sgmunson - Monday, July 3, 2017 10:19 AM

    Jeff Moden - Sunday, July 2, 2017 12:14 PM

    Jeff Moden - Friday, June 30, 2017 9:03 PM

    halifaxdal - Friday, June 30, 2017 7:59 PM

    Jeff Moden - Friday, June 30, 2017 5:47 PM

    Unless your RDBMS generates some sort of accessible unique row number internally and if random sampling is going to become the norm rather than the exception, then it's seriously worth adding a unique sequential number to the table.

    In the meantime, although your table doesn't have an integer column to use for such a thing, what is the PK for the table that uniquely identifies a row?

    Thank you Jeff for your input, it's not practical to add a unique sequential number: many tables are in the size of billion records 🙂

    You didn't answer my question.  What is the PK for the big table that uniquely identifies a row?

    Ah... missed the title on this.  There is no PK on this table, at least not one that you can easily use.  I have to admit that I'm a little gob-smacked that no one anticipated the problem of duplicates on a billion row table. 

    Shifting gears on this, it's difficult for me to understand that out of more than 40 columns, there are no candidate keys.  Someone needs to seriously analyze this table and start picking it apart for keys and, maybe, some lookup tables.  As a bit of a sidebar, I'd be interested in what the data is in this heap and what people expect to draw from it.  Perhaps there's something that we could all glean from such knowledge and, perhaps, even come up with a solution to the original problem of detecting duplicates and doing something about them.

    If it matters, I've seen this poster just repeatedly abuse the assistance being received.   The database is NOT a SQL Server database, and rarely does this poster make much effort to solve the problem on their own.  I'm pretty sure he/she doesn't really care one whit about what effort goes into helping, and is only concerned with getting a solution.   If one doesn't magically appear pretty quickly, he/she usually loses interest.   This person has been around the forum for more than long enough to know how to get good help, and yet continues to obfuscate and avoid answering any "inconvenient" questions.   I'll worry about helping when I see a more concerted effort on the part of the original poster to help solve their own problem by providing the right kind of information and sufficient detail to allow folks to help in a practical way.

    For your information: July 1 is Canada Day, long weekend here, I don't take work to camping, I don't think you do.
    Second, I said I trust this forum, like everyone can see, I started here quite some years ago, BUT, I am no longer using SQL now, I am in SQL-like NoSQL now, dataset has been dramitically increased to different level like billions. 
    I like this community and I know there are common usage of grammar between various system, for example, Netezza's SQL, Hive, Impala......
    I appreciate every help/clue I received from here (Steve, Luis, Jeff, .......many many), I also share my thought in this community although my contribution mighe be little compared to you.
    Do not judge people by saying something like "abuse" - that's very negative.
    When you mentioned "inconvenient", don't you know that there are something called "confidential", "privacy" in some companies? 

    BUT, I do want to apologize again for not mentioning the question is to get some enlightening here for a Non-conventional SQL question in the original post.
    And sorry I didn't have chance to check my email because I was in camping.

  • Jeff Moden

    SSC Guru

    Points: 996661

    Didn't know this was for a type of NoSQL.  Considering that you have "billions" of rows, it may be time to bite the bullet and put it all in a relational database engine.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.
    "Change is inevitable... change for the better is not".
    "If "pre-optimization" is the root of all evil, then what does the resulting no optimization lead to?"

    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • halifaxdal

    SSCoach

    Points: 19741

    Jeff Moden - Tuesday, July 4, 2017 10:14 AM

    Didn't know this was for a type of NoSQL.  Considering that you have "billions" of rows, it may be time to bite the bullet and put it all in a relational database engine.

    Thank you Jeff, we are in the process of archiving the data from Netezza to Hadoop cluster. Traditional database like SQL has no way to handle this in an economic way. and in Netezza there is no concept like Primary Key.

  • Jeff Moden

    SSC Guru

    Points: 996661

    halifaxdal - Tuesday, July 4, 2017 4:51 PM

    Jeff Moden - Tuesday, July 4, 2017 10:14 AM

    Didn't know this was for a type of NoSQL.  Considering that you have "billions" of rows, it may be time to bite the bullet and put it all in a relational database engine.

    Thank you Jeff, we are in the process of archiving the data from Netezza to Hadoop cluster. Traditional database like SQL has no way to handle this in an economic way. and in Netezza there is no concept like Primary Key.

    Thanks for the feedback.  Like I said, I'd love to see what the data actually looks like before I accept the notion of "Traditional database like SQL has no way to handle this in an economic way". 😉

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.
    "Change is inevitable... change for the better is not".
    "If "pre-optimization" is the root of all evil, then what does the resulting no optimization lead to?"

    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

Viewing 14 posts - 16 through 29 (of 29 total)

You must be logged in to reply to this topic. Login to reply