What's the best way to get a sample set of a big table without primary key?

Question

Post reply

What's the best way to get a sample set of a big table without primary key?

halifaxdal

SSCoach

Points: 19741
More actions
June 29, 2017 at 7:06 am

#338415

This is what is happening here:
We have lots of big data, table's size is often around billion and there is no integer id for primary key.
We need to get a small set of sample data (1%), current solution is for the tables with big int field, we do something like WHERE id%100<1
As lots of pending tables do not have integer field like id, we are stuck here now.
Can anyone share your solution here on how to get the needed sampling dataset?
Please note: it is not acceptable nor realistic to add new field to existing table.

Viewing 15 posts - 1 through 15 (of 29 total)

You must be logged in to reply to this topic. Login to reply

parody Say Hey Kid Points: 682 More actions · Answer 1

The issue with that is it'll result in a ful index or heap scan if there is nothing supprting it.

How about tablesample? This should only tocuh pages required to fulfill the sample request

http://technet.microsoft.com/en-us/library/ms189108%28v=sql.105%29.aspx

halifaxdal SSCoach Points: 19741 More actions · Answer 2

parody - Thursday, June 29, 2017 7:26 AM
The issue with that is it'll result in a ful index or heap scan if there is nothing supprting it.
How about tablesample? This should only tocuh pages required to fulfill the sample request
http://technet.microsoft.com/en-us/library/ms189108%28v=sql.105%29.aspx

Thank you for your input, unfortunately this is not MS SQL and there is no TABLESAMPLE function or similar one.

parody Say Hey Kid Points: 682 More actions · Answer 3

halifaxdal - Thursday, June 29, 2017 7:32 AM
parody - Thursday, June 29, 2017 7:26 AM
The issue with that is it'll result in a ful index or heap scan if there is nothing supprting it.
How about tablesample? This should only tocuh pages required to fulfill the sample request
http://technet.microsoft.com/en-us/library/ms189108%28v=sql.105%29.aspx
Thank you for your input, unfortunately this is not MS SQL and there is no TABLESAMPLE function or similar one.

Ohk, it's just you have posted under SQL Server 2012 - T- SQL seciton. What is the DBMS in question, else no one can guess at the syntax you are expecting to be supported?? Or do you actually mean Google Big Table, not simple a big table. Maybe i'm missing something.

John Mitchell-245523 SSC Guru Points: 148809 More actions · Answer 4

halifaxdal - Thursday, June 29, 2017 7:32 AM
parody - Thursday, June 29, 2017 7:26 AM
The issue with that is it'll result in a ful index or heap scan if there is nothing supprting it.
How about tablesample? This should only tocuh pages required to fulfill the sample request
http://technet.microsoft.com/en-us/library/ms189108%28v=sql.105%29.aspx
Thank you for your input, unfortunately this is not MS SQL and there is no TABLESAMPLE function or similar one.

You're posting on the wrong website, then. If you're using a similar RDBMS, something like SELECT TOP 1 PERCENT might work for you. Otherwise, you're likely to get a quicker, clearer and more comprehensive answer on a discussion board relating to the product you're using.

John

Luis Cazares SSC Guru Points: 183706 More actions · Answer 5

What about using something like this?
SELECT TOP (1) PERCENT * FROM MyTable;
Note that if you add an ORDER BY clause the performance will decrease.

Luis C.
General Disclaimer:
Are you seriously taking the advice and code from someone from the internet without testing it? Do you at least understand it? Or can it easily kill your server?

How to post data/code on a forum to get the best help: Option 1 / Option 2

parody Say Hey Kid Points: 682 More actions · Answer 6

Just to be pedantic, though, this may not give the type of sample desired - it will return in an efficient-first order which may or may not be represantive of the full contents depending on a number of factors (insert order, what is the cluster, partitioning, hardware etc)

The modulo example from the OP mimcks a sampled set across the full contents, hence the equivilent thought of tablesample (in tsql at least!)

Sue_H SSC Guru Points: 90860 More actions · Answer 7

I don't think anyone can tell you the best way (or even a viable approach) since it's not SQL Server and the database platform for the table is unknown.

Sue

Eirikur Eiriksson SSC Guru Points: 182951 More actions · Answer 8

halifaxdal - Thursday, June 29, 2017 7:32 AM
parody - Thursday, June 29, 2017 7:26 AM
The issue with that is it'll result in a ful index or heap scan if there is nothing supprting it.
How about tablesample? This should only tocuh pages required to fulfill the sample request
http://technet.microsoft.com/en-us/library/ms189108%28v=sql.105%29.aspx
Thank you for your input, unfortunately this is not MS SQL and there is no TABLESAMPLE function or similar one.

You have been around long enough to know better than keeping the facts to yourself, if you want an answer then you must at least post all the relevant information. For a start, what system are you working on?
😎

There are hundreds of regulars here which can answer your question, the only requirement is to properly ask and not leave us guessing!

ScottPletcher SSC Guru Points: 101223 More actions · Answer 9

Most typically in SQL Server I see this technique used:

SELECT TOP ...
FROM dbo.table_name
ORDER BY NEWID()

I presume that gives a roughly random sample. Of course it likely does require sorting the entire record set.

If you have an existing index that contains a unique set of columns, you could select just those columns in the "ORDER BY" query, allowing an index scan and sort rather than a full table scan and sort, then join back to the main table after that to get other column(s) from the table.

SQL DBA,SQL Server MVP(07, 08, 09) "It's a dog-eat-dog world, and I'm wearing Milk-Bone underwear." "Norm", on "Cheers". Also from "Cheers", from "Carla": "You need to know 3 things about Tortelli men: Tortelli men draw women like flies; Tortelli men treat women like flies; Tortelli men's brains are in their flies".

halifaxdal SSCoach Points: 19741 More actions · Answer 10

I am so sorry that I forgot to mention that I am working in Cloudera and Netezza, I post my question here because I know this is the best forum for SQL and people like yourself are all the wonderful people and are willing to help, so I can borrow some idea from here and port it to my own environment.

My apology to all.

In this case, the query would be used in Impala (Cloudera)

The raw table has a size of 11978638511, for the query to do the sampling, the table's size is around 6000

I tried Luis suggestion of

SELECT TOP (1) PERCENT *
FROM MyTable;

Impala doesn't take it.

Thank you very much.

Sue_H SSC Guru Points: 90860 More actions · Answer 11

halifaxdal - Thursday, June 29, 2017 10:20 AM
I am so sorry that I forgot to mention that I am working in Cloudera and Netezza, I post my question here because I know this is the best forum for SQL and people like yourself are all the wonderful people and are willing to help, so I can borrow some idea from here and port it to my own environment.
My apology to all.
In this case, the query would be used in Impala (Cloudera)
The raw table has a size of 11978638511, for the query to do the sampling, the table's size is around 6000
I tried Luis suggestion of
SELECT TOP (1) PERCENT *
FROM MyTable;
Impala doesn't take it.
Thank you very much.

I just did a quick search - it won't support top. An equivalent is the limit clause. But it also doesn't mean it's the best and there could be alternatives so you may want to google and get more information on using the limit clause.
select * from MyTable limit 6000;

Sue

parody Say Hey Kid Points: 682 More actions · Answer 12

Rereading your first post actually made me realise you are doing it just fine with modulo, it's just with some tables you don't have an integer column to use.

In that case can you not find some other suitable column with enough variation (timestamp or some random data) on which to calculate a checksum and then perform a modulo?

Alternatively a TOP with calculated instead of percentage and order by something random or a new_id type function?

Also FYI

https://issues.apache.org/jira/plugins/servlet/mobile#issue/IMPALA-1924

halifaxdal SSCoach Points: 19741 More actions · Answer 13

Thank you. I am not sure if I ran into sort of "bug" in Impala, here is what I like to share here and I am waiting for Cloudera's reply:
As there is no integer column to use, I came up with the following query hoping to get 1% random subset:

SELECT round(rand(unix_timestamp(now())) * 10,0) ,cast(round(rand(unix_timestamp(now())) * 10,0) as int) ,* FROM fq_stage.OATS_GTC_ORDER --where cast(round(rand(unix_timestamp(now())) * 10,0) as int) = 0

The second field is not necessary but added there for troubleshooting, the query with where clause is supposed to return me the 1% subset but actually nothing
Comment out the where clause will give all the dataset
Modify it to <=1,2,..... will give me all the dataset.

BTW:
rand(unix_timestamp(now())) generates value like: 0.2900070456275749

mw_sql_developer SSCoach Points: 19445 More actions · Answer 14

ScottPletcher - Thursday, June 29, 2017 9:57 AM
Most typically in SQL Server I see this technique used:
SELECT TOP ...
FROM dbo.table_name
ORDER BY NEWID()
I presume that gives a roughly random sample. Of course it likely does require sorting the entire record set.
If you have an existing index that contains a unique set of columns, you could select just those columns in the "ORDER BY" query, allowing an index scan and sort rather than a full table scan and sort, then join back to the main table after that to get other column(s) from the table.

How about adding that WITH ( NOLOCK) next to the table name