select top 3 records of each type baesd on type id

  • I have a table with 100k records of 10 different types. I need to extract any 2 of each type into a second table for sampling. I can do this with a cursor or loop but can't come up with the set based solution. For example the loop solution is shown below.

    create table #testgrp

    (

    recid decimal(13,0),

    typeid varchar(5)

    CONSTRAINT [PK_#testgrp] PRIMARY KEY CLUSTERED

    (

    [recid]

    ) ON [PRIMARY]

    ) ON [PRIMARY]

    insert into #testgrp(recid,typeid)

    select 1,'N1' union

    select 2,'N1' union

    select 3,'N1' union

    select 4,'N1' union

    select 5,'N2' union

    select 6,'N2' union

    select 7,'N2' union

    select 8,'N2' union

    select 9,'N3' union

    select 10,'N3' union

    select 11,'N3' union

    select 21,'N3'

    select top 0 e.* into #seeds from #testgrp e

    declare @rc int,@qc int

    set @rc = 0

    set @qc = 0

    select typeid from #testgrp

    group by typeid

    set @rc = @@rowcount

    while @qc < @rc

    begin

    insert into #seeds

    select top 2 e.* from #testgrp e left join #seeds s on s.typeid=e.typeid

    where s.typeid is null

    set @qc = @qc + 1

    end

    select * from #seeds

    drop table #testgrp

    drop table #seeds

    Results:

    recidtypeid

    1N1

    2N1

    5N2

    6N2

    9N3

    10N3


  • mrpolecat, with windows 2005 and up, you could easily use the row_number() function to get what you are after; you probably knoew that already.

    For SQL 2000 it's not so intuitive.

    Here's a snippet I've adapted to your sample data that i had in my notes; it seems to do exactly what you were after:

    SELECT T1.[recid],

    T1.[typeid]

    FROM #testgrp T1

    WHERE T1.[recid] IN

    (

    SELECT TOP 2 --Change this number to vary # output

    T2.[recid]

    FROM #testgrp T2

    WHERE T2.[typeid] = T1.[typeid]

    ORDER BY T2.[recid] DESC

    )

    ORDER BY T1.[recid],

    T1.[typeid] DESC

    Lowell


    --help us help you! If you post a question, make sure you include a CREATE TABLE... statement and INSERT INTO... statement into that table to give the volunteers here representative data. with your description of the problem, we can provide a tested, verifiable solution to your question! asking the question the right way gets you a tested answer the fastest way possible!

  • That works great but I don't get why. I'd appreciate it if you could give me the explaination as well.


  • i believe it is because of the correlated subquery, where the inner query getting the TOP 2 is connecting to the outer query:

    SELECT TOP 2 --Change this number to vary # output

    T2.[recid]

    FROM #testgrp T2

    WHERE T2.[typeid] = T1.[typeid] -- the correlating part of the query referencing the outer query

    ORDER BY T2.[recid] DESC

    if it was just a vanilla SELECT recid from the table, all rows would show up in the outer query...but because of the WHERE T2.[typeid] = T1.[typeid], only the rows that have a join show up, and hte TOP limits it to only a few rows for each recid;

    Lowell


    --help us help you! If you post a question, make sure you include a CREATE TABLE... statement and INSERT INTO... statement into that table to give the volunteers here representative data. with your description of the problem, we can provide a tested, verifiable solution to your question! asking the question the right way gets you a tested answer the fastest way possible!

  • so is the inner query really running one time for each typeid?


  • mrpolecat (7/6/2010)


    I need to extract any 2 of each type into a second table for sampling.

    mrpolecat, it may not be an issue in this case, but for a truly randomized statistical sample, you should probably stay away from ordering by id for extracting the two records from each type. For example, what if someone bulk loaded the data into the table at one time, and had ORDER BY cost, or age, or any other attribute that would cause the top two records to not be representative of the phenomenon. Just a thought.

    Greg
    _________________________________________________________________________________________________
    The glass is at one half capacity: nothing more, nothing less.

  • Thanks Greg, There is actually a whole other part of the query that deals with the randomization of the records that I left out to simplify the issue to the part I was having toruble with. I add a where clause to both the inner and outer queries which limits the records it can select from to a random set.


Viewing 7 posts - 1 through 6 (of 6 total)

You must be logged in to reply to this topic. Login to reply