Joining with large table

  • Good morning. I have been working on a faster solution for some time, and thought I had it. I have a very large table with all people who work in the state. I need to join that table to a much smaller one of college graduates for outcomes. Since the graduates table will continue to grow, I can't remove anyone from the wage table. So wage table remains huge, and will only get bigger. Now, joining the tables is not a big deal, but for the fact that people often work in more than one job. But I created a table-valued function:

    CREATE FUNCTION dbo.GetWages

    (@SSN char(9))

    RETURNS TABLE

    AS

    RETURN

    SELECT year, quarter, ssn, SUM(hours) hours, SUM(wages) wages, SUM(wages)/nullif(sum(hours),'0') rate

    FROM dbo.Wage_data

    WHERE ssn IN (@SSN)

    GROUP BY year, quarter, ssn;

    GO

    And it works great:

    SELECT student_id, LName, rate

    FROM dbo.Degrees_Data CROSS APPLY dbo.GetWages(SSN)

    However, we also want to know the industry of employment. When there are two jobs, we want the industry of the main job, defined as one with most wages. To get this, I have been using a CTE in a view, which is very slow since I cannot write in index to it. So again I wrote a TVF:

    CREATE FUNCTION dbo.GetWagesAndIndustry (@SSN char(9))

    RETURNS TABLE

    AS

    RETURN

    WITH cte AS

    (

    SELECT year, quarter, ssn, hours, industry

    , ROW_NUMBER() over (partition by year, quarter, ssn ORDER BY wages ASC) rn

    , SUM(wages) over (partition by year, quarter, ssn ) SumWage

    , SUM(hours) over (partition by year, quarter, ssn ) SumHours

    FROM dbo.Wage_data

    WHERE ssn IN (@SSN)

    )

    SELECT year, quarter, ssn, industry, SumHours hours, SumWage wages, SumWage/NULLIF(SumHours, '0') rage

    FROM cte

    WHERE rn = 1

    But this is also painfully slow. Eventually we want to be able to create reports on the aggregated data, but who's going to wait 5-10 minutes for a report to generate? Definitely, not my boss. There must be a creative way to group the wage data to one record per person, yet have the industry information still available. What am I not seeing? Thank you for any help.

  • Just curious, what's the drawback of using a bog standard join using tables? (posting a wild guess at such a query :w00t:)

    SELECT grads.student_ID, grads.Lname, WAGES.SumWage, WAGES.SumHours

    FROM GRADS

    JOIN

    (

    SELECT year, quarter, ssn, hours, industry

    , ROW_NUMBER() over (partition by year, quarter, ssn ORDER BY wages ASC) rn

    , SUM(wages) over (partition by year, quarter, ssn ) SumWage

    , SUM(hours) over (partition by year, quarter, ssn ) SumHours

    FROM dbo.Wage_data

    ) WAGES

    ON GRADS.ssn = WAGES.ssn

    where WAGES.rn = 1

  • I'm sorry, but I don't know what a bog is (relating to SQL, that is). I googled it, and didn't see anything obvious.

  • Amy.G (4/8/2013)


    I'm sorry, but I don't know what a bog is (relating to SQL, that is). I googled it, and didn't see anything obvious.

    "Bog Standard" http://en.wiktionary.org/wiki/bog_standard

    (idiomatic) Especially plain, ordinary, or unremarkable; having no special, excess or unusual features; plain vanilla "She drives a bog standard economy car."

    Sorry about that!

  • OK, thank you for the link on ancient ceramic toilets 😛 But anyway, you are asking why don't I just put it in a regular table? Correct? The answer is space. I have used temp tables, where I can put on index on the table and run queries from there, which is fine for ad hoc queries. But I'm looking for something to use in reporting services. I played around with a stored proc to generate temp tables, but I wanted to throw out the issue to the forums to see if there were any ideas people had.

  • Amy.G (4/8/2013)


    OK, thank you for the link on ancient ceramic toilets 😛 But anyway, you are asking why don't I just put it in a regular table? Correct? The answer is space. I have used temp tables, where I can put on index on the table and run queries from there, which is fine for ad hoc queries. But I'm looking for something to use in reporting services. I played around with a stored proc to generate temp tables, but I wanted to throw out the issue to the forums to see if there were any ideas people had.

    Ok I was assuming the data was already in two tables that could be indexed. My bad!!!

    Also for what its worth, I like temp tables for reporting services, if you use a single hash (like #name), I didn't think there was a problem with it as if I'm not mistaken these are actually only visible to the connection they're created with.

  • Ok I was assuming the data was already in two tables that could be indexed. My bad!!!

    Oops, I think I misunderstood something. The data are in two separate tables that do have indexes on them. The wage data needs to be grouped.... hold on, it just occurred to me the data can be grouped by employee after the tables have been joined. (don't laugh, it seriously just occurred to me). I'll see where this leads.

  • I can't believe this. My query went from nearly 10 minutes to 2 seconds. I had this idea that the data had to be summed before joining. The dangers of tunnel vision, I guess.

  • Amy.G (4/8/2013)


    I can't believe this. My query went from nearly 10 minutes to 2 seconds. I had this idea that the data had to be summed before joining. The dangers of tunnel vision, I guess.

    I'm not an expert or anything, but I've seen plenty of evidence that the sql server software will rearrange query operations in a variety of ways as long as the results would be logically equivalent to what you spec'ed in your query text. Sometimes it can pay to just write your query in the most obvious way and see what the server does with it. Doesn't hurt to eyeball the plan also.

    Good luck!

  • Amy.G (4/8/2013)


    Ok I was assuming the data was already in two tables that could be indexed. My bad!!!

    Oops, I think I misunderstood something. The data are in two separate tables that do have indexes on them. The wage data needs to be grouped.... hold on, it just occurred to me the data can be grouped by employee after the tables have been joined. (don't laugh, it seriously just occurred to me). I'll see where this leads.

    Glad you got it figured out. Sometimes its best to go back to the logical steps SQL Server uses to work through a query. Great job!

Viewing 10 posts - 1 through 9 (of 9 total)

You must be logged in to reply to this topic. Login to reply