Practice SQL querying with dates and aggregations

Question

Post reply

Practice SQL querying with dates and aggregations

Lord Slaagh

SSCertifiable

Points: 5796
More actions
November 25, 2020 at 10:23 pm
Go to Answer
#3814021

Hello SSC!
I am looking for some practice with advanced SQL querying. I have a face to face interview coming up (not zoom, actually going to the office), and I want to be sure that I practice. Is there a site or perhaps some SSC documentation that has practice queries using Adventure works DB? I am looking specifically for aggregations and manipulating dates.
Happy holiday's and stay safe!
Thanks
The are no problems, only solutions. --John Lennon
ScottPletcher

SSC Guru

Points: 101236
More actions
November 26, 2020 at 3:47 am
Answer
#3814083
This is the main test q I use to test query-writing skill. It's not as trivial / obvious as it sounds at first.
A table contains 1, 2 or 3 rows for each student, each row with one test score. For example:
student_scores ( student_id int NOT NULL, assignment_number int NOT NULL, score int NOT NULL )
Write a single query (no ctes, joins, subqueries, etc.) that lists the following:
student_id, score_count, avg_score, highest_score, middle_score, lowest_score
If there are fewer than 3 scores, list from left-to-right (i.e., highest first, then middle, finally lowest), leaving missing scores NULL.
- This reply was modified 5 years, 3 months ago by ScottPletcher. Reason: Reworded the description of the table to make it clearer
SQL DBA,SQL Server MVP(07, 08, 09) "It's a dog-eat-dog world, and I'm wearing Milk-Bone underwear." "Norm", on "Cheers". Also from "Cheers", from "Carla": "You need to know 3 things about Tortelli men: Tortelli men draw women like flies; Tortelli men treat women like flies; Tortelli men's brains are in their flies".

Viewing 15 posts - 1 through 15 (of 36 total)

You must be logged in to reply to this topic. Login to reply

Lord Slaagh SSCertifiable Points: 5796 More actions · Answer 1

Thanks for this... A few questions.

Can you clarify the instructions? I will send you the script once I have completed it. I appreciate the help.

Not sure about the table structure. Is it the first create statement or second? I think the second is the result set you are looking for.

create table students (
student_id int, 
score_count float
);

create table students 
(
student_id int, 
score_count float, 
avg_score float, 
highest_score float, 
middle_score float, 
lowest_score float
);

This reply was modified 5 years, 3 months ago by Lord Slaagh.

The are no problems, only solutions. --John Lennon

ScottPletcher SSC Guru Points: 101236 More actions · Answer 2

Here's the table structure:

student_scores ( student_id int NOT NULL, assignment_number int NOT NULL, score int NOT NULL )

SQL DBA,SQL Server MVP(07, 08, 09) "It's a dog-eat-dog world, and I'm wearing Milk-Bone underwear." "Norm", on "Cheers". Also from "Cheers", from "Carla": "You need to know 3 things about Tortelli men: Tortelli men draw women like flies; Tortelli men treat women like flies; Tortelli men's brains are in their flies".

Lord Slaagh SSCertifiable Points: 5796 More actions · Answer 3

Acceptable data?

create table student_scores 
( 
student_id int NOT NULL,
score_date date NOT NULL, 
score int NOT NULL
);

insert into student_scores VALUES (1,'11/27/2020',80);
insert into student_scores VALUES (1,'12/01/2019',60);
insert into student_scores VALUES (1,'09/27/2018',95);
insert into student_scores VALUES (2,'11/27/2020',80);
insert into student_scores VALUES (2,'06/24/2019',100);
insert into student_scores VALUES (3,'07/04/2020',65);
insert into student_scores VALUES (3,'06/25/2019',82);
insert into student_scores VALUES (3,'10/17/2018',76);

The are no problems, only solutions. --John Lennon

Lord Slaagh SSCertifiable Points: 5796 More actions · Answer 4

Completed script.

create table student_scores 
( 
student_id int NOT NULL,
score_date date NOT NULL, 
score int NOT NULL
);

insert into student_scores VALUES (1,'11/27/2020',80);
insert into student_scores VALUES (1,'12/01/2019',60);
insert into student_scores VALUES (1,'09/27/2018',95);
insert into student_scores VALUES (2,'11/27/2020',80);
insert into student_scores VALUES (2,'06/24/2019',100);
insert into student_scores VALUES (3,'07/04/2020',65);
insert into student_scores VALUES (3,'06/25/2019',82);
insert into student_scores VALUES (3,'10/17/2018',76);
insert into student_scores VALUES (4,'10/17/2018',70);

select student_id, count(score) as score_count, avg(score) as avg_score, max(score) as highest_score, (sum(score)/count(student_id)) as middle_score, min(score) as lowest_score 
from student_scores
group by student_id;

The are no problems, only solutions. --John Lennon

ScottPletcher SSC Guru Points: 101236 More actions · Answer 5

Pretty good overall, reasonably close.

But:

the middle score will not necessarily be the avg score (sum/count is same as avg);

if there are fewer than 3 scores, leave extra scores NULL. For example, if there is only 1 test score for a student, leave middle_score and lowest_score NULL;

You may modify a data type if you need to get a more accurate result, and you may add a column to the query if you feel it is necessary to display meaningful results (HINT, HINT).

SQL DBA,SQL Server MVP(07, 08, 09) "It's a dog-eat-dog world, and I'm wearing Milk-Bone underwear." "Norm", on "Cheers". Also from "Cheers", from "Carla": "You need to know 3 things about Tortelli men: Tortelli men draw women like flies; Tortelli men treat women like flies; Tortelli men's brains are in their flies".

Jeff Moden SSC Guru Points: 1004704 More actions · Answer 6

Lord Slaagh wrote:

Acceptable data?

I would have to say "No"... not acceptable. The table that you've defined is different than the table that Scott very explicitly described not once but twice in this thread. You'll find that also one of the things they look for in interviews.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Jeff Moden SSC Guru Points: 1004704 More actions · Answer 7

ScottPletcher wrote:

This is the main test q I use to test query-writing skill. It's not as trivial / obvious as it sounds at first.
A table contains 1, 2 or 3 rows for each student, each row with one test score. For example: student_scores ( student_id int NOT NULL, assignment_number int NOT NULL, score int NOT NULL )
Write a single query (no ctes, joins, subqueries, etc.) that lists the following:
student_id, score_count, avg_score, highest_score, middle_score, lowest_score
If there are fewer than 3 scores, list from left-to-right (i.e., highest first, then middle, finally lowest), leaving missing scores NULL.

Quick question... when you provide this question, do you provide any test data to go with it or do you leave that up to the interviewee like what might happen in real life?

In the latter case, do you allow for the interviewee to ask clarification questions? For example, what is the domain of "assignment_number". Is it sequential with no gaps starting at 1 for each student? If not, what is the possible domain of numbers for that column? I'd ask the same question if such requirements were given to me in real life.

Also, since the problem description contains no defined PK nor a Clustered Index (a favorite issue of yours and I'm surprised to see this omission in your problem description), as a person taking this test I'd have to ask if the combination of student_id and assignment_number form a unique combination or not. Do you allow your interviewees to ask for/discuss such things?

ScottPletcher wrote:

Write a single query (no ctes, joins, subqueries, etc.) that lists the following:

Really? So you're not interested in finding someone that might have an innovation that might blow the status-quo solution out of the water? I went through that during an interview once... it wasn't a written test but a discussion. The interviewers were hell bent on an incremental rCTE being the correct answer and they argued vehemently towards it. When I explained both why incremental rCTEs are a bad idea and that the use of a Tally table or function is a highly efficient replacement in the given problem , they thought it was a preposterous pipe dream and wouldn't even allow me to explain how it works. Even though (at the time) I actually needed a job (company layoff first time in my life), I explained why I was going to leave, thanked them for their time, and left because I didn't want to work in a place where innovation is stifled by such a ridiculous posture of the interviewers.

During that same short out-of-work adventure, I interviewed with another company and the Lead DBA was "too busy" to part in the interview process. That's a first troubling sign. I answered a whole bunch of questions and the final question was "Where is the one place where you wouldn't want to call a stored procedure even if you could"? What they were looking for was "in a function", which I didn't get because... guess what? I know how to call stored procedures from a function and it's absolutely the right thing to do in certain cases and so I'd have never guess they were looking for that answer. It also turned out that was the make-it-or-break-it question for the interview... they didn't accept anyone that didn't answer "in a function".

The fun part was that some good time after that, I ran into the new Lead DBA that replaced the old one and we had a discussion about that question and that's when he told me the answer was "in a function". It turns out the old Lead DBA was fired. I explained the question and, thinking about it, explained how I actually do use stored procedure calls in functions and why. He was totally gob smacked and said "That's a part of why the old DBA was let go... he suppressed that type of innovation".

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Lord Slaagh SSCertifiable Points: 5796 More actions · Answer 8

Scott - If you are not looking for the test results to be listed in a string, then I think this is what you are looking for.

create table student_scores 
( 
student_id int NOT NULL,
score_date date NOT NULL, 
score int NOT NULL
);

insert into student_scores VALUES (1,'11/27/2020',80);
insert into student_scores VALUES (1,'12/01/2019',60);
insert into student_scores VALUES (1,'09/27/2018',95);
insert into student_scores VALUES (2,'11/27/2020',80);
insert into student_scores VALUES (2,'06/24/2019',100);
insert into student_scores VALUES (3,'07/04/2020',65);
insert into student_scores VALUES (3,'06/25/2019',82);
insert into student_scores VALUES (3,'10/17/2018',76);
insert into student_scores VALUES (4,'10/17/2018',70);



select student_id
  ,count(score) as score_count
  ,avg(score) as avg_score
  ,max(score) as highest_score
  ,case when count(student_id) <3 then NULL else (sum(score)/count(*)) end as middle_score
  ,case when count(student_id) <2 then NULL else min(score) end as lowest_score
from student_scores
group by student_id;

The are no problems, only solutions. --John Lennon

Lord Slaagh SSCertifiable Points: 5796 More actions · Answer 9

Scott - I agree with Jeff in the sense that this test is somewhat unrealistic because you have CTE's and sub queries at your disposal. However, if this is a test to take the candidate out of their comfort zone and landing the job does not depend on the right answer to this specific query, then this could be a one off question. Perhaps you can ask the candidate to write this query with and without CTE's and sub queries? That would be a nice test.

Jeff - Do you have any challenge queries that you can provide for practice? I have all versions of Adventure Works DB's if that helps. I would really appreciate it and thank you both for preparing me for this interview!

The are no problems, only solutions. --John Lennon

jcelko212 32090 SSCrazy Eights Points: 9306 More actions · Answer 10

>> Completed script.<<

Why not start with learning how to write DDL? I've always felt that 80 to 90% of the work in SQL is in declaring your data properly and letting it take care of itself. What you posted is not even a table; a table must have a key, by definition. Then you use some local dialect data display format for your dates. Your student_id seems to be an identifier. But, by definition, an identifier is on a nominal scale and nominal scales are never numeric. This will be covered in the first week or so of the data modeling course which you should take. Let's start over, and do it right.

CREATE TABLE Student_Scores

(student_id CHAR(10) NOT NULL, --- nominal scale, identifier

score_date DATE DEFAULT CURRENT_TIMESTAMP NOT NULL,

score INTEGER NOT NULL

CHECK(score BETWEEN 0 AND 100), --- learn to write a declarative language

PRIMARY KEY (student_id, score_date) --- not an option!

);

Microsoft has allowed the ANSI/ISO insertion statement syntax for quite a few years now. The use of the table constructor instead of doing it as if you are still reading punchcards one at a time into a 1960s tape file lets the optimizer make some decisions:

INSERT INTO Student_Scores

VALUES

(1, '2020-11-27', 80),

(1, '2019-12-01', 60),

(1, '2018-09-27', 95),

(2, '2020-11-27', 80),

(2, '2019-06-24', 100),

(3, '2020-07-04', 65),

(3, '2018-10-17', 76),

(3, '2019-06-25', 82),

(4, '2018-10-17', 70);

Your specs have some problems. Say I have only one score for a given student; by definition, that score is both the lowest, the highest and the middle score. I'm going I'm going to assume you don't want to see a value appear multiple times. So what should the pattern be? Decide if you want to show it as (score, null, null), (null, score, null) or (null, null, score)? Likewise, we had the same problem when there are only two values in the group; (score1, score2, null), (null, score1, score2), or (score1, null, score2) ? If you have three scores, and there's no problem.

At this point you need to decide on your specs and I would use a case expression to give me the results which I've decided on. The trick is that if the wind clauses in a case expression are the same level of aggregation as the then clauses you can write something like this:

(CASE

WHEN COUNT(student_id) = 3

THEN SUM(score) - MIN(score) - MAX(score)

WHEN COUNT(student_id) = 2

THEN MIN(score) --- or whatever

WHEN COUNT(student_id) = 1

THEN NULL --- or whatever

ELSE NULL END) AS middle_score,

does this help?

Please post DDL and follow ANSI/ISO standards when asking for help.

Jeff Moden SSC Guru Points: 1004704 More actions · Answer 11

Lord Slaagh wrote:

Scott - I agree with Jeff in the sense that this test is somewhat unrealistic because you have CTE's and sub queries at your disposal. However, if this is a test to take the candidate out of their comfort zone and landing the job does not depend on the right answer to this specific query, then this could be a one off question. Perhaps you can ask the candidate to write this query with and without CTE's and sub queries? That would be a nice test.
Jeff - Do you have any challenge queries that you can provide for practice? I have all versions of Adventure Works DB's if that helps. I would really appreciate it and thank you both for preparing me for this interview!

First of all and just to reiterate, when you are trying to solve a given problem with specific requirements for an interview question, you must first solve the given problem if you want to pass the interview. That means that if they provide a given table structure, you have to follow it to the "T". You can't do like you did and substitute a given column with one of your own even if you totally disagree with the structure of the table.

With that being said and especially for practice on such things, some interviewers (like me) aren't just looking for just a correct answer... we're also looking for a correct answer that performs well and uses resources wisely. After all, we're normally NOT hiring for a junior position. In most cases, it's for a mid to senior level and, of course, even if it's for a junior position, YOU will need to beat out the other candidates, which may not all be "juniors" in knowledge or skill.

Of course, when I'm on an interview and after I solve the problem according to their (sometimes crazy) constraints and if there's time, I'll go back and offer an alternate solution in an "Alice's Restaurant" fashion with explanations built into the comments (and don't say things like "incremental rCTEs are stupid" in your comments... avoid insulting your captors! 😀

And, with that in mind, if you're going to practice, you need to also practice generating a shedload of data so that you can practice for performance and scalability and not have to rely on finding appropriate examples in sample databases because such example data might not actually be available in those databases. Here's a couple of "starter" articles for the basics of generating tons of random but constrained data.

https://www.sqlservercentral.com/articles/generating-test-data-part-1-generating-random-integers-and-floats-1

https://www.sqlservercentral.com/articles/generating-test-data-part-2-generating-sequential-and-random-dates

Ok... back to the Scott's example problem.

A small handful of rows like what you're done is ok for the first part of "Make it work, make it fast, make it pretty (and it ain't done 'til it's pretty)" self-imposed mantra that I follow. So here's some code that creates more substantial data that follow the "rules" of the data provided in the problem description. It also contains the original problem and the post/thread reference so I can "go back" to the original post if I later need more understanding or I want to see what other people came up with for solutions and then test those for performance/scalability, as well.

/********************************************************************************************************
 Problem Description
 A table contains 1, 2 or 3 rows for each student, each row with one test score.
 For example:
 student_scores ( student_id int NOT NULL, assignment_number int NOT NULL, score int NOT NULL )

 Write a single query (no ctes, joins, subqueries, etc.) that lists the following:
 student_id, score_count, avg_score, highest_score, middle_score, lowest_score

 If there are fewer than 3 scores, list from left-to-right leaving missing scores NULL.
 (i.e., highest first, then middle, finally lowest)

 REF: https://www.sqlservercentral.com/forums/topic/practice-sql-querying-with-dates-and-aggregations

 Note to self... SQLServerCentral.com has code windows of only 105 characters wide without scrolling so
                 stop typing at a cursor postion of 106 or less.
********************************************************************************************************/
--=======================================================================================================
--      Create the test data
--=======================================================================================================
--===== If the test table already exists, drop the table to make reruns in SSMS easier
     -- I generally use a Temp Table for such things to avoid accidently dropping a real table.
   DROP TABLE IF EXISTS #student_scores
;
--===== Create the table according to the given Problem Description.
     -- Note that no Primary Key was described although the combination of student_id and 
     -- assignment_number would make a proper Clustered PK.  We also weren't provided any info on whether
     -- or not the assignment_number column is unique by student_id.
     --
     -- Also note that since I do not know the collation settings (especially for case sensitivity), I 
     -- didn't take the liberty of change the case of any named objects to make them more readable.
 CREATE TABLE #student_scores
        (
         student_id         INT NOT NULL
        ,assignment_number  INT NOT NULL
        ,score              INT NOT NULL
        )
;
--=======================================================================================================
--===== Generate random test data for 100Kstudents where each student will have from one to 3 assignments
     -- each per the problem definition.  Note that if you want to first experiment with a much smaller 
     -- number of students for the initial "Make it work" effort, then just change the "100000" to a 
     -- smaller number in the cte code below.
   WITH cteGenStudent AS
(
 SELECT  student_id = t.N
        ,assignment_count = ABS(CHECKSUM(NEWID())%3)+1
   FROM dbo.fnTally(1,100000) t
)
 INSERT INTO #student_scores WITH (TABLOCK) --Minimial Logging might be possible
        (student_id, assignment_number, score)
 SELECT  gs.student_id
        ,assignment_number = t.N --Problem doesn't use this, so we can take the easy way out.
        ,score = ABS(CHECKSUM(NEWID())%41)+60 --Domain of possible scores is 60 thru 100
   FROM cteGenStudent gs
  CROSS APPLY dbo.fnTally(1,gs.assignment_count) t
;
--===== Visual check of some of the contents of the test table
 SELECT TOP 1000 *
   FROM #student_scores
  ORDER BY student_id, assignment_number
;

To beginners in the craft of generating test data, that seems like a lot of code BUT... you'll quickly become a Ninja at it if you practice it AND you will no longer have to rely on trying to find something specific that doesn't actually provide enough data in a sample database. Since you've also documented the problem and the test data generation, you can easily go back to it a year or two from now when you have a sudden epiphany on possibly doing it a better way. For example, would adding a Clustered Index or a Non-Clustered Index help or hurt for such a thing? Would using something like a CTE or CROSS APPLY be better than than typical "Must be a Single Select" requirement? Despite thoughts to the contrary, could a self-join actually a lot be better for performance?

You just don't know until you have enough data. For example, a million row table is actually "chump change" in the world of today. The code above is really easy to change to 10 or even a 100 million rows and it'll still fly. Performance is just as important in building test data as it is in solution code because, the faster you can regenerate it, the more quickly you can try different solutions.

BTW... the code above generates the data for 10 million students resulting in about 20 million rows in 7 seconds on my box. There's also something to be learned there and that learning will also help you on interviews and doing a job better once you land it.

Also... the code above relies on the dbo.fnTally iTVF... you can get that code at the similarly name link in my signature line below.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Lord Slaagh SSCertifiable Points: 5796 More actions · Answer 12

jcelko212 - That is a good point, but if you recall DDL definitions and constraints are not required in the spec of this practice test. Plus, this is not a script that is being released to production, so the rules do not have to be so strict. Let's ask Scott. He is the interviewer in this case. His opinion is the only one that matters in getting the job.

ScottPletcher - Do you care about the DDL definition? Are you satisfied with the results that I produced? Would you hire me after this script?

Thanks SSC!

The are no problems, only solutions. --John Lennon

Lord Slaagh SSCertifiable Points: 5796 More actions · Answer 13

I just noticed your script will fail. Did you not notice the red highlights. You didn't comment your code properly.

The are no problems, only solutions. --John Lennon