Using Exotic Joins in SQL – Part 1
Chris Cubley, MCSD
www.queryplan.com
When most developers think of joins, they think of
“a.SomethingID = b.SomethingID”.
This type of join, the equijoin, is vitally important to SQL programming;
however, it only scratches the surface of the power of the SQL join.
This is the first in a series of articles that will look at
several different types of “exotic” joins in SQL. This article will focus on using the
BETWEEN operator in joins when dealing with range-based data.
Introducing the BETWEEN Join
When dealing with things like calendars, grading scales, and
other range-based data, the BETWEEN operator comes in very handy in the WHERE
clause. It is often forgotten that
the BETWEEN operator can also be used in join criteria.
In the WHERE clause, the BETWEEN operator is usually used to
test whether some field is between two constants. However, the BETWEEN operator can take
any valid SQL expression for any or all of its three arguments. This includes columns of tables.
One use of a BETWEEN join is to determine in which range a
particular value falls. Joins of
this nature tend to have the following pattern:
<FactData> BETWEEN
<RangeMinimum> AND <RangeMaximum>
In this pattern, the “fact data” is contained in a table with
instances of data such as payments, test scores, login attempts, or clock in/out
events. The other table, the “range
lookup table”, is usually a smaller table which provides a range minimum and
maximum and other data for the various ranges.
For example, consider a scenario in which a student is
enrolled in a class. A student
receives a numeric grade for a class on a scale of 0 to 100. This numeric grade corresponds to a
letter grade of A, B, C, D, or E.
However, the school does not use the traditional grading scale in which
90 to 100 corresponds to an A, 80-89 corresponds to a B, and so forth. Instead, the school uses the following
grading scale:
|
Letter
Grade |
Numeric
Grade |
|
A |
92 - 100 |
|
B |
84 - 91 |
|
C |
76 - 83 |
|
D |
68 - 75 |
|
E |
0 - 68 |
To accommodate the school’s custom grading scale, their
records database has the following table defined:
CREATE TABLE tb_GradeScale(
LetterGrade
char(1) NOT
NULL,
MinNumeric
int
NOT NULL,
MaxNumeric
int
NOT NULL,
IsFailing
smallint NOT
NULL,
CONSTRAINT PK_GradeScale PRIMARY KEY(LetterGrade),
CONSTRAINT CK_MinMax CHECK(MinNumeric <= MaxNumeric)
)
The students’ numeric scores are stored in the following
table:
CREATE TABLE tb_StudentGrade(
StudentID
int
NOT NULL,
ClassID
varchar(5) NOT NULL,
NumericGrade
int
NOT NULL,
CONSTRAINT PK_StudentGrade PRIMARY
KEY(StudentID, ClassID),
CONSTRAINT
CK_StudentGrade_NumericGrade
CHECK(NumericGrade BETWEEN 0 AND 100)
)
In this scenario, the tb_StudentGrade table is the “fact
table” and the tb_GradeScale table is the “range lookup table”. The NumericGrade field serves as “fact
data” while the MinNumeric and MaxNumeric fields serve as the “range minimum”
and “range maximum”. Thus,
following the fact-min-max pattern, we can construct the following join
criteria:
NumericGrade BETWEEN MinNumeric AND
MaxNumeric
If we put these join criteria into the context of a query
which generates a report containing all the students’ letter grades for English
101, we end up with the following:
SELECT
s.StudentID,
g.LetterGrade
FROM
tb_StudentGrade s
INNER JOIN
tb_GradeScale g
ON(
s.NumericGrade BETWEEN g.MinNumeric AND g.MaxNumeric
)
WHERE
ClassID = 'EH101'
In this query, we join the student grade table with the
grading scale table in order to translate a numeric grade to a letter
grade. In order to accomplish this,
we use the BETWEEN operator to specify the relationship between the two tables
being joined.
Using BETWEEN With Temporal Data
Some of the trickiest queries to write are those that deal
with temporal data like calendars, appointment times, and class schedules. For example, many businesses have a
fiscal calendar that they use for accounting. Accounting periods may start on the
26th of the month and end on the 25th of the following
month. The company may vary the
starting and ending dates of each accounting period to even out the number of
days in each accounting period. In
order to generate reports by accounting period, you need to define a table that
lays out the fiscal calendar being used.
Such a table may look like this:
CREATE TABLE tb_FiscalCalendar(
FiscalYear
int
NOT NULL,
AcctPeriod
int
NOT NULL,
StartDatetime
datetime NOT
NULL,
EndDatetime
datetime NOT
NULL,
CONSTRAINT PK_FiscalCalendar PRIMARY
KEY(FiscalYear, AcctPeriod), CONSTRAINT
CK_FiscalCalendar_DateCheck
CHECK(StartDatetime <
EndDatetime)
)
In this table, the FiscalYear column indicates the fiscal
year to which the accounting period belongs. The AcctPeriod column identifies the
accounting period within the fiscal year.
The StartDatetime and EndDatetime columns specify the actual starting and
ending date and time of the accounting period.
Suppose you are trying to write a report as part of a
customer payment processing system.
This report summarizes the total number and amount of payments by
accounting period. The records of
the customer payments are stored in the following table:
CREATE TABLE tb_Payment(
PaymentID
int
NOT NULL
IDENTITY(1, 1),
AccountID
int
NOT NULL,
PostedDatetime
datetime NOT
NULL
DEFAULT(GETDATE()),
PaymentAmt
money
NOT NULL,
CONSTRAINT PK_Payment PRIMARY
KEY(PaymentID)
)
In order to construct the query needed for the report, you
must first determine the fiscal year and accounting period in which each payment
occurred. You must then group by
the fiscal year and accounting period, summing the PaymentAmt field and counting
the number of records in each group.
To determine each payment’s accounting period, you can use a
BETWEEN join to the tb_FiscalCalendar table:
FROM
tb_Payment p
INNER JOIN
tb_FiscalCalendar c
ON(
p.PostedDatetime BETWEEN c.StartDatetime AND c.EndDatetime
)
As do many other joins using the BETWEEN operator, this join
follows the fact-min-max pattern seen in the grading scale example. Each payment record (of which there are
many) provides a “fact” stating that a certain payment occurred at a particular
date and time. The fiscal calendar
table acts more as a configuration table that specifies a range of datetime
values and provides configuration data about this range.
To finish off the payment reporting query, we add the
grouping, aggregate functions, and an ORDER BY clause to make the output more
readable:
SELECT
c.FiscalYear,
c.AcctPeriod,
COUNT(*) AS PaymentCount,
SUM(PaymentAmt) AS TotalPaymentAmt
FROM
tb_Payment p
INNER JOIN
tb_FiscalCalendar c
ON(
p.PostedDatetime BETWEEN c.StartDatetime AND c.EndDatetime
)
GROUP BY
c.FiscalYear,
c.AcctPeriod
ORDER BY
c.FiscalYear,
c.AcctPeriod
The output yields the needed report easily and
efficiently. With proper indexing,
this query should run quite well even against large sets of data.
Other Uses of BETWEEN Joins
The BETWEEN join can be put to use in a number of other
scenarios. Coupling the BETWEEN
operator with a self-join can be a useful technique for concurrency-checking
queries such as validating calendars and appointment schedules. BETWEEN joins can be used to produce
histograms by aggregating ranges of data.
In a situation where you must join precise data to rounded data, the
BETWEEN operator can be used to perform a “fuzzy” join. Once you put the BETWEEN join in your
query toolbox, you’ll find even more uses for it and wonder how you ever did
without it.
About the Author
Chris Cubley is an MCSD with over four years of experience
designing and implementing SQL Server-based solutions in the education,
healthcare, and telecommunications industries. He can be reached at ccubley@queryplan.com.