SQL Assignment - Join 2 Tables

  • The data set that I have has 2 tables. One (trips) contains the start and end date in DATETIME format for bicycle trips as well as the ID code for the station that the bike belongs to, the other table (station) contains the same corresponding ID code for the stations along with the actual name of the station. The questions I need to answer, which means I will have to JOIN these two tables are:

      1. Station name

        • Which station has the longest trips on average?
        • Which station has the shortest trips on average?
        • Extremely long / short trips can skew your results. How would avoid that?

    1. Let's call trips that start and end in the same station "round trips". Calculate the fraction of trips that were round trips and break it down by:

      1. Membership status
      2. Day of the week

  • paige.miller91 - Thursday, January 17, 2019 11:21 AM

    The data set that I have has 2 tables. One (trips) contains the start and end date in DATETIME format for bicycle trips as well as the ID code for the station that the bike belongs to, the other table (station) contains the same corresponding ID code for the stations along with the actual name of the station. The questions I need to answer, which means I will have to JOIN these two tables are:

      1. Station name

        • Which station has the longest trips on average?
        • Which station has the shortest trips on average?
        • Extremely long / short trips can skew your results. How would avoid that?

    1. Let's call trips that start and end in the same station "round trips". Calculate the fraction of trips that were round trips and break it down by:

      1. Membership status
      2. Day of the week

    Okay, what have you tried to answer these questions?  We aren't just going to give you the answers as that doesn't he;p you learn.

  • Haha, of course, ok so I got this so far: 

    CREATE VIEW stationtriplength AS
    SELECT trips.start_station_code, trips.duration_sec, stations.name
    FROM trips JOIN stations ON trips.start_station_code = stations.code 

    ---to get the information I need from both tables into a view
    THEN: 

    Untitled SELECT AVG(duration_sec) trip_length, name
    FROM stationlengthtrip
    GROUP BY name
    --- to get the average trip duration 

    From here I answered which station has the longest and shortest trip durations, that part I figured out by using:

    Untitled SELECT AVG(duration_sec) trip_length, name
    FROM stationlengthtrip
    GROUP BY name
    ORDER BY AVG(duration_sec) DESC 
    LIMIT 1;

    The only part I cannot get is how to do the round trips

  • How do you know which of those trips had a round trip? You are only bringing the start point and how long it takes, unless I am missing something.

  • astrid 69000 - Thursday, January 17, 2019 1:24 PM

    How do you know which of those trips had a round trip? You are only bringing the start point and how long it takes, unless I am missing something.

    I think the length of trip is duration not distance travelled.
    The table has StartTime, EndTime and StationId.
    So the calculation of a trip length (in time) can be done from one row.

  • Ah yeah, I missed, there is an end_station_code as well,

  • and yeah the curation of the trip is time, but there is a start station code and an end station code for where the bikes were picked up and where they were returned. Would that change anything I did above?

  • paige.miller91 - Thursday, January 17, 2019 1:35 PM

    and yeah the curation of the trip is time, but there is a start station code and an end station code for where the bikes were picked up and where they were returned. Would that change anything I did above?

    So if depends what "Which station has the longest trips on average?" means.
    If the length of a trip from Station A to Station B is included as a statistic for Station A as well as Station B then it will make a difference.

  • I also think we are missing information, are you checking trip per user? because you talk about trips, but you need an identifier per trip
    the trip is from point a to b but 10 people can ride from a to b and you.
    still confusing for me. can you show a sample of the data and exactly what are you trying to get?

  • I'm doing a lot of guessing here; moving forward - DDL and sample data will help you get the best answer.  Let's start with this:

    USE tempdb
    GO

    IF OBJECT_ID('dbo.trip')  IS NOT NULL DROP TABLE dbo.trip;
    IF OBJECT_ID('dbo.station') IS NOT NULL DROP TABLE dbo.station;

    CREATE TABLE dbo.station
    (
    stationId INT PRIMARY KEY CLUSTERED,
    stationInfo VARCHAR(100)
    );

    CREATE TABLE dbo.trip
    (
    tripId  INT IDENTITY PRIMARY KEY NONCLUSTERED,
    starttime DATETIME,
    endtime DATETIME,
    stationId INT,
    CONSTRAINT fk_trip__station FOREIGN KEY (stationID) REFERENCES dbo.station(stationId)
    );

    INSERT dbo.station (stationId, stationInfo)
    VALUES (1, 'Greenville'), (2,'BooBerg'), (3,'ElseWhere');

    INSERT dbo.trip (starttime, endtime, stationid)
    VALUES
    (GETDATE()-50.5,GETDATE()-50,1), (GETDATE()-100.775,GETDATE()-100,1),
    (GETDATE()-60.85,GETDATE()-60.77,2), (GETDATE()-60.44,GETDATE()-60,2),
    (GETDATE()-10.76,GETDATE()-10.7,3), (GETDATE()-11,GETDATE()-9.7,3),
    (GETDATE()-11,GETDATE()-10,3), (GETDATE()-11.004,GETDATE()-11,3);

    Then you could just do this:
    SELECT s.stationInfo, AvgMinutes = AVG(f.tm)
    FROM dbo.trip AS t
    CROSS APPLY (VALUES(DATEDIFF(MINUTE,t.starttime,t.endtime))) AS f(tm)
    JOIN dbo.station AS s ON t.stationId = s.stationId
    GROUP BY s.stationInfo;

    Extremely long / short trips can skew your results. How would avoid that?

    For this you could add a couple parameters to define an "Extremely long/short trip". For my example short = less than 10 min, long = more than a day. This will filter out the "short/long" trips prior to aggregation.

    DECLARE @min-2 INT = 10, @max-2 INT = 1440; -- minutes; 1440 = 1 day

    SELECT s.stationInfo, AvgMinutes = AVG(f.tm)
    FROM dbo.trip AS t
    CROSS APPLY (VALUES(DATEDIFF(MINUTE,t.starttime,t.endtime))) AS f(tm)
    JOIN dbo.station AS s ON t.stationId = s.stationId
    WHERE f.tm > @min-2 AND f.tm < @max-2
    GROUP BY s.stationInfo;

    "I cant stress enough the importance of switching from a sequential files mindset to set-based thinking. After you make the switch, you can spend your time tuning and optimizing your queries instead of maintaining lengthy, poor-performing code."

    -- Itzik Ben-Gan 2001

  • Would you mind saving us the trouble of reading your mind and posting some DDL? It would also help if we had some sample data to work with.

    Please post DDL and follow ANSI/ISO standards when asking for help. 

  • Here an extended version of  Alan Burstein's example.

    A End station is added.
    So roundtrips can be handled distinctly.
    More examples are added to show the build in increments.


    --
    -- Example provided by Alan Burstein.
    --
    -- Extended
    -- 20190124
    -- Ben Brugman

    USE tempdb
    GO

    IF OBJECT_ID('dbo.trip') IS NOT NULL DROP TABLE dbo.trip;
    IF OBJECT_ID('dbo.station') IS NOT NULL DROP TABLE dbo.station;

    CREATE TABLE dbo.station
    (
    stationId INT PRIMARY KEY CLUSTERED,
    stationInfo VARCHAR(100)
    );

    CREATE TABLE dbo.trip
    (
    tripId INT IDENTITY PRIMARY KEY NONCLUSTERED,
    starttime DATETIME,
    endtime DATETIME,
    stationId INT,
    end_stationid INT,
    CONSTRAINT fk_trip__station FOREIGN KEY (stationID) REFERENCES dbo.station(stationId)
    ,CONSTRAINT fk_trip__end_station FOREIGN KEY (end_stationID) REFERENCES dbo.station(stationId)
    );

    -- Values for 3 stations.
    INSERT dbo.station (stationId, stationInfo)
    VALUES (1, 'Greenville'), (2,'BooBerg'), (3,'ElseWhere');

    -- Values for a number of trips.
    INSERT dbo.trip (starttime, endtime, stationid, end_stationid)
    VALUES
    (GETDATE()-50.5,GETDATE()-50  ,1, 3),
    (GETDATE()-100.775,GETDATE()-100,1, 2),
    (GETDATE()-60.85,GETDATE()-60.77,2, 1),
    (GETDATE()-60.44,GETDATE()-60 ,2, 3),
    (GETDATE()-10.76,GETDATE()-10.7 ,3, 2),
    (GETDATE()-11,GETDATE()-9.7  ,3, 1),
    (GETDATE()-11,GETDATE()-10   ,3,3),
    (GETDATE()-11.004,GETDATE()-11 ,3,2),
    (GETDATE()-11.1,GETDATE()-11  ,3,3);

    --
    -- Show the data in the two tables.
    --
    select * from trip
    select * from station

    --
    -- Show the timedifference in minutes for each trip.
    --
    select DATEDIFF(MINUTE, starttime, endtime) minutes,* from trip

    --
    -- Show the AVG, Min, Max for each station.
    -- Remark:
    --   The stationid field is used in the GROUP BY
    -- Remark:
    --   Trips of less then 10 minutes and of more than a day are excluded.
    --
    select
      AVG (DATEDIFF(MINUTE, starttime, endtime)) AVG_minutes
     , MIN (DATEDIFF(MINUTE, starttime, endtime)) min_minutes
     , MAX (DATEDIFF(MINUTE, starttime, endtime)) max_minutes
     , stationid from trip
    WHERE DATEDIFF(MINUTE, starttime, endtime) > 10
      AND
      DATEDIFF(MINUTE, starttime, endtime) < 60*24
    GROUP BY stationid

    --
    -- Get the 'name' of the station with the above query.
    --
    select
      AVG (DATEDIFF(MINUTE, starttime, endtime)) AVG_minutes
     , MIN (DATEDIFF(MINUTE, starttime, endtime)) min_minutes
     , MAX (DATEDIFF(MINUTE, starttime, endtime)) max_minutes
     , trip.stationid
     , stationinfo
    from trip join station on trip.stationid = station.stationid
    WHERE DATEDIFF(MINUTE, starttime, endtime) > 10
      AND
      DATEDIFF(MINUTE, starttime, endtime) < 60*24
    GROUP BY trip.stationid, stationinfo

    --
    -- Roundtrips. begin and startstation are the same.
    --
    select
      AVG (DATEDIFF(MINUTE, starttime, endtime)) AVG_minutes
     , MIN (DATEDIFF(MINUTE, starttime, endtime)) min_minutes
     , MAX (DATEDIFF(MINUTE, starttime, endtime)) max_minutes
     , trip.stationid
     , stationinfo
    from trip join station on trip.stationid = station.stationid
    WHERE DATEDIFF(MINUTE, starttime, endtime) > 10
      AND
      DATEDIFF(MINUTE, starttime, endtime) < 60*24
      AND
      trip.stationId = trip.end_stationid
    GROUP BY trip.stationid, stationinfo

    --
    -- Reuse the difference calculation several times.
    -- Use a 'WITH' of CTE (Common Table Expression) construction for this.
    -- A calculate the Delta time and resolve the stationinfo.
    -- B Select the correct Delta times.
    -- C Get the Average, Min and Max values of Delta.
    -- (Roundtrips are still included)
    -- A, B, C can be seen as temporary views. (Only during this statement).
    --
    ;
    WITH
    A AS (SELECT DATEDIFF(MINUTE, starttime, endtime) Delta, stationinfo FROM trip join station on trip.stationId = station.stationId)
    ,B as (SELECT * FROM A WHERE Delta >= 10 AND Delta < 24*60)
    ,C as (SELECT AVG(Delta) AVG_Delta, MIN(Delta) Min_Delta, MAX(Delta) Max_Delta, stationinfo
       FROM B GROUP BY stationinfo)
    SELECT * FROM C

    --
    -- CleanUp. 
    --
    IF OBJECT_ID('dbo.trip') IS NOT NULL DROP TABLE dbo.trip;
    IF OBJECT_ID('dbo.station') IS NOT NULL DROP TABLE dbo.station;

Viewing 12 posts - 1 through 11 (of 11 total)

You must be logged in to reply to this topic. Login to reply