Query running slow across linked server

  • Hi all

    Firstly, apologies for the lengthy post, but I'm trying to get everything in I thought would be useful.

    I've got a query and I need to use a linked server to get the data I need.

    The query looks like this:-
    SELECT
        --scbs.SourceID
        PL.[pkTheatreID]
        ,PL.[TheatreName]
        ,PL.[pkStartDate]
        ,PL.[pkStartTime]
        ,PL.[EndDate]
        ,PL.[EndTime]
        ,PL.[SessionStatusID]
        ,PL.[SessionStatusDescription]
        ,PL.[CaseTypeID]
        ,PL.[CaseTypeName]
        ,PL.[SessionAnaesthetistID]
        ,PL.[SessionConsultantID]
        ,PL.[SessionSurgeonID]
        ,PL.[SpecialtyDescriptionID]
        ,PL.[SpecialtyDescription]
        ,IsDeleted = 1
        ,SYSDateLastUpdated = PL.SYSDateLastUpdated
        ,SYSDateLoaded = PL.SYSDateLoaded
        ,SYSIsDuplicate = 0
        ,SYSSourceSystem = 'Meditech'
    FROM
        DataWarehouseStaging.dbo.tbl_APC_Theatres_Sessions PL
        LEFT JOIN [MEDITECHDR01-M1].[livedb_daily].[dbo].[SchCalendarBasicSchedule] scbs
            ON PL.[pkTheatreID] = scbs.ResourceID
            AND CAST(PL.[pkStartDate] AS DATETIME) + CAST(PL.[pkStartTime] AS DATETIME) = scbs.[BasicStartDateTime]
            AND scbs.DateTime = CAST(PL.[pkStartDate] AS DATETIME)
            AND scbs.SourceID = 'RFT'
    WHERE
        scbs.ResourceID IS NULL
        AND PL.IsDeleted = 0

    The query looks for data that has been removed from the source database but is still on the target and needs to be marked as removed.
    I've mentioned to the software provider for the source database that data should never be deleted (just marked as such) but apparently, this is "by design".

    The Theatres sessions table looks like this:-
    CREATE TABLE [dbo].[tbl_APC_Theatres_Sessions](
        [pkSYSSourceSystem] [varchar](8) NOT NULL,
        [pkTheatreID] [varchar](10) NOT NULL,
        [TheatreName] [varchar](30) NULL,
        [pkStartDate] [date] NOT NULL,
        [pkStartTime] [time](7) NOT NULL,
        [EndDate] [date] NULL,
        [EndTime] [time](7) NULL,
        [SessionStatusID] [varchar](15) NULL,
        [SessionStatusDescription] [varchar](40) NULL,
        [CaseTypeID] [varchar](2) NULL,
        [CaseTypeName] [varchar](15) NULL,
        [SessionAnaesthetistID] [varchar](15) NULL,
        [SessionConsultantID] [varchar](15) NULL,
        [SessionSurgeonID] [varchar](15) NULL,
        [SpecialtyDescriptionID] [varchar](15) NULL,
        [SpecialtyDescription] [varchar](40) NULL,
        [IsDeleted] [bit] NULL,
        [SYSDateLastUpdated] [datetime] NULL,
        [SYSDateLoaded] [datetime] NOT NULL,
    CONSTRAINT [PK_tbl_APC_Theatres_Sessions] PRIMARY KEY CLUSTERED
    (
        [pkSYSSourceSystem] ASC,
        [pkTheatreID] ASC,
        [pkStartDate] ASC,
        [pkStartTime] ASC
    )WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON, FILLFACTOR = 90) ON [PRIMARY]
    ) ON [PRIMARY]

    and the Calendar table looks like this:-
    CREATE TABLE [dbo].[SchCalendarBasicSchedule](
        [SourceID] [varchar](3) NOT NULL,
        [ResourceID] [varchar](15) NOT NULL,
        [DateTime] [datetime] NOT NULL,
        [BasicStartDateTime] [datetime] NOT NULL,
        [EndDate] [varchar](8) NULL,
        [EndTime] [varchar](4) NULL,
        [GroupDescription] [varchar](30) NULL,
        [GroupID] [varchar](40) NULL,
        [Capacity] [int] NULL,
        [ProfileFromTime] [varchar](4) NULL,
        [Profile] [varchar](1) NULL,
        [ProfileRelease] [varchar](4) NULL,
        [ProfileMonogram] [varchar](15) NULL,
        [ResourceGroupLocation] [varchar](10) NULL,
        [ResourceGroupLocationGlDept] [varchar](15) NULL,
        [ResourceGroupLocationName] [varchar](30) NULL,
        [ResourceName] [varchar](30) NULL,
        [RowUpdateDateTime] [datetime] NULL,
        [ProfileIncludeExclude] [varchar](11) NULL,
    CONSTRAINT [mtpk_scalbasc] PRIMARY KEY CLUSTERED
    (
        [SourceID] ASC,
        [ResourceID] ASC,
        [DateTime] ASC,
        [BasicStartDateTime] ASC
    )WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
    ) ON [PRIMARY]

    Now, if I run the query as posted, I have to stop the query after 10 minutes as I got bored and it had returned no records! 

    If I switch the query so the linked server goes the other way (so the FROM section looks like this):-

    FROM
        [DW-DEV].DataWarehouseStaging.dbo.tbl_APC_Theatres_Sessions PL
        LEFT JOIN [livedb_daily].[dbo].[SchCalendarBasicSchedule] scbs
            ON PL.[pkTheatreID] = scbs.ResourceID
            AND CAST(PL.[pkStartDate] AS DATETIME) + CAST(PL.[pkStartTime] AS DATETIME) = scbs.[BasicStartDateTime]
            AND scbs.DateTime = CAST(PL.[pkStartDate] AS DATETIME)
            AND scbs.SourceID = 'RFT'

    the query runs in a few seconds and returns the expected results.

    I've made sure I'm using the primary key on both tables.

    A bit more info:-
    Server DW-DEV - SQL2016 Developers - table tbl_APC_Theatres_Sessions contains 42355 records
    Server MEDITECHDR01-M1 - SQL2012 Enterprise - table SchCalendarBasicSchedule contains 29734600 records

    The linked server from DW-DEV to MEDITECHDR01-M1 is set up using SQL Server (as the option in the Linked server properties).

    Two questions:-
    1) Why is there such a big difference in run times?
    2) How can I speed it up?

    Any help gratefully received.

  • Take a quick glance at your record counts.   You have to pull ALL the records from the linked server across the network in order for your local server to perform that join.   When you then ran that query on the remote server, it only had to pull 42,355 records across the network, and that's a LOT less data.   Any time you run a query that does a JOIN between a Linked Server table and a local server table, the entire table on the Linked Server is brought across the network to the local server before it does anything else, and that's the problem.   Given your query, there is an opportunity to reduce the number of rows coming across, by using OPENQUERY, and specifying only the fields you need from the remote table as part of a SELECT that has a WHERE clause that looks only at WHERE clause items in your original query that examine fields in the linked server table.   If I've deduced correctly, that would be the scbs.SourceID = 'RFT'.  You could also break out each date that has to be compared against separate fields in the local server, by also doing that in your OPENQUERY's SELECT statement.   The only other alternative is to do all the work on the remote server, as it will pull a LOT less data across the network.   Also, be sure not to use SELECT * in the OPENQUERY, as that ends up bringing all the fields, when you don't necessarily need all of them.   SELECT only the ones you actually need for the query, including the fields being joined on.

    Steve (aka sgmunson) 🙂 🙂 🙂
    Rent Servers for Income (picks and shovels strategy)

  • Here's how you might use OPENQUERY:SELECT
      --scbs.SourceID
      PL.[pkTheatreID]
      ,PL.[TheatreName]
      ,PL.[pkStartDate]
      ,PL.[pkStartTime]
      ,PL.[EndDate]
      ,PL.[EndTime]
      ,PL.[SessionStatusID]
      ,PL.[SessionStatusDescription]
      ,PL.[CaseTypeID]
      ,PL.[CaseTypeName]
      ,PL.[SessionAnaesthetistID]
      ,PL.[SessionConsultantID]
      ,PL.[SessionSurgeonID]
      ,PL.[SpecialtyDescriptionID]
      ,PL.[SpecialtyDescription]
      ,IsDeleted = 1
      ,SYSDateLastUpdated = PL.SYSDateLastUpdated
      ,SYSDateLoaded = PL.SYSDateLoaded
      ,SYSIsDuplicate = 0
      ,SYSSourceSystem = 'Meditech'
    FROM
        DataWarehouseStaging.dbo.tbl_APC_Theatres_Sessions AS PL
            LEFT OUTER JOIN (
                OPENQUERY([MEDITECHDR01-M1],
                    'SELECT SourceID, ResourceID, BasicStartDateTime, [DateTime]
                    FROM [livedb_daily].[dbo].[SchCalendarBasicSchedule]
                    WHERE SourceID = ''RFT''
                        AND ResourceID IS NULL;') AS scbs
                ON PL.[pkTheatreID] = scbs.ResourceID
                AND CAST(PL.[pkStartDate] AS datetime) + CAST(PL.[pkStartTime] AS datetime) = scbs.[BasicStartDateTime]
                AND CAST(PL.[pkStartDate] AS datetime) = scbs.[DateTime]
    WHERE PL.IsDeleted = 0;

    If you want to know how much data is going to traverse the network, perform the query within OPENQUERY on the remote server and see how much data it retrieves.   This may still be too much to improve the query all that much, but it might at least help some.   Again, hard to know for sure, but just taking 4 fields instead of 19 should make a difference.  I included SourceID just to be sure you could uncomment that fields selection in your original query.   Ask questions if there's anything you need help with.

    Steve (aka sgmunson) 🙂 🙂 🙂
    Rent Servers for Income (picks and shovels strategy)

  • If you get the explain plans for both executions (even just the estimated will suffice) you will most likely see the reason for it.

    If I get this right you do 2 queries as follows

    1 - executed on MEDITECHDR01-M1
    - this gets data for the smaller table tbl_APC_Theatres_Sessions onto a local table and then does the join to the big huge bad table.

    2 - executed on DW-DEV
    - this has 2 options
      a - gets all the big table records onto a local table and then does the joins
      b - executes a filtered query on the remote server for each record on tbl_APC_Theatres_Sessions and copies the result to a local table

    and then both tbl_APC_Theatres_Sessions and the temp table from the options a/b above are joined to get the final results

    There are a few more variations on the above but the above should be enough to highlight the issues

    Really linked servers should be avoided to join to local tables - specially if remote side is quite big.

    Speed up - don't use linked server - do all on same server and if required use a tool to copy the smaller table to the remote server and then do the join on the remote server.

  • The fundamental problem you're dealing with is that data required for relational joins is split across multiple servers. I'd suggest replicating or ETL copying the table [SchCalendarBasicSchedule] (and any other remote tables) locally to your datawarehouse server on a periodic schedule. You need only copy those specific columns and rowsets required. Not only would it prevent remote joins, but you could then index the tables however best facilitates your DW needs.

    "Do not seek to follow in the footsteps of the wise. Instead, seek what they sought." - Matsuo Basho

  • Steve - Brilliant, thank you.  That query now runs in under 5 minutes (which is fine for now).  I'd never even considered OPENQUERY to do that bit.

    Eric - That's what we're trying to do but we've hit various stumbling blocks with replication (and other methods).  Our current plan is to use log shipping to move data from the original server to a central point and then use replication to "farm" the data out to where we need it.  Complicated (with too many failure points for my liking) but it should work.

    Thanks once again folks.

  • richardmgreen1 - Friday, September 22, 2017 2:48 AM

    ..
    Eric - That's what we're trying to do but we've hit various stumbling blocks with replication (and other methods).  Our current plan is to use log shipping to move data from the original server to a central point and then use replication to "farm" the data out to where we need it.  Complicated (with too many failure points for my liking) but it should work.
    ...

    Something like replication or log shipping isn't really needed, just something as simple as an SSIS package or T-SQL INSERT... SELECT... statement that copies rows from remote source table into a local table will suffice.

    "Do not seek to follow in the footsteps of the wise. Instead, seek what they sought." - Matsuo Basho

  • We could do that, but we need to move a lot of data across to the new server.
    The powers-that-be want near-real-time reporting and we aren't allowed to run too much on the original server (apparently it slows other stuff down too much).

    The 3rd-party software vendor throws a wobbly when we try to set up replication (it upsets a few of their processes apparently) so we're trying to figure out ways round that little lot.

  • richardmgreen1 - Friday, September 22, 2017 7:27 AM

    We could do that, but we need to move a lot of data across to the new server.
    The powers-that-be want near-real-time reporting and we aren't allowed to run too much on the original server (apparently it slows other stuff down too much).

    The 3rd-party software vendor throws a wobbly when we try to set up replication (it upsets a few of their processes apparently) so we're trying to figure out ways round that little lot.

    The thing is that the data gets moved between servers regardless of whether you load it once daily on a schedule you choose, or multiple times daily at random intervals every time someone executes a distributed query. A distributed query is essentially an ad-hoc ETL process.

    "Do not seek to follow in the footsteps of the wise. Instead, seek what they sought." - Matsuo Basho

  • Very true.
    Like I said, we wanted to use replication to do the transfers real-time (or near enough real-time).  This would then get us away from going cross-server completely.

    The current plan is to use log-shipping to move data from server A to server B and then use replication to move the data server B to servers C and D.

    We need to check with our IT department that this approach won't upset our backup solution (another 3rd-party application) too much.

  • richardmgreen1 - Monday, September 25, 2017 8:06 AM

    Very true.
    Like I said, we wanted to use replication to do the transfers real-time (or near enough real-time).  This would then get us away from going cross-server completely.

    The current plan is to use log-shipping to move data from server A to server B and then use replication to move the data server B to servers C and D.

    We need to check with our IT department that this approach won't upset our backup solution (another 3rd-party application) too much.

    Seems like there's a lot of wobbly" with your 3rd party partners...   Might be time to negotiate better solutions from the wobblers...  After all, you may be able to take your business elsewhere if they can't start providing better service...

    Steve (aka sgmunson) 🙂 🙂 🙂
    Rent Servers for Income (picks and shovels strategy)

  • A quick question:  How often does the data that will be retrieved from the linked server change? 

    Would it be practical (or allowed) to create a materialised view against the linked server that could be stored locally?  The view could be refreshed against a schedule (maybe daily) and all the processing could be achieved locally.

  • The data on server A changes continuously and we need to have near-real-time reporting from it.
    I'm not sure what a materialised view is (never heard of that one) so I don't know.

  • richardmgreen1 - Monday, September 25, 2017 8:24 AM

    The data on server A changes continuously and we need to have near-real-time reporting from it.
    I'm not sure what a materialised view is (never heard of that one) so I don't know.

    I got my Oracle and SQL Server mixed up..... it is an indexed view in SQL Server....

    If a view has an clustered index on it, it persists to disk.  My thought was that if it was a slowly changing table the view could be refreshed against a schedule.  The fact that the data changes continuously and you need realtime reporting shoots that plan down in flames....

  • kevaburg - Monday, September 25, 2017 8:30 AM

    richardmgreen1 - Monday, September 25, 2017 8:24 AM

    The data on server A changes continuously and we need to have near-real-time reporting from it.
    I'm not sure what a materialised view is (never heard of that one) so I don't know.

    I got my Oracle and SQL Server mixed up..... it is an indexed view in SQL Server....

    If a view has an clustered index on it, it persists to disk.  My thought was that if it was a slowly changing table the view could be refreshed against a schedule.  The fact that the data changes continuously and you need realtime reporting shoots that plan down in flames....

    SQL Server indexed views can only reference tables contained locally within the same database as the view. There is a reason for that limitation; the I/O and network overhead of pushing inserts/updates/deletes from the remote tables to the indexed view would be significant.

    https://docs.microsoft.com/en-us/sql/relational-databases/views/create-indexed-views

    But it seems to me that simply persisting a distributed query to a local staging table on a schedule would be achieve the same result (possibly even better results) as creating an indexed view were that option possible. Essentially all we're talking about here is the need to periodically stage data from remote database locally on the data warehouse server.

    "Do not seek to follow in the footsteps of the wise. Instead, seek what they sought." - Matsuo Basho

Viewing 15 posts - 1 through 15 (of 25 total)

You must be logged in to reply to this topic. Login to reply