SQL query performance question

  • I am working on a small research team that has been given access to a set of SQL Server 2008 tables via some views. We can do select queries using those views, and I can see the names of the tables using the Design command when one of the views is selected in SQL Server Management Studio, but we don't have permission to look at the table columns and indexes directly, much less change them in any way.

    The database contains electronic medical records. My task is to populate a Lucene index from the contents of the free text fields in the database. I have been using Solr and Lucene to do the indexing, and JDBC to access the database. For our experiments, we are indexing the documents of randomly selected sets of patients. From the views that we have access to, I can get patients' ID numbers and the documents associated with those patients. Each document may contain a variable number of addendum documents and may contain a variable number of diagnosis records.

    In my first attempt to access these records, I modified some sample code from a Solr web site (http://wiki.apache.org/solr/DataImportHandler#Full_Import_Example) that specified nested SQL selects in a configuration file to obtain records of items for sale, their descriptions, and their categories. I figured that I could do the same thing with an outer select statement that obtains the patient and main document ids and other main document information, and two inner queries at the same level, one of which obtains the addendum information associated with a given document id and the other of which obtains the diagnosis information associated with that document id. The data sets we are working with range in size from 1,000 patients to 500,000 patients, and the patients have on average about 100 main documents associated with them. So for a data set of 100,000 patients, one select statement was run to get the patient and main document info, 100,000 select statements were run to get the addendum info, and 100,000 select statements were run to get the diagnosis info. This ran extremely slowly. I don't know if the problem was with the sheer number of selects, each of which obtained a small number of rows, or if some additional work is needed to make the selects execute faster. A database admin looked at logs of the queries I was running, and said "each [query] is being recompiled with a new query plan being generated for each one". Unfortunately, both the databases and the administrators are located over a thousand miles from us, and we don't get many chances to talk with them.

    I changed the way I was accessing the database to using a single query that gets all of the patient, document, addendum and diagnosis info. I needed to write some Java code to remove duplicates for some documents, but that wasn't difficult. This query runs much faster than the nested ones I had been using, but it often takes a long time for the first rows of data to be returned. In a recent run with data sets containing about 25,000 patients each (about 1,600,000 SQL rows each), it took at least 40 minutes and up to 5 hours for the first rows of the query to be returned. After that, the rows came in fast enough for our purposes. I don't know if the query I am using is at fault or if the tables that the views we use are based on are lacking the appropriate table structure and indexes that would allow the query to run as efficiently as possible. The query I am using looks like this:

    SELECT main.PatID, main.DocID, main.Site, main.RefDate, main.Report, ad.DocID AS AddID,

    ad.RefDate AS AddRefDate, ad.Report AS AddReport, dx.Code, dx.Text

    FROM (SELECT pat1.PatID, pat1.DocID, pat1.Site, doc.RefDate, doc.Report

    FROM PatMapTable AS pat1 WITH (NoLock)

    INNER JOIN DocumentView AS doc WITH (NoLock)

    ON doc.DocID = pat1.DocID) AS main

    LEFT OUTER JOIN DocumentView AS ad WITH (NoLock)

    ON main.DocID = ad.ParentID

    LEFT OUTER JOIN (SELECT pat2.DocID, diag.Code, diag.Text

    FROM PatMapTable AS pat2 WITH (NoLock)

    INNER JOIN DiagnosisView AS diag WITH (NoLock)

    ON pat2.Site = diag.Site AND pat2.VisitID = diag.VisitID) AS dx

    ON main.DocID = dx.DocID

    ORDER BY main.DocID, AddID, dx.Code

    Are there log files I could access that could indicate whether the database tables could be improved or if there are problems with this select statement? Is there a different approach that would work better? I would appreciate any advice anyone can offer about this.

    Thanks,

    Mike

  • The first thing that I would do is improve the readability of your query (slightly) by rewriting it as:

    ;WITH main AS (

    SELECT pat1.PatID, pat1.DocID, pat1.Site, doc.RefDate, doc.Report

    FROM PatMapTable AS pat1 WITH (NoLock)

    INNER JOIN DocumentView AS doc WITH (NoLock)

    ON doc.DocID = pat1.DocID

    ),

    dx AS (

    SELECT pat2.DocID, diag.Code, diag.Text

    FROM PatMapTable AS pat2 WITH (NoLock)

    INNER JOIN DiagnosisView AS diag WITH (NoLock)

    ON pat2.Site = diag.Site AND pat2.VisitID = diag.VisitID

    )

    SELECT main.PatID, main.DocID, main.Site, main.RefDate, main.Report, ad.DocID AS AddID,

    ad.RefDate AS AddRefDate, ad.Report AS AddReport, dx.Code, dx.Text

    FROM main

    LEFT OUTER JOIN DocumentView AS ad WITH (NoLock)

    ON main.DocID = ad.ParentID

    LEFT OUTER JOIN dx

    ON main.DocID = dx.DocID

    ORDER BY main.DocID, AddID, dx.Code

    That won't help with performance though.

    The first step would be to look at the actual execution plan (menu: Query / Include Actual Execution Plan) and check to see if your tables are being scanned or if indexes are being scanned. I suspect the former, in which case the tables lack the indexing you need to speed up your queries.

    If you post that execution plan, folks may be able to look at it and make further suggestions.


    My mantra: No loops! No CURSORs! No RBAR! Hoo-uh![/I]

    My thought question: Have you ever been told that your query runs too fast?

    My advice:
    INDEXing a poor-performing query is like putting sugar on cat food. Yeah, it probably tastes better but are you sure you want to eat it?
    The path of least resistance can be a slippery slope. Take care that fixing your fixes of fixes doesn't snowball and end up costing you more than fixing the root cause would have in the first place.

    Need to UNPIVOT? Why not CROSS APPLY VALUES instead?[/url]
    Since random numbers are too important to be left to chance, let's generate some![/url]
    Learn to understand recursive CTEs by example.[/url]
    [url url=http://www.sqlservercentral.com/articles/St

  • A couple of query execution plan files are attached. The one for the query I am using is called SingleQueryExecutionPlan.sqlplan. It shows 99% of the cost assigned to one sort node near the end of the processing. The second file, called SingleQueryExecutionPlanNoSort.sqlplan is from the same query, but with the ORDER BY clause removed, to see which other nodes have a non-zero cost assigned in that case. I need to have the results sorted, so removing the ORDER BY clause is not an option in practice. Is there a way to write the query so sorting the results is not so expensive? Could you also please tell me which of the other nodes or sequences of nodes indicate inefficiently written parts of the query?

    Thanks,

    Mike

  • Well, you're trying to sort 1.6 million rows. To do that "quickly" (several seconds) requires a substantial amount of memory, depending on how big these rows are. If you do not have enough memory then is has to do a memory+disk sort which is substantially slower (minutes to hours).

    [font="Times New Roman"]-- RBarryYoung[/font], [font="Times New Roman"] (302)375-0451[/font] blog: MovingSQL.com, Twitter: @RBarryYoung[font="Arial Black"]
    Proactive Performance Solutions, Inc.
    [/font]
    [font="Verdana"] "Performance is our middle name."[/font]

  • The second thing I'd do is take the nolock hints out. See - http://blogs.msdn.com/b/davidlean/archive/2009/04/06/sql-server-nolock-hint-other-poor-ideas.aspx

    If the sort has to be there and is the most expensive portion, there's not really much (other than more hardware) that can be done. Can't leverage indexes because the columns are in 3 different tables.

    Gail Shaw
    Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
    SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

    We walk in the dark places no others will enter
    We stand on the bridge and no one may pass
  • Hmm, the query plan looks odd to me. The relationship between estimated and actual suggests that things could be improved, but what's really suprising to me is that the cost estimates of the Sort seems to be based on the preceeding row count estimate (36M) which is way off the actual (0.26M).

    Being off by more than two orders of magnitude, I think that the 98% cost estimate for the Sort should be taken with a huge grain of salt.

    [font="Times New Roman"]-- RBarryYoung[/font], [font="Times New Roman"] (302)375-0451[/font] blog: MovingSQL.com, Twitter: @RBarryYoung[font="Arial Black"]
    Proactive Performance Solutions, Inc.
    [/font]
    [font="Verdana"] "Performance is our middle name."[/font]

  • Hmm, the estimates in the NO-OrderBy version are really bad too. I wonder why that is?

    Could you supply the DDL for the tables with keys and indexes please? If there are a lot of unused columns, then you can just included the columns that are used in this query. Thanks.

    [font="Times New Roman"]-- RBarryYoung[/font], [font="Times New Roman"] (302)375-0451[/font] blog: MovingSQL.com, Twitter: @RBarryYoung[font="Arial Black"]
    Proactive Performance Solutions, Inc.
    [/font]
    [font="Verdana"] "Performance is our middle name."[/font]

  • I see you're also using OPTION (FAST 1), I thought that that didn't work anymore?

    [font="Times New Roman"]-- RBarryYoung[/font], [font="Times New Roman"] (302)375-0451[/font] blog: MovingSQL.com, Twitter: @RBarryYoung[font="Arial Black"]
    Proactive Performance Solutions, Inc.
    [/font]
    [font="Verdana"] "Performance is our middle name."[/font]

  • It's FASTFIRSTROW that no longer works, the replacement is FAST 1

    It's worth noting, the FAST 1 hint will result in the optimiser estimating and optimising for 1 row only. The FAST hint is a row goal, tells the optimiser how many rows it needs to plan for. That's why both queries have completely incorrect row estimations

    I didn't see a hint in the initial query, that's why I didn't mention it, but FAST 1 may very well cause the first row to be returned faster with the side effect that the entire query takes much longer than it would without the hint.

    Ok, correcting my earlier statement. First thing I'd do is take off the nolock hint. Second thing I'd do is take off the FAST 1 query hint. Get some sensible estimations and see how the plans change and how the performance changes.

    Gail Shaw
    Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
    SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

    We walk in the dark places no others will enter
    We stand on the bridge and no one may pass
  • Can you show us what the query/execution plan from this looks like:

    SELECT pat2.TIUDocumentSID, icd.ICDCode, icd.DiagnosisText

    into #d

    FROM [VHA19\VHASLCOlearT].PatientMap_1K_patients AS pat2

    INNER JOIN Src.Visit_IDs AS vis ON pat2.Sta3n = vis.Sta3n AND pat2.VisitIEN = vis.VisitIEN

    INNER JOIN Src.VDiagnosis AS diag ON vis.VisitSID = diag.VisitSID

    INNER JOIN [VHA20\VHAPUGHAMMOK].ICD AS icd

    ON diag.ICDSID = icd.ICDSID

    WITH main AS (

    SELECT pat1.TIUDocumentSID, pat1.ScrSSN, pat1.Sta3n, doc.ReferenceDate, doc.VHAEnterpriseStandardTitle, doc.ReportText, doc.TIUDocumentIEN

    FROM [VHA19\VHASLCOlearT].PatientMap_1K_patients AS pat1

    JOIN [Src].TIUDocument_8925_New AS doc ON doc.TIUDocumentSID = pat1.TIUDocumentSID

    ),

    d AS (

    Select * From #d

    )

    SELECT main.TIUDocumentSID, main.ScrSSN, main.Sta3n, main.ReferenceDate, main.VHAEnterpriseStandardTitle, main.ReportText,

    ad.TIUDocumentSID AS TIUAddendumSID, ad.ReferenceDate AS AddendumReferenceDate, ad.ReportText AS AddendumReportText,

    d.ICDCode, d.DiagnosisText

    FROM main

    LEFT OUTER JOIN Src.TIUDocument_8925_New AS ad ON main.Sta3n = ad.Sta3n AND main.TIUDocumentIEN = ad.ParentIEN

    LEFT OUTER JOIN d ON main.TIUDocumentSID = d.TIUDocumentSID

    WHERE main.ScrSSN >= 0

    AND main.ScrSSN < 1000000000

    ORDER BY main.TIUDocumentSID, TIUAddendumSID, d.ICDCode

    Thanks,

    [font="Times New Roman"]-- RBarryYoung[/font], [font="Times New Roman"] (302)375-0451[/font] blog: MovingSQL.com, Twitter: @RBarryYoung[font="Arial Black"]
    Proactive Performance Solutions, Inc.
    [/font]
    [font="Verdana"] "Performance is our middle name."[/font]

  • GilaMonster (8/26/2012)


    It's FASTFIRSTROW that no longer works, the replacement is FAST 1

    It's worth noting, the FAST 1 hint will result in the optimiser estimating and optimising for 1 row only. The FAST hint is a row goal, tells the optimiser how many rows it needs to plan for. That's why both queries have completely incorrect row estimations

    Ah yes, that makes sense. Thanks, Gail!

    [font="Times New Roman"]-- RBarryYoung[/font], [font="Times New Roman"] (302)375-0451[/font] blog: MovingSQL.com, Twitter: @RBarryYoung[font="Arial Black"]
    Proactive Performance Solutions, Inc.
    [/font]
    [font="Verdana"] "Performance is our middle name."[/font]

  • Here are two more query execution plans. SingleQueryExecutionPlanWithoutHints.sqlplan is the same query as before, but with the NoLock and FAST 1 hints removed. SingleQueryExecutionPlanWithDiagTempTable.sqlplan is for the requested query in which the diagnosis data is selected into a temp table first. It also doesn't use any hints.

    A question underlying the analysis of the gargantuan query that I posted originally is whether using a single query to get back millions of rows is the best approach for selecting this data or not.

    The original set of queries that I had were one that selected only the information about the main document, that is, the TIUDocumentSID, ScrSSN, Sta3n, ReferenceDate, VHAEnterpriseStandardTitle and ReportText and separate queries that selected the addendum information -- the TIUAddendumSID, AddendumReferenceDate and AddendumReportText -- and the diagnosis information -- the ICDCode and the DiagnosisText.

    The first of these queries would be executed only once and would return one row for each main document. The second and third queries would be executed once per main document each. If there are 1,000,000 main documents in the data set, then the first select statement would be run once and would return 1,000,000 rows, and the second and third queries would be run 1,000,000 times each and would return between 0 and 3 rows in most cases, and occasionally return as many as 10 rows. In contrast, the single query I am using now is run once and returns a number of rows equal to the number of addenda times the number of diagnoses associated with each main document (except that 0 addenda or 0 diagnoses does not mean that no rows are returned for that main document) summed over all of the main documents. The data set I have been using to generate these query plans contains 129,810 main documents, and the single query returns 261,385 rows.

    The way that I was submitting the main document query and the many addendum and diagnosis queries was apparently very inefficient, but I would like to find out if there is a way to submit them in a way that is faster than the single query that I displayed in my first post. I am running all of this from a Java program via JDBC, so the main document query would return TIUDocumentSIDs for all of the main documents, and Java code would loop through them and obtain addenda and diagnoses using SQL queries containing WHERE clauses that specified "WHERE TIUDocumentSID = ?" and ? is bound to the current TIUDocumentSID in the loop. Is there a way to make a set of queries like this run faster than a single query that returns all of the rows (with some duplication of information)? Or is there a third way of accessing the database information that would be faster than either of these?

    Thanks,

    Mike

  • I can't obtain the indexes and keys because I am working with a view to a table that I can't access. That view has 95 columns, and I am using only something like 8 of them.

  • tmoleary (8/27/2012)


    Here are two more query execution plans. SingleQueryExecutionPlanWithoutHints.sqlplan is the same query as before, but with the NoLock and FAST 1 hints removed. SingleQueryExecutionPlanWithDiagTempTable.sqlplan is for the requested query in which the diagnosis data is selected into a temp table first. It also doesn't use any hints.

    How long did these take to execute, compared to your original?

    A question underlying the analysis of the gargantuan query that I posted originally is whether using a single query to get back millions of rows is the best approach for selecting this data or not.

    Generally, Yes, one big query is usually the best way. Breaking it down into intermediate tables as I was trying is something should only be done as a remedial approach. It sometimes works for big queries, only because the optimizer has a lot more trouble figuring out the best approach for big queries than it does for small ones. Also for any other cases where the optimizer doesn't seem to be figuring things out very well, it might help.

    Also, can you tell us how many rows are in the VDiagnosis table?

    [font="Times New Roman"]-- RBarryYoung[/font], [font="Times New Roman"] (302)375-0451[/font] blog: MovingSQL.com, Twitter: @RBarryYoung[font="Arial Black"]
    Proactive Performance Solutions, Inc.
    [/font]
    [font="Verdana"] "Performance is our middle name."[/font]

  • Ok, try this query:

    SELECT

    pat1.TIUDocumentSID,

    pat1.ScrSSN,

    pat1.Sta3n,

    doc.ReferenceDate,

    doc.ReportText,

    ad.TIUDocumentSID AS TIUAddendumSID,

    ad.ReferenceDate AS AddendumReferenceDate,

    ad.ReportText AS AddendumReportText,

    icd.ICDCode,

    icd.DiagnosisText

    FROM [VHA19\VHASLCOlearT].PatientMap_1K_patients AS pat1

    JOIN Src.TIUDocument_8925_New AS doc ON doc.TIUDocumentSID = pat1.TIUDocumentSID

    LEFT JOIN Src.TIUDocument_8925_New AS ad ON pat1.Sta3n = ad.Sta3n AND doc.TIUDocumentIEN = ad.ParentIEN

    LEFT JOIN Src.Visit_IDs AS vis ON pat1.Sta3n = vis.Sta3n AND pat1.VisitIEN = vis.VisitIEN

    JOIN Src.VDiagnosis AS diag ON vis.VisitSID = diag.VisitSID

    JOIN [VHA20\VHAPUGHAMMOK].ICD AS icd ON diag.ICDSID = icd.ICDSID;

    WHERE pat1.ScrSSN >= 0

    AND pat1.ScrSSN < 1000000000

    ORDER BY pat1.TIUDocumentSID, TIUAddendumSID, icd.ICDCode

    I think that this is about as clean as the query expression can get, so hopefully, the optimizer will be able to figure it out better. Either way, I would be interested to see the query plan and the run times.

    You will also need to check the results for correctness, since I have moved a lot of stuff around and I cannot test it myself.

    [font="Times New Roman"]-- RBarryYoung[/font], [font="Times New Roman"] (302)375-0451[/font] blog: MovingSQL.com, Twitter: @RBarryYoung[font="Arial Black"]
    Proactive Performance Solutions, Inc.
    [/font]
    [font="Verdana"] "Performance is our middle name."[/font]

Viewing 15 posts - 1 through 15 (of 16 total)

You must be logged in to reply to this topic. Login to reply