Click here to monitor SSC
SQLServerCentral is supported by Red Gate Software Ltd.
 
Log in  ::  Register  ::  Not logged in
 
 
 
        
Home       Members    Calendar    Who's On


Add to briefcase 12»»

SQL query performance question Expand / Collapse
Author
Message
Posted Monday, August 20, 2012 5:03 PM
Grasshopper

GrasshopperGrasshopperGrasshopperGrasshopperGrasshopperGrasshopperGrasshopperGrasshopper

Group: General Forum Members
Last Login: Thursday, November 15, 2012 2:08 PM
Points: 11, Visits: 62
I am working on a small research team that has been given access to a set of SQL Server 2008 tables via some views. We can do select queries using those views, and I can see the names of the tables using the Design command when one of the views is selected in SQL Server Management Studio, but we don't have permission to look at the table columns and indexes directly, much less change them in any way.

The database contains electronic medical records. My task is to populate a Lucene index from the contents of the free text fields in the database. I have been using Solr and Lucene to do the indexing, and JDBC to access the database. For our experiments, we are indexing the documents of randomly selected sets of patients. From the views that we have access to, I can get patients' ID numbers and the documents associated with those patients. Each document may contain a variable number of addendum documents and may contain a variable number of diagnosis records.

In my first attempt to access these records, I modified some sample code from a Solr web site (http://wiki.apache.org/solr/DataImportHandler#Full_Import_Example) that specified nested SQL selects in a configuration file to obtain records of items for sale, their descriptions, and their categories. I figured that I could do the same thing with an outer select statement that obtains the patient and main document ids and other main document information, and two inner queries at the same level, one of which obtains the addendum information associated with a given document id and the other of which obtains the diagnosis information associated with that document id. The data sets we are working with range in size from 1,000 patients to 500,000 patients, and the patients have on average about 100 main documents associated with them. So for a data set of 100,000 patients, one select statement was run to get the patient and main document info, 100,000 select statements were run to get the addendum info, and 100,000 select statements were run to get the diagnosis info. This ran extremely slowly. I don't know if the problem was with the sheer number of selects, each of which obtained a small number of rows, or if some additional work is needed to make the selects execute faster. A database admin looked at logs of the queries I was running, and said "each [query] is being recompiled with a new query plan being generated for each one". Unfortunately, both the databases and the administrators are located over a thousand miles from us, and we don't get many chances to talk with them.

I changed the way I was accessing the database to using a single query that gets all of the patient, document, addendum and diagnosis info. I needed to write some Java code to remove duplicates for some documents, but that wasn't difficult. This query runs much faster than the nested ones I had been using, but it often takes a long time for the first rows of data to be returned. In a recent run with data sets containing about 25,000 patients each (about 1,600,000 SQL rows each), it took at least 40 minutes and up to 5 hours for the first rows of the query to be returned. After that, the rows came in fast enough for our purposes. I don't know if the query I am using is at fault or if the tables that the views we use are based on are lacking the appropriate table structure and indexes that would allow the query to run as efficiently as possible. The query I am using looks like this:

SELECT main.PatID, main.DocID, main.Site, main.RefDate, main.Report, ad.DocID AS AddID,
ad.RefDate AS AddRefDate, ad.Report AS AddReport, dx.Code, dx.Text
FROM (SELECT pat1.PatID, pat1.DocID, pat1.Site, doc.RefDate, doc.Report
FROM PatMapTable AS pat1 WITH (NoLock)
INNER JOIN DocumentView AS doc WITH (NoLock)
ON doc.DocID = pat1.DocID) AS main
LEFT OUTER JOIN DocumentView AS ad WITH (NoLock)
ON main.DocID = ad.ParentID
LEFT OUTER JOIN (SELECT pat2.DocID, diag.Code, diag.Text
FROM PatMapTable AS pat2 WITH (NoLock)
INNER JOIN DiagnosisView AS diag WITH (NoLock)
ON pat2.Site = diag.Site AND pat2.VisitID = diag.VisitID) AS dx
ON main.DocID = dx.DocID
ORDER BY main.DocID, AddID, dx.Code

Are there log files I could access that could indicate whether the database tables could be improved or if there are problems with this select statement? Is there a different approach that would work better? I would appreciate any advice anyone can offer about this.
Thanks,
Mike
Post #1347509
Posted Tuesday, August 21, 2012 1:18 AM


Hall of Fame

Hall of FameHall of FameHall of FameHall of FameHall of FameHall of FameHall of FameHall of FameHall of Fame

Group: General Forum Members
Last Login: Yesterday @ 6:03 PM
Points: 3,609, Visits: 5,221
The first thing that I would do is improve the readability of your query (slightly) by rewriting it as:

;WITH main AS (
SELECT pat1.PatID, pat1.DocID, pat1.Site, doc.RefDate, doc.Report
FROM PatMapTable AS pat1 WITH (NoLock)
INNER JOIN DocumentView AS doc WITH (NoLock)
ON doc.DocID = pat1.DocID
),
dx AS (
SELECT pat2.DocID, diag.Code, diag.Text
FROM PatMapTable AS pat2 WITH (NoLock)
INNER JOIN DiagnosisView AS diag WITH (NoLock)
ON pat2.Site = diag.Site AND pat2.VisitID = diag.VisitID
)
SELECT main.PatID, main.DocID, main.Site, main.RefDate, main.Report, ad.DocID AS AddID,
ad.RefDate AS AddRefDate, ad.Report AS AddReport, dx.Code, dx.Text
FROM main
LEFT OUTER JOIN DocumentView AS ad WITH (NoLock)
ON main.DocID = ad.ParentID
LEFT OUTER JOIN dx
ON main.DocID = dx.DocID
ORDER BY main.DocID, AddID, dx.Code


That won't help with performance though.

The first step would be to look at the actual execution plan (menu: Query / Include Actual Execution Plan) and check to see if your tables are being scanned or if indexes are being scanned. I suspect the former, in which case the tables lack the indexing you need to speed up your queries.

If you post that execution plan, folks may be able to look at it and make further suggestions.



My mantra: No loops! No CURSORs! No RBAR! Hoo-uh!

My thought question: Have you ever been told that your query runs too fast?

My advice:
INDEXing a poor-performing query is like putting sugar on cat food. Yeah, it probably tastes better but are you sure you want to eat it?
The path of least resistance can be a slippery slope. Take care that fixing your fixes of fixes doesn't snowball and end up costing you more than fixing the root cause would have in the first place.


Need to UNPIVOT? Why not CROSS APPLY VALUES instead?
Since random numbers are too important to be left to chance, let's generate some!
Learn to understand recursive CTEs by example.
Splitting strings based on patterns can be fast!
Post #1347601
Posted Saturday, August 25, 2012 10:35 PM
Grasshopper

GrasshopperGrasshopperGrasshopperGrasshopperGrasshopperGrasshopperGrasshopperGrasshopper

Group: General Forum Members
Last Login: Thursday, November 15, 2012 2:08 PM
Points: 11, Visits: 62
A couple of query execution plan files are attached. The one for the query I am using is called SingleQueryExecutionPlan.sqlplan. It shows 99% of the cost assigned to one sort node near the end of the processing. The second file, called SingleQueryExecutionPlanNoSort.sqlplan is from the same query, but with the ORDER BY clause removed, to see which other nodes have a non-zero cost assigned in that case. I need to have the results sorted, so removing the ORDER BY clause is not an option in practice. Is there a way to write the query so sorting the results is not so expensive? Could you also please tell me which of the other nodes or sequences of nodes indicate inefficiently written parts of the query?
Thanks,
Mike


  Post Attachments 
SingleQueryExecutionPlan.sqlplan (13 views, 238.07 KB)
SingleQueryExecutionPlanNoSort.sqlplan (10 views, 147.86 KB)
Post #1350089
Posted Sunday, August 26, 2012 10:58 AM


SSCrazy Eights

SSCrazy EightsSSCrazy EightsSSCrazy EightsSSCrazy EightsSSCrazy EightsSSCrazy EightsSSCrazy EightsSSCrazy EightsSSCrazy EightsSSCrazy Eights

Group: General Forum Members
Last Login: Thursday, June 5, 2014 10:54 AM
Points: 9,902, Visits: 9,480
Well, you're trying to sort 1.6 million rows. To do that "quickly" (several seconds) requires a substantial amount of memory, depending on how big these rows are. If you do not have enough memory then is has to do a memory+disk sort which is substantially slower (minutes to hours).

-- RBarryYoung, (302)375-0451 blog: MovingSQL.com, Twitter: @RBarryYoung
Proactive Performance Solutions, Inc.
"Performance is our middle name."
Post #1350124
Posted Sunday, August 26, 2012 11:22 AM


SSC-Forever

SSC-ForeverSSC-ForeverSSC-ForeverSSC-ForeverSSC-ForeverSSC-ForeverSSC-ForeverSSC-ForeverSSC-ForeverSSC-ForeverSSC-ForeverSSC-ForeverSSC-ForeverSSC-ForeverSSC-Forever

Group: General Forum Members
Last Login: Today @ 11:41 AM
Points: 42,442, Visits: 35,496
The second thing I'd do is take the nolock hints out. See - http://blogs.msdn.com/b/davidlean/archive/2009/04/06/sql-server-nolock-hint-other-poor-ideas.aspx

If the sort has to be there and is the most expensive portion, there's not really much (other than more hardware) that can be done. Can't leverage indexes because the columns are in 3 different tables.



Gail Shaw
Microsoft Certified Master: SQL Server 2008, MVP
SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

We walk in the dark places no others will enter
We stand on the bridge and no one may pass

Post #1350127
Posted Sunday, August 26, 2012 1:54 PM


SSCrazy Eights

SSCrazy EightsSSCrazy EightsSSCrazy EightsSSCrazy EightsSSCrazy EightsSSCrazy EightsSSCrazy EightsSSCrazy EightsSSCrazy EightsSSCrazy Eights

Group: General Forum Members
Last Login: Thursday, June 5, 2014 10:54 AM
Points: 9,902, Visits: 9,480
Hmm, the query plan looks odd to me. The relationship between estimated and actual suggests that things could be improved, but what's really suprising to me is that the cost estimates of the Sort seems to be based on the preceeding row count estimate (36M) which is way off the actual (0.26M).

Being off by more than two orders of magnitude, I think that the 98% cost estimate for the Sort should be taken with a huge grain of salt.


-- RBarryYoung, (302)375-0451 blog: MovingSQL.com, Twitter: @RBarryYoung
Proactive Performance Solutions, Inc.
"Performance is our middle name."
Post #1350151
Posted Sunday, August 26, 2012 2:00 PM


SSCrazy Eights

SSCrazy EightsSSCrazy EightsSSCrazy EightsSSCrazy EightsSSCrazy EightsSSCrazy EightsSSCrazy EightsSSCrazy EightsSSCrazy EightsSSCrazy Eights

Group: General Forum Members
Last Login: Thursday, June 5, 2014 10:54 AM
Points: 9,902, Visits: 9,480
Hmm, the estimates in the NO-OrderBy version are really bad too. I wonder why that is?

Could you supply the DDL for the tables with keys and indexes please? If there are a lot of unused columns, then you can just included the columns that are used in this query. Thanks.



-- RBarryYoung, (302)375-0451 blog: MovingSQL.com, Twitter: @RBarryYoung
Proactive Performance Solutions, Inc.
"Performance is our middle name."
Post #1350152
Posted Sunday, August 26, 2012 2:10 PM


SSCrazy Eights

SSCrazy EightsSSCrazy EightsSSCrazy EightsSSCrazy EightsSSCrazy EightsSSCrazy EightsSSCrazy EightsSSCrazy EightsSSCrazy EightsSSCrazy Eights

Group: General Forum Members
Last Login: Thursday, June 5, 2014 10:54 AM
Points: 9,902, Visits: 9,480
I see you're also using OPTION (FAST 1), I thought that that didn't work anymore?

-- RBarryYoung, (302)375-0451 blog: MovingSQL.com, Twitter: @RBarryYoung
Proactive Performance Solutions, Inc.
"Performance is our middle name."
Post #1350153
Posted Sunday, August 26, 2012 2:16 PM


SSC-Forever

SSC-ForeverSSC-ForeverSSC-ForeverSSC-ForeverSSC-ForeverSSC-ForeverSSC-ForeverSSC-ForeverSSC-ForeverSSC-ForeverSSC-ForeverSSC-ForeverSSC-ForeverSSC-ForeverSSC-Forever

Group: General Forum Members
Last Login: Today @ 11:41 AM
Points: 42,442, Visits: 35,496
It's FASTFIRSTROW that no longer works, the replacement is FAST 1

It's worth noting, the FAST 1 hint will result in the optimiser estimating and optimising for 1 row only. The FAST hint is a row goal, tells the optimiser how many rows it needs to plan for. That's why both queries have completely incorrect row estimations

I didn't see a hint in the initial query, that's why I didn't mention it, but FAST 1 may very well cause the first row to be returned faster with the side effect that the entire query takes much longer than it would without the hint.

Ok, correcting my earlier statement. First thing I'd do is take off the nolock hint. Second thing I'd do is take off the FAST 1 query hint. Get some sensible estimations and see how the plans change and how the performance changes.



Gail Shaw
Microsoft Certified Master: SQL Server 2008, MVP
SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

We walk in the dark places no others will enter
We stand on the bridge and no one may pass

Post #1350154
Posted Sunday, August 26, 2012 2:35 PM


SSCrazy Eights

SSCrazy EightsSSCrazy EightsSSCrazy EightsSSCrazy EightsSSCrazy EightsSSCrazy EightsSSCrazy EightsSSCrazy EightsSSCrazy EightsSSCrazy Eights

Group: General Forum Members
Last Login: Thursday, June 5, 2014 10:54 AM
Points: 9,902, Visits: 9,480
Can you show us what the query/execution plan from this looks like:

SELECT pat2.TIUDocumentSID, icd.ICDCode, icd.DiagnosisText
into #d
FROM [VHA19\VHASLCOlearT].PatientMap_1K_patients AS pat2
INNER JOIN Src.Visit_IDs AS vis ON pat2.Sta3n = vis.Sta3n AND pat2.VisitIEN = vis.VisitIEN
INNER JOIN Src.VDiagnosis AS diag ON vis.VisitSID = diag.VisitSID
INNER JOIN [VHA20\VHAPUGHAMMOK].ICD AS icd
ON diag.ICDSID = icd.ICDSID


WITH main AS (
SELECT pat1.TIUDocumentSID, pat1.ScrSSN, pat1.Sta3n, doc.ReferenceDate, doc.VHAEnterpriseStandardTitle, doc.ReportText, doc.TIUDocumentIEN
FROM [VHA19\VHASLCOlearT].PatientMap_1K_patients AS pat1
JOIN [Src].TIUDocument_8925_New AS doc ON doc.TIUDocumentSID = pat1.TIUDocumentSID
),
d AS (
Select * From #d
)
SELECT main.TIUDocumentSID, main.ScrSSN, main.Sta3n, main.ReferenceDate, main.VHAEnterpriseStandardTitle, main.ReportText,
ad.TIUDocumentSID AS TIUAddendumSID, ad.ReferenceDate AS AddendumReferenceDate, ad.ReportText AS AddendumReportText,
d.ICDCode, d.DiagnosisText

FROM main
LEFT OUTER JOIN Src.TIUDocument_8925_New AS ad ON main.Sta3n = ad.Sta3n AND main.TIUDocumentIEN = ad.ParentIEN
LEFT OUTER JOIN d ON main.TIUDocumentSID = d.TIUDocumentSID

WHERE main.ScrSSN >= 0
AND main.ScrSSN < 1000000000

ORDER BY main.TIUDocumentSID, TIUAddendumSID, d.ICDCode

Thanks,


-- RBarryYoung, (302)375-0451 blog: MovingSQL.com, Twitter: @RBarryYoung
Proactive Performance Solutions, Inc.
"Performance is our middle name."
Post #1350156
« Prev Topic | Next Topic »

Add to briefcase 12»»

Permissions Expand / Collapse