Optimization advice for huge tables

  • Hello,

    I need tuning advices for large tables query performance.

    I have 3 big tables and I need to join these 3 tables on some queries and it's getting slower every day because of the table rows size increasing.

    I believe I made the index tuning.

    I'd be glad if you suggest me for better performance tuning.

    It is not easy to split tables. I tried to create another data files but it didn't make a change.

    Each table 5-6 GB size by the way.

    Thanks in advance.

  • Performance tuning is generally done for a specific performance goal, and without some detail on what you already have in the way of tables, indexes, row sizes, row counts, and the query (ies?) that need help, general advice might not be terribly useful. There are a number of possible techniques to help out, such as Partitioning, Sparse Indexes (typically for fields with large numbers of rows with NULL values. or for situations where just 1 value ou all possible values for a field is of interest), Index query hints, query optimization, temp tables, table variables, etc...

    You haven't provided enough information to go into any detail, however... Care to elaborate?

    Steve (aka sgmunson) 🙂 🙂 🙂
    Rent Servers for Income (picks and shovels strategy)

  • When working with large tables:

    Retrieve just the data that you absolutely need.

    Understand that no one is going to read several million rows, hence do not query several million rows and return that to the client. It is a waste

    Make sure your indexes support the queries

    Make sure the queries are written to be able to use indexes

    Partitioning is not primarily a performance tuning mechanism, it's mostly about maintainability.

    Avoid index hints unless you are absolutely, completely sure that you know better than the optimiser what the best plan for a query is

    Gail Shaw
    Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
    SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

    We walk in the dark places no others will enter
    We stand on the bridge and no one may pass
  • Hello Steve,

    You're right. Let me give detailed information.

    My website is http://www.websiteanalyzer.info

    It analyzes websites and give information to webmasters for better SEO.

    Websites have keyword metatags(one to many relationship) and for same keywords there may be other sites too. For example sqlservercentral.com has "SQL Server" tag and there are lots of websites has the same tag.

    So I have 3 big tables.

    1 - Sites table - 20.000.000 rows

    SiteID - Primary Key

    SiteUrl - Nonclustered Index

    2 - Tags table - 30.000.000 rows

    TagID - PK

    Tag - Nonclustered Index

    3 - SiteTags table - 120.000.000 rows

    SiteID - Nonclustered Index, FK

    TagID - Nonclustered Index, FK

    If you check following urls you will see the problem. On first page there is no performance problem but if you navigate the pager, it will be slower(you will see why if you check stored procedure).

    http://www.websiteanalyzer.info/Tag.aspx?tag=SQL%20Server&page=1

    http://www.websiteanalyzer.info/Tag.aspx?tag=SQL%20Server&page=100

    Following is the stored procedure running on these pages.

    ALTER PROCEDURE [dbo].[spGetSiteListFromTagPager]

    @Tag nvarchar(64),

    @page int = 0

    AS

    DECLARE @TagID int

    SET @TagID = (select top 1 TagID FROM Tags with(nolock) WHERE Tag=@Tag)

    IF(@page = 1)

    BEGIN

    SELECT TOP 20 Sites.SiteUrl,Sites.Title,SUBSTRING(Sites.Description,1,500) as 'Description'

    from Sites with(nolock)

    INNER JOIN SiteTags with(nolock) ON Sites.SiteID=SiteTags.SiteID

    WHERE SiteTags.TagID=@TagID

    END

    ELSE IF(@page > 1)

    BEGIN

    create table #t(SiteID bigint NOT NULL)

    ALTER TABLE #t ADD PRIMARY KEY CLUSTERED (SiteID)

    insert #t

    select distinct SiteID FROM SiteTags with(nolock) WHERE TagID = @TagID

    --CREATE INDEX #t_IDX_SiteID1 ON #t(SiteID)

    CREATE TABLE #t2(UID int IDENTITY(1,1) NOT NULL,SiteUrl varchar(120),Title varchar(500),Description nvarchar(500))

    ALTER TABLE #t2 ADD PRIMARY KEY CLUSTERED (UID)

    INSERT #t2

    SELECT TOP (@page*20) Sites.SiteUrl,Sites.Title,SUBSTRING(Sites.Description,1,500) from Sites with(nolock)

    INNER JOIN #t with(nolock) ON Sites.SiteID = #t.SiteID

    SELECT --distinct

    SiteUrl,Title,Description FROM #t2 WHERE

    UID>=((@page-1)*20) AND UID<(@page*20)

    END

    By the way my web server conf: INTEL CORE I5 2400 3.10GHZ 6MB, 16MB 7200RPM SATA DISK, 16 GB RAM and I'm looking for a new server with SSD.

    If you need more information pls let me know.

    Thanks

  • I'm wondering if your SUBSTRING function is causing performance grief. How many bytes is the Description field? Is that field contained in the index? If not, you'll end up with a table scan, so you might want to do something with that. Also, why do the SUBSTRING at the same time as the TOP n ? Why not join to the relevant TOP n records and pull the SUBSTRING only on the records that qualify on the TOP n ?

    Here's the thought:

    ;WITH CTE AS (

    SELECT TOP (@page*20) Sites.SiteUrl,Sites.Title, Sites.SiteID

    FROM Sites WITH(nolock)

    INNER JOIN #t WITH(nolock) ON Sites.SiteID = #t.SiteID

    )

    INSERT #t2

    SELECT C.*, SUBSTRING(S.Description,1,500)

    FROM CTE AS C

    INNER JOIN Sites AS S

    ON C.SiteID = S.SiteID

    Not sure if that will help, but it might avoid yanking that Description field for a ton of records when you only need it for the qualifying records. If the CTE doesn't make that happen, you can temp table it instead and hopefully force it that way.

    Steve (aka sgmunson) 🙂 🙂 🙂
    Rent Servers for Income (picks and shovels strategy)

  • Thank you for your advice, I changed as you written but it didn't change.It took 38 seconds

    Description field type is nvarchar(2000)

    There are other queries take huge executon times.

    I believe that I paid lots of time for query optimization for lots of stored procedures and I should change database design.

    Splitting tables or partitioning or something else. I don't know.

    When I run this stored procedure with "Include Actual Execution Plan", I see that 100% cost on following statement.

    WITH CTE AS (

    SELECT TOP (@page*20) Sites.SiteUrl,Sites.Title, Sites.SiteID

    FROM Sites WITH(nolock)

    INNER JOIN #t WITH(nolock) ON Sites.SiteID = #t.SiteID

    )

    INSERT #t2

    SELECT C.SiteUrl, C.Title, SUBSTRING(S.Description,1,500)

    FROM CTE AS C

    INNER JOIN Sites AS S

    ON C.SiteID = S.SiteID

    When I run with old script; I see on execution plan that 100% time on following.

    Seek Keys[1]: Prefix: [OSA].[dbo].[Sites].SiteID = Scalar Operator([#t].[SiteID])

    So as I understand that because of the Sites table has lots of rows, clustered index seeking on SiteID column takes time.

    I didn't test it but if I split Sites table to 10 part then maybe queries will run faster.

    If I split like following Sites1,Sites2,...Sites10 then If I create a view like following

    CREATE VIEW [dbo].[vwSites]

    AS

    select * from Sites1 with(nolock)

    UNION

    select * from Sites2 with(nolock)

    ...

    select * from Sites10 with(nolock)

    then if I replace all "Sites" with "vwSites" on all sp and functions do you think that everything will be faster?

  • Just splitting the table doesn't necessarily make things faster. The problem might be what to use as criteria to do the split. If, after all, you just have to put them all back together again, nothing is really gained. The question is whether you need to do the SUBSTRING (can the application handle and toss away any excess characters past 500?), or more importantly, is the Description field a part of the index? Another question related to splitting, is the overall design of the webpage. Do you really need to be able to handle a request for ANY given page number? Seems to me that any more than the top 100 websites for any given category might be serious overkill for most practical purposes, but then, I know nothing about what drives revenue for your company, so I can only ask the question as opposed to provide the answer. If it turns out that some fixed number of sites is "good enough" at any given time, then maybe you can design the insert / update process around maintaining the TOP N rows in an additional table as opposed to in a view, and then you might well have a lightning fast site. Again, just tossing around ideas (IOW, grasping at what might even be a straw here or there - LOL).

    Anyway, let me know what you have in the Index. There might even be a way to use a Sparse Index if we can flag any given record with an additional field to indicate it's in the TOP N, with NULL value if it's not, then create a SPARSE INDEX on that field, including the other fields. You'll have to tell me...

    Steve (aka sgmunson) 🙂 🙂 🙂
    Rent Servers for Income (picks and shovels strategy)

  • Looking at the way you've done the paging it you should be able to simplify the whole thing, firstly the substring could be the cause, have you looked at other methods like LEFT, CONVERT to do the trimming?

    I might also be temepted to do something like this,

    Declare @StartRow int, @EndRow Int

    Set @StartRow=@page*20

    Set @EndRow=@StartRow+19

    With CTE as(

    SELECT

    ROW_NUMBER() OVER(ORDER BY sites.siteUrl) Rn

    ,Sites.SiteUrl

    ,Sites.Title

    ,LEFT(Sites.Description,500)

    --,CONVERT(Nvarchar(500),Sites.Description)

    from Sites

    INNER JOIN SiteTags t ON Sites.SiteID = t.SiteID

    )

    Select * from CTE

    Where Rn Between @StartRow and @EndRow

    It would be interesting to see if this works and is more performant, I have doubts about the CTE and the Order by, you might want to change that to SiteId rather than the SiteUrl it depends on the order you want to use.

    _________________________________________________________________________
    SSC Guide to Posting and Best Practices

  • Well SUBSTRING is not important too much so I removed it.

    I think it's better to share table DDL.

    CREATE TABLE [dbo].[Sites](

    [SiteID] [bigint] IDENTITY(1,1) NOT NULL,

    [SiteUrl] [varchar](120) NOT NULL,

    [SiteFirstCheckDate] [datetime] NOT NULL,

    [SiteLastCheckDate] [datetime] NOT NULL,

    [SitePoint] [int] NOT NULL,

    [Title] [nvarchar](500) NOT NULL,

    [TitleLength] [int] NOT NULL,

    [Description] [nvarchar](2000) NOT NULL,

    [SiteWorth] [int] NULL,

    [Encoding] [nvarchar](200) NULL,

    [Author] [nvarchar](200) NULL,

    CONSTRAINT [PK_Sites] PRIMARY KEY CLUSTERED

    (

    [SiteID] ASC

    )WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]

    ) ON [PRIMARY]

    GO

    SET ANSI_PADDING OFF

    GO

    ALTER TABLE [dbo].[Sites] ADD CONSTRAINT [DF_Sites_SiteFirstCheckDate] DEFAULT (getdate()) FOR [SiteFirstCheckDate]

    GO

    ALTER TABLE [dbo].[Sites] ADD CONSTRAINT [DF_Sites_SiteLastCheckDate] DEFAULT (getdate()) FOR [SiteLastCheckDate]

    GO

    ALTER TABLE [dbo].[Sites] ADD CONSTRAINT [DF_Sites_SitePoint] DEFAULT ((0)) FOR [SitePoint]

    GO

    Description column doesn't have any index. SiteUrl,SiteLastCheckDate and SitePoint have nonclustered index.

  • Just adding the Description field to the index on SiteID would probably help quite a bit, although at up to 2K per row, the size of that index would be potentially large, so keep that in mind for disk space usage. Also, there will be a hit on INSERT activity, as then that much more data has to hit the index for inserts and updates.

    Steve (aka sgmunson) 🙂 🙂 🙂
    Rent Servers for Income (picks and shovels strategy)

  • sgmunson (12/11/2012)


    Just adding the Description field to the index on SiteID would probably help quite a bit, although at up to 2K per row, the size of that index would be potentially large, so keep that in mind for disk space usage. Also, there will be a hit on INSERT activity, as then that much more data has to hit the index for inserts and updates.

    Surely You shouldnt need to add the Descrption as the SiteId is a clustered index anyway so at the leaf level its already included, and as Description isnt a part of any filter here, surely it doesnt make any difference

    _________________________________________________________________________
    SSC Guide to Posting and Best Practices

  • I didn't understand why you ask to add index for description field, it is not used on where clause?

  • I modified as following and I think it's little bit changed. I removed tep tables and use built-in paging.

    ALTER PROCEDURE [dbo].[spGetSiteListFromTagPager]

    @Tag nvarchar(64),

    @page int = 0

    AS

    DECLARE @TagID int

    SET @TagID = (select top 1 TagID FROM Tags with(nolock) WHERE Tag=@Tag)

    IF(@page = 1)

    BEGIN

    SELECT TOP 20 Sites.SiteUrl,Sites.Title,Sites.Description

    from Sites with(nolock)

    INNER JOIN SiteTags with(nolock) ON Sites.SiteID=SiteTags.SiteID

    WHERE SiteTags.TagID=@TagID

    END

    ELSE IF(@page > 1)

    BEGIN

    SELECT Sites.SiteUrl,Sites.Title,Sites.Description from Sites with(nolock)

    INNER JOIN (select distinct SiteID FROM SiteTags with(nolock) WHERE TagID = @TagID) as t ON Sites.SiteID = t.SiteID

    ORDER BY t.SiteID

    OFFSET (@page-1)*20 ROWS

    FETCH NEXT 20 ROWS ONLY

    END

  • What's the distinct for? If you have messy data, clean it up. Distinct is not a cheap operation.

    Watch the nolocks.

    See - http://blogs.msdn.com/b/davidlean/archive/2009/04/06/sql-server-nolock-hint-other-poor-ideas.aspx

    Gail Shaw
    Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
    SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

    We walk in the dark places no others will enter
    We stand on the bridge and no one may pass
  • Jason-299789 (12/11/2012)


    sgmunson (12/11/2012)


    Just adding the Description field to the index on SiteID would probably help quite a bit, although at up to 2K per row, the size of that index would be potentially large, so keep that in mind for disk space usage. Also, there will be a hit on INSERT activity, as then that much more data has to hit the index for inserts and updates.

    Surely You shouldnt need to add the Descrption as the SiteId is a clustered index anyway so at the leaf level its already included, and as Description isnt a part of any filter here, surely it doesnt make any difference

    Missed that... oops...

    Steve (aka sgmunson) 🙂 🙂 🙂
    Rent Servers for Income (picks and shovels strategy)

Viewing 15 posts - 1 through 15 (of 17 total)

You must be logged in to reply to this topic. Login to reply