What's a good indexing strategy for this situation?

  • I have a 94,000 row table that looks like:

    CREATE TABLE dbo.GrossPay (

    [CompanyID] [int] NOT NULL,

    [EmployeeID] [int] NOT NULL,

    [PayCode] [varchar](15) NOT NULL,

    [PayPeriod] [int] NOT NULL,

    [Payroll_Year] [int] NOT NULL,

    Amount decimal(18, 4))

    I always select WHERE CompanyID = @CompanyID AND EmployeeID = @EmployeeID

    AND PayPeriod = @PayPeriod AND Payroll_Year = @PayrollYear

    About 88,000 rows of this table are CompanyID 91, and the rest are CompanyID = 98, so CompanyID is not very selective.

    What's the best way to index this table? A composite index? Or separate indexes to cover my most common WHERE clauses? Thanks.

    -- EDIT: PayPeriod is NOT NULL

    There is no "i" in team, but idiot has two.
  • What columns are you returning? And why no primary key? The first index you need is a clustered index...

    Edit: Added question about PK.

    "I cant stress enough the importance of switching from a sequential files mindset to set-based thinking. After you make the switch, you can spend your time tuning and optimizing your queries instead of maintaining lengthy, poor-performing code."

    -- Itzik Ben-Gan 2001

  • I am most often returning all the columns in the table.

    As far as the primary key, that's part of my "what's a good strategy" question. Should I be using an identity column for a PK? Or a composite column of the most common WHERE clause columns. Also, I truncate and load this table every two weeks from a linked DB2 server.

    There is no "i" in team, but idiot has two.
  • Dave-148053 (2/5/2015)


    I am most often returning all the columns in the table.

    As far as the primary key, that's part of my "what's a good strategy" question. Should I be using an identity column for a PK? Or a composite column of the most common WHERE clause columns.

    I hope not to be the final word on this as I am not an Index Master. That said does PayPeriod have to be nullable? Pay period can't be part of a clustered index because it is nullable. If it does not need to be nullable then you could do a composite key that includes CompanyID, EmployeeID, Payroll_Year & PayPeriod. If PayPeriod has to be nullable and the combination of CompanyID, EmployeeID & Payroll_Year does not guarantee a unique set of values then an identity column is the way to go.

    Another question to answer that would help determine a good clustered and non-clustered index strategy is - do you have to do any sorting our grouping on this table? This is where indexes really speed things up.

    Also, I truncate and load this table every two weeks from a linked DB2 server.

    I good practice here is to drop your constraints and indexes before doing this then rebuilding them after.

    This is a good question and I hope that some of the local SSC old timers jumped in on this thread.

    "I cant stress enough the importance of switching from a sequential files mindset to set-based thinking. After you make the switch, you can spend your time tuning and optimizing your queries instead of maintaining lengthy, poor-performing code."

    -- Itzik Ben-Gan 2001

  • Alan.B (2/5/2015)


    Dave-148053 (2/5/2015)


    I am most often returning all the columns in the table.

    As far as the primary key, that's part of my "what's a good strategy" question. Should I be using an identity column for a PK? Or a composite column of the most common WHERE clause columns.

    I hope not to be the final word on this as I am not an Index Master. That said does PayPeriod have to be nullable? Pay period can't be part of a clustered index because it is nullable. If it does not need to be nullable then you could do a composite key that includes CompanyID, EmployeeID, Payroll_Year & PayPeriod. If PayPeriod has to be nullable and the combination of CompanyID, EmployeeID & Payroll_Year does not guarantee a unique set of values then an identity column is the way to go.

    Another question to answer that would help determine a good clustered and non-clustered index strategy is - do you have to do any sorting our grouping on this table? This is where indexes really speed things up.

    Also, I truncate and load this table every two weeks from a linked DB2 server.

    I good practice here is to drop your constraints and indexes before doing this then rebuilding them after.

    My typo, PayPeriod should be NOT NULL. Also, I only get a calendar year of data at a time, so Payroll_Year doesn't need to be in my where clauses. Good idea on drop and rebuilding constraints and indexes each load. From looking at my report queries, it looks like I mostly join on CompanyID and EmployeeID.

    There is no "i" in team, but idiot has two.
  • When choosing the components to include in the PRIMARY KEY, you must allow for uniqueness. Meaning that you can only have one row per company, employee, pay period and pay year if that is what the composite PRIMARY KEY is built upon.

    Without knowing what data you're storing, that's only my best guess at what to include in the PK.


    My mantra: No loops! No CURSORs! No RBAR! Hoo-uh![/I]

    My thought question: Have you ever been told that your query runs too fast?

    My advice:
    INDEXing a poor-performing query is like putting sugar on cat food. Yeah, it probably tastes better but are you sure you want to eat it?
    The path of least resistance can be a slippery slope. Take care that fixing your fixes of fixes doesn't snowball and end up costing you more than fixing the root cause would have in the first place.

    Need to UNPIVOT? Why not CROSS APPLY VALUES instead?[/url]
    Since random numbers are too important to be left to chance, let's generate some![/url]
    Learn to understand recursive CTEs by example.[/url]
    [url url=http://www.sqlservercentral.com/articles/St

  • Do you always query by employee? I'll guess you have more employees than payperiods. Assuming that, your columns should be EmployeeId, PayPeriod, CompanyId. If the queries use more the period than the employee, you could swap your first two columns in the index.

    Have you considered on making payperiod a smallint or even a tinyint? Payroll_Year can also be a smallint.

    Luis C.
    General Disclaimer:
    Are you seriously taking the advice and code from someone from the internet without testing it? Do you at least understand it? Or can it easily kill your server?

    How to post data/code on a forum to get the best help: Option 1 / Option 2
  • Looking at the table structure, you are going to need all the columns except amount to create a unique primary key unless you can verify that employee IDs are not (and never will be) duplicated across companies.

    As for indexing strategy, if you look at the execution plans for your queries it will normally advise you if performance could be improved with additional indexes. REMEMBER that there is a cost to creating an index so writes are more expensive with more indexes.

    Normally in the absence of any statistics, I would suggest that you index by the most granular (i.e. has the most distinct entries) in your case, likely to be empID, however you also need to consider page fill and how often you will be inserting records into the index. Think if the index literally as a box full of index cards, if you have the index by empID and pay date then on every run of payroll you will be inserting records INTO the middle of the index and you will eventually need to move some of the cards to a new box (this is paging). If your index is by payroll date then you will probably be ADDING records to the end of the index. you will still need the extra box, but you don't have to waste time moving the existing cards around.

    Likewise, if you need to get all the data for an employee then jumping into the index by empID and puling the 12 cards for the payroll date range is relatively simple. If you needed all the payroll records for a date range then you would effectively have to flick through each card and check the date (table scan). If you used the latter index strategy then the second query become easy (jump to the date and pull all the employee cards) but the first becomes hard (table scan). If you keep both indexes then both queries can be fast but you have to make the time to keep two sets of index records.

  • I'd cluster the table by:

    Payroll_Year, PayPeriod, EmployeeID and, optionally, CompanyID

    That way the clustering key is generally increasing, which will reduce fragmentation. That's not the main criterion for a clus key, but if you can use a natural ascending value and still match your most common search conditions, you should.

    If your business requirements indicate that it should also be a PK, you cluster the table using a PK. If not, just create a clustered index with those columns.

    Here's sample SQL:

    --PK

    ALTER TABLE dbo.table_name ADD CONSTRAINT table_name__PK PRIMARY KEY CLUSTERED ( Payroll_Year, PayPeriod, EmployeeID, CompanyID ) WITH ( FILLFACTOR = 99 ) ON [PRIMARY]

    --Clus Index

    CREATE CLUSTERED INDEX table_name__CL ON dbo.table_name ( Payroll_Year, PayPeriod, EmployeeID ) WITH ( FILLFACTOR = 99 ) ON [PRIMARY]

    SQL DBA,SQL Server MVP(07, 08, 09) A socialist is someone who will give you the shirt off *someone else's* back.

  • Scott,

    I mainly agree with you, but the OP stated that there's always one Payroll_Year at a time. Based on that, I wouldn't include it in the index.

    Luis C.
    General Disclaimer:
    Are you seriously taking the advice and code from someone from the internet without testing it? Do you at least understand it? Or can it easily kill your server?

    How to post data/code on a forum to get the best help: Option 1 / Option 2
  • Luis Cazares (2/6/2015)


    Scott,

    I mainly agree with you, but the OP stated that there's always one Payroll_Year at a time. Based on that, I wouldn't include it in the index.

    True, I hadn't seen his later comment in it with that at the time.

    If that holds true, and since joins are based on CompanyId and EmpId, there's a much better case for using CompanyId and EmpId as the lead columns, in whichever order.

    It would also be nice to see SQL's missing index stats and index usage stats before finalizing the keys. Those may help resolve which is really the most common seek.

    SQL DBA,SQL Server MVP(07, 08, 09) A socialist is someone who will give you the shirt off *someone else's* back.

  • Luis Cazares (2/5/2015)


    Do you always query by employee? I'll guess you have more employees than payperiods. Assuming that, your columns should be EmployeeId, PayPeriod, CompanyId. If the queries use more the period than the employee, you could swap your first two columns in the index.

    Most queries read all employees by CompanyID and PayPeriod, then group on EmployeeID. From reading other replies, I'm thinking non-unique clustered index on CompanyID, PayPeriod, EmployeeID. It's a reporting table loaded from a DB2 server, so it will never be updated.

    Have you considered on making payperiod a smallint or even a tinyint? Payroll_Year can also be a smallint.

    Yes, I certainly should. Thanks for keeping me honest about my datatypes!

    There is no "i" in team, but idiot has two.
  • Do you need the data at the granularity of empID. Maybe get a data feed at a higher level of aggregation or generate another reporting table, but that is stepping into the realms of data warehousing.

Viewing 13 posts - 1 through 12 (of 12 total)

You must be logged in to reply to this topic. Login to reply