Fragmented index

  • There's only one index the cluster index on the table.

    PRIMARY KEY CLUSTERED

    ([Guid] ASC )WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, SORT_IN_TEMPDB = OFF, IGNORE_DUP_KEY = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [SECONDARY]


    John Zacharkan

  • Then the import is not done sequentially.

    What are you using to import the data? SSIS, sps, bulk?

  • Okay - I have an update on how this GUID column is generated.

    On a seperate database we have a table with columns

    1) table prefix char(15)

    2) Sequence_no numberic(18,0)

    3) Reset_Date date

    The ETL tool DataStage reads files from disk and concats the three columns above and manages the incrementing of the Sequence_no. These columns combined make up the field they call GUID varchar(32).

    Again however, since the Sequence _no in the field GUID so far has been sequential. Why would the data be fragemented.

    I've rebuild the cluster Index. And the team is going to be loading more data.

    Thanks for the replies - my apologies for being ambiguous.


    John Zacharkan

  • <mini_rant>The use of GUID for this entity is a misnomer in the context of Microsoft SQL Server so I am calling them IDs from now on because the terminology is confusing!</mini_rant>

    zach_john (5/25/2011)


    Okay - I have an update on how this GUID column is generated.

    On a seperate database we have a table with columns

    1) table prefix char(15)

    2) Sequence_no numberic(18,0)

    3) Reset_Date date

    Was your sample data real?

    How does this break down?

    arw20110519000000000000001487001

    What format is the date?

    The ETL tool DataStage reads files from disk and concats the three columns above and manages the incrementing of the Sequence_no. These columns combined make up the field they call GUID varchar(32).

    Is there only one instance of "ETL tool DataStage" running at a time? Or can there be an instance running that gets a block of IDs that takes longer to process and insert than another instance that got a set of IDs later in the sequence?

    There are no special teachers of virtue, because virtue is taught by the whole community.
    --Plato

  • Ninja's_RGR'us (5/25/2011)


    opc.three (5/25/2011)


    The idea of having an app generate a sequential GUID outside the database and provide it to SQL Server as the clustering key for a table is great on paper...but I have never had the privilege of seeing where the technique was implemented properly.

    Can't find the post, but not long ago someone on SSC brought to the light that Microsoft CRM uses the technique. It never came out whether they did it sequentially in the parent table, but IIRC they were having problems with performance in one of the child tables.

    We were both talking about the same thread... mine ironically enough :w00t:

    http://www.sqlservercentral.com/Forums/FindPost1107898.aspx

    Awesome 😎

    There are no special teachers of virtue, because virtue is taught by the whole community.
    --Plato

  • opc.three (5/25/2011)


    <mini_rant>The use of GUID for this entity is a misnomer in the context of Microsoft SQL Server so I am calling them IDs from now on because the terminology is confusing!</mini_rant>

    Agreed! Not my doing.

    zach_john (5/25/2011)


    Okay - I have an update on how this GUID column is generated.

    On a seperate database we have a table with columns

    1) table prefix char(15)

    2) Sequence_no numberic(18,0)

    3) Reset_Date date

    opc.three (5/25/2011)


    Was your sample data real?

    No and it can't be due to a PHI requirement

    opc.three (5/25/2011)


    How does this break down?

    arw20110519000000000000001487001

    What format is the date?

    'ARW' is the table prefix - they all seem to be 3 characters why it was defined bigger i'm lost.

    20110519000 is reset date not sure why there is 3 zeros at then end

    000000000001487001 is Reset_data, the sequence number handled by the

    The ETL tool DataStage reads files from disk and concats the three columns above and manages the incrementing of the Sequence_no. These columns combined make up the field they call GUID varchar(32).

    Is there only one instance of "ETL tool DataStage" running at a time? Or can there be an instance running that gets a block of IDs that takes longer to process and insert than another instance that got a set of IDs later in the sequence?

    Only one instance at this time, however I have approached the subject of multiple instances running with concern on the ID being generated.

    We just completed loading another batch approx 1.5M record, again the table has become fragmented

    depth 4, avg fragmentation 45%, #Fragments 22,615, avg page per fragment 2, # of pages 50,816

    The IDs loaded today start at

    'xvp20110525000000000000000000001'

    and end at

    'xvp20110525000000000000001601205'


    John Zacharkan

  • OK...so much for that...

    What about Ninja's thought...are the rows being updated in the table after they are initially inserted?

    There are no special teachers of virtue, because virtue is taught by the whole community.
    --Plato

  • I'm no SSIS guru but I know this for sure, you can commit the whole process in batches and unless you specifically ask for an ordered insert you won't get any order in the insert. And with 1.5 M you can defenitely fragment the heck out of that table.

    Also the LEADING part of the key is a prefix, NOT a date, NOT a sequential GUID. So that alone can cause that much fragmentation.

    The only real "identity" column is concatenated at the END of the key so the "logical" order of insert is not 100% controled by this, NOR the date.

  • Ninja's_RGR'us (5/25/2011)


    Also the LEADING part of the key is a prefix, NOT a date, NOT a sequential GUID. So that alone can cause that much fragmentation.

    The only real "identity" column is concatenated at the END of the key so the "logical" order of insert is not 100% controled by this, NOR the date.

    A very good point indeed...need more info about how the prefix changes from batch to batch...

    There are no special teachers of virtue, because virtue is taught by the whole community.
    --Plato

  • opc.three (5/25/2011)


    Ninja's_RGR'us (5/25/2011)


    Also the LEADING part of the key is a prefix, NOT a date, NOT a sequential GUID. So that alone can cause that much fragmentation.

    The only real "identity" column is concatenated at the END of the key so the "logical" order of insert is not 100% controled by this, NOR the date.

    A very good point indeed...need more info about how the prefix changes from batch to batch...

    My guess is that this is a type or code column... there's little logic in dumping a 3rd, meaningless key in the "guid". And if there was it probably wouldn't concatenated.

    I've done this for imports, but each import type had a different file... not merged into a single file, single table, single column without some way of extracting that data again.

  • Ninja's_RGR'us (5/25/2011)


    opc.three (5/25/2011)


    Ninja's_RGR'us (5/25/2011)


    Also the LEADING part of the key is a prefix, NOT a date, NOT a sequential GUID. So that alone can cause that much fragmentation.

    The only real "identity" column is concatenated at the END of the key so the "logical" order of insert is not 100% controled by this, NOR the date.

    A very good point indeed...need more info about how the prefix changes from batch to batch...

    My guess is that this is a type or code column... there's little logic in dumping a 3rd, meaningless key in the "guid". And if there was it probably wouldn't concatenated.

    I've done this for imports, but each import type had a different file... not merged into a single file, single table, single column without some way of extracting that data again.

    It would be nice if SQL Server supported "expression indexes" to help with data models like this. I am not counting "indexing a computed column" as exactly the same thing but they are a valuable stand-in.

    There are no special teachers of virtue, because virtue is taught by the whole community.
    --Plato

  • opc.three (5/25/2011)


    OK...so much for that...

    What about Ninja's thought...are the rows being updated in the table after they are initially inserted?

    No they're loaded in directly as is.


    John Zacharkan

  • Hmm...running out of ideas...

    Are the 3-byte "table prefix char(15)" always used in a sequential order? i.e. Could you possibly load a batch with a prefix of "xyz" before a batch that uses a prefix of "abc"?

    There are no special teachers of virtue, because virtue is taught by the whole community.
    --Plato

  • Ninja's_RGR'us (5/25/2011)


    I'm no SSIS guru but I know this for sure, you can commit the whole process in batches and unless you specifically ask for an ordered insert you won't get any order in the insert. And with 1.5 M you can defenitely fragment the heck out of that table..

    True but the data ID created is incremented and inserted in order in 5K batches, the next 5K batch sequentially followed.

    Ninja's_RGR'us (5/25/2011)


    Also the LEADING part of the key is a prefix, NOT a date, NOT a sequential GUID. So that alone can cause that much fragmentation. .

    So within a batch 1.5 million records inserted sequentially, starting with 'abc20110501000000000001'

    ending with 'abc20110501000001530242', commiting every 5K, you'd expect the index to be fragmented? I don't see how. If we say that because the range between the first insert and the second allows for x number of possiblity I might follow.

    Ninja's_RGR'us (5/25/2011)


    The only real "identity" column is concatenated at the END of the key so the "logical" order of insert is not 100% controled by this, NOR the date.


    John Zacharkan

  • opc.three (5/25/2011)


    Hmm...running out of ideas...

    Are the 3-byte "table prefix char(15)" always used in a sequential order? i.e. Could you possibly load a batch with a prefix of "xyz" before a batch that uses a prefix of "abc"?

    The 3bytes are table specific. So no they will always be the same for this table. Getting late I'll be back after 9am est. Thanks for interest.


    John Zacharkan

Viewing 15 posts - 16 through 30 (of 37 total)

You must be logged in to reply to this topic. Login to reply