Identifying what caused Corruption

  • I had a database in suspect mode this morning.

    Fixed it by:

    -placing in emergency mode.

    -set single_user mode

    -dbcc checkdb(N'dbname', REPAIR_ALLOW_DAT_LOSS)

    -Rebuild the log.

    -bring online.

    DBCC Checkdb returned a bunch of errors:

    Extent (3:394032) in database ID 7 is allocated by more than one allocation object.

    The error has been repaired.

    Msg 8913, Level 16, State 1, Line 2

    Extent (3:394032) is allocated to 'GAM' and at least one other object.

    The error has been repaired.

    --Lots of these

    Repair: Deleted off-row data column with ID 562825134080, for object ID 1076927654, index ID 1, partition ID 72057594039566336, alloc unit ID 72057594044153856 (type LOB data) on page (3:393971), slot 28.

    Repair: Deleted off-row data column with ID 562825199616, for object ID 1076927654, index ID 1, partition ID 72057594039566336, alloc unit ID 72057594044153856 (type LOB data) on page (3:393971), slot 29.

    Repair: Deleted off-row data column with ID 562825265152, for object ID 1076927654, index ID 1, partition ID 72057594039566336, alloc unit ID 72057594044153856 (type LOB data) on page (3:393971), slot 30.

    Lots of these:

    Msg 8945, Level 16, State 1, Line 2

    Table error: Object ID 1076927654, index ID 1 will be rebuilt.

    The error has been repaired.

    Msg 8977, Level 16, State 1, Line 2

    Table error: Object ID 1076927654, index ID 1, partition ID 72057594039566336, alloc unit ID 72057594044088320 (type In-row data). Parent node for page (3:394009) was not encountered.

    The error has been repaired.

    Msg 8977, Level 16, State 1, Line 2

    Table error: Object ID 1076927654, index ID 1, partition ID 72057594039566336, alloc unit ID 72057594044088320 (type In-row data). Parent node for page (3:394024) was not encountered.

    The error has been repaired.

    Msg 8914, Level 16, State 1, Line 2

    Incorrect PFS free space information for page (3:393971) in object ID 1076927654, index ID 1, partition ID 72057594039566336, alloc unit ID 72057594044153856 (type LOB data). Expected value 100_PCT_FULL, actual value 50_PCT_FULL.

    The error has been repaired.

    Msg 8964, Level 16, State 1, Line 2

    Table error: Object ID 1076927654, index ID 1, partition ID 72057594039566336, alloc unit ID 72057594044153856 (type LOB data). The off-row data node at page (3:393971), slot 28, text ID 562825134080 is not referenced.

    The error has been repaired.

    Msg 8964, Level 16, State 1, Line 2

    Table error: Object ID 1076927654, index ID 1, partition ID 72057594039566336, alloc unit ID 72057594044153856 (type LOB data). The off-row data node at page (3:393971), slot 29, text ID 562825199616 is not referenced.

    The error has been repaired.

    Msg 8964, Level 16, State 1, Line 2

    Table error: Object ID 1076927654, index ID 1, partition ID 72057594039566336, alloc unit ID 72057594044153856 (type LOB data). The off-row data node at page (3:393971), slot 30, text ID 562825265152 is not referenced.

    Im googling these errors, but im struggling to find out what could have caused this. Any help would be appreciated.

    Thanks a mil!

  • Pl. find any other error in windows error log.

    ---------------------------------------------------
    "Thare are only 10 types of people in the world:
    Those who understand binary, and those who don't."

  • free_mascot (5/7/2014)


    Pl. find any other error in windows error log.

    There was windows security patching performed, and an unexpected shutdown.

    Unexpected shutdown is worrying, but i assumed SQL can handle these by rolling back anything that was in flight when the cut/shutdown happened.

  • Do you understand the ramifications of the below??

    winston Smith (5/7/2014)


    Repair: Deleted off-row data column with ID 562825134080, for object ID 1076927654, index ID 1, partition ID 72057594039566336, alloc unit ID 72057594044153856 (type LOB data) on page (3:393971), slot 28.

    Repair: Deleted off-row data column with ID 562825199616, for object ID 1076927654, index ID 1, partition ID 72057594039566336, alloc unit ID 72057594044153856 (type LOB data) on page (3:393971), slot 29.

    Repair: Deleted off-row data column with ID 562825265152, for object ID 1076927654, index ID 1, partition ID 72057594039566336, alloc unit ID 72057594044153856 (type LOB data) on page (3:393971), slot 30.

    Congratulations you've just dropped Lords knows how much data from your database, was that your desired intention? Didn't you have a backup you could have restored?

    -----------------------------------------------------------------------------------------------------------

    "Ya can't make an omelette without breaking just a few eggs" 😉

  • Perry Whittle (5/7/2014)


    Do you understand the ramifications of the below??

    winston Smith (5/7/2014)


    Repair: Deleted off-row data column with ID 562825134080, for object ID 1076927654, index ID 1, partition ID 72057594039566336, alloc unit ID 72057594044153856 (type LOB data) on page (3:393971), slot 28.

    Repair: Deleted off-row data column with ID 562825199616, for object ID 1076927654, index ID 1, partition ID 72057594039566336, alloc unit ID 72057594044153856 (type LOB data) on page (3:393971), slot 29.

    Repair: Deleted off-row data column with ID 562825265152, for object ID 1076927654, index ID 1, partition ID 72057594039566336, alloc unit ID 72057594044153856 (type LOB data) on page (3:393971), slot 30.

    Congratulations you've just dropped Lords knows how much data from your database, was that your desired intention? Didn't you have a backup you could have restored?

    I 100% completely understand what i did, i wouldn't have done it otherwise. And yes, I have a backup. The db only has 4 tables and is only for logging of agent data (not sql agent). I needed to get it online asap to prevent issues with agents/services/whatever you want to call them.

  • All the errors you're listing are just the symptoms from the allocation errors. They're not going to help you troubleshooting the origin of the problem. Instead, you need to look back in the SQL Server error log and the windows error log to see if you can find indications of what happened prior to you deleting some portion of the database. That will help you figure out what happened.

    If you have a backup, do you know it's clean? Could you restore it to a second location and run DBCC on it (without using REPAIR) to see if you can identify allocation errors?

    "The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
    - Theodore Roosevelt

    Author of:
    SQL Server Execution Plans
    SQL Server Query Performance Tuning

  • Corruption is 99.99% IO subsystem problem. Not necessarily the disks, could be anything anywhere in the IO stack.

    Since you mentioned an uncontrolled shutdown, I'd check and make sure all write caches are battery-backed (with working batteries) or disabled.

    Without seeing the full results of a CheckDB (not one run with repair), it's very, very hard to say more. There's not enough information.

    Gail Shaw
    Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
    SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

    We walk in the dark places no others will enter
    We stand on the bridge and no one may pass
  • GilaMonster (5/7/2014)


    Corruption is 99.99% IO subsystem problem. Not necessarily the disks, could be anything anywhere in the IO stack.

    Since you mentioned an uncontrolled shutdown, I'd check and make sure all write caches are battery-backed (with working batteries) or disabled.

    Without seeing the full results of a CheckDB (not one run with repair), it's very, very hard to say more. There's not enough information.

    I agree. But given the extent of the problem (pun intended 😛 ) I doubt it is from the shutdown itself, or even multiple unexpected shutdowns. This seems systemic, and that goes to entire-IO-stack firmware and drivers (and maybe other stuff you may have on your system like antivirus, disk defraggers, file-filter-driver thingies. It could also be a bad chip somewhere leading to corruptions too, but that is WAY less likely than firmware/drivers.

    It is a virtual certainty that you will continue to get these errors until you patch everything. And honestly I would IMMEDIATELY take this system offline and never use it for any work I cared about again until I had beat the sh!t out of it to validate that my patching everything in sight prevented any errors from occurring. And maybe not even then.

    Best,
    Kevin G. Boles
    SQL Server Consultant
    SQL MVP 2007-2012
    TheSQLGuru on googles mail service

  • TheSQLGuru (5/7/2014)


    But given the extent of the problem (pun intended 😛 ) I doubt it is from the shutdown itself, or even multiple unexpected shutdowns.

    I'm not willing to guess without a lot more information. There's no where near enough detail for me to make an accurate diagnostic or prediction.

    Gail Shaw
    Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
    SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

    We walk in the dark places no others will enter
    We stand on the bridge and no one may pass
  • GilaMonster (5/7/2014)


    TheSQLGuru (5/7/2014)


    But given the extent of the problem (pun intended 😛 ) I doubt it is from the shutdown itself, or even multiple unexpected shutdowns.

    I'm not willing to guess without a lot more information. There's no where near enough detail for me to make an accurate diagnostic or prediction.

    Fair enough, although I do not consider my doubt/assertion a true guess and certainly not a random guess. It is a probabilistic analysis of information (i.e. LOTS of corruptions mentioned multiple times) that I believe is sufficient to rule out a particular cause the (vast?) majority of the time.

    Also note that you stated that 9,999 times out of 10,000 that the IO subsystem was responsible for corruptions. Same type of probabilistic statement - just with a higher level of certainty. 😎

    Best,
    Kevin G. Boles
    SQL Server Consultant
    SQL MVP 2007-2012
    TheSQLGuru on googles mail service

  • TheSQLGuru (5/7/2014)


    Also note that you stated that 9,999 times out of 10,000 that the IO subsystem was responsible for corruptions. Same type of probabilistic statement - just with a higher level of certainty. 😎

    Well that's from Paul, who has a hell of a lot more experience and knowledge in this area, and was based on his time with the dev team and corruptions he's worked on since. I'll take his probabilistic statement as fact. I certainly have no grounds to disagree with his statement.

    Gail Shaw
    Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
    SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

    We walk in the dark places no others will enter
    We stand on the bridge and no one may pass
  • Update:

    There was windows patching of the VM occurring at the time of the issue. The Virtual Disk Service was flickering between stopped and started. I "guess" this could cause an issue if it was flicking between started and stopped when sql was trying to write to the tran log on a LUN controlled by the Virtual Disk Service.

    Scans of the disks within the VM VHD's and physical storage have shown no issues. No corruption since the issue also.

    Im inclined to think patching caused this issue, but am still looking for hard evidence.

  • Ah so its a virtual machine, this wasnt made known before. Are these virtual hard disks or raw device mappings?

    -----------------------------------------------------------------------------------------------------------

    "Ya can't make an omelette without breaking just a few eggs" 😉

Viewing 13 posts - 1 through 12 (of 12 total)

You must be logged in to reply to this topic. Login to reply