Indexing strategy very long text strings

  • Hey guys,  Just looking to see if anyone has any bright ideas how to solve this problem.

    Background

    I have been handed down a project to report on files that have gone 'missing'.. Ignoring the fact that are using the wrong tools for the job...

    I have a process which effectively writes the resutlts of a DIR/LS command to the database each night.  I am retaining 4 months worth of data as this table currently supports another report.   The table contains they following fields:

    Full Directory Path, FileName, Last Write Date, File Size, Snapshot date.

    With 4 months worth of data, some files are in the table up to 120 times with the only difference being the snapshot date (but sometimes the path will also change).  The table contains approx 36M records.

    I am busy breaking it down into 3rd Normal Form:

    Table of distinct file names

    Table of distinct directory Paths

    Table of distinct Path and File name combinations

    Table of distinct Directory Path Fragments and a table of DirectoryPatIDs to Fragments map (using Ds8K to build both the unique fragment parts and the map): Part of what I need to do is 'permit' movement of files withing directory hierarchy.

    Table of Directly History which takes the DirectoryIndex (FilenameID, PathID) and writes away the Last update date and size of file.

    I build the data in these tables incrementally every night as I can limit the logic to looking at a specific snapshot date, however validation that the  Directly History is complete is painful and takes over 20 Minutes:

    SELECT
    IDX.ID
    ,DT.DateOfExtract
    ,DT.LastWriteDate
    ,DT.SizeOfFile
    FROM
    DocumentTree AS DT WITH (NOLOCK)
    JOIN
    DEV.DirectoryPath AS DP ON DP.FullPath = DT.FullPath
    JOIN
    DEV.DirectoryNameOfFile AS NOF ON NOF.NameOfFile = DT.NameOfFile
    JOIN
    DEV.DirectoryIndex AS IDX ON IDX.DirectoryPath_ID = DP.ID
    AND IDX.DirectoryNameOfFile_ID = NOF.ID
    LEFT JOIN
    DEV.DirectoryHistory AS HIST ON HIST.DirectoryIndex_ID = IDX.ID
    AND HIST.DateofExtract = DT.DateOfExtract
    WHERE
    HIST.ID IS NULL

    I already have an index on the DocumentTree for FullPath and Filename and another on DateOfExtrct that has all of the other fields in  the cover.

    Query Optimiser is suggesting an index of FullPathName with LastWriteDate, NamoeOfFile, SizeOfFile and DateOfExtract as covered fields.  This is effectively another duplicate of the file data. Indexes already make up over 19GB and the server is extremeley limited on disk space.

    I'm looking for suggestions on how to properly manage this table and try to drive better performance.  I am currently thinking:

    1) Drop the current index by path and filename and create a cluseterd index (none of them can be NULL) and together FilePath,Filename and DateOfExtract are unique but the index will be extremely fractured and it will be inserting into every page of the index every night and I am going to need a really low Fill Factor which will likely lead to even more disk space issues.

    2) have another process that runs at the end of every nights procedure that adds the FilePath ID and FileName ID to DocumentTree records that have just been processed. This will be painful as it will require a table scan on 36M records but at least I only have to do it once and it can happen out of hours. The only real reason for having the indexes by Path and Filename is to enable me to look up the IDs in the normalised tables anyway.

    Eventually I want to be able to drop the DocumenTree down to just a few days (e.g. as a staging table) and rebuild the logic in a view.

    Can I have votes for 1) or 2) or suggest another way.

    For LOLs I have suggested that they look at a) a proper document management solution or b) github as this will allow us to store both the movements in the files and also to recover 'lost' data as it is in the repo, but the organisation is not particularly technical and I am a contractor.  They know and undersant t0SQL so to the hammer wielder, everything is a nail.

    The database is currently on 2012 but will be moving to vNext in Q2 but will still be on-prem.

     

  • what I would suggest is adding a hashbytes (SHA-1) for the document name - and another for fullpath - and use those in both incoming and permanent tables to do your joins - while it adds a bit of space overhead (20 bytes per row), the lookup is faster (and less memory required for the query)

    also ensure you are using data compression - again for both space and memory savings.

  • Interesting idea.

    The HASH would effectively act as a proxy for the DirectoryFullPath.ID and DirectoryNameOfFile.ID and apart from the fact that they are calculated on the base data rather then added post-processing why would that be advantagous?  Once I have the FullPath and NameOfFile indexed in their lookup tables and back-populated to the DocumentTree, I don't actually need the text representation or the HASH.  Indexing on the hash is going to cause more page tearing because they are truly random, although the record size will be smaller so I should get more records in an index page.

     

     

  • I would think you would want four ids:

    drive

    full path (other than drive, of course)

    filename

    extension

    When new data comes in, you'd pre-add any new ids before loading the main table.  That way you are never "going back" to assign the ids, they are assigned before the initial INSERT and loaded as part of the that INSERT.

    SQL DBA,SQL Server MVP(07, 08, 09) "It's a dog-eat-dog world, and I'm wearing Milk-Bone underwear." "Norm", on "Cheers". Also from "Cheers", from "Carla": "You need to know 3 things about Tortelli men: Tortelli men draw women like flies; Tortelli men treat women like flies; Tortelli men's brains are in their flies".

  • One thing to point out though about the hashing approach and indexing on it though - is that index useful? You likely are not searching by the hash and are also not likely sorting by hash. The ONLY benefit the hash does for you (that I am aware of) is make finding duplicates faster.

    The approach I would use is to do a bit of normalizing. The directory path (for example) is going to be a small list of repeated values, so put them in their own table and link things back by ID. That'd be a quick win in terms of space saving. File names, if there are a lot of unique file names, normalizing there may be less useful, but still has benefits.

    I'd also be hesitant to use "NOLOCK" unless you absolutely have to as it can result in bad data or missed data since it reads uncommitted data when you do that. Are you SURE you need the NOLOCK hint?

    BUT I think the first step is to define what problem you are trying to solve and how you will identify that it is in fact solved. The tables you are talking about creating (or have created) sound to me like they are mostly just different ways of presenting the raw data. Is there a reason you didn't use views or stored procedures for this instead of duplicating the data across multiple tables? Views can be indexed and stored procedures can use the table indexes.

    Also, I am very surprised that such a narrow table (5 columns, 2 which are dates and not going to be that large, one int which isn't that big, and 2 VARCHAR's which probably are not going to be THAT big) and small table in terms of rows (34M rows is not THAT big and I doubt that you are using 8KB per row) would have 19GB worth of indexes... that sounds like something is wrong with your indexes. Do you have a lot of duplicate or unused indexes? If this is strictly a space saving exercise, cleaning up unused and duplicate indexes to try to get your 19GB of indexes down would be where I'd start. As a completely random guess, I would sya you are likley using under 1 KB per row, so lets say 34,000,000 KB per row which works out to be 34GB for the table. Having 19GB of indexes on a 34GB table is crazy. over 50% of the space used is in indexes? Now, that is me making some assumptions, but I can't really see more than 1 KB worth of data per row with what you described... Maybe I am wrong (I've been wrong before), but I'd be curious what your data to index ratio is. IF the index size is near or over 50% of the space used, I'd be re-evaluating which indexes are actually needed...

    ALSO, when SQL recommends that you add an index, it is just saying that the ONE query you just ran would probably benefit from that index. It is not saying that it has analyzed the workload on the table and that index will provide across the board cost savings. EACH index you add results in SOME data duplication AND performance impact on INSERT, UPDATE, and DELETE operations every time and sometimes even SELECT statements.

    Another suggestion - do you have a clustered index right now or is it just a heap? IF you are doing mostly inserts and SELECTs are rare, then a HEAP may be a better option in your case in terms of performance. BUT if you do need a clustered index, it will contain ALL of the columns by definition, but will be sorted on disk based on the columns you add in the order you add them. SO if your primary searches are done by the file path, then have that first. If filename is more common, then do that first. If date is the most common, put that one first.

    Another thought - do they care about the historical data? What I mean is do they care if file ABC was modified Jan 1st, Jan 10th, Jan 25th OR do they only care about the LAST updated time? If it is just the last update, then doing an INSERT for new files and an UPDATE on the modified ones will likely save a bunch of disk space.

    Another question - do you NEED to normalize the data? Generally normalizing the data is about compromise between storage and performance. It removes data duplication at the column level and can offer faster data lookups IF you are comparing the integers since numeric comparison is MUCH faster than string comparison. BUT if your searches are against strings frequently (ie file name or file path), normalizing isn't going to help performance any.

     

    So to summarize my above suggestions, to save disk space - clean up unused and duplicate indexes. To improve performance, determine the interesting data (what do the users care about and what is OK to remove) and remove the uninteresting data; also create views/stored procedures to get the data out rather than duplicating it across multiple tables.

    Now, like you said, I would still recommend a document control system of some sort. Git has some caveats around non-text files (ie binary files), so if you throw in Word or Excel documents for example, you end up with a LOT of data duplication as if the file is updated, git stores the whole file multiple times rather than just what has changed. Getting a tool like Sharepoint or Alfresco (2 that I have helped implement, there are others out there. Sharepoint is nice as it is quick and easy to set up and maintain; Alfresco is a bit more work but has a "free" tier last time I implemented it) is going to be a nicer long term solution as the database solution feels like a screw that you are trying to make fit by hitting it with a  hammer. It may work, but you are going to have problems down the line...

    The above is all just my opinion on what you should do. 
    As with all advice you find on a random internet forum - you shouldn't blindly follow it.  Always test on a test server to see if there is negative side effects before making changes to live!
    I recommend you NEVER run "random code" you found online on any system you care about UNLESS you understand and can verify the code OR you don't care if the code trashes your system.

  • I think you are better off using 1NF and then implement a status and status change date for historical locations of files, use filtered indexes even for active and historical file locations.

    You should be able to use page level compression without risk of performance issues, since any change would only update non-variable width columns related to status.

  • I would just leave it all in one table.

    How many rows has this table got?

  • ScottPletcher wrote:

    I would think you would want four ids:

    drive full path (other than drive, of course) filename extension

    When new data comes in, you'd pre-add any new ids before loading the main table.  That way you are never "going back" to assign the ids, they are assigned during the initial INSERT.

    Just to be clear, then the main table would have only the Ids and not strings.  This will both shorten the table and make lookups more efficient.  You could then (page) compress the full path and filename tables if you wanted to.  That usually adds more efficiency.

    Encoding the strings using Ids also makes looking up (partial) path(s) easier, and makes changing the drive and/or path and/or filename vastly easier.

    SQL DBA,SQL Server MVP(07, 08, 09) "It's a dog-eat-dog world, and I'm wearing Milk-Bone underwear." "Norm", on "Cheers". Also from "Cheers", from "Carla": "You need to know 3 things about Tortelli men: Tortelli men draw women like flies; Tortelli men treat women like flies; Tortelli men's brains are in their flies".

Viewing 8 posts - 1 through 7 (of 7 total)

You must be logged in to reply to this topic. Login to reply