Document Storage - Ideas and opinion

  • Hi all,
    We have a lot of documents due to the nature of our business and I'm looking at how best to store them in the future. I thought it would be good to get some outside opinions and see if anyone else had some other ideas or thoughts.
    Our options seem to;

    1. File system for the files themselves and then store links in the database.
    2. SQL Filestream - we currently have this and partition the database based on data.
    3. SQL File table - I like the idea of the drag and drop but I think we may have issues with meta data (how to capture it - i.e documents can be linked to policy records).
    4. Some form of other document store (some form of NoSQL or cloud based style db).

    My thoughts at this point are that SQL File Tables sound cool but I think I'll have issue with associating meta data to the files (policy refs etc). Option 4 again sounds cool but I know better than just doing things for the sake of it and since we already use SQL FS so why not stick with that, Its served us well this far.

    So I just wondered on what your thoughts were? Any comments or suggestions would be greatly appreciated.

    Thanks,

    Nic

  • It's something we've done at my current place.  As the files in this system are quite small, they've been stored in a SQL Server table (partitioned due to the volume - tens of millions of documents - and this works well.  Other than that I'd go Filestream, which we also have on other systems where the documents are larger (>1Mb).
    Both of these offer suitable performance and the ability to store easily queryable metadata (type of document, created, status, creator, department etc, etc).  Also, as these are critically important documents, both allow us to both secure the documents (problematic with simple file system) as well as ensure data loss is minimized by having a 5 minute log backup.
    NoSQL.  IMO, fine if you're effectively using it like a Dewey box - i.e. you find the key and grab the whole book.  If you're going to be using SQL like interfaces to shred content for searches, to my mind you've lost the plot and are using the wrong solution completely.  Not what that tech is for, and kludging a mess to do it still means it's wrong.  Also, any solution of this type, make sure to investigate your recovery and potential dataloss in the event something goes wrong.  Look at reputable independent research - most of these solutions are prone to potentially significant loss when things go pecs up - and they're not going to highlight this.  Not a problem for certain solutions (losing 10000 tweets is neither here nor there, 15 notes in a clinical oncology system certainly is), it depends on how important your data is.  Security can be iffy too.
    Storing in a directory - don't go there.  Security is a mare, and if something goes wrong - when was that directory last backed up - how do you know if it's even running, do you have any visibility at all there, how long does it take?  How much data do you want to lose here.  Any personally identifiable data and your security and compliance is a nightmare.

    I'm a DBA.
    I'm not paid to solve problems. I'm paid to prevent them.

  • Hi Andrew,

    Thanks for taking the time to reply, I really appreciate it.
    I think based on a variety of things our best bet is to stick with SQL FileStream. Its worked this far, I understand it and can ensure its backed up and secure etc.
    Thanks again,
    Nic

  • Glad it was helpful.
    Just in case you're not familiar with setting up filestreams to ensure best performance, the following link
    Considerations for Filestream Configuration may be of help.  However, regarding the following advice
    " If planning to use a GUID, use the NEWSEQUENTIALID() instead of NEWID()" my advice would be, Just Don't.

    I'm a DBA.
    I'm not paid to solve problems. I'm paid to prevent them.

Viewing 4 posts - 1 through 3 (of 3 total)

You must be logged in to reply to this topic. Login to reply