Document storage

  • Hi all,

    Design question for you, if I may.

    If you had a SQL 2019 DB that was storing documents in a varbinary format along side other data, lets say the documents table is 100GB in size but we predict that to rapidly increase over the coming years to over 1TB and perhaps even beyond that.

    I'm looking at how best to store these and I wondered what you thought, my idea's so far (I have free reign to do it whatever way is best)

    1. Keep it in the SQL database and partition the table by something to aid performance (perhaps date, perhaps something else)
    2.  Azure Cosmos but this may be massive overkill and would need some more research.
    3. Azure BLOB storage, this will scale but I may have issues searching for data
    4. Some NoSQL database, perhaps MongoDB or CrateDB

    So what do you think?

    Thanks,

    Nic

  • I'd look at filestream and filetables personally, everything in the database, but separate in a way as its on its own filegroup etc.

    Backups restores etc all flow as normal.

    Obviously factor in storage.

    Not tried it with stretch DB yet to offload cold / luke warm data to Azure if that's needed

    BLOB storage could work, if you store the metadata re the files in the DB then search the DB for the right pointer to the storage container to pull it back.  Same could be said for on prem too, it's all about the design phase, as long as your not searching data within the varbinary itself.

     

  • Honestly, for a few documents, probably filestream. For lots of documents as you're describing, don't use SQL Server. It's not going to do a good job. Determine if you need a literal document storage engine like MongoDB or a search engine like ElasticSearch and use one of those. You'll be much, much happier. Yeah, you'll need to set up all sorts of extended searches & what not, but there's technology for that today.

    "The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
    - Theodore Roosevelt

    Author of:
    SQL Server Execution Plans
    SQL Server Query Performance Tuning

  • My question would be... how would you be accessing the documents and what would be the search criteria for finding a given document?  Also, are you going to be searching the content of the document based on the type of document it is?

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Thank you for the responses.

    To pick up on Jeff's point, I agree usage is going to be key and I have a meeting with the project team to establish the usage where I intend to ask;

    1. What sort of documents are we storing? All I've had so far is the word "documents", this could for all I know be in the same formats as we do now (pdf, csv, etc) but they could mean JSON documents, which is obviously a very different path to a physical document
    2. How will they be accessing them? I suspect it will be in small batches such as give me the documents for customer A
    3. How will they be importing them? Do they need to be versioned?
    4. How long do they need to keep them 'hot' and accessible? Can I archive them?
    5. What search capabilities do they need for the documents

    And a whole host of other ones. The current db that is doing this is 2019 on a VM, I'd like to move this to Azure DB but I think and I may be wrong, if I were to utilize FileStream rather than VARBINARY (as it does now) then this would need to be done via a managed instance.

    Decisions, decisions but first let me get answers.

    Thanks again for the responses.

    Nic

  • Several decades ago I put in a document database (text base) for Southern California Edison. The project was to build a document base for the legal department that handles their union contracts. I can't member after all this time, but they had at least a dozen different unions, federal and state regulations and a whole bunch of legal documents like you would not believe. Their company lawyers needed to do research on such wonderful interesting things as "what constitutes the equivalent of a restaurant meal to a power line worker in a line shack in the middle of Northern California hundred miles from any kind of civilization, as guaranteed by their last union negotiations?" Yes, that was an actual search condition which the lawyers had to make! My opinion was they should have just said, "buy Wolfgang Puck instead of TV dinners!", But I am not a labor attorney.

    They hired me because they expected me to write SQL on an IBM mainframe in DB2. This was completely the wrong choice for the project, so I didn't do it. Instead of spending $50,000+ per month , I made them get a copy of a product called Zyindex, burn the documents onto LaserDiscs and get a document scanner. The advantage is that a LaserDisc is a write only media that is virtually indestructible, reads really fast, and is cheap. Document scanning can be done by day labor and many of the things they needed were also available in electronic formats from federal and state governments. Just hire a bunch of college kids to scan them.

    I do not know if Zyindex is still the "gold standard" of document processing, but you can start researching. Documents are controlled by the ANSI Z39 group and not the old ANSI X3 H2 database group. They have their own standards their own tools, they are a separate industry.

    If you want to get a good idea what the difference in the RDBMS versus the text base model of data is, try to write an SQL query for those TV dinners, when you didn't have anything like this in your database to start with. 🙂

    Please post DDL and follow ANSI/ISO standards when asking for help. 

  • Hi,

    So after some discussion it turns out that our documents are on average going to be about 5-10MB in size each, the Microsoft guidance is on File Stream is "Objects that are being stored are, on average, larger than 1 MB"

    and

    "For smaller objects, storing varbinary(max) BLOBs in the database often provides better streaming performance.".

    So the issue I have hear is that we wanted this to be in SQL Azure but Azure DB (managed, elastic or single db) dont support File Stream.

    Turning this back round I asked the business how the current 250k documents they have in the db perform and they said absolutely fine, no issues.

    Given this is on a db with no partitioning or such I'm inclined to do one of 2 things now;

    1. Use VARBINARY(MAX) in an Azure DB
    2. Set up a seperate BLOB storage in Azure and have the application guys manage the documents.

    I'm going to chat it over with them and then we'll proceed down one of these paths.

    Thanks for taking the time to read, consider and respond.

    Nic

Viewing 7 posts - 1 through 6 (of 6 total)

You must be logged in to reply to this topic. Login to reply