Designing Persistence Storage for a new Project in Azure

  • Can someone share from their experience on various design options pros / cons, please?

    The project:

    1. In Azure environment (given, cannot be changed) there is a need to store many various types files: video, voice, images, etc.
    2. The expected annual volume of data (provided to the team as guesstimate) is circa 500GB [Edited from MB] per annum.
    3. The files are normally grouped against given ID and may contain metadata (provided in JSON form, cannot be changed).

      Normally means there are cases the files will be uploaded on their own and later own either linked (manually) or removed

    4. It is assumed 20% of the data can be queried within a short amount of time (hours), and 80% - sporadically, from time to time.
    5. The design must support multiple search criteria: duration, date, type of the file, etc. Circa 100 (!) different criteria. Cannot be negotiated.

     

    The proposal would be:

    • Store the the files in Azure BLOB Storage
    • Store the metadata JSON object with some additional, normalised information (created on / by / etc.) in Azure SQL DB (e.g. DB_data)
    • Store the auditing in a separate Azure SQL DB (e.g., DB_audit)
    • Also store the JSON in Azure ElasticSearch for better performance of all the queries, based on various, multiple search criteria.

     

    What are your thoughts, please?

     

     

  • As a different approach, any chance you could spin up SharePoint? It solves most of your problems as it allows for file storage, quick searching (no need for ElasticSearch), auditing (no need for special SQL DB), metadata can be stored along with the files (created on/by, modified on/by, approved on/by, etc). PLUS you get file versioning, file history, and a "recycle bin" for accidental deletion. To solve the "grouping against an ID", you can just create a folder or a new "site" in SharePoint.

    It sounds to me like you are trying to create your own document control system and in my experience, it is MUCH cheaper and easier to use an out of box solution rather than trying to build it yourself. MAY be higher initial cost, but is cheaper long term cost. Build it yourself is going to be a lot of work and a lot of overhead and you are going to forever have "feature requests" that may or may not be possible and may or may not blow you out of that 500 MB per year expected growth. Plus, I've found that SOME end users expect that new features (such as "approved on/by" being added) will somehow, magically be retroactive which is 100% impossible to do.

    I strongly encourage you to look at the out of box solutions (such as SharePoint, but Alfresco is also an option that I have used... not quite as nice, but is another option), rather than build it yourself.

    BUT if you really want to build it yourself, you do run into potential issues if you are storing the data in 2 places - BLOB and ElasticSearch. What do you do if the file in BLOB doesn't match ElasticSearch? Which is your "source of truth"? I imagine you will want something to force-sync the files, but how do you determine which is the "source of truth" file and which is "safe to overwrite"? Plus, if you are expecting 500MB of data per year, duplicated to 2 locations, that means you are looking at 1 GB of data per year which, based on my personal experience, actually grows year of year in most cases. MAY not grow in your specific use case, but I've seen it grow with all of my systems when we do 5 year growth estimates. Year 1 may be 500 MB, year 2 may be 600 MB, and by year 5 we could be looking at 1.5 GB. Now this isn't that big in the grand scheme of things, it causes impact on future growth estimates. At some point, you run into disk space issues and costs grow quickly. Now, your system MAY be 100% static growth (ie you have 100 5 MB files that get uploaded per year and that will never change due to how your business works), then the 500MB estimate will be good, but if I don't have hard numbers where I can say with 100% confidence that the data won't grow year over year, I would add some extra buffer room. How much depends on how previous year growth looks like, but I like to have 2X the space reserved for growth or "surprises". So if year 1 is 500 MB disk required, I budget for 1 GB. Worst case, I wasted 500 MB of disk. Best case, we went over 500 MB and had no business impact.

    Are you planning on storing the JSON in SQL or will that only exist on disk and have a reference to the file in the DB around the metadata you are interested in? If you store the JSON in the database, you may not need the BLOB storage?

    You may also want to look at what the end user interface is going to be for uploading these files. Is it going to be a manual process or is it part of an automated process? If it is automated, is it a scheduled process or trigger based or near-real-time? If it is manual, you run the risk of incorrect uploads (duplicate uploads, missed uploads, incorrect uploads, etc).

    Another question (for you internally, doesn't really matter to me specifically) - how will users access the data when they need it? Are they logging into Azure and pulling the data as needed or are you building a custom app for it or something else? The reason this is important is that you want to make sure that the search tool you are using makes sense and can work with the solution you implement. For example, if the search is "windows search" (bad example, but just an example), I am pretty sure that has no way to connect to ElasticSearch so the solution may be "wrong" from the start.

    Some of the questions above are just to make you think about more than just the technical part of the solution design. That may be the only part YOU are involved in, but doesn't hurt to get the answers. It COULD be that nobody thought about some of that stuff before telling you what needs to be done. I've had solutions come my way and I asked some questions and the project got pushed in a new direction or cancelled completely. I've also had solutions pushed my way where I was told "this is what the company bought, so you and the team have to make this work". You may be in the second boat which is less fun to be in. BUT if Azure is the "solution" they want, then SharePoint meets the "Azure" requirement because the data is stored in Azure if I am not mistaken. Plus, SharePoint licenses give you a ton of storage. You'd need 1 license per user and I think it is 1 TB storage/user licensed if I remember right, plus a chunk (10 TB I think?) for getting the SharePoint license to begin with. Not 100% sure on those numbers, but they should be more than enough for your data growth.

    The above is all just my opinion on what you should do. 
    As with all advice you find on a random internet forum - you shouldn't blindly follow it.  Always test on a test server to see if there is negative side effects before making changes to live!
    I recommend you NEVER run "random code" you found online on any system you care about UNLESS you understand and can verify the code OR you don't care if the code trashes your system.

  • Thank you for such a detailed reply. I have just noticed I wrote 500MB instead of 500GB - mia culpa and I apologise.

    My experience from SharePoint from two last places is less than positive, although I wasn't involved in either:

    both attempted to migrate from an old version of SharePoint to the new one and both couldn't easily do it - just abandoned the idea

    and kept two separate versions. My only experience with SharePoint as a user, hence I cannot comment on what were the issues.

    My thoughts were that Azure BLOB Storage will store the media, where SQL Server and ElasticSearch will allow me to implement CQRS with SQL simply storing all the data, including metadata, and ElasticSearch - used for the queries.

    I will read through your suggestion again, but I wonder if my mistake between MB and GB will change your approach.

    Much obliged.

     

  • If you are looking at Blob Storage you might also like to consider Object Drift and Corruption. We were looking at something similar last year but it got abandoned for more pressing problems:

    https://www.sqlservercentral.com/forums/topic/how-do-i-add-files-to-the-server/page/2#post-4318826

    I am not totally convinced about Sharepoint especially as few of our files will change. Even if they do change S3/Blob Storage have the option of versioning. If time is available, it might be best to try and prototype various options and see if any problems emerge.

Viewing 4 posts - 1 through 3 (of 3 total)

You must be logged in to reply to this topic. Login to reply