Files vs. Managed Tables - how data is stored physically?

  • Hello,

    I am quite new to ADL and USQL. I went through quite a lot of documentation and presentations but I am afraid that I am still lacking many answers. 

    Simplifying a bit, I have a large set of data (with daily increments) but it contains information about many different clients in one file. In most cases the data will be analysed for one client (one report = one client), but I would like to keep the possibility to do a cross-client analysis (much less common scenario). I am aware of the importance of correctly partitioning this data (probably keeping one client data together makes most sense). I was looking into two scenarios:

    1. I will partition the data myself by splitting the files into folder-file structure where I can have full control over how it is done, how big the files are etc.
    2. I will use managed tables and set up table partitioning

    I am now looking into pros and cons of both scenarios. Some things that come to my mind are:

    1. The ability to compress data in scenario #1 (at the cost of performance of course)
    2. The ability to build a much more granular security model by using files and ADL security (e.g. give access only to one client's data)
    3. On the other hand, using the tables is much more comfortable as I will be dealing with just one data source and will not have to worry about extracting correct files, only about the correct filters in a query - in theory USQL should do the rest
    4. I would expect that the tables will offer better performance

    One very important factor I wanted to investigate, before making my decision, is how the data is stored physically when using tables and partitions. I have read the documentation and I found a statement that confused me (https://docs.microsoft.com/en-us/u-sql/ddl/tables):

    First we can read that: 

    "U-SQL tables are backed by files. Each table partition is mapped to its own file" - this seems to make perfect sense. I would assume that if I set up partitioning by client I would end up with the same scenario as doing the partitioning myself. Fantastic! U-SQL will do all the work for me! Or.. will it not?

    Later we can read that:

    "..., and each INSERT statement adds an additional file (unless a table is rebuilt with ALTER TABLE REBUILD)."

    Now this makes things more complicated. If I read it correctly, this means that if I will never rebuild a table I will have my data stored physically in exactly the same way as the original raw files and thus experience bad performance. I did some experiments and it seemed to work this way. Unfortunately, I was still not able to match the files with partitions as the guids were different (the .ss files in the store had different guids than partitions in usql views) so this is just my guess.

    Therefore I have several questions:

    1. Is there some documentation explaining in more detail how TABLE REBUILD works?
    2. What is the performance of TABLE REBUILD? Will it work better than my idea of appending (extract -> union all -> output) just the files that need to be appended?
    3. How can I monitor the size of my partitions? In my case (running local, have not checked it online yet) the guids of files and partitions in the store do not match even after REBUILD (they do for the DB, schema and table)
    4. Is there any documentation explaining in more details the .ss files?
    5. Which of the scenarios would you take and why?

    Many thanks for your help,

    Jakub

  • What I do is store data files down to the YYYY/MM/DD hierarchy. I also create two main splits, RAW and PROCESSED. The raw structure is for all the data you ingest. Then processed is the same layout, but all the data that is processed using Azure Data Lake Analytics. Remember, data is sitting in Azure Data Lake Storage first, then processed with Analytics later with U-SQL. These are two entirely separate services that you pay for separately regardless if they feel like one.

    The reasoning for that partition breakout is because for a group of files, it reads every file in that directory regardless of any filters on the U-SQL statement. The same is true for single files like where you build a single table off one file.  Therefore, to control the size of that single file, you have to constantly rebuild that single file either using Analytics or some other means like a third-party application or ETL process that is getting data into Storage. There is not INSERT for these files nor appending to them them. You have to rebuild them as far as I know.

    Now, when it comes to physical tables that you are referring to. I assume the INSERT adds a new physical file because again, you cannot INSERT into an existing document regardless if they give you the option to build a table or not. The system will simply make another document for that table for each INSERT. Instead of being a single document table, it becomes a multi-document table. Like say 20 CSV files attached to one table, where when you READ from that table, it's like reading all 20 CSV files.

    The reason it likely tells you that you can ALTER REBUILD is in case you want to rebuild that single document yourself and then execute that command to ensure the table re-reads the new document you just built. It will re-read the entire document. That means, if you started off with 1 million record CSV file, then rebuilt that file using Analytics to now have 2 million records. When you execute that ALTER REBUILD command, it will read the first million it already read PLUS the additional 1 million records you just added.

    If it was me, I would stay clear from re-reading what you have already read. The whole purpose of why you would go to YYYY/MM/DD structure is to help reduce re-reads. You are charged for every read and transaction in Storage. ALTER REBUILD will likely cost the business more because it has to re-read everything you put into that single document every time it's updated versus letting it create a new document for every INSERT. At least that is my understanding.

    Personally, I do not use this feature yet. I am using EXTERNAL TABLES because Analytics is still ensuring the document behind the table is consistent as opposed to a managed table. Then I pull it into Azure Data Warehouse using Polybase where the data is read from the EXTERNAL TABLE into a PHYSICAL TABLE within a proper data warehouse. I do not think MANAGED TABLES can be read into Azure Data Warehouse or Azure SQL Database like EXTERNAL TABLES can. Ensuring there is a proper data warehouse on top of the data lake was my goal.

  • Looks like this site has a good workup on managed tables too. As I assumed, they all live within Azure Data Lake Analytics that sit on top of Azure Data Lake Storage. You should be able to create schemas on these tables and the works. I assume what I mentioned before is still correct. If you rebuild the physical documents that power the table, ALTER REBUILD will be like TRUNCATE TABLE and INSERT for that new document you made versus allowing it to split off into multiple documents based on the partition key or whatever you defined when using just INSERT.

    http://www.sqlservercentral.com/articles/U-SQL/142866/

  • Many thanks for your reply.
    I did some more tests and it only made it more intriguing.

    1. I took a sample of 7 days of data
    2. I created a table partitioned by date
    3. I created 8 partitions - one for each day + one default
    4. I imported data from the 7 days - as a result, in the catalogue I got 8 files corresponding (probably) to my partitions
    5. I imported the same file once a gain - as a result, in the catalogue I got 16 files (1 per partition per import - the sizes of the files matched exactly)
    6. To be tripple sure I did it once again and got 24 files (again 1 per partition per import, sizes match)
    7. I did the TABLE REBUILD - ended up again with 8 files (8 partitions) - makes sense
    8. I imported the file once again - ended up having 16 files (sizes don't match so I guess I got 8 files for the partition and 8 files for the import - 1 per partition)
    9. I did the TABLE REBUILD - edned up again with 8 files - size still growing - still makes sense, but... this is where it gets funny
    10. I then imported another file containing only 2 days of data
    11. I ended up with... nope, you didn't guess! - 16 files. So I got 8 files with the large partitions, 2 larger files with new import for 2 days and 6 very small files
    12. Being even more intrigued I ran the TABLE REBUILD 
    13. I endeed up with 8 files (for each partitions) but... they were all just recently modified
    Conclusion? If I am not mistaken, this looks like the rebuild will actually touch all my files no matter what I just inserted. If this is the case, it means that the whole scenario will become more and more expensive over time as the data grows. Is there anyone who could please explain I am wrong?
  • If I'm reading this right, it makes sense. It is basically taking your single file and splitting them up into the partitions. I think where you are getting hung up is you assume 1 partition means 1 file. It's looks like it's making multi-document partition files. Meaning 1 partition can have N number of files tied to it. Then when you rebuild the table, it consolidates these N number of files into one single file per partition. Thus, it will have to read all the N number of files across Y number of partitions when you do that rebuild. Might be more expensive to rebuild it every time versus just letting it partition the data automatically into N number of files for you.

    I wouldn't stress too much on how it partitioned it physically, but it is something to consider as there seems to be a cap on how many partitions per table there are as well maybe file size of each partition. That and if the size and number of the files impact performance where you should do rebuilds every so often.

    But other than that, sounds like what I was thinking minus I thought you had to build the single file. It's cool it will do it for you with the rebuild. So, you keep inserting new data and then you can rebuild it to a single file per partition. Just test to see that with each insert, you are are still able to read all the data you inserted prior.

  • I think my abstraction with dates brought a bit of confusion. I just used it to see how it handles the file.

    What I ultimately need to do is partition the data by client, not by date. The reason for that is that I will be mostly doing reports with all the data of one client rather than reports with data of all clients over time. The performance should be better if I need to read just one file for a report instead of few hundered. This is the main idea for partitioning, as far as I understand.

    Now let's assume I've been collecting and processing my data for a year. Then I get a new file from day 365 + 1. 

    This file will, most probably, contain data of some new clients and some old clients, but many clients will not be there. If I do the partitioning "manually" I can append only the files/partitions for those clients that had new data. It seems that if I have a table with partitions and rebuild it, all the files for all the clients will be processed.

    Or am I still missing something?

  • As before, every new document you INSERT is going to add multiple files per partition until you rebuild it. That means, if you partition on client and you start off with 100 unique clients, then you may end up with 100 files. If you add in more clients, both unique and non-unique clients, then you may end up with additional files added to additional partitions or new files added to new partitions. If you were to ALTER REBUILD, you will see partitions that have more than one file assigned to them consolidate to one file per partition. That means, if you eventually have 300 unique clients, then your REBUILD will result to shrinking down to 300 unique documents for each client partition, where before the rebuild, you may have 600 files.

    Just test it out, see how it works out for you.

  • Yes, this is absolutely correct and I do understand it. This is also what my tests showed. 

    The only problem I see is that when I end up with 300 files for 300 unique clients and then I add 5 more clients and REBUILD, it doesn't rebuild only those files for partitions I am inserting to. It looks like it does rebuild all 300 files (at least that's what the timestamps on the files show me). If this is the case, it means that the cost of rebuilding rises with the amount of data. If I do the append manually I only append those files that need to be change, which means the cost of appending is more or less the same every time I add 5 new clients.

Viewing 8 posts - 1 through 7 (of 7 total)

You must be logged in to reply to this topic. Login to reply