What I do is store data files down to the YYYY/MM/DD hierarchy. I also create two main splits, RAW and PROCESSED. The raw structure is for all the data you ingest. Then processed is the same layout, but all the data that is processed using Azure Data Lake Analytics. Remember, data is sitting in Azure Data Lake Storage first, then processed with Analytics later with U-SQL. These are two entirely separate services that you pay for separately regardless if they feel like one.
The reasoning for that partition breakout is because for a group of files, it reads every file in that directory regardless of any filters on the U-SQL statement. The same is true for single files like where you build a single table off one file. Therefore, to control the size of that single file, you have to constantly rebuild that single file either using Analytics or some other means like a third-party application or ETL process that is getting data into Storage. There is not INSERT for these files nor appending to them them. You have to rebuild them as far as I know.
Now, when it comes to physical tables that you are referring to. I assume the INSERT adds a new physical file because again, you cannot INSERT into an existing document regardless if they give you the option to build a table or not. The system will simply make another document for that table for each INSERT. Instead of being a single document table, it becomes a multi-document table. Like say 20 CSV files attached to one table, where when you READ from that table, it’s like reading all 20 CSV files.
The reason it likely tells you that you can ALTER REBUILD is in case you want to rebuild that single document yourself and then execute that command to ensure the table re-reads the new document you just built. It will re-read the entire document. That means, if you started off with 1 million record CSV file, then rebuilt that file using Analytics to now have 2 million records. When you execute that ALTER REBUILD command, it will read the first million it already read PLUS the additional 1 million records you just added.
If it was me, I would stay clear from re-reading what you have already read. The whole purpose of why you would go to YYYY/MM/DD structure is to help reduce re-reads. You are charged for every read and transaction in Storage. ALTER REBUILD will likely cost the business more because it has to re-read everything you put into that single document every time it’s updated versus letting it create a new document for every INSERT. At least that is my understanding.
Personally, I do not use this feature yet. I am using EXTERNAL TABLES because Analytics is still ensuring the document behind the table is consistent as opposed to a managed table. Then I pull it into Azure Data Warehouse using Polybase where the data is read from the EXTERNAL TABLE into a PHYSICAL TABLE within a proper data warehouse. I do not think MANAGED TABLES can be read into Azure Data Warehouse or Azure SQL Database like EXTERNAL TABLES can. Ensuring there is a proper data warehouse on top of the data lake was my goal.