When we have an Azure SQL Data Warehouse (ASDW), it is possible to query NoSql files using PolyBase. It is very common to work with csv files and query them using ASDW. However, these csv files are constantly changing and they need to be uploaded to an Azure Storage account. How can we compare the content of two folders in order to upload only the new ones?
When the files inside a folder are constantly updated, it is necessary to update the files in Azure and a comparision between the old and the new folders is necessary. In this article, we will see different ways to compare two folders in the local file system. We will also show how to synchronize them. We will try the following methods:
After that, we will show how to upload the files or folers to the Azure Storage in order to be queried in ASDW (we will show how to query these files in later chapters).
- A local machine with Windows and PowerShell
- An Azure account
- Microsoft Azure Storage Explorer (MASE) installed
We will first compare file and folders using PowerShell.
PowerShell is a powerful tool to handle files and folders. In this example, we will have two folders. Folder "a" and folder "b" as follows:
Table 1. Folders and files
|Folder a||Folder b|
|cities.csv||This file is equal to the file in folder a|
|countries.csv||This file is different from the file in folder a|
|russiancities.csv||This file does not exist in folder b|
You can download these folders from the resource files at the bottom of this article.
The following cmdlets will allow you to compare Folder a and Folder b:
PS C:\Users\OSCAR> $source = Get-ChildItem -Recurse -path C:\a PS C:\Users\OSCAR> $destination = Get-ChildItem -Recurse -path C:\b PS C:\Users\OSCAR> Compare-Object -ReferenceObject $source -DifferenceObject $destination
The Get-ChildItem will store the files inside the folder a in the $source variable. In the $destination variable, we will store all the files in folder b. The Compare-Object cmdlet compares two objects. It will show the file differences between the folders:
As you can see, the compare-object cmdlet detected that the folder "a" has a file named russiancities.csv not included in folder b. The problem is that the Compare-Object cmdlet does not detect the files modified with the code used before. It only detects what files exist and what files are missing. You can compare files using the compare-object cmdlet:
Compare-Object $(Get-Content c:\a\countries.csv) $(Get-Content c:\b\countries.csv)
Compare-Object compares the content of the countries in folder "a" and folder "b". Get-Content obtains the content of the files. The results displayed are the following:
The rows with Austria and Azerbaijan are included in the countries.csv file in folder "a", but they are not included in the same file in folder "b".
Another way to compare two files is to check the last write time. This attribute will show when was the last time that the file was modified (run this cmdlets in the a or b folder):
gci | select name,lastwritetime
The gci cmdlet will show the file names and the last write time:
You can also compare the size of the files (length) using the gci cmdlet:
gci | select name,length
The length of the files is measured in bytes. We can store the information in files of the folder "a" and folder "b" and after that compare the results. The following example stores the last write time and the file names in a file named source.txt of the files in folder "a":
cd cd a PS C:\a> gci | select lastwritetime,name | out-file c:\output\source.txt
Next, we will do the same with the folder "b" and we will call the file destination.txt:
PS C:\a> cd .. PS C:\> cd b PS C:\b> gci | select lastwritetime,name | out-file c:\output\destination.txt
Finally, compare both files:
Compare-Object $(Get-Content c:\output\source.txt) $(Get-Content c:\output\destination.txt)
The results displayed are the following:
21:43:53 is the last write time of the countries.csv file in the folder "b" and 21:29:44 is the last write time of the same file in the folder "a". Finally, 21:42:02 is the last write time of the file russiancities.csv.
Once you have the files that changed, you can upload them to Azure.
You can also synchronize both folders using the robocopy command. This command is included in the command line (cmd).
C:\a>robocopy c:\a c:\b /mir
Robocopy has nothing to do with Robocop, the OCP or Detroit. It is a command to copy files with some advanced options. Robocopy with the /mir option, creates a mirror from the source into the destination. It compares the source and the destination folder and makes sure to remove additional files or copies missing files from the source to the destination.
This option is good if the files are not so big and it is easier to upload all the files from the folder (including the ones that did not change) to the Azure Storage. The problem is that sometimes we only want to know the files that were changed or added and we do not want to just synchronize them. The next option would be WinMerge.
WinMerge is an Open Source software that can compare files and folders. You can get this free software here: WinMerge. Once installed, go to the menu and go to File>Open:
Select the folders to compare:
As you can see, it shows the files that are identical, the files that are different and the files that are left. In the Filename column, you can see the icons to compare identical, different and missing files:
If you click countries.csv, you will see the differences between the file in folder "a"and "b":
You can also copy the files from the right to the left or vice versa:
In these examples, we used three methods. Here you have some advantages and disadvantages:
|PowerShell||You can customize and create very advanced solutions.||It takes time to learn PowerShell and sometimes it is hard to maintain the code. You can automate this tool integrating with the SQL Agent, the Windows Task Scheduler or any other tool of your preference.|
|Robocopy||It is easy to use.||It synchronizes two folders, but it does not show the differences between two folders. This tool can also run automatically using the SQL Agent, the Windows Task Scheduler or any other tool of your preference. Use this option if you want just to synchronize the folders and you do not need to detect which ones changed.|
|WinMerge||It is extremely friendly and easy to use.||You cannot easily customize it. It is open source, but it takes time to make modifications. You cannot automate tasks with this tool.|