I ran across Denis Gobo’s post about working with names from Social Security cards and wanted to play with the dataset. However, rather than use xp_cmdshell, which I might have, I decided to use PowerShell. I need to repeat this with Python, but for now, this worked.
There is a file you can download that has a lot of .txt files, each one with a CSV. You can read about this in more detail in Denis’ post. Essentially there were three fields, and the year of the file in the name. I downloaded and unzipped the files into a folder.
Here was my process:
- Loop through the files
- Extract the year from the file name
- bcp in the file into a staging table
- call a proc that takes the year as a parameter and moves all data
With that in mind, here’s how I set things up.
The SQL Server Side
I started by creating two tables.
CREATE TABLE Names
( FirstName VARCHAR(500),
Gender CHAR(1),
NameCount INT,
YearInFile INT
);
CREATE TABLE Names2
( FirstName VARCHAR(500),
Gender CHAR(1),
NameCount INT
);
With these in place, I created a procedure:
CREATE PROCEDURE MoveNames @year INT
AS
INSERT Names SELECT FirstName, Gender, NameCount, @year FROM Names2;
TRUNCATE TABLE Names2;
GO
I tested these and they worked fine for my process. I want to load into Names2, then to get the year, I move the data to the other table, adding the year. There are probably better ways, but this worked fine for a relatively small load (few million rows).
PowerShell
With PoSh, I started with a simple loop. I tend to set variables first since this makes this easier to turn into a function.
$sourcefolder = “E:\Downloads\names”
$sourcefiles = Get-ChildItem $sourcefolder -Filter “*.txt”
foreach($file in $sourcefiles ){
$yearinfile = $file.Name.Substring(3, 4)
write-host “Loading File:” $file.Name ” year: ” $yearinfile
}
Running this gets me a list of all files along with the years. That’s what I want, since I’m going to process each file and load it in with bcp. Thanks to Mike Fal for the outline of how I’ll do this.
Next, I set more variables at the top:
$InstanceName = “.\SQL2016”
$DatabaseName = “Sandbox”
$StagingTableName = “Names2”
$StagingProc = “MoveNames”
I’ll use these in this code. I call Invoke-Expression to run bcp and then Invoke-Sqlcmd to run my proc.
$cmd = “bcp ‘$DatabaseName.dbo.[$StagingTableName]’ in ‘$file’ -S’$InstanceName’ -T -c -t’,'”
Invoke-Expression $cmd
Invoke-Sqlcmd -ServerInstance $InstanceName -Database $DatabaseName -Query “$StagingProc $yearinfile”
From here, I put this in a file and ran it from my download location. The result:
and from SQL Server:
Now I can run some queries, and find out where I stand. #26, it turns out.
Filed under: Blog Tagged: ETL, powershell, syndicated