Zip XML data

Question

Zip XML data

VastSQL

SSC-Dedicated

Points: 36067
More actions
January 18, 2018 at 12:34 am

#402845

Hi Experts,
We are creating an archival plan for one of our table which stores json data in nvarchar(max). We planned to load data from original table to a warehouses database on daily basis ,one time a day . The data will be selected as xml and will zip and then will insert into a table. We were able to select data as xml but when tried to create a assembly to zip the data it threw error.
Msg 10301, Level 16, State 1, Line 11
Assembly 'Zip' references assembly 'netstandard, version=2.0.0.0, culture=neutral, publickeytoken=cc7b13ffcd2ddd51.', which is not present in the current database. SQL Server attempted to locate and automatically load the referenced assembly from the same location where referring assembly came from, but that operation has failed (reason: 2(The system cannot find the file specified.)). Please load the referenced assembly into the current database and retry your request.
Used this to create the zip dll
https://www.codeproject.com/Articles/27203/GZipStream-Compress-Decompress-a-string.
Does anyone have experience on this or I am I going way off from archiving the data?

Viewing 15 posts - 1 through 15 (of 16 total)

You must be logged in to reply to this topic. Login to reply

Eirikur Eiriksson SSC Guru Points: 182881 More actions · Answer 1

VastSQL - Thursday, January 18, 2018 12:34 AM
Hi Experts,
We are creating an archival plan for one of our table which stores json data in nvarchar(max). We planned to load data from original table to a warehouses database on daily basis ,one time a day . The data will be selected as xml and will zip and then will insert into a table. We were able to select data as xml but when tried to create a assembly to zip the data it threw error.
Msg 10301, Level 16, State 1, Line 11
Assembly 'Zip' references assembly 'netstandard, version=2.0.0.0, culture=neutral, publickeytoken=cc7b13ffcd2ddd51.', which is not present in the current database. SQL Server attempted to locate and automatically load the referenced assembly from the same location where referring assembly came from, but that operation has failed (reason: 2(The system cannot find the file specified.)). Please load the referenced assembly into the current database and retry your request.
Used this to create the zip dll
https://www.codeproject.com/Articles/27203/GZipStream-Compress-Decompress-a-string.
Does anyone have experience on this or I am I going way off from archiving the data?

Quick thought, the XML data type is quite an efficient storage type, I've only seen around 20-25% effective compression when XML stored in a zipped format. Accessing the data as XML is much easier as having to decompress it before querying it is quite an effort.
😎
Question, why NVARCHAR and not VARBINARY for the zipped XML?

Sergiy SSC Guru Points: 110206 More actions · Answer 2

Did you ever consider using a relational database?

I heard they have a fascinating feature named "data normalization".

Being destined appropriately, such a database would take from 5 to 10 times less storage space than corresponding set of JSON or XML strings, before any compression.

Well, that's the rumour...

_____________
Code for TallyGenerator

VastSQL SSC-Dedicated Points: 36067 More actions · Answer 3

Eirikur Eiriksson - Thursday, January 18, 2018 12:47 AM
VastSQL - Thursday, January 18, 2018 12:34 AM
Hi Experts,
We are creating an archival plan for one of our table which stores json data in nvarchar(max). We planned to load data from original table to a warehouses database on daily basis ,one time a day . The data will be selected as xml and will zip and then will insert into a table. We were able to select data as xml but when tried to create a assembly to zip the data it threw error.
Msg 10301, Level 16, State 1, Line 11
Assembly 'Zip' references assembly 'netstandard, version=2.0.0.0, culture=neutral, publickeytoken=cc7b13ffcd2ddd51.', which is not present in the current database. SQL Server attempted to locate and automatically load the referenced assembly from the same location where referring assembly came from, but that operation has failed (reason: 2(The system cannot find the file specified.)). Please load the referenced assembly into the current database and retry your request.
Used this to create the zip dll
https://www.codeproject.com/Articles/27203/GZipStream-Compress-Decompress-a-string.
Does anyone have experience on this or I am I going way off from archiving the data?
Quick thought, the XML data type is quite an efficient storage type, I've only seen around 20-25% effective compression when XML stored in a zipped format. Accessing the data as XML is much easier as having to decompress it before querying it is quite an effort.
😎
Question, why NVARCHAR and not VARBINARY for the zipped XML?

Thanks Eirikur for the reply.. We haven't even reached the stage of zipping . Currently the json is stored as nvarchar and we are converting it to XML and then planning to zip it , now i am stuck at zipping the xml data.

VastSQL SSC-Dedicated Points: 36067 More actions · Answer 4

Sergiy - Thursday, January 18, 2018 12:58 AM
Did you ever consider using a relational database?I heard they have a fascinating feature named "data normalization".Being destined appropriately, such a database would take from 5 to 10 times less storage space than corresponding set of JSON or XML strings, before any compression.Well, that's the rumour...

We need to keep the json that way. the Json file is given by BizTalk which converts all normalized data to json for communicating with other system if i am not wrong.

Eirikur Eiriksson SSC Guru Points: 182881 More actions · Answer 5

VastSQL - Thursday, January 18, 2018 1:07 AM
Eirikur Eiriksson - Thursday, January 18, 2018 12:47 AM
VastSQL - Thursday, January 18, 2018 12:34 AM
Hi Experts,
We are creating an archival plan for one of our table which stores json data in nvarchar(max). We planned to load data from original table to a warehouses database on daily basis ,one time a day . The data will be selected as xml and will zip and then will insert into a table. We were able to select data as xml but when tried to create a assembly to zip the data it threw error.
Msg 10301, Level 16, State 1, Line 11
Assembly 'Zip' references assembly 'netstandard, version=2.0.0.0, culture=neutral, publickeytoken=cc7b13ffcd2ddd51.', which is not present in the current database. SQL Server attempted to locate and automatically load the referenced assembly from the same location where referring assembly came from, but that operation has failed (reason: 2(The system cannot find the file specified.)). Please load the referenced assembly into the current database and retry your request.
Used this to create the zip dll
https://www.codeproject.com/Articles/27203/GZipStream-Compress-Decompress-a-string.
Does anyone have experience on this or I am I going way off from archiving the data?
Quick thought, the XML data type is quite an efficient storage type, I've only seen around 20-25% effective compression when XML stored in a zipped format. Accessing the data as XML is much easier as having to decompress it before querying it is quite an effort.
😎
Question, why NVARCHAR and not VARBINARY for the zipped XML?
Thanks Eirikur for the reply.. We haven't even reached the stage of zipping . Currently the json is stored as nvarchar and we are converting it to XML and then planning to zip it , now i am stuck at zipping the xml data.

Why are you converting the JSON to XML?
😎

VastSQL SSC-Dedicated Points: 36067 More actions · Answer 6

Eirikur Eiriksson - Thursday, January 18, 2018 1:25 AM
VastSQL - Thursday, January 18, 2018 1:07 AM
Eirikur Eiriksson - Thursday, January 18, 2018 12:47 AM
VastSQL - Thursday, January 18, 2018 12:34 AM
Hi Experts,
We are creating an archival plan for one of our table which stores json data in nvarchar(max). We planned to load data from original table to a warehouses database on daily basis ,one time a day . The data will be selected as xml and will zip and then will insert into a table. We were able to select data as xml but when tried to create a assembly to zip the data it threw error.
Msg 10301, Level 16, State 1, Line 11
Assembly 'Zip' references assembly 'netstandard, version=2.0.0.0, culture=neutral, publickeytoken=cc7b13ffcd2ddd51.', which is not present in the current database. SQL Server attempted to locate and automatically load the referenced assembly from the same location where referring assembly came from, but that operation has failed (reason: 2(The system cannot find the file specified.)). Please load the referenced assembly into the current database and retry your request.
Used this to create the zip dll
https://www.codeproject.com/Articles/27203/GZipStream-Compress-Decompress-a-string.
Does anyone have experience on this or I am I going way off from archiving the data?
Quick thought, the XML data type is quite an efficient storage type, I've only seen around 20-25% effective compression when XML stored in a zipped format. Accessing the data as XML is much easier as having to decompress it before querying it is quite an effort.
😎
Question, why NVARCHAR and not VARBINARY for the zipped XML?
Thanks Eirikur for the reply.. We haven't even reached the stage of zipping . Currently the json is stored as nvarchar and we are converting it to XML and then planning to zip it , now i am stuck at zipping the xml data.
Why are you converting the JSON to XML?
😎

1 column value is json there are 4 other columns . the json value is stored as nvarchar(max) .i will convert the whole data for a day as 1 xml record and planning to zip it.

mike.knospe SSC Rookie Points: 25 More actions · Answer 7

Hallo everybody,
I uses a powershell script for reading the Content table and Zip the Content to a specified path. Store this in an PS script and you can create an SQL Server Agent Job to run this file.

Best regrads
Mike

# Configuration data
[string] $server   = "Servername\";       # SQL Server Instance
[string] $database = "DatabaseName";     # Database containing the BLOB data.
[string] $folder   = "C:\TEMP\Exportpath\";         # Path to export to
[string] $filename;
[int]    $bufferSize = 32768;               # Stream buffer size in bytes.
$sql = "SELECT
  CAST(Filename AS VARCHAR(6)) + '.XML' as [FileName]
      ,CAST([XMLDoc] as varbinary(MAX)) as [XMLDoc]
  FROM TableToExport
  WHERE XMLDOC IS NOT NULL;";

# Open ADO.NET Connection with Windows authentification.
$con = New-Object Data.SqlClient.SqlConnection;
$con.ConnectionString = "Data Source=$server;Initial Catalog=$database;Integrated Security=True;"
$con.Open();

Write-Output ((Get-Date -format yyyy-MM-dd-HH:mm:ss) + ": Started ...");

# New command and reader
$cmd = New-Object Data.SqlClient.SqlCommand $Sql, $con;
$rd = $cmd.ExecuteReader(); #([System.Data.CommandBehavior]::SequentialAccess);

# Create a byte array for the stream.
$out = [array]::CreateInstance('Byte', $bufferSize)

# Looping through all selected records.
While ($rd.Read())
{
    try
    {
    $filename = $folder + $rd.GetString(0)
        Write-Output ((Get-Date -format yyyy-MM-dd-HH:mm:ss) + ": Exporting {0}" -f $filename);

        # New BinaryWriter; existing file will be overwritten.
        $fs = New-Object System.IO.FileStream ($filename), Create, Write;
        $bw = New-Object System.IO.BinaryWriter $fs;

        $start = 0;
        # Read first byte stream chunk.
        $received = $rd.GetBytes(1, $start, $out, 0, $bufferSize - 1);
        While ($received -gt 0)
        {
           $bw.Write($out, 0, $received);
           $bw.Flush();
           $start += $received;
           # Read next byte stream chunk.
           $received = $rd.GetBytes(1, $start, $out, 0, $bufferSize - 1);
        }

        $bw.Close();
        $fs.Close();
    }
    catch
    {
        Write-Output ($_.Exception.Message)
    }
    finally
    {
        $fs.Dispose();
    }
}

# Closing & Disposing all objects.
$rd.Close();
$cmd.Dispose();
$con.Close();

$srcdir = "C:\TEMP\ExportToPath"
$zipFilename = "ZipName.zip"
$zipFilepath = "C:\TEMP\Export\"
$zipFile = "$zipFilepath$zipFilename"

#Prepare zip file
if(-not (test-path($zipFile))) {
set-content $zipFile ("PK" + [char]5 + [char]6 + ("$([char]0)" * 18))
(dir $zipFile).IsReadOnly = $false
}

$shellApplication = new-object -com shell.application
$zipPackage = $shellApplication.NameSpace($zipFile)
$files = Get-ChildItem -Path $srcdir | where{! $_.PSIsContainer}

foreach($file in $files) {
    $zipPackage.CopyHere($file.FullName)
#using this method, sometimes files can be 'skipped'
#this 'while' loop checks each file is added before moving to the next
    while($zipPackage.Items().Item($file.name) -eq $null){
        Start-sleep -seconds 1
    }
}

Write-Output ((Get-Date -format yyyy-MM-dd-HH:mm:ss) + ": Finished.");

Sergiy SSC Guru Points: 110206 More actions · Answer 8

1 column value is json there are 4 other columns . the json value is stored as nvarchar(max) .i will convert the whole data for a day as 1 xml record and planning to zip it.

Ok! let's see.
You take copies of outgoing messages generated by BizTalk and save it into a table (what's wrong with the file system?)
Then you collect all the saved messages for a day and merge them into a huge XML. That CPU intensive process requires all the inbound JSON messages and outbound XML message to be placed into SQL Server memory, not to mention - they'll to go through the buffer pool. Which will make SQL Server resources unavailable for other, actual data manipulation, tasks.
And at the end of it you want to dump the bulky output file by zipping it (Lynn mentioned - Zip is quite useless in compressing XMLs) into an archive on some remote location (returning to the roots - apparently, there is nothing wrong with the file system storage), so you can never see it again.

And it's all instead of having a log table with description of the JSON messages (those 3 columns you mentioned) and PK link(s) to the data which went into every one of them. So any message can be recreated when requested.
Should not exceed 50 bytes per record.

Look, if you have so much spare computing power at your disposal, why not consider other, more productive uses of it?
They say, bit coin mining is quite profitable.

_____________
Code for TallyGenerator

VastSQL SSC-Dedicated Points: 36067 More actions · Answer 9

Sergiy - Friday, January 19, 2018 6:12 AM
1 column value is json there are 4 other columns . the json value is stored as nvarchar(max) .i will convert the whole data for a day as 1 xml record and planning to zip it.
Ok! let's see.
You take copies of outgoing messages generated by BizTalk and save it into a table (what's wrong with the file system?)
Then you collect all the saved messages for a day and merge them into a huge XML. That CPU intensive process requires all the inbound JSON messages and outbound XML message to be placed into SQL Server memory, not to mention - they'll to go through the buffer pool. Which will make SQL Server resources unavailable for other, actual data manipulation, tasks.
And at the end of it you want to dump the bulky output file by zipping it (Lynn mentioned - Zip is quite useless in compressing XMLs) into an archive on some remote location (returning to the roots - apparently, there is nothing wrong with the file system storage), so you can never see it again.
And it's all instead of having a log table with description of the JSON messages (those 3 columns you mentioned) and PK link(s) to the data which went into every one of them. So any message can be recreated when requested.
Should not exceed 50 bytes per record.
Look, if you have so much spare computing power at your disposal, why not consider other, more productive uses of it?
They say, bit coin mining is quite profitable.

Thanks Sergy for the advise.

When you talk about memory even if i do a insert into an archive table from the main table will use the same memory as that of huge xml is going to take ,right?

the XML data includes all colums. Is there any better way to archiving data which wont use much disk space and also resources?

VastSQL SSC-Dedicated Points: 36067 More actions · Answer 10

mike.knospe - Friday, January 19, 2018 3:06 AM
Hallo everybody,
I uses a powershell script for reading the Content table and Zip the Content to a specified path. Store this in an PS script and you can create an SQL Server Agent Job to run this file.
Best regrads
Mike
# Configuration data
[string] $server   = "Servername\";       # SQL Server Instance
[string] $database = "DatabaseName";     # Database containing the BLOB data.
[string] $folder   = "C:\TEMP\Exportpath\";         # Path to export to
[string] $filename;
[int]    $bufferSize = 32768;               # Stream buffer size in bytes.
$sql = "SELECT
  CAST(Filename AS VARCHAR(6)) + '.XML' as [FileName]
      ,CAST([XMLDoc] as varbinary(MAX)) as [XMLDoc]
  FROM TableToExport
  WHERE XMLDOC IS NOT NULL;";

# Open ADO.NET Connection with Windows authentification.
$con = New-Object Data.SqlClient.SqlConnection;
$con.ConnectionString = "Data Source=$server;Initial Catalog=$database;Integrated Security=True;"
$con.Open();
Write-Output ((Get-Date -format yyyy-MM-dd-HH:mm:ss) + ": Started ...");
# New command and reader
$cmd = New-Object Data.SqlClient.SqlCommand $Sql, $con;
$rd = $cmd.ExecuteReader(); #([System.Data.CommandBehavior]::SequentialAccess);
# Create a byte array for the stream.
$out = [array]::CreateInstance('Byte', $bufferSize)
# Looping through all selected records.
While ($rd.Read())
{
    try
    {
    $filename = $folder + $rd.GetString(0)
        Write-Output ((Get-Date -format yyyy-MM-dd-HH:mm:ss) + ": Exporting {0}" -f $filename);
        # New BinaryWriter; existing file will be overwritten.
        $fs = New-Object System.IO.FileStream ($filename), Create, Write;
        $bw = New-Object System.IO.BinaryWriter $fs;
        $start = 0;
        # Read first byte stream chunk.
        $received = $rd.GetBytes(1, $start, $out, 0, $bufferSize - 1);
        While ($received -gt 0)
        {
           $bw.Write($out, 0, $received);
           $bw.Flush();
           $start += $received;
           # Read next byte stream chunk.
           $received = $rd.GetBytes(1, $start, $out, 0, $bufferSize - 1);
        }
        $bw.Close();
        $fs.Close();
    }
    catch
    {
        Write-Output ($_.Exception.Message)
    }
    finally
    {
        $fs.Dispose();
    }
}
# Closing & Disposing all objects.
$rd.Close();
$cmd.Dispose();
$con.Close();
$srcdir = "C:\TEMP\ExportToPath"
$zipFilename = "ZipName.zip"
$zipFilepath = "C:\TEMP\Export\"
$zipFile = "$zipFilepath$zipFilename"
#Prepare zip file
if(-not (test-path($zipFile))) {
    set-content $zipFile ("PK" + [char]5 + [char]6 + ("$([char]0)" * 18))
    (dir $zipFile).IsReadOnly = $false
}
$shellApplication = new-object -com shell.application
$zipPackage = $shellApplication.NameSpace($zipFile)
$files = Get-ChildItem -Path $srcdir | where{! $_.PSIsContainer}
foreach($file in $files) {
    $zipPackage.CopyHere($file.FullName)
#using this method, sometimes files can be 'skipped'
#this 'while' loop checks each file is added before moving to the next
    while($zipPackage.Items().Item($file.name) -eq $null){
        Start-sleep -seconds 1
    }
}
Write-Output ((Get-Date -format yyyy-MM-dd-HH:mm:ss) + ": Finished.");

Thanks Mike,

My understanding of your method is . you are converting data in a table to a file ? What advantage you have from this? and how you revert back to tabular data?

Are you compressing the file ?

Sergiy SSC Guru Points: 110206 More actions · Answer 11

Is there any better way to archiving data which wont use much disk space and also resources?

Yes, of course.

First, you have to keep JSON and XML messages our of SQL Server.

Placing a string into a table incurs a huge overhead comparing to simply storing it in a file. An overhead measured not in %, but times.

Second, store JSON messages on a separate file server.

This way you'll perform 2 I/O opreration per each message - 1 for writing, an 1 for reading when archiving. Repeating backups, recording into trn logs, page splits, re indexing, etc. - will be out of the picture.

Zip the files using an appropriate tool in a combination with a script similar to the one posted by Mike.

Except - you'll read uncompressed messages not from a table but from a file folder.

Which core of which CPU of which server to be used - it's up to you to decide. Whichever has the most of free time.

_____________
Code for TallyGenerator

VastSQL SSC-Dedicated Points: 36067 More actions · Answer 12

Sergiy - Sunday, January 21, 2018 4:49 AM
Is there any better way to archiving data which wont use much disk space and also resources?
Yes, of course.First, you have to keep JSON and XML messages our of SQL Server.Placing a string into a table incurs a huge overhead comparing to simply storing it in a file. An overhead measured not in %, but times.Second, store JSON messages on a separate file server.This way you'll perform 2 I/O opreration per each message - 1 for writing, an 1 for reading when archiving. Repeating backups, recording into trn logs, page splits, re indexing, etc. - will be out of the picture.Zip the files using an appropriate tool in a combination with a script similar to the one posted by Mike. Except - you'll read uncompressed messages not from a table but from a file folder.Which core of which CPU of which server to be used - it's up to you to decide. Whichever has the most of free time.

Thanks Sergiy,

What I understood from your words is ,from main OLTP database the table having json data has to be moved to a file. We have 1000 of messages a day that means 1000 files per day (Correct me if I am wrong). How to achieve this? and how we can get old data if in case it is required in tabular form?

You didnt comment on the memory usage of xml vs insert..select from the main OLTP table. If i am not wrong both will take same amount of buffer.

Sergiy SSC Guru Points: 110206 More actions · Answer 13

OK, let's clarify the matters first.

JSON are not originated from a DB table. So, they cannot be moved out of the table. They should not be put in there.

Then, according to your words, JSON messages are built by a BizTalk orchestration from some data which is sitting in a relational database.

So, there is no valuable data to be archived in those messages, except for the fact that the messages were actually generated and were correctly built.

All the data pieces which populate the messages are already in the database, and there is no point in placing it in there once more.

What I'm suggesting is take the output of the BizTalk process and save it in a file rather than in a SQL table. If my memory serves me well, 1000 files per folder was not a problem even for DOS 6.0, not to mention any Windows OS.

You may wish to create daily folders for the messages, then you could ZIP the whole folder at once.

XML will always take more memory, as it will be always bigger than total size of all files it wraps up.

I don't want to discuss the best way of doing "insert into ... Select" part, because - honestly - I cannot see any point in doing this at all. Except, well, for keeping the kids (I mean servers) occupied.

Can you explain why do you need to move BizTalk history records around?

_____________
Code for TallyGenerator

VastSQL SSC-Dedicated Points: 36067 More actions · Answer 14

Sergiy - Sunday, January 21, 2018 4:41 PM
OK, let's clarify the matters first.JSON are not originated from a DB table. So, they cannot be moved out of the table. They should not be put in there.Then, according to your words, JSON messages are built by a BizTalk orchestration from some data which is sitting in a relational database.So, there is no valuable data to be archived in those messages, except for the fact that the messages were actually generated and were correctly built.All the data pieces which populate the messages are already in the database, and there is no point in placing it in there once more.What I'm suggesting is take the output of the BizTalk process and save it in a file rather than in a SQL table. If my memory serves me well, 1000 files per folder was not a problem even for DOS 6.0, not to mention any Windows OS. You may wish to create daily folders for the messages, then you could ZIP the whole folder at once.XML will always take more memory, as it will be always bigger than total size of all files it wraps up.I don't want to discuss the best way of doing "insert into ... Select" part, because - honestly - I cannot see any point in doing this at all. Except, well, for keeping the kids (I mean servers) occupied.Can you explain why do you need to move BizTalk history records around?

Thanks a lot Sergiy.
Your explanations cleared many of my questions.
The BizTalk team needs those json to reprocess the message\txion if in case needed. Keeping them on the main database will put huge toll on disk space as we are have full backup every day and will keep it for a week. Will move the data to a different database will backup everyday but will keep only for 2 days.