Importing Poorly Formatted Text File to SQL

  • I have below formatted text file each record for a user is separated with space, userid is unique value.

    i want to import to SQL either using SSIS or bulk insert, i need o/p as below format

    Text file

    dn: userid=test123,ou=company.com

    objectClass: top

    userid: test123

    companyVal: 1234

    preferredLanguage:: IA==

    company: Yahoo

    job-title: Admin

    job-title: approve_perm

    mail: a.b@test.com

    createdby:

    modifiedby:

    createdTime: 20110917180228Z

    modifiedTime: 20110917180228Z

    dn: userid=test123,ou=company.com

    objectClass: top

    userid: test1235

    companyVal: 12345

    preferredLanguage:: IA==

    pyramid-group: 5

    company: Yahoo

    mail: abc@test.com

    createdby:

    modifiedby:

    createdTime: 20110917180228Z

    modifiedTime: 20110917180228Z

    dn: userid=test12345,ou=company.com

    objectClass: top

    userid: test12345

    companyVal: 12345

    preferredLanguage:: IA==

    company: google

    job-title: Admin

    job-title: approve_perm

    mail: a.b@test.com

    createdby:

    modifiedby:

    createdTime: 20110917180228Z

    modifiedTime: 20110917180228Z

    Desired o/p format

    Userid CompanyVal Company Job-title Mail Createdby Modifiedby Createdtime Modifiedtime

    test123 1234 Yahoo admin,approve_perm a.b@test.com 20110917 20110917

  • I am far from an SSIS expert, but with the little I know I think that you'll have to use an Execute Script task for this. Which means you'll need to have some C# or VB coding skills to pull this off. I do not have such skills (if I had to do this it would cost me a full day with lots of trial and error, lots of internet searches, lots of frustration and a great sense of achievement when it finally works). But the rough outline of the code would be as follows:

    Initialise and set up

    While (not end of data)

    Initialise output row

    While (not end of output row) /* I think you can recognise this from the empty line */

    Read input line

    Parse line and store in appropriate column

    End

    Create output row

    End


    Hugo Kornelis, SQL Server/Data Platform MVP (2006-2016)
    Visit my SQL Server blog: https://sqlserverfast.com/blog/
    SQL Server Execution Plan Reference: https://sqlserverfast.com/epr/

  • Hugo is heading in the right direction. This is quite tricky. Here is an outline solution.

    Create a flat file connection pointing to your source file. Set the connection to have only one column.

    Add a dataflow task, with the flat file as the source.

    Add a script component transformation and connect your source to it.

    Make the script component asynchronous (number of rows in <> number of rows out).

    Create an output from the SC which contains the columns you want.

    Edit the code inside the SC to consume the rows from the flat file, assigning the results to output columns as each line is processed and then, when a new record is found, to send the row to the output you have defined.

    This code will take a while to write, but it's not that difficult. You will find examples if you search for them.

    The SC should be connected to your target table (using the output you defined) and columns mapped accordingly.

    If you haven't even tried to resolve your issue, please don't expect the hard-working volunteers here to waste their time providing links to answers which you could easily have found yourself.

  • I don't know SSIS, but I don't think you need it for this. That's not that horrible of a format to deal with. It looks like you have a key/value pair consistently delimited by a colon. You mentioned BULK INSERT in your original post and I think that's the right direction for this.

    The first step will be to import the text file into a single column in a staging table. From there, you can use DelimitedSplit8K to split the variable-length strings (key and value) to their own columns. If you aren't familiar with DelimitedSplit8K yet, it's explained in Jeff Moden's article at http://www.sqlservercentral.com/articles/Tally+Table/72993/.

    Now that you have the key and value in separate columns, you can start working with it. You'll need to use several techniques to get to where you need to be. I have an article on detecting differences in files on servers at http://www.sqlservercentral.com/articles/T-SQL/127804/. The overall topic is completely unrelated, but if you scroll down to the heading "Parsing the Data", you'll find a walkthrough of the techniques you need. The end goal is to get the different values consolidated into a single row and different columns for each value.

    If you're on SQL 2012 and the order of the key/value pairs is consistent in all blocks you need to import, I think you'll find the LEAD and LAG functions to be especially useful after you import and split your strings.

    I'm sorry that I don't have time to write the code out for you. I enjoy this type of thing and wish I had the time to devote to it.

  • Thanks everyone for your response. I dont know much about c#, will try solutions provided by Phil and Wagner.

    thanks

  • I wanted to do this with only native SSIS objects but am not seeing a way. Pivot was promising but the source data does not have a grouping attribute, e.g. an ID on every line of the file that would tie each profile together. My second inclination after bailing on Pivot was to use a Script Component as a Source but ": " (notice we need to type in a custom 2-byte delimiter with a space after the colon because preferredLanguage has a double-colon) in the data is a natural delimiter so I went with a Flat File Source followed by a Script Component setup for Transformation, basically what Phil described while I was playing around with this. Attached is a proof-of-concept solution.

    The zip contains:

    1. ActiveDirectoryLog.sql - DDL to create destination table in tempdb

    2. ActiveDirectoryLog.txt - sample input file (from OP)

    3. ImportActiveDirectoryLog.dtsx - SSIS Package that loads the sample file into the table in tempdb

    Note, the SSIS is in SQL 2016 format so you'll need to download the CTP and choose to install SSDT to be able to open this and try it out.

    Some things I noticed about your input file that will add some wrinkles:

    EDIT: update image

    There are no special teachers of virtue, because virtue is taught by the whole community.
    --Plato

  • @orlando,

    I've gotten away from SSIS so much that I don't even know how to open SSIS and load a package. I'm really interested in seeing what you've done. Can you give me a quick lesson on how to load the package. I think I can follow my nose after that. Thanks.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • mxy (1/30/2016)


    I have below formatted text file each record for a user is separated with space, userid is unique value.

    i want to import to SQL either using SSIS or bulk insert, i need o/p as below format

    Text file

    dn: userid=test123,ou=company.com

    objectClass: top

    userid: test123

    companyVal: 1234

    preferredLanguage:: IA==

    company: Yahoo

    job-title: Admin

    job-title: approve_perm

    mail: a.b@test.com

    createdby:

    modifiedby:

    createdTime: 20110917180228Z

    modifiedTime: 20110917180228Z

    dn: userid=test123,ou=company.com

    objectClass: top

    userid: test1235

    companyVal: 12345

    preferredLanguage:: IA==

    pyramid-group: 5

    company: Yahoo

    mail: abc@test.com

    createdby:

    modifiedby:

    createdTime: 20110917180228Z

    modifiedTime: 20110917180228Z

    dn: userid=test12345,ou=company.com

    objectClass: top

    userid: test12345

    companyVal: 12345

    preferredLanguage:: IA==

    company: google

    job-title: Admin

    job-title: approve_perm

    mail: a.b@test.com

    createdby:

    modifiedby:

    createdTime: 20110917180228Z

    modifiedTime: 20110917180228Z

    Desired o/p format

    Userid CompanyVal Company Job-title Mail Createdby Modifiedby Createdtime Modifiedtime

    test123 1234 Yahoo admin,approve_perm a.b@test.com 20110917 20110917

    You still need to tell us what you want to do with duplicate "field names" as Orlando has suggested. I just figure that we can't correct a blank piece of paper so here's my go at it using BULK INSERT.

    For a BULK INSERT solution, the first thing we need to do with code is to create a BCP format file because of the columns in the staging table that don't exist in the data. This is essential because absolutely need the IDENTITY column to preserve the order of input or this whole thing falls apart. Here's the format file that I used. Think of this as the "mapping" step.

    Here's the BCP format file I used. I stored it in my C:\Temp directory on my box. You can certainly change that. I know nothing of your file names so I just called it JobFile.fmt.

    [font="Arial Black"]{EDIT} NOTE THAT THIS FORUM ATE THE "\ n" (without the space) SO YOU NEED TO REMOVE THE SPACE BEFORE YOU SAVE THE BCP FORMAT FILE BELOW!!![/font]

    9.0

    6

    1 SQLCHAR 0 0 "" 0 LineNumber ""

    2 SQLCHAR 0 0 "" 0 RecordNumber ""

    3 SQLCHAR 0 0 "" 0 ItemName ""

    4 SQLCHAR 0 0 "" 0 ItemValue ""

    5 SQLCHAR 0 0 "" 0 ItemInstance ""

    6 SQLCHAR 0 1000 "\ n" 6 LineData ""

    Next, I stored your example file in my C:\Temp directory, as well. Again, you can certainly change that.

    Having done those two things and ensuring that your server can see them and that the files are not open by anything else, the following code accomplishes what you want. Details are in the comments, where they belong. I strongly recommend you lookup and study BCP FORMAT FILES and CROSSTABs.

    --Reference: http://www.sqlservercentral.com/Forums/Topic1757028-3077-1.aspx

    --=====================================================================================================================

    -- Prepare the file staging table

    --=====================================================================================================================

    --===== If the file staging table exists, drop it to make reruns in SSMS easier.

    IF OBJECT_ID('tempdb..#FileStaging','U') IS NOT NULL

    DROP TABLE #FileStaging

    ;

    --===== Create the initial file table.

    -- Notice the PERSISTED computed columns that do some parsing as soon as a row appears in the table.

    CREATE TABLE #FileStaging

    (

    LineNumber INT IDENTITY(1,1) PRIMARY KEY CLUSTERED

    ,RecordNumber INT

    ,ItemName AS RTRIM(LTRIM(REPLACE(SUBSTRING(LineData,1,NULLIF(CHARINDEX(':',LineData),0)-1),CHAR(13),''))) PERSISTED

    ,ItemValue AS RTRIM(LTRIM(REPLACE(SUBSTRING(LineData,DATALENGTH(LineData)-NULLIF(CHARINDEX(':',REVERSE(LineData)),0)+2,500),CHAR(13),''))) PERSISTED

    ,ItemInstance INT

    ,LineData VARCHAR(500)

    )

    ;

    --=====================================================================================================================

    -- Load the data from the file

    -- (Hard coded for this example. Convert to dynamic SQL if desired or necessary.)

    --=====================================================================================================================

    --===== Load the file staging table from the file.

    -- Because of the BCP format file, the first 5 columns of the table are ignored.

    BULK INSERT #FileStaging

    FROM 'C:\Temp\Jobfile.txt'

    WITH (

    BATCHSIZE = 2000000000

    ,CODEPAGE = 'RAW'

    ,DATAFILEtYPE = 'CHAR'

    ,FORMATFILE = 'C:\Temp\JobFile.fmt'

    ,TABLOCK

    )

    ;

    --=====================================================================================================================

    -- Add the record numbers to the staging table so that we can group on them when we pivot the data for

    -- final result.

    --=====================================================================================================================

    --===== Local obviously named variables

    DECLARE @SafetyCounter INT

    ,@RecordNumber INT;

    SELECT @SafetyCounter = 1

    ,@RecordNumber = 1

    ;

    --===== Use a Quirky Update (with a safety counter) to isolate the records and assign record numbers based on the

    -- blank lines between the records. Hopefully, the provider of the data is consistent at least about that.

    -- If you're using 2012, I think you could use PRECEEDING ROWS to do similar but this works in 2005 and up.

    -- Note that a "record" consists of multiple "lines", in this case.

    WITH cteRN AS

    (

    SELECT LineNumber

    ,RecordNumber

    ,LineData

    FROM #FileStaging

    )

    UPDATE tgt

    SET @RecordNumber

    = RecordNumber --The two back-to-back equals signs are a key part of the Quirky Update.

    = CASE --It's improperly documented in Books Online (they don't know how to use it) and definitely NOT supporte by MicroSoft.

    WHEN LineNumber = @SafetyCounter

    THEN CASE WHEN LineData > ' ' THEN @RecordNumber ELSE @RecordNumber+1 END

    ELSE 'Sequence Error'

    END

    , @SafetyCounter = @SafetyCounter + 1

    FROM cteRN tgt WITH (TABLOCKX, INDEX(1))

    OPTION (MAXDOP 1)

    ;

    --=====================================================================================================================

    -- Enumerate the instances of the ItemName so we can handle multiple instances per record.

    -- Note that we haven't yet been told how you want to handle such a problem.

    --=====================================================================================================================

    WITH cteEnumerate AS

    (

    SELECT InstanceNumber = ROW_NUMBER() OVER (PARTITION BY RecordNumber,ItemName ORDER BY LineNumber)

    ,ItemInstance

    FROM #FileStaging

    )

    UPDATE cteEnumerate

    SET ItemInstance = InstanceNumber

    ;

    --=====================================================================================================================

    -- Create the final output.

    -- AGAIN, WE HAVEN'T YET BEEN TOLD HOW YOU WANT TO HANDLE DUPLICATED ITEMNAMES AND RELATED VALUES.

    -- THIS ONLY USES THE FIRST VALUE WHEN SUCH DUPLICATION IS PRESENT.

    -- I also recommend that you change the dates to a real date/time datatype but didn't do so here.

    --=====================================================================================================================

    SELECT UserID = MAX(CASE WHEN ItemName = 'userid' THEN ItemValue ELSE '' END)

    ,CompanyVal = MAX(CASE WHEN ItemName = 'companyVal' THEN ItemValue ELSE '' END)

    ,Company = MAX(CASE WHEN ItemName = 'company' THEN ItemValue ELSE '' END)

    ,JobTitle = MAX(CASE WHEN ItemName = 'job-title' THEN ItemValue ELSE '' END)

    ,eMail = MAX(CASE WHEN ItemName = 'mail' THEN ItemValue ELSE '' END)

    ,CreatedBy = MAX(CASE WHEN ItemName = 'createdby' THEN ItemValue ELSE '' END)

    ,ModifiedBy = MAX(CASE WHEN ItemName = 'modifiedby' THEN ItemValue ELSE '' END)

    ,CreatedDate = MAX(CASE WHEN ItemName = 'createdTime' THEN LEFT(ItemValue,8) ELSE '' END)

    ,ModifiedDate = MAX(CASE WHEN ItemName = 'modifiedTime' THEN LEFT(ItemValue,8) ELSE '' END)

    FROM #FileStaging

    WHERE ItemInstance = 1

    GROUP BY RecordNumber

    ;

    Using the file data you provided, here's what the above code returns...

    UserID CompanyVal Company JobTitle eMail CreatedBy ModifiedBy CreatedDate ModifiedDate

    ----------- ----------- ------- ----------- --------------- ----------- ----------- ----------- ------------

    test123 1234 Yahoo Admin a.b@test.com 20110917 20110917

    test1235 12345 Yahoo abc@test.com 20110917 20110917

    test12345 12345 google Admin a.b@test.com 20110917 20110917

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • @mxy,

    BTW, what was the DOS command that you used to create the file? The reason I ask is because there may be a simpler method available but I need to know what the full syntax of the command that was used to produce the file. Of course, do not include any real computer or domain names. Replace those with something else.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Jeff Moden (1/30/2016)


    @Orlando,

    I've gotten away from SSIS so much that I don't even know how to open SSIS and load a package. I'm really interested in seeing what you've done. Can you give me a quick lesson on how to load the package. I think I can follow my nose after that. Thanks.

    1. Download latest SQL Server 2016 CTP and minimum install SSDT*.

    2. Download the attachment from my previous post and extract the .dtsx file to a local location.

    3. Open SSDT then select File>Open>File to find and open the .dtsx.

    That should be all the steps you need to do although step 1 could take hours to accomplish due to the size of the download and installation.

    Here are some snaps of the SSIS screens:

    All the integration work is done in configuring the C# Script Component to setup the behavior (asynchronous, i.e. more rows going into the Component than being output), the Output Columns (i.e. the shape of the dataset being output) and the C# code that decides how to make sense of the rows going in to get the right rows going out.

    * SSDT = SQL Server Data Tools was introduced with SQL 2012 and is the replacement for BIDS from SQL 2005 and 2008. SSDT, like BIDS, is just a brand name for "Visual Studio Shell plus Visual Studio Project Types for Database, SSRS, SSIS and SSAS Projects.

    There are no special teachers of virtue, because virtue is taught by the whole community.
    --Plato

  • Jeff Moden (1/30/2016)


    [font="Arial Black"]{EDIT} NOTE THAT THIS FORUM ATE THE "\ n" (without the space) SO YOU NEED TO REMOVE THE SPACE BEFORE YOU SAVE THE BCP FORMAT FILE BELOW!!![/font]

    I have had this issue too and you can get around it by inputing &#92;n as the text which will yield \n when printed on this forum. Hit the "Quote" button on this post to see what I mean.

    There are no special teachers of virtue, because virtue is taught by the whole community.
    --Plato

  • Jeff Moden (1/30/2016)


    @mxy,

    BTW, what was the DOS command that you used to create the file? The reason I ask is because there may be a simpler method available but I need to know what the full syntax of the command that was used to produce the file. Of course, do not include any real computer or domain names. Replace those with something else.

    I did recognize this as Active Directory output as evidenced by my choice of file names in my demo but it didn't occur to me to ask about reshaping the upstream input file. I think because I was so excited about trying to solve the tech challenge around loading such a strange file-format. I think you are onto something though. Now that you raised this it looks like dsquery output which has options for output format, although I do not know how a tool would rationalize multiple instances of properties like job-title and if they did if it would suit the requirement.

    There are no special teachers of virtue, because virtue is taught by the whole community.
    --Plato

  • Thank you for the post on how to load the SSIS package. I appreciate the time you took. I'll have to pass, though. I was looking forward to seeing an SSIS solution that had no script component.

    Shifting gears, I had a comment in my reply to the op about beating on the provider to get the data in a more normal format. I didn't realize this was an AD output until you mentioned it. As you say, it seems like DSQuery or WMIC for AD could do a much better job of producing more readily importable data.

    Agreed on the problem of multiple instances of a property.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Orlando Colamatteo (1/30/2016)


    I think because I was so excited about trying to solve the tech challenge around loading such a strange file-format.

    +1000 to that. It was a bit o' fun!

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • I have received file from other team,i will check with them, how it was generated.

    About duplicate columns names, even though column name is repeated value is unique and it can be n number of times per user.

    all attributes are not in all profiles - we can ignore those, we dont need those attributes in o/p

    Double colon attributes are not required we can ignore those fields

Viewing 15 posts - 1 through 15 (of 19 total)

You must be logged in to reply to this topic. Login to reply