How to iterate thru 10million plus records

  • Hi,

    I am trying to loop thru a current dataset of 10million+ records.

    I receive XML data in a text file. One column with 10million rows. I am then stripping out the tags to obtain the columns and values. I will then be pivotting this information. The only snag is when i import this info it obviously needs to groupd certain records together ie tags between <album> and </ablum> is associated with 1 record. I have scripted a cursor with will loop throug the info to obtain the first <ablum> and insert that info into a secondary table with the ID, ColumnName, and Value. It will then get the next record and will have to associate the same ID to the next record but with the new columnName and Value and so forth. Till it reaches the next <ablum> it will set ID + 1 and perform the same loop.

    What is the best way to handle a process like this. Currently its run 2hrs30mins+ and i have stopped the query.

    Please help?

  • Please post table definitions, sample data and desired output. Read this to see the best way to post this to get quick responses.

    http://www.sqlservercentral.com/articles/Best+Practices/61537/

    Gail Shaw
    Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
    SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

    We walk in the dark places no others will enter
    We stand on the bridge and no one may pass
  • Hi,

    These are the two tables being used:

    --This one holds the inital imported data

    CREATE TABLE

    [dbo].[XMLLoading]

    (

    [ID] [int] IDENTITY(1,1) NOT NULL,

    [ColumnSource] [varchar](8000) NULL

    )

    --This one is used to hold the new cleanedData with associated RecordID

    CREATE TABLE

    [dbo].[XMLPivotTable]

    (

    [RecordID] [int] NOT NULL,

    [ColumnName] [varchar](1000) NULL,

    [ItemName] [varchar](max) NULL

    )

    The Imported data into [dbo].[XMLLoading] looks something like this:

    1<?xml version="1.0"?>

    2<rimco>

    3 <album>

    4 <producttype>music</producttype>

    6 <barcode>034571173344</barcode>

    8 <barcodetype>UPC</barcodetype>

    10 <origin>USA / Canada</origin>

    12 <title>Complete Songs 2</title>

    14 <artist>Johnson, Graham</artist>

    16 <altartist>Graham Johnson</altartist>

    18 <composer>Faure</composer>

    20 <stock>CDA 67334</stock>

    22 <coupling>CDA 67334</coupling>

    24 <source>AMARANTH</source>

    26 <label>Hyperion</label>

    28 <media>CD</media>

    30 <mediatype>CD</mediatype>

    32 <mediacount>1</mediacount>

    34 <releasedate>2005-03-31 00:00:00</releasedate>

    36 <coding>None</coding>

    I then clean up the info using the following script and insert into [dbo].[XMLPivotTable]:

    DECLARE

    @RecordIDINT

    ,@GetPreviousRecordINT

    ,@ColumnNameVARCHAR(1000)

    ,@ItemNameVARCHAR(MAX)

    ,@getRecordIDCURSOR

    SET @getRecordID = CURSOR FOR

    SELECT

    CASE WHEN

    charindex('</', ColumnSource) = 0

    THEN NULL

    ELSE SubString(ColumnSource, charindex('<', ColumnSource)+1, (charindex('>', ColumnSource)-1)-charindex('<', ColumnSource)) END AS ColumnName,

    CASE WHEN

    charindex('</', ColumnSource) = 0

    THEN NULL

    ELSE SubString(ColumnSource, charindex('>', ColumnSource)+1, (charindex('</', ColumnSource)-2)-charindex('>', ColumnSource)+1) END AS ItemName

    FROM

    dbo.XMLLoading WITH(NOLOCK)

    WHERE

    (CHARINDEX('</', ColumnSource) > CHARINDEX('<', ColumnSource)

    ORColumnSource like '%<album>%')

    OPEN @getRecordID

    SET @RecordID = 0

    FETCH NEXT FROM @getRecordID INTO @ColumnName, @ItemName

    WHILE @@FETCH_STATUS = 0

    BEGIN

    IF @ColumnName IS NULL

    SET @GetPreviousRecord = @RecordID + 1

    INSERT INTO

    XMLPivotTable

    SELECT

    @GetPreviousRecord AS RecordID

    ,@ColumnName AS ColumnName

    ,@ItemName AS ItemName

    FETCH NEXT FROM @getRecordID INTO @ColumnName, @ItemName

    SET @RecordID = @GetPreviousRecord

    END

    CLOSE @getRecordID

    DEALLOCATE @getRecordID

    --I also PIVOT this information dynamically.

    --DECLARE

    --@ColumnNamePvtVARCHAR(MAX)

    --,@SQLNVARCHAR(MAX)

    --

    --

    --SELECT

    -- @ColumnNamePvt = Coalesce(@ColumnNamePvt +',['+ColumnName+']' , '['+ColumnName+']')

    --FROM

    --(

    --SELECT DISTINCT

    --ColumnName

    --FROM

    --XMLPivotTable

    --WHERE

    --ColumnName IS NOT NULL

    --) AS A

    --

    --SET @sql =

    --

    --'SELECT RecordID, ' +

    --@ColumnNamePvt +

    --'FROM

    --(

    --SELECT

    --RecordID,

    --ColumnName,

    --ItemName

    --FROM

    --XMLPivotTable

    --WHERE

    --ColumnName IS NOT NULL

    --) AS a

    --PIVOT

    --(

    --MAX(ItemName)

    --FOR ColumnName IN (' + @ColumnNamePvt + ')

    --) AS pvt'

    --

    --EXEC sp_executesql @sql

    --PRINT @sql

    This process is way to slow.

    Any suggestions??

  • Hello,

    you can try to do it without using cursors, that should speed up your process.

    I think in a solution using intermediate tables, let me explain:

    You coded a query to strip a row and obtain its columnName, itemName; use this query to populate a new table,

    CREATE TABLE

    [dbo].[IntermediateResults1]

    ([ID] [int] NOT NULL,

    [ColumnName] [varchar](1000) NULL,

    [ItemName] [varchar](max) NULL)

    this table should contain

    1, NULL, NULL

    2, NULL, NULL

    3, NULL, NULL

    4, producttype, music

    6, barcode, 034571173344

    8, barcodetype, UPC

    10, origin, USA / Canada

    12, title, Complete Songs 2

    14, artist, Johnson, Graham

    16, altartist, Graham Johnson

    18, composer, Faure

    20, stock, CDA 67334

    ...

    Add a column albumId to this table, the albumId to assign to a row is the previous row where columnName is NULL.

    Go to create a new IntermediateAlbumId table:

    SELECT ID as AlbumId

    INTO IntermediateAlbumId

    FROM IntermediateResults1

    WHERE columnName IS NULL

    Now there are some options to populate the column IntermediateResults1.AlbumId. That's one:

    UPDATE IntermediateResults1 SET AlbumId =

    (SELECT max(albumId)

    FROM IntermediateAlbumId A

    WHERE A.AlbumId <= Id)

    WHERE albumId IS NULL

    or, if it is to slow, iterate this query

    UPDATE IntermediateResults1 SET AlbumId = newAlbumId FROM

    (SELECT TOP 10000 Id,

    (SELECT max(albumId)

    FROM IntermediateAlbumId A

    WHERE A.AlbumId <= Id) AS newAlbumId

    FROM IntermediateResults1

    WHERE albumId IS NULL)

    I haven't checked these queries, please correct them.

    Hope that this helps,

    Francesc

  • Is there a reason to import the XML this way? If you were to use SSIS on the source XML file you wouldn't be having to play some outlandish games like you are now.

    Even without that - using OPENXML would allow you to retrieve whole albums at a time, preserving the relations.

    ----------------------------------------------------------------------------------
    Your lack of planning does not constitute an emergency on my part...unless you're my manager...or a director and above...or a really loud-spoken end-user..All right - what was my emergency again?

  • Hi,

    With SSIS 2005 you can only import 1 column at a time using the XML source component. i know 2008 handles the XML better so for now rather than creating many data flows into different temp tables then joining them all together again.

    I was trying some other way until somebody could help me otherwise.

    Please advise.

  • ismailmallum (2/16/2012)


    Hi,

    With SSIS 2005 you can only import 1 column at a time using the XML source component. i know 2008 handles the XML better so for now rather than creating many data flows into different temp tables then joining them all together again.

    I was trying some other way until somebody could help me otherwise.

    Please advise.

    How are you using it? I took a mocked up model of the format you previewed, repeated the album element in there a dozen times, and ask it to import them, and it brought them in as 12 records (in 2005). No need to cycle through each element and re-pivot them: just define the album table, and point the XML source at that node and it will auto-parse the inner elements for you.

    I've done a lot of these feeds in previous jobs, and it will chew through huge XML files in a few minutes.

    Yes - 2008 allows me to play some extra games with the data, but it looks like 2005 can handle your file structure. Unless there's an extra snag that doesn't show in the sample you offered up.

    ----------------------------------------------------------------------------------
    Your lack of planning does not constitute an emergency on my part...unless you're my manager...or a director and above...or a really loud-spoken end-user..All right - what was my emergency again?

  • When i try importing using the XML source it gives me a message:

    Error at Data Flow Task 1 [XML source [1]]: The XML Source Adapter does not support mixed content model on Complex Types.

    What does this mean?

  • That means you have instances of elements with text AND sub-elements.

    So - this would be an example:

    <myBadNode>

    This is a sample piece of text for <Context>Sample</Context> purposes.

    </myBadNode>

    This kind of content is tough to deal with because there's an ambiguity as to what you're supposed to do with the contents (it is one piece of info? 2? 3?).

    ----------------------------------------------------------------------------------
    Your lack of planning does not constitute an emergency on my part...unless you're my manager...or a director and above...or a really loud-spoken end-user..All right - what was my emergency again?

Viewing 9 posts - 1 through 9 (of 9 total)

You must be logged in to reply to this topic. Login to reply