Parsing data

Question

Parsing data

jkw

SSC Eights!

Points: 815
More actions
August 10, 2018 at 6:01 am

#369649

I've imported a large flat file that I need to clean up. Each record should begin with a 4 digit integer and then a string following the integer. A new line would start with another 4 digit integer and string. Currently the records look like this
1375 Order includes
All of the items you ordered plus
a few
other items that you did not order
1376 There are no orders in this entry
1377 We always shop for apples and
oranges, but we do not like grapes
I would like to get it to be 3 individual records starting with the integer followed by the string.
1375 Order includes All of the items you ordered plus a few other items that you did not order
1376 There are no orders in this entry
1377 We always shop for apples and oranges, but we do not like grapes

Viewing 15 posts - 1 through 14 (of 14 total)

You must be logged in to reply to this topic. Login to reply

Phil Parkin SSC Guru Points: 246695 More actions · Answer 1

jon.wilson - Friday, August 10, 2018 6:01 AM
I've imported a large flat file that I need to clean up. Each record should begin with a 4 digit integer and then a string following the integer. A new line would start with another 4 digit integer and string. Currently the records look like this
1375 Order includes
All of the items you ordered plus
a few
other items that you did not order
1376 There are no orders in this entry
1377 We always shop for apples and
oranges, but we do not like grapes
I would like to get it to be 3 individual records starting with the integer followed by the string.
1375 Order includes All of the items you ordered plus a few other items that you did not order
1376 There are no orders in this entry
1377 We always shop for apples and oranges, but we do not like grapes

OK, so is your question: "Can you assist me in writing a query to remove all rows which do not begin with "nnnn "? If not, what is your question?

If you haven't even tried to resolve your issue, please don't expect the hard-working volunteers here to waste their time providing links to answers which you could easily have found yourself.

jkw SSC Eights! Points: 815 More actions · Answer 2

Sorry, I would like to have the rows that don't begin with nnnn to be attached to the previous row. The data source I recieved has a Item number (nnnn) and then a description. However, most of the descriptions after the item number start on a new row or rows. The next item starts with a new 4 digit integer. So, the final output would be
1234 String with the description

Phil Parkin SSC Guru Points: 246695 More actions · Answer 3

jon.wilson - Friday, August 10, 2018 6:23 AM
Sorry, I would like to have the rows that don't begin with nnnn to be attached to the previous row. The data source I recieved has a Item number (nnnn) and then a description. However, most of the descriptions after the item number start on a new row or rows. The next item starts with a new 4 digit integer. So, the final output would be
1234 String with the description

If I had read your first post more carefully, I would have noticed that, sorry!

How have you guaranteed that the rows are in the correct order after performing the import? Is there some identity column in your import table which can be used to ensure order?

Or are you looking to modify the import itself, such that the rows are imported 'correctly' without requiring a subsequent clean-up?

If you haven't even tried to resolve your issue, please don't expect the hard-working volunteers here to waste their time providing links to answers which you could easily have found yourself.

jkw SSC Eights! Points: 815 More actions · Answer 4

Hopefully this will help, right now there are 7 records that need to be combined into 2 records each beginning with the 4 digit integer and the corresponding string.

SET ANSI_NULLS ON GO SET QUOTED_IDENTIFIER ON GO SET ANSI_PADDING ON GO CREATE TABLE [dbo].

( [Column 0] [varchar](5000) NULL ) ON [PRIMARY]

GO SET ANSI_PADDING OFF GO INSERT [dbo].

([Column 0]) VALUES (N'1376 Beginning of string ''A''') GO INSERT [dbo].

([Column 0]) VALUES (N'The rest of string ''A''') GO INSERT [dbo].

([Column 0]) VALUES (N'1377 Part of string ''B''') GO INSERT [dbo].

([Column 0]) VALUES (N'more of string ''B''') GO INSERT [dbo].

([Column 0]) VALUES (N'even more of string ''B''') GO INSERT [dbo].

([Column 0]) VALUES (N'still more of string ''B''') GO INSERT [dbo].

([Column 0]) VALUES (N'the last part of string ''B''') GO

jkw SSC Eights! Points: 815 More actions · Answer 5

The rows are in the correct order from what I can tell

jkw SSC Eights! Points: 815 More actions · Answer 6

The flat file has the data in the correct order, every record does begin with the 4 digit integer and then the description, but there are carrage returns at the end of many of the strings creating essentially a new record.

Phil Parkin SSC Guru Points: 246695 More actions · Answer 7

jon.wilson - Friday, August 10, 2018 6:46 AM
The rows are in the correct order from what I can tell

Tables in SQL Server have no natural order. If you have the time/inclination, you may find this interesting.

If you haven't even tried to resolve your issue, please don't expect the hard-working volunteers here to waste their time providing links to answers which you could easily have found yourself.

Phil Parkin SSC Guru Points: 246695 More actions · Answer 8

jon.wilson - Friday, August 10, 2018 6:50 AM
The flat file has the data in the correct order, every record does begin with the 4 digit integer and then the description, but there are carrage returns at the end of many of the strings creating essentially a new record.

Understood. Here are two ways to fix this:

1) Pre-process the file (use PoSh / C# / Python or whatever language you feel comfortable with) to put it in the desired format prior to import.

2) Use SSIS to perform the import. A Script Component source would be able to read the file in its current form and output the file in its tidied form (after writing some code).

If you haven't even tried to resolve your issue, please don't expect the hard-working volunteers here to waste their time providing links to answers which you could easily have found yourself.

jkw SSC Eights! Points: 815 More actions · Answer 9

Ideally, I would like to get this into 2 columns. One for the 4 digit integer (which would actually be my key) and then the description

jkw SSC Eights! Points: 815 More actions · Answer 10

I'll take a look at these two paths, thanks for your suggestions

Phil Parkin SSC Guru Points: 246695 More actions · Answer 11

jon.wilson - Friday, August 10, 2018 6:56 AM
I'll take a look at these two paths, thanks for your suggestions

If you're not already familiar with SSIS, option 1 is likely to be the faster option, as there's a bit of a learning curve.

If you haven't even tried to resolve your issue, please don't expect the hard-working volunteers here to waste their time providing links to answers which you could easily have found yourself.

drew.allen SSC Guru Points: 76944 More actions · Answer 12

There is a third option if you can add an identity column to your staging table. Is that possible?

Drew

J. Drew Allen
Business Intelligence Analyst
Philadelphia, PA

jkw SSC Eights! Points: 815 More actions · Answer 13

Yes, the staging table can be anything right now

drew.allen SSC Guru Points: 76944 More actions · Answer 14

I've updated the staging table definition. I used a temp table here, but you should use your regular table.

CREATE TABLE #table( id INT IDENTITY, [Column 0] [varchar](5000) NULL ) ON [PRIMARY]

INSERT #table ([Column 0])
VALUES
    (N'1376 Beginning of string ''A''')
,   (N'The rest of string ''A''')
,   (N'1377 Part of string ''B''')
,   (N'more of string ''B''')
,   (N'even more of string ''B''')
,   (N'still more of string ''B''')
,   (N'the last part of string ''B''')
;

WITH grouped AS
(
    SELECT t.id, t.[Column 0], i.is_start, SUM(i.is_start)OVER(ORDER BY t.id ROWS UNBOUNDED PRECEDING) grp
    FROM #table t
    CROSS APPLY (VALUES(CASE WHEN t.[Column 0] LIKE '[0-9][0-9][0-9][0-9]%' THEN 1 ELSE 0 END)) i(is_start)
)
SELECT
(
    SELECT sub.[Column 0] + ' '
    FROM grouped sub
    WHERE sub.grp = g.grp
    ORDER BY sub.id
    FOR XML PATH(''), TYPE
).value('.', 'VARCHAR(5000)') AS Column0
FROM grouped g
WHERE g.is_start = 1
ORDER BY g.id

Drew

J. Drew Allen
Business Intelligence Analyst
Philadelphia, PA