Are ETL/DW techniques best for incremental extracts?

  • Hello experts,

    I've been reading about ETL and data warehousing lately to get ideas on how I might create a kind of export framework for various data feeds that I need to run. The feeds need to either export a spreadsheet for upload into another system or need to output a recordset that can be transferred to a system using an API to process the rows. In both cases there may be updates to the data, and if so, only those rows would be selected. If nothing has changed, the rows should not be sent again.

    This is not for a full-blown data warehouse. I just want to be able to put in place a system to pull incremental differences from the data as well as preserve a history of changes.

    My basic questions are:

    (1) Am I right to pursue using an ETL-style design, or is there another, simpler variant just for incremental extracts?

    (2) And either way, are there examples of doing either way that don't involve using the SQL MERGE statement, which I've heard has problems?

    (3) Finally, is SSIS the way to handle this, or can a combination of tables, views, and stored procedures also be used?

    I've tried too many links to list them all here, but for starters I can say that I've reviewed the examples from Andy Leonard, read some of the Ralph Kimball material on slowly changing dimensions, and found these pages helpful:

    http://stackoverflow.com/questions/20882742/how-to-create-staging-table-to-handle-incremental-load

    http://www.dwbiconcepts.com/etl/27-basic-etl-concepts/109-methods-of-incremental-loading-in-data-warehouse.html

    http://stackoverflow.com/questions/13860331/ssis-incremental-import

    I'm just concerned that I am heading into overkill for this kind of process by trying to use data warehouse techniques. But maybe that is the way I need to go.

    Thanks for any help.

    - webrunner

    -------------------
    A SQL query walks into a bar and sees two tables. He walks up to them and asks, "Can I join you?"
    Ref.: http://tkyte.blogspot.com/2009/02/sql-joke.html

  • webrunner (11/17/2014)


    This is not for a full-blown data warehouse. I just want to be able to put in place a system to pull incremental differences from the data as well as preserve a history of changes.

    These are two different tasks. Setting up an auditable history of changes involves, well, setting up an audit system for the data. Knowing what's changed from point to point is a different matter.

    Let's skip the audit system for now, that's not what you're after, at least in this post.

    The simplest way to determine a delta change is to put a 'last updated' field on your tables, and trigger it to update to GETDATE() whenever an insert/update is applied. Then, in your data pickups, you store the most recent datetime you've transimitted data for. From there, you use a where clause on your data pickups you get the most recent time in the table (so that you don't have skips between different points), and simply grab everything WHERE LastUpdated > @LastPullEndTime AND LastUpdate <= @MaxCurrentDateInTable.

    (2) And either way, are there examples of doing either way that don't involve using the SQL MERGE statement, which I've heard has problems?

    That's curious, what problems have you heard? Either way, you're determining deltas for outbounds, not trying to merge a delta. MERGE doesn't help you here.

    (3) Finally, is SSIS the way to handle this, or can a combination of tables, views, and stored procedures also be used?

    SSIS can be used to handle this, yes, and is probably the best bet for creating the spreadsheet. If you're actually trying to load to an API, however, is it an exposed EXE? If so, you might want to look into loading Service Broker to directly call the API and skip the middle step of generating a hard file.


    - Craig Farrell

    Never stop learning, even if it hurts. Ego bruises are practically mandatory as you learn unless you've never risked enough to make a mistake.

    For better assistance in answering your questions[/url] | Forum Netiquette
    For index/tuning help, follow these directions.[/url] |Tally Tables[/url]

    Twitter: @AnyWayDBA

  • Thanks for this reply!

    I'll review your suggestions and come up with a design. I see what you mean about tracking the history/auditing can be treated separately.

    The particular problem I have concerns a course with meeting times, and if the meeting time changes, I need to make sure the update is captured.

    The tricky thing here is that the meetings (as I currently understand it) are sessions, so it's possible that a session could change not only location but also start or end time in such a way that it's order in the session list changes. For example, if session 3 has its time changed to a date that is before session 1, then as I see it, session 3 should become the new session 1 and the other sessions need to be renumbered, etc. The session number is used in the title so I need to get that part right if there's an update. I want to solve this puzzle on my own if possible, but if I get stumped I will return humbly and ask for help.

    Regarding MERGE, this is the page I read:

    http://www.mssqltips.com/sqlservertip/3074/use-caution-with-sql-servers-merge-statement/[/url]

    FYI, the last update for that page is listed as 10/17/2013, so maybe things have changed regarding that information.

    Thanks again,

    webrunner

    -------------------
    A SQL query walks into a bar and sees two tables. He walks up to them and asks, "Can I join you?"
    Ref.: http://tkyte.blogspot.com/2009/02/sql-joke.html

  • webrunner (11/17/2014)


    For example, if session 3 has its time changed to a date that is before session 1, then as I see it, session 3 should become the new session 1 and the other sessions need to be renumbered, etc. The session number is used in the title so I need to get that part right if there's an update. I want to solve this puzzle on my own if possible, but if I get stumped I will return humbly and ask for help.

    We've moved beyond delta location and temporal logic. Good luck with it. My guess is you'll end up leaning on the Row_Number() function and/or calculated columns. Both have their merits and pitfalls. Drop on by if you want help. For now, a hint: Don't put the session # into the varchar title. Concatonate the field on its way out and put the session# into it's own column. Persisting the calculated column to do this would give you an easy access point for it as well.

    Regarding MERGE, this is the page I read:

    http://www.mssqltips.com/sqlservertip/3074/use-caution-with-sql-servers-merge-statement/[/url]

    FYI, the last update for that page is listed as 10/17/2013, so maybe things have changed regarding that information.

    Heh, I doubt it. Connect doesn't move quickly.

    I'd heard about these before. Most of the concerns are handled by using an explicit transaction and not using it against tables where you also have filtered indexes. Filtered Indexes are rare enough in my world that they're a non-concern. There are some RI concerns as well, but I generally already sanitize any deltas inbound to my systems during the ETL process into error tables for ease of locating and avoiding FK errors and the like during the inclusion.


    - Craig Farrell

    Never stop learning, even if it hurts. Ego bruises are practically mandatory as you learn unless you've never risked enough to make a mistake.

    For better assistance in answering your questions[/url] | Forum Netiquette
    For index/tuning help, follow these directions.[/url] |Tally Tables[/url]

    Twitter: @AnyWayDBA

Viewing 4 posts - 1 through 3 (of 3 total)

You must be logged in to reply to this topic. Login to reply