Stored procedures and Creating tables. (Problem).

  • Hello All,

    In a single store procedure I often have a number of functions:

    Initialize,

    Prepare something (for example the creation of a empty table).

    Process something (for example filling the table with content).

    Problem:

    Sometimes I get an error for the stored procedure because a table does not exist.

    So a call to the procedure fails.

    But I am still in the initializing fase or the prepare fase.

    The failure does not occure when using the table, but calling the stored procedure for a function which does not use the table, for example the function prepare.

    Questions:

    Is this description clear enough ?

    Does anybody recognise this behaviour ?

    What are the rules for these situations ?

    Is there a proper (and elegant) solution to create and use a table in the same stored procedure ? (Not neccesary in the same call).

    Thanks for your time and attention,

    Ben

    (In general these stored procedures are kept in the master database, their name starts with SP_ and sp_ms_marksystemobject is used to mark them).

    Sometimes (rare) I get an "Invalid column name", for tables where the metadata can change.

  • If I understand you correctly, you have procs that create persisted (not temp) tables, and then do things with those tables. Sometimes you don't need the proc to create the table, but it errors out if you don't.

    If that's reasonably accurate, I recommend breaking the functionality up into multiple sub-procs. One proc that creates a table, one that does "process 1", etc. Then conditionally execute those procs from within your main proc.

    So the main proc would just have flow-control ("IF <condition> <do>"), and would execute code in sub-procs.

    That should eliminate the errors you're running into.

    - Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
    Property of The Thread

    "Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon

  • GSquared (7/15/2016)


    If that's reasonably accurate, I recommend breaking the functionality up into multiple sub-procs. One proc that creates a table, one that does "process 1", etc. Then conditionally execute those procs from within your main proc.

    This very probably is a solution.

    But I had hoped for a more elegant solution AND

    the specific rules for the situation and when this occurs. (It does only fail occaisonaly and "not" consistend).

    If I understand you correctly, you have procs that create persisted (not temp) tables, and then do things with those tables. Sometimes you don't need the proc to create the table, but it errors out if you don't.

    Yes create persisted tables.

    Bit different.

    Example:

    SP_Ben Help -- Would give a number of prints for Help.

    SP_Ben Init -- Would initialise some things.

    SP_Ben Prepare -- Would for example create a table.

    SP_Ben Read -- Would read the table.

    SP_Ben Convert -- Would do a conversion of the table.

    SP_Ben Cleanup -- Would remove intermediate tables.

    Starting on an empty database this works. (Almost always).

    But when something goes awry, sometimes even the Help function reports the missing table.

    To create all the tables all the time in a distinct stored procedure is something I try to avoid. Sometimes the meta data of tables does depend on the 'environment' and is not known at design time.

    Ben

    Why persistent tables (yes it is a choice) :

    Better performance. (In some situations).

    The results are 'persistent' and so give a better overview when testing and debugging.

    Easier to restart the process if something fails.

    (You can only have one temptable starting with ##)

    (A tables starting with # can only be reached from one process).

  • It helps if you get more specific, then we can work on that solution, perhaps making it more general.

    If you need to build a table and work with it, you can do this in a single call, but remember that you need to consider concurrency. Or will this never be called by multiple people/clients?

    The best way, as Gus noted, is to use IF, or guard, clauses to check. Then branch and call the proc yourself, or return an error. So:

    Create procedure PrepareTable

    as

    begin

    if not exists( select name from sys.objects where type = 'U' and name = 'MyTable' and SCHEMA_ID = 1)

    CREATE TABLE Mytable (myid INT, mydata VARCHAR(200))

    END

    GO

    CREATE PROCEDURE FillTable

    AS

    BEGIN

    if not exists( select name from sys.objects where type = 'U' and name = 'MyTable' and SCHEMA_ID = 1)

    EXEC dbo.PrepareTable

    INSERT dbo.MyTable

    ( myid, mydata)

    VALUES ( 1,

    'My data'

    )

    END

    GO

    end

    Instead of calling the proc in the 2nd item, I can return an error or code.

    If you need something more complex, you need to include more checks. I'm not sure if you want a different type of elegance, but you'll have to explain more. If multiple people can call procs, usually you use a temp table, or you have some other schema, like using the SPID to insert/remote rows.

  • Steve Jones - SSC Editor (7/15/2016)


    It helps if you get more specific, then we can work on that solution, perhaps making it more general.

    At the moment I am in a process converting a number of CSV files into XML files.

    Using SQL-server to do the grunt of the work. (Why, because I am familiar with SQL-server and do not want to have other dependencies on software/versions etc.)

    So I start of with a fresh and new empty database. Have a number of general and or generic stored procedures and some specific ones.

    SP_CSV does all the CSV handling reading a folder and importing the files from that folder each in a specific CSV table. I started of with importing all CSV files, there were just over 60, but now I am only importing 28 (Faster and others are not used at the moment).

    a SP_.....processing which does the mapping to 24 tables, from the CSV content.

    a SP_....XML... which produces 5 types of XML files. For four types only one XML is generated, for one type thousands of XML's are generated.

    Roughly:

    28 tables which are each generated from an CSV file.

    24 tables which hold the data after conversion/transformation.

    5 tables holding the conversion to a XML strings. (Holds the XML strings).

    10-12 Supporting tables:

    For reading the CSV''s, I use a few Meta data tables (5 to 6) and a few content tables (5).

    And a few Meta data tables for the other processes.

    At the moment I use 3 Stored procedures which are dedicated to this process.

    These 3 stored procedures generate all the tables and a number of Views.

    During development and for mapping purposes more tables where used also generated by the set of stored procedures. (For example the mapping was done using a CSV file which was deployed and altered using Excel)

    In General this works ok.

    The Problem is that occasionally the stored procedures refuse to work because of a missing table. *) (Or sometimes because a existing table does not match the definition the stored procedure thinks it should have). **)

    *)

    This is a problem, because with an completely empty database the stored proc didn't have any problems. Creation of the table is a problem because the stored procedure does not want to start.

    **)

    No problem, I just delete the offending table. Then it normally works and sometimes becomes problem one.

    Ben

  • Create and delete static tables on fly is one of the worst possible ways to do things in SQL Server.

    There are many other ways to do what you're doing, which won't cause errors on run time.

    One of the obvious ones - create static tables with an extra column for SessionID (it's worth to have clustered indexes on all tables having this column as an initial one).

    SessionID would be a number generated when you start the process.

    Then you can operate the data specific to each particular process by using WHERE SessionID = @SessionID.

    You can pass this parameter from procedure to procedure until you're done and you can

    DELETE

    ...

    WHERE SessionID = @SessionID

    If a process was interrupted by something - you can always restart from where it's stopped - data is already there, no need to reload.

    You'll surely find many more advantages of doing things right, if you go this way.

    _____________
    Code for TallyGenerator

  • Sergiy (7/18/2016)


    Create and delete static tables on fly is one of the worst possible ways to do things in SQL Server.

    Why ?

    One of the obvious ones - create static tables with an extra column for SessionID (it's worth to have clustered indexes on all tables having this column as an initial one).

    The CSV tables which get created depend on the content of the CSV files, so they can not be created at the start of the process only after the content is known.

    For the intermediate files it would be possible to generate them at the start, because their format is known. *)

    SessionID

    Normally I use the spid when it is a multiuser procedure, this procedure is stricktly single user. So a SessionID is not needed.

    You'll surely find many more advantages of doing things right, if you go this way.

    Maybe you are right that this is the right method. But I would like to see arguments why this is the right method. I have very often followed the advise of people claiming the 'right method' and very often discovered later in the process that also that 'right method' does come with it's own problems. **)

    Thanks for your reply,

    Ben

    *)

    In the running version the created intermediate files are dropped on the fly as well. This to preserve space (especially cache). Dropping and recreating has some speed advantages. (Some of the larger time consuming tables worked faster with a persistent table. Create table, fill table, cluster index, use table, drop table).

    **)

    In my experience it is good to question the advises that are given on the internet. Often the advises are very good. But for more complex matters very often the advises given are less then perfect.

    So yes, I learned for previous experiences, I try to evaluate each advise.

  • ben.brugman (7/19/2016)


    Sergiy (7/18/2016)


    Create and delete static tables on fly is one of the worst possible ways to do things in SQL Server.

    Why ?

    Because it's not a single-user environment.

    Because a process can be delayed, another file can arrive before the previous one is done.

    Because - see your starting topic.

    You listed there several problems which are caused by that approach.

    One of the obvious ones - create static tables with an extra column for SessionID (it's worth to have clustered indexes on all tables having this column as an initial one).

    The CSV tables which get created depend on the content of the CSV files, so they can not be created at the start of the process only after the content is known.

    How exactly do you become aware of the content?

    Does your code parse each file before loading it?

    Does your code read the tags/headers in attempt to allocate random unpredictable labels to known items in database schema?

    I really doubt that.

    You know exactly the schema definitions for each CSV file your code is about to upload.

    So, there is no problem to have prepared staging tables for each type of CSV files you're dealing with.

    And keep them static in the database.

    For the intermediate files it would be possible to generate them at the start, because their format is known. *)

    SessionID

    Normally I use the spid when it is a multiuser procedure, this procedure is stricktly single user. So a SessionID is not needed.

    SPID is too unreliable.

    If your connection is dropped for some reason the data associated with that SPID becomes orphaned.

    You cannot really continue processing it, you better drop it and start over.

    It's better abandon the process and delete the data because a new @@SPID allocated to "restored" connection may have been used by another uploading process, just before the occurrence of lost connections.

    Not to mention - 2 consequitive steps of the same Agent job will most certainly have different SPID's, so the data sets they would operate would have no relevance to each other.

    *)

    In the running version the created intermediate files are dropped on the fly as well. This to preserve space (especially cache). Dropping and recreating has some speed advantages. (Some of the larger time consuming tables worked faster with a persistent table. Create table, fill table, cluster index, use table, drop table).

    How much space you preserve by dropping a table?ers and actually releases the space.

    So, by dropping tables you might actually increase the demand for space.

    Dropping and recreating does not have any speed advantages.

    It actually creates an overhead of updating system catalogue, which slows things down.

    It might even cause heavy locking and deadlocking on system catalogue which would stall all the processes on the database completely and bring the system down (speaking from personal expreience).

    **)

    In my experience it is good to question the advises that are given on the internet. Often the advises are very good. But for more complex matters very often the advises given are less then perfect.

    So yes, I learned for previous experiences, I try to evaluate each advise.

    It's good indeed.

    _____________
    Code for TallyGenerator

  • How are you doing the CSV imports? OpenRowset? Linked servers? BCP?

    For most of those, you need to have the file definition already in-place.

    If you're pulling data in via OpenRowset as SINGLE_CLOB, then using a parser to pull column-headers out of the VARCHAR blob, then you would just need to add a check against sys.columns that validates the new metadata against the existing column definitions. You could then either error-out, or drop the table in question, or dynamically issue the necessary Alter Table statements to make the existing table work with the new file.

    Making something like that work in a datastream where you have dynamically defined source-files is a complex thing to build. It's either going to require heavy human intervention whenever the system tries to plug a square peg in a round hole, or it's going to need complex coding and algorithms to know when to modify the peg and when to adjust the hole.

    I've built systems like that. They aren't easy and they take a lot of work to get running, but they can handle an amazingly complex array of file definitions if you do them right.

    - Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
    Property of The Thread

    "Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon

  • Sergiy (7/19/2016)


    Because it's not a single-user environment.

    Because a process can be delayed, another file can arrive before the previous one is done.

    Because - see your starting topic.

    You listed there several problems which are caused by that approach.

    It is a single-user environment.

    The files do not arrive but are present before the process is started.

    Most complex situations have problems you have to deal with. If a situation is abandoned because there are problems, most products wouldn't get build. So I like to analyse the problem before abandoning the current solution.

    How exactly do you become aware of the content?

    Does your code parse each file before loading it?

    Does your code read the tags/headers in attempt to allocate random unpredictable labels to known items in database schema?

    A CSV file is read as a string. Then it is parsed.

    After parsing the number of columns is known.

    The code does read the tags/headers (if available) and allocates those names to the columns.

    In my current set of files the second line is the datatype. But the datatype does not match the SQLserver datatypes, so these are substituted for SQLserver datatypes. For files not containing datatypes, all fields are defined as varchar.

    If no tags/headers are available, the columns get a generic name and a number.

    For the current project I have 6 different sets of CSV's to process. I only have seen 2 of the set's up to now. For the other 4 sets I do not know if there is a tag/header, if there is a type, or what the delimitor, the eol and/or the textqualifier is.

    SPID is too unreliable.

    If your connection is dropped for some reason the data associated with that SPID becomes orphaned.

    You cannot really continue processing it, you better drop it and start over.

    It's better abandon the process and delete the data because a new @@SPID allocated to "restored" connection may have been used by another uploading process, just before the occurrence of lost connections.

    Not to mention - 2 consequitive steps of the same Agent job will most certainly have different SPID's, so the data sets they would operate would have no relevance to each other.

    As said in this case it is a single user process.

    And in cases I use the spid it is sufficient for the function.

    Dropping and recreating does not have any speed advantages.

    I tried a number of setup's. Temp tables were slower than persistent tables.

    After the filling of the table the table is clustered. For the next filling, the table has to be emptied (truncated) and the cluster key removed (to my knowledge the removel of a clustered key involves the recreating of the table so that is more work than dropping the table). So I am not convinced that removing the cluster has advantages over dropping the table.

    Thanks for your remarks. I will keep the remarks in my mind. For the current project I probably go ahead with the road taken. For the next project the remarks will be taken into account. *)

    Ben

    *)

    To many times I have followed suggestions of the wise to discover that the advise was not based on a decent analyses of the problem, but the suggestion was given because it might work, but also that might not work. Often the advise is to abandon the road taken and restart again, often the suggested road was not better than the initial road. (And restarting did take a huge amount of time without solving the problem).

  • GSquared (7/19/2016)


    How are you doing the CSV imports? OpenRowset? Linked servers? BCP?

    For most of those, you need to have the file definition already in-place.

    If you're pulling data in via OpenRowset as SINGLE_CLOB, then using a parser to pull column-headers out of the VARCHAR blob, then you would just need to add a check against sys.columns that validates the new metadata against the existing column definitions. You could then either error-out, or drop the table in question, or dynamically issue the necessary Alter Table statements to make the existing table work with the new file.

    Making something like that work in a datastream where you have dynamically defined source-files is a complex thing to build. It's either going to require heavy human intervention whenever the system tries to plug a square peg in a round hole, or it's going to need complex coding and algorithms to know when to modify the peg and when to adjust the hole.

    I've built systems like that. They aren't easy and they take a lot of work to get running, but they can handle an amazingly complex array of file definitions if you do them right.

    The information of a folder is aquired using xp_cmdShell.

    Reading of individual files is again done using xp_cmdShell.

    After that the file is parsed into independend fields.

    This involves splitting the string on the different delimitors.

    And repairing the splits which fall between textqualifiers, this repairing was faster than figuring out which delimitors where between textqualifiers.

    After this the number of columns is known, the number of rows/lines is known.

    If the first line contains tags/headers this is used. (If not numbers)

    If the second line contain datatypes this is used. (If not varchar)

    I have looked on the internet and asked questions here how to parse a CSV file where the definition is not known at design time and create a table from this. I could not find a solution within SQL-server. (There were solutions presented, sometimes not complete, sometimes dependend on other technologies.)

    So I have created a solution that reads files and create tables from that. It is still not perfect, but it works. It can handle different delimitor, different textqualifiers and delimitor and eol within textqualifiers.*)

    See exception *), but I think I can handle all other constructions. And I might publish the result on this forum, but at the moment it is still to rough.

    Thanks for your reply,

    Ben

    *)

    One thing I can not handle at the moment is where two different textqualifiers are used, for example:

    " 'I am Ben' oep's this is wrong", 'A double " is used in the line instead of a double single qoute ('''') '

    These should be processed as two fields. The fields being:

    'I am Ben' oep's this is wrong

    A double " is used in the line instead of a double single qoute ('')

    I don't even know that a construction like that is 'legal'

  • I think you've thought this through well. Certainly you can check for existence of the table before processing, but that doesn't help with a schema mismatch.

    The best way may be to actually create the table and process it in one proc, perhaps having a proc that drives this, but once you have a file, using a single proc to load/parse/process it. Seems backwards, but you really have dependencies here that make it seem as though having separate calls for the steps might not be the best architecture.

  • ben.brugman (7/19/2016)


    The information of a folder is aquired using xp_cmdShell.

    Reading of individual files is again done using xp_cmdShell.

    After that the file is parsed into independend fields.

    This involves splitting the string on the different delimitors.

    And repairing the splits which fall between textqualifiers, this repairing was faster than figuring out which delimitors where between textqualifiers.

    After this the number of columns is known, the number of rows/lines is known.

    If the first line contains tags/headers this is used. (If not numbers)

    If the second line contain datatypes this is used. (If not varchar)

    Been there, done that.

    Exactly that.

    But in my case I had to do it not because I did not know the expected schema, but because I could not predict what kind of error they'll make on another side.

    So, there is how did I do it.

    1. Import file into 2 column #table:

    RowNo IDENTITY

    RowText

    I used BULK INSERT using a format file. Format file was needed to bypass RowID column.

    2. Register the file in table dbo.Files and get a new @FileID for it.

    That was probably not the best way to do it.

    In case of BULK INSERT failure I did not have any trace of the attempt to do it recorded in the database.

    Now I'd generate @FileID first, then would load into a static 3 column table:

    FileID,

    RowNo IDENTITY

    RowText

    PRIMARY KEY CLUSTERED (FileID, RowNo)

    3. Parse the uploaded rows into EAV-kind of table:

    FileID (big)int,

    RowNo int,

    ColumnNo int,

    Value sql_variant

    PRIMARY KEY CLUSTERED (FileID, RowNo, ColumnNo)

    I was parsing the rows according to the fixed schema definitions saved in another table:

    FileTypeID

    ColumnNo

    DataType

    IsNullabale

    -- something else I can't remember right now.

    If you don't know what to expect in the file you're gonna need to populate such table on the go by parsing the headers, and it's gonna be FileID instead of FileTypeID then.

    Once this table is loaded #table is dropped (or records relevant to @FileID are deleted from "BULK LOAD" table)

    4. Data in Value column are validated according to the rules imposed by the data type and probably some other limitations/definitions applied.

    Validation is done by columns, so it pays to have an index on FileID, ColumnNo with Value included.

    5. If validation is successful it's time to transfer the data to tables in the actual database.

    I used dynamically generated cross-tab query for that.

    It was based on a mapping table defining a correlation between fields in an inbound file and table-columns in the destination database.

    6. If there are no errors till now it's time to delete all the records referenced by the @FileID from the staging table(s).

    I had it all in a separate database named "Integration" and it was used to load files into several systems having databases hosted on the same server.

    I must admit, it was not lightening fast, but it was faster than pathetic DTS packages anyway.

    It processing hundreds of files per day, submitted with no particular schedule, some of them had over a million rows each.

    Self-deleting dynamically created jobs were processing the data in parallel, so big files were not delaying tiny data feeds.

    I honestly do not see how this process can benefit from dynamically created static tables.

    *)

    One thing I can not handle at the moment is where two different textqualifiers are used, for example:

    " 'I am Ben' oep's this is wrong", 'A double " is used in the line instead of a double single qoute ('''') '

    These should be processed as two fields. The fields being:

    'I am Ben' oep's this is wrong

    A double " is used in the line instead of a double single qoute ('')

    I don't even know that a construction like that is 'legal'

    I'm pretty sure it's not 'legal', and to me - it's not reasonable.

    Single quote in first position may be a part of the value in the field, you cannot assume it's a delimiter, or you'll distort the data.

    The only case when it can be interpreted correctly if you have static data definitions, and textqualifiers are defined for each column, not for a file type (like I had it in my system).

    _____________
    Code for TallyGenerator

  • ben.brugman (7/19/2016)


    I tried a number of setup's. Temp tables were slower than persistent tables.

    That's odd.

    It should not be like that.

    Tempdb is normally setup to be the fastest database in the neighbourhood.

    If it's not then there is something wrong in your server setup.

    After the filling of the table the table is clustered. For the next filling, the table has to be emptied (truncated) and the cluster key removed (to my knowledge the removel of a clustered key involves the recreating of the table so that is more work than dropping the table). So I am not convinced that removing the cluster has advantages over dropping the table.

    I don't see why the clustered index has to be removed.

    Can you name a single reason for doing that?

    Thanks for your remarks. I will keep the remarks in my mind. For the current project I probably go ahead with the road taken. For the next project the remarks will be taken into account. *)

    Ben

    Your path is not so much different from mine, actually.

    Just don't remove the staging tables and clustered indexes on them.

    If you insert new records in an order matching the clustered index definitions it hardly creates any overhead at all.

    _____________
    Code for TallyGenerator

  • Sergiy (7/19/2016)


    ben.brugman (7/19/2016)


    The information of a folder is aquired using xp_cmdShell.

    Reading of individual files is again done using xp_cmdShell.

    After that the file is parsed into independend fields.

    This involves splitting the string on the different delimitors.

    And repairing the splits which fall between textqualifiers, this repairing was faster than figuring out which delimitors where between textqualifiers.

    After this the number of columns is known, the number of rows/lines is known.

    If the first line contains tags/headers this is used. (If not numbers)

    If the second line contain datatypes this is used. (If not varchar)

    Been there, done that.

    Exactly that.

    But in my case I had to do it not because I did not know the expected schema, but because I could not predict what kind of error they'll make on another side.

    So, there is how did I do it.

    1. Import file into 2 column #table:

    RowNo IDENTITY

    RowText

    I used BULK INSERT using a format file. Format file was needed to bypass RowID column.

    2. Register the file in table dbo.Files and get a new @FileID for it.

    That was probably not the best way to do it.

    In case of BULK INSERT failure I did not have any trace of the attempt to do it recorded in the database.

    Now I'd generate @FileID first, then would load into a static 3 column table:

    FileID,

    RowNo IDENTITY

    RowText

    PRIMARY KEY CLUSTERED (FileID, RowNo)

    3. Parse the uploaded rows into EAV-kind of table:

    FileID (big)int,

    RowNo int,

    ColumnNo int,

    Value sql_variant

    PRIMARY KEY CLUSTERED (FileID, RowNo, ColumnNo)

    I was parsing the rows according to the fixed schema definitions saved in another table:

    FileTypeID

    ColumnNo

    DataType

    IsNullabale

    -- something else I can't remember right now.

    If you don't know what to expect in the file you're gonna need to populate such table on the go by parsing the headers, and it's gonna be FileID instead of FileTypeID then.

    Once this table is loaded #table is dropped (or records relevant to @FileID are deleted from "BULK LOAD" table)

    4. Data in Value column are validated according to the rules imposed by the data type and probably some other limitations/definitions applied.

    Validation is done by columns, so it pays to have an index on FileID, ColumnNo with Value included.

    5. If validation is successful it's time to transfer the data to tables in the actual database.

    I used dynamically generated cross-tab query for that.

    It was based on a mapping table defining a correlation between fields in an inbound file and table-columns in the destination database.

    6. If there are no errors till now it's time to delete all the records referenced by the @FileID from the staging table(s).

    I had it all in a separate database named "Integration" and it was used to load files into several systems having databases hosted on the same server.

    I must admit, it was not lightening fast, but it was faster than pathetic DTS packages anyway.

    It processing hundreds of files per day, submitted with no particular schedule, some of them had over a million rows each.

    Self-deleting dynamically created jobs were processing the data in parallel, so big files were not delaying tiny data feeds.

    I honestly do not see how this process can benefit from dynamically created static tables.

    *)

    One thing I can not handle at the moment is where two different textqualifiers are used, for example:

    " 'I am Ben' oep's this is wrong", 'A double " is used in the line instead of a double single qoute ('''') '

    These should be processed as two fields. The fields being:

    'I am Ben' oep's this is wrong

    A double " is used in the line instead of a double single qoute ('')

    I don't even know that a construction like that is 'legal'

    I'm pretty sure it's not 'legal', and to me - it's not reasonable.

    Single quote in first position may be a part of the value in the field, you cannot assume it's a delimiter, or you'll distort the data.

    The only case when it can be interpreted correctly if you have static data definitions, and textqualifiers are defined for each column, not for a file type (like I had it in my system).

    I built something very similar, but it uses XML instead of EAV, and NULLs out empty columns. Used less space and XML parsing was faster than EAV reconstitution. (I got the idea from Sparse Columns tables, which use XML for their actual storage.)

    My solution keeps a record of the rows received, and assigns a hash value to each, since many of the files have very low delta rates and send the exact same rows day after day after day. Dramatically reduced the amount of processing needed for subsequent files after the first one.

    But mine's a specialized solution for a specific datastream.

    There are other variations. Different ones are "best" for different situations.

    - Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
    Property of The Thread

    "Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon

Viewing 15 posts - 1 through 15 (of 30 total)

You must be logged in to reply to this topic. Login to reply