Storing VARCHAR(MAX) off-row

  • Hi Jeff

     

    We have to pull in everything from the documents as we don't know what is needed for reporting purposes (orders from the higher-ups).

    This is the only document that is causing an issue (presumably the one with the most fields, although I haven't checked).

     

    The "documents" database sounds interesting, have you got any links with a decent amount of detail on this?

  • You should take all those varchar(max) columns off the table and put it into kinda old good EAV table:

     

    DocumentId (fk to the main table)

    InfoType NOT NULL

    infoContents VARCHAR](MAX) NOT NULL

     

    _____________
    Code for TallyGenerator

  • I think I've just had a brainwave.

     

    We currently store all the document data vertically in Table1.  It contains (amongst other things):-

    Document name

    Field name

    Field value

     

    Now, what we then do is split this data out into individual tables using the Document name from Table1.

    This obviously take a while and creates 600+ tables hoding the data just for that document.  This is done automagically.

     

    Now, what I thought was, instead of creating tables to hold the data, why not create views instead to do the same thing.

     

    Pros :-

    1. Less transfer of data on disc (i.e. we are not taking data from Table1 to individual document tables)
    2. Less processing time (see above)
    3. We can get id of all the individual tables, thereby saving a lot of disc space (I've worked out about 100-150GB)
    4. As far as I know, there is no limit on the number of fields in a view and no restrictions on the size of the data being brought back

     

    Cons :-

    I can only see one potential con which is that it may take longer to get data back from the view rather than the table.

     

    Can anyone see any other flaws in my plan?

     

    Thanks

     

    Richard

  • richardmgreen1 wrote:

    Hi Jeff

    We have to pull in everything from the documents as we don't know what is needed for reporting purposes (orders from the higher-ups).

    This is the only document that is causing an issue (presumably the one with the most fields, although I haven't checked).

    The "documents" database sounds interesting, have you got any links with a decent amount of detail on this?

    I am, by no stretch of the imagination, even a casual user of document databases never mind any kind of expert on the subject.  My understanding is that a lot of them use JSON "documents" to store information and that it can be very useful but I've never worked with one to even come close to knowing if that's actually correct.

    With that understanding, I've been told that document databases are good at this type of thing (storing lots of non-relational data).  One of those database engines is MongoDB and it seems to be supported by MS and a fair number of people I know seem to think it's ok.  That is not a judgement on my part.  It's only what I've heard.  There are other "document" databases that you can search for.  Here's a "top level" link for MongoDB.

    https://www.mongodb.com/what-is-mongodb

    As with any possible new endeavor, especially one of this nature, there's going to be a need on your part for a fair bit of research.  The first thing I'd do is research a whole lot more about what a document database actually is, what it's capabilities, limits, caveats, requirements, and level of support are.  Then I'd start checking and comparing different providers of such databases.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Thanks Jeff, looks like I've got plenty of reading to do.

     

    Any thoughts on my new plan?

  • Also, when you have columns that have the same names except ending with a 1 or a 2, you may also want to look at normalizing the data into multiple tables, or possibly even a EAV table.  I know Jeff, you hate that, but that may be worthwhile in this case.

     

  • Using Mongo is a possibility, and JSON can be fairly easy to use. I personally have not used MongoDB but the product is used in development for a product we sell. There are costs involved even if you are using the free version.  It is different to use and many of the features you expect in an RDBMS aren't there and need to be built manually, such as indexes. Be careful there.  The developers here turned MongoDB into a relational database with all the indexes they felt were needed.  Not sure if that changed or not.

     

  • Lynn Pettis wrote:

    Also, when you have columns that have the same names except ending with a 1 or a 2, you may also want to look at normalizing the data into multiple tables, or possibly even a EAV table.  I know Jeff, you hate that, but that may be worthwhile in this case.

    Heh... I don't hate EAVs.  They're quite useful... "It Depends".  I just hate some of the reasons people use them.

     

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • I would recommend reviewing the data being populated - and seeing if that data really should be in VARCHAR(MAX) columns.  For example, SNOMED is a code and generally doesn't exceed 18 characters and a MNEMONIC is generally like a single word short description or code.

    It would also be a much better design to move all of these codes/identifiers/etc... to another table.  For example, instead of separate columns in this one big table as:

    [PracticalConcerns_ProblemsWithAlcoholAndDrugs_2_Mnemonic] [VARCHAR](MAX) NULL,
    [PracticalConcerns_ProblemsWithAlcoholAndDrugs_2_SNOMED] [VARCHAR](MAX) NULL,
    [PracticalConcerns_ProblemsWithAlcoholAndDrugs_2] [VARCHAR](MAX) NULL,
    [PracticalConcerns_ProblemsWithAlcoholAndDrugs_2_WithComments] [VARCHAR](MAX) NULL,
    [PracticalConcerns_MyMedication_2_Mnemonic] [VARCHAR](MAX) NULL,
    [PracticalConcerns_MyMedication_2_SNOMED] [VARCHAR](MAX) NULL,
    [PracticalConcerns_MyMedication_2] [VARCHAR](MAX) NULL,
    [PracticalConcerns_MyMedication_2_WithComments] [VARCHAR](MAX) NULL,
    [EmotionalConcerns_RegretAboutThePast_1_Mnemonic] [VARCHAR](MAX) NULL,
    [EmotionalConcerns_RegretAboutThePast_1_SNOMED] [VARCHAR](MAX) NULL,
    [EmotionalConcerns_RegretAboutThePast_1] [VARCHAR](MAX) NULL,
    [EmotionalConcerns_RegretAboutThePast_1_WithComments] [VARCHAR](MAX) NULL,
    [EmotionalConcerns_AngerOrFrustration_1_Mnemonic] [VARCHAR](MAX) NULL,
    [EmotionalConcerns_AngerOrFrustration_1_SNOMED] [VARCHAR](MAX) NULL,
    [EmotionalConcerns_AngerOrFrustration_1] [VARCHAR](MAX) NULL,
    [EmotionalConcerns_AngerOrFrustration_1_WithComments] [VARCHAR](MAX) NULL,

    These types of fields all repeat and could be moved to another table named: '...Concerns'

    ConcernType      varchar(20)
    ConcernSubType varchar(50)
    Sequence int
    Mnemonic varchar(10) --size this to largest mnemonic/code being sent
    SNOMEDCode varchar(18) --SNOMED codes will not be larger than 18 characters
    WithComments varchar(max)

    Set the Sequence to a default of 1 - for those entries that don't have repeating segments.  Take the ConcernType from the first part of the column name as either: Physical, Practical, Emotional, FamilyOrRelationship, Spiritual, InformationOrSupport, etc...

    The ConcernSubType will be the second part of your naming (you probably need to identify a default value for those columns that do not include this - or use a blank or null value.

    Looking at this a bit more - it looks like you are receiving data in a single field and parsing it into the separate distinct elements.  I am guessing that the column [PhysicalConcerns] contains all of the data - either an XML or JSON document.  Then it is parsed out to get the further discrete data elements.

    If that is the case - then keep that column in the primary table - but the parsed data is separated out to the secondary table as above.

    Either way, I am guessing that a lot of these columns are not populated on every row since a patient would not have every possible concern (problem) available.  By moving to a separate table you only insert rows where there is data which will reduce the overall table usage.

    Jeffrey Williams
    “We are all faced with a series of great opportunities brilliantly disguised as impossible situations.”

    ― Charles R. Swindoll

    How to post questions to get better answers faster
    Managing Transaction Logs

  • That leads us back to the question I asked that hasn't yet been answered... before the OP makes any "plans", the first thing that should be done is an evaluation of the columns to see if they really need to be VARCHAR(MAX).  As you've pointed out, I seriously doubt it.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Jeff Moden wrote:

    That leads us back to the question I asked that hasn't yet been answered... before the OP makes any "plans", the first thing that should be done is an evaluation of the columns to see if they really need to be VARCHAR(MAX).  As you've pointed out, I seriously doubt it.

    Funny you did not question varchar(30) for ID's

    🙂

     

    _____________
    Code for TallyGenerator

  • Hi all

     

    Apologies for missing any questions, I'll get to them now.

     

    Jeff - VARCHAR(MAX) is used as we don't know what amount of data is in each column.  The other consideration for this being used was to try and push the data off-row (this hasn't worked in this case.

     

    Sergiy - The ID's are internal application IDs and are, in fact, 30 characters long and we need to store them (I'm assuming you mean the DocumentID field?)

     

    Jeffrey - if only it was that simple.  The individual fields are stored and then the application combines them all together.  The documents are built in "sections" which is why the fields are named that way.  The bit before the underscore is the section, with the bit after it being the actual question.  A lot of the time, all the questions are completed which is why we need all the fields to be imported.

     

    Hopefully, this covers everything I've missed but feel free to point anything I've missed.

     

    Richard

  • richardmgreen1 wrote:

    Sergiy - The ID's are internal application IDs and are, in fact, 30 characters long and we need to store them (I'm assuming you mean the DocumentID field?)

    Well, I cannot know the reasons behind the decision to use 30 char long strings as ID's, I can only make my assumptions, but they're not gonna be flattering - ask Jeff, he knows. So, I better keep them to myself and share some food for thoughts with you.

    30 characters (non-unicode) occupy 8 times more storage space than an integer number.

    It inflates not only storage for a single table, it applies to every table referencing DocumentID, it also increases memory demand for every query mentioning that identifier.

    It takes at least 10 times (depending on the collation) more CPU time to compare 2 varchar(30) strings rather then 2 integer numbers.

    Front-end developers used to ignore that processing overhead, as they deal with handful number of records at most, but in databases with millions of records queried at any time such overhead makes some noticeable  difference.

    Positive integers can hold more than 2 billion ID's ( 2 billion seconds is about 60 years). From what I see in your samples your database won't get anywhere near this kind of capacity without a major refactoring. So, you're not gonna run out of ID's.

    If you want to store some kind of GUID, then [uniqueidetifier] data type takes 16 bytes of storage, which is half of what's needed for varchar(30), and is 3-4 times faster to process (again, due to collation rules).

    If those ID strings are some kind of mnemonic codes, then it would be better to map them to internally generated database ID's (int) and use those ID's for processing within the database, retrieving those mnemonic codes only for presentation layer.

    Same applies to ClientID or any other varchar ID you may have in the database.

    ClientID actually adds another overhead, but with so numerous updateable varchar(MAX) columns that overhear is so insignificant it's not even worth to be discussed here.

    _____________
    Code for TallyGenerator

  • Hi Sergiy

     

    The reasoning escapes me as well, I'd need to speak to the vendor on that one.

    All I do know is that we need to store it in that fashion to link various sections together (you'd need to speak to the people that develop the documents for that one.

     

    As I mentioned earlier, I'm attempting to write some views to bring all the bits together rather than fixed tables.

    I've written one view (it used a table with 1.3 million rows and is quite wide).

    It took 4 and a half minutes to display all the records from the table and 5 minutes 18 seconds using the view.

     

  • Jeff Moden wrote:

    That leads us back to the question I asked that hasn't yet been answered... before the OP makes any "plans", the first thing that should be done is an evaluation of the columns to see if they really need to be VARCHAR(MAX).  As you've pointed out, I seriously doubt it.

    Yeah - and it doesn't appear that there is any initiative to actually analyze the data being sent to determine what is being sent so of course the default of everything set to varchar(max).

     

    Jeffrey Williams
    “We are all faced with a series of great opportunities brilliantly disguised as impossible situations.”

    ― Charles R. Swindoll

    How to post questions to get better answers faster
    Managing Transaction Logs

Viewing 15 posts - 16 through 30 (of 31 total)

You must be logged in to reply to this topic. Login to reply