New Database Options

Question

New Database Options

Steve Jones - SSC Editor

SSC Guru

Points: 734449
More actions
November 15, 2018 at 9:18 pm

#380874

Comments posted to this topic are about the item New Database Options

Viewing 9 posts - 1 through 8 (of 8 total)

You must be logged in to reply to this topic. Login to reply

Jeff Moden SSC Guru Points: 1003863 More actions · Answer 1

Perhaps a bit deeper than just a database option, I'd like to see some table options that would give NoSQL a run for their money when it comes to ETL and "Minimal Logging".

There are times where not only don't I care about such things as delayed durability, I also don't care at all about being able to "roll back" a transaction". Those times are when I'm importing files or copying certain data from one database to another and then updating the staging tables that I've loaded them into.

For example, we load a shedload of data from files every day. We load the files into staging tables (which are easily made to be minimally logged) and then we update those staging tables with other data including certain condition flags that identify the validity of the data, etc. Even in the SIMPLE Recovery Model, nothing about updates is minimally logged. This is ALL highly expendable data until we copy that newly validated data into the final permanent tables and yet we have to tolerate all of the problems associated with fully logged UPDATEs, SIMPLE Recovery Model or not. Those problems include but aren't limited to the extra amount of time it takes to write to the log file and the amount of space the log file requires.

MS made it so you can achieve minimal logging for inserts even with a Clustered Index in place and even if the table isn't empty. What I'd like to see them do (which would really help import and validation processes in expendable staging tables) is to make it so you can actually assign an option to a table saying this is a totally non-logged table and still be able to have CI, NCIs, and DRI (still super important for these tables) in place. I'd like it at the table level because then they could also make a modification to backups that would simply not include such tables instead of screwing up Point-In-Time restores because you happen to have even 1 byte of data stored in "Minimally Logged" fashion.

Yes... you CAN create a separate database for such things but UPDATEs (and the much lesser used DELETE) are still not minimally logged even in the Simple Recovery Model and, even with the use of Trace Flag 610 (which is built into 2016 so you don't have to set it), even Inserts into a table that has NCIs in place is NOT guaranteed in any way, shape or fashion.

Perhaps an even easier option would be the ability to create a File Group that you could tag as a NOT LOGGED File Group that could still use DRI to LOGGED file groups and would come up as a PreAllocated but empty space after a restart. It would be like a TempDB for each database but with no logging whatsoever. It would be the ultimate "Sandbox" for ETL and report table generation purposes.

Yep... I get around all of this with some T-SQL prestidigitation but life would be a whole lot easier (it's currently a royal PITA even with some helper procs in place) if you could assign a guaranteed minimally logged option to tables and actually have backups ignore such tables. And, no... Posh isn't the right tool for this... it's way to slow and it shouldn't be necessary anyway. Same goes for things like SSIS and other "applications").

It would also be nice if MS would make it easy to generate the code to build tables (even if limited to "just" non-partitioned monoliths) and to separately be able to generate the code for constraints and indexes in an easy "here's the name, go build the code for it" fashion. They do it in SSMS... they should be able to do it quite easily in T-SQL.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Eric M Russell SSC Guru Points: 125522 More actions · Answer 2

Jeff Moden - Friday, November 16, 2018 6:28 AM
Perhaps a bit deeper than just a database option, I'd like to see some table options that would give NoSQL a run for their money when it comes to ETL and "Minimal Logging".
...

Jeff, I have considered this 100 times in the past, and I was thinking this again just yesterday when an ETL job ingesting 200 million rows filled the log drive. SQL Server (and most any RDMS) is an excellent tool for querying, transforming, and modeling relational data, but it's mediocre at aggregate querying and simply terrible at bulk loading data. What I've done in the past was select / stage into flat text files, perform column transformation, analysis, and QA on the text files, and then finally batch load prepped and cleansed data from there into SQL Server tables. On those occasions where a table load went wrong and needed to be "rolled back", I would delete the load, in batches of 10,000 rows at a time using ETL RunID. I was ingesting 100s millions of rows on a low end 2 CPU / 16 GB RAM server back in the early 2000s with little DBA support, and it worked because I was circumventing most of the transaction logging during normal daily usage.

"Do not seek to follow in the footsteps of the wise. Instead, seek what they sought." - Matsuo Basho

Jeff Moden SSC Guru Points: 1003863 More actions · Answer 3

Eric M Russell - Friday, November 16, 2018 8:13 AM
Jeff Moden - Friday, November 16, 2018 6:28 AM
Perhaps a bit deeper than just a database option, I'd like to see some table options that would give NoSQL a run for their money when it comes to ETL and "Minimal Logging".
...
Jeff, I have considered this 100 times in the past, and I was thinking this again just yesterday when an ETL job ingesting 200 million rows filled the log drive. SQL Server (and most any RDMS) is an excellent tool for querying, transforming, and modeling relational data, but it's mediocre at aggregate querying and simply terrible at bulk loading data. What I've done in the past was select / stage into flat text files, perform column transformation, analysis, and QA on the text files, and then finally batch load prepped and cleansed data from there into SQL Server tables. On those occasions where a table load went wrong and needed to be "rolled back", I would delete the load, in batches of 10,000 rows at a time using ETL RunID. I was ingesting 100s millions of rows on a low end 2 CPU / 16 GB RAM server back in the early 2000s with little DBA support, and it worked because I was circumventing most of the transaction logging during normal daily usage.

Ditto that and that's exactly what I'm talking about. I've rather grown to like the fact that I don't have to write any loops in SQL Server (SELECT is a loop behind the scenes) and it makes life easy except for all the bloody logging it does for updates or you happen to miss the mark for minimal logging on an insert. The work arounds do work like you say but, lordy, it could be so much easier.

That and fix BULK INSERT/BCP to automagically handle true CSV and create a BULK EXPORT. Then we'd be rockin'. "D

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

xsevensinzx One Orange Chip Points: 25560 More actions · Answer 4

I personally never cared about what minimal logging exists when loading billions of rows from CSV files in SQL Server 2008. I mean, it's large data already and having a large transaction log is going to happen. I too never cared about rolling anything back. The CSV file is the raw file, we just reprocess it in case of fire and the transaction log just gets replaced regardless along with ensuring there is a bulk load of space to fill up. Just the nature of the game. Not a huge deal.

When it comes to NoSQL style tables etc. You have Polybase. You have a window into the NoSQL data layer or NoSQL-Like data layers (i.e.: data buckets). That's all you really need. You can create external tables that you can just read into a physical table. Most of your transactional updates can happen outside of SQL Server where all you need to focus on is the INSERT and NEVER THE UPDATE. At least, that has been my approach most recently. Not going to update the data warehouse. I'm just going to INSERT with at most DROP and REPLACE with fresh new data that has been built elsewhere. That way SQL Server can focus on serving the data and providing a easy layer for analyzing the data with the magic of T-SQL, which is a good language that is extremely powerful outside of coding Python or learning other complicated languages to query.

Some things I would love to see though would be more machine learning integration as well service integration across the cloud based products. Some products have more advances in these than others. Would love to see them all come up to the same level so it's not a constant battle of switching to whatever one has more of that integration than the other. Like for example, would love Azure DB to have the same integration of ML as that of SQL Server 2016. Be happy to pay extra for it as long as we can further enhance what exists etc.

Jeff Moden SSC Guru Points: 1003863 More actions · Answer 5

xsevensinzx - Friday, November 16, 2018 8:39 PM
You have Polybase...
...{snip}...
where all you need to focus on is the INSERT and NEVER THE UPDATE.

Now THAT's an interesting idea! I'm going to have to look into that and you should write an article on it. 😀

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

HighPlainsDBA SSCrazy Points: 2611 More actions · Answer 6

I think I would like to see each database have it's own workspace inside the database object and only have truly global data in tempdb. At least I don't understand why they built it that way.

xsevensinzx One Orange Chip Points: 25560 More actions · Answer 7

Jeff Moden - Saturday, November 17, 2018 12:59 PM
xsevensinzx - Friday, November 16, 2018 8:39 PM
You have Polybase...
...{snip}...
where all you need to focus on is the INSERT and NEVER THE UPDATE.
Now THAT's an interesting idea! I'm going to have to look into that and you should write an article on it. 😀

Always lack the time to do so from a technical standpoint others would want. Maybe I should just write something less technical in terms of syntax and just talk about it to start. I can certainly try.

But yes, the idea here is something I do a lot here and influence from Kimball's books when he talks a lot about doing as much of the transformation on disk as possible. If you treat your data lake storage or document store as your physical hard disks, then the idea is the same. Do as much in the document store as possible and then find a way to link the final output to your data warehouse. In the case of SQL Server, that link or window as Microsoft calls it, is Polybase. It's an amazing piece of technology that many are not really utilizing to it's fullest. You can basically have the power of hundreds of computers behind your SMP system. Then of course, reduce the amount of UPDATES and massive CPU sucks your warehouse is doing.

This is why I had said in the past, I will likely always use document stores (NoSQL) with all my future projects. It's just too damn powerful not to use it.

Steve Jones - SSC Editor SSC Guru Points: 734449 More actions · Answer 8

HighPlainsDBA - Tuesday, November 20, 2018 4:19 PM
I think I would like to see each database have it's own workspace inside the database object and only have truly global data in tempdb. At least I don't understand why they built it that way.

I so want this