Data Lake On-Premises and Distributed Computing

Question

Post reply

Data Lake On-Premises and Distributed Computing

Brahmanand Shukla

SSC Eights!

Points: 980
More actions
December 20, 2019 at 1:02 pm

#3709550
Recently, our senior management went to some event and attended seminar on Data Lake. When they came back they are very exited to implement it due to following reasons :
1. To explore the open source technologies and parallel computing. We are planning to explore Apache Hadoop and related tools such as Hive, Sqoop, Ambari and ZooKeeper for this. We shall also explore Apache Spark.
2. Minimize the time taken by our batch activities up-to 100%. Currently we use SQL Server 2016 with SSIS and processes sometimes run up-to 5 hours which we want to bring down to 10 minutes. I know that's a very aggressive expectation but it seems possible. We want to do it with the help of Parallel Computing offered by Hadoop cluster.
3. We have too many SQL Server Production instances and it keeps on growing. We wish to save the licensing cost of SQL Server by replacing the SQL Server instances which are not referred by any application and used only for batch processing.
4. We want to combine the multiple batch processes which runs across multiple servers and SQL Server instances and wish to have a data lake. This can be further referred for reporting and data analytics purpose.
5. We wanted to have the common storage for all the raw data and processed data to have better control.
Few important fact that is worth sharing :
1. We do not want to use Cloud due to regulatory as well as Data Security constraints.
2. None of the team members assigned for the POC are experienced in Hadoop. All of us are SQL Server guys with programming experience of other technologies such as .Net, C#, VB etc. But the Leader initiated the project is experienced in Hadoop.
This is very interesting topic but somehow I'm not experienced with either data lake or Hadoop so my fingers are crossed. Any feedback on whether we are on the right direction ? If yes, then what would be the correct approach to do it? would be really appreciated.
SQL Server Carpenter - Consultant & Trainer
Author of Learn T-SQL from Scratch

Viewing 5 posts - 1 through 4 (of 4 total)

You must be logged in to reply to this topic. Login to reply

Site Owners SSC Guru Points: 80320 More actions · Answer 1

Thanks for posting your issue and hopefully someone will answer soon.

This is an automated bump to increase visibility of your question.

Brahmanand Shukla SSC Eights! Points: 980 More actions · Answer 2

Any suggestions would be much appreciated !

SQL Server Carpenter - Consultant & Trainer
Author of Learn T-SQL from Scratch

Brahmanand Shukla SSC Eights! Points: 980 More actions · Answer 3

I'm hoping to have some useful suggestion on this topic.

SQL Server Carpenter - Consultant & Trainer
Author of Learn T-SQL from Scratch

xsevensinzx One Orange Chip Points: 25560 More actions · Answer 4

Hi,

Bit late to the party so hopefully you still read this.

Using a document store or some form of data lake with a RDBMS is a good step. I've been using a managed data lake solution called Azure Data Lake Store with Azure Data Lake Analytics within Azure. Choosing this option means I don't have to manage the full Hadoop ecosystem such as Sqoop, ZooKeeper, etc and I can fully integrate Active Directory across the stack from the data lake -> data warehouse -> data mart.

Leveraging the data lake allows an accessible location to store all my raw data that can be queried in raw format by the end users BEFORE it's processed by whatever ETL or ELT you may use. It also allows me to offload, like you are looking to do, a lot of that process into another system, either for real-time loading of data or batch processing of data. This means, less stored procedures for SQL and most importantly, less work.

If SQL Server was a wheelbarrow, it was constantly carrying all the rocks. Leveraging a data lake means we have more than one wheelbarrow helping distribute those rocks so SQL Server can focus on more important things, like serving and securing the data.

So yes, it's good thinking. I just again, would leverage more managed services than trying to manage Hadoop yourself. Newer versions of SQL Server and Azure Data Warehouse all come with Polybase now, which is the equivalent of Hive for Hadoop. You can connect it to Azure Data Lake Store or S3-like buckets such as Blob storage. It's really nice.