Data Lake On-Premises and Distributed Computing

  • Recently, our senior management went to some event and attended seminar on Data Lake. When they came back they are very exited to implement it due to following reasons :

    1. To explore the open source technologies and parallel computing. We are planning to explore Apache Hadoop and related tools such as Hive, Sqoop, Ambari and ZooKeeper for this. We shall also explore Apache Spark.
    2. Minimize the time taken by our batch activities up-to 100%.  Currently we use SQL Server 2016 with SSIS and processes sometimes run up-to 5 hours which we want to bring down to 10 minutes. I know that's a very aggressive expectation but it seems possible. We want to do it with the help of Parallel Computing offered by Hadoop cluster.
    3. We have too many SQL Server Production instances and it keeps on growing. We wish to save the licensing cost of SQL Server by replacing the SQL Server instances which are not referred by any application and used only for batch processing.
    4. We want to combine the multiple batch processes which runs across multiple servers and SQL Server instances and wish to have a data lake. This can be further referred for reporting and data analytics purpose.
    5. We wanted to have the common storage for all the raw data and processed data to have better control.

    Few important fact that is worth sharing :

    1. We do not want to use Cloud due to regulatory as well as Data Security constraints.
    2. None of the team members assigned for the POC are experienced in Hadoop. All of us are SQL Server guys with programming experience of other technologies such as .Net, C#, VB etc. But the Leader initiated the project is experienced in Hadoop.

    This is very interesting topic but somehow I'm not experienced with either data lake or Hadoop so my fingers are crossed. Any feedback on whether we are on the right direction ? If yes, then what would be the correct approach to do it? would be really appreciated.

  • Thanks for posting your issue and hopefully someone will answer soon.

    This is an automated bump to increase visibility of your question.

  • Any suggestions would be much appreciated !

  • I'm hoping to have some useful suggestion on this topic.

  • Hi,

    Bit late to the party so hopefully you still read this.

    Using a document store or some form of data lake with a RDBMS is a good step. I've been using a managed data lake solution called Azure Data Lake Store with Azure Data Lake Analytics within Azure. Choosing this option means I don't have to manage the full Hadoop ecosystem such as Sqoop, ZooKeeper, etc and I can fully integrate Active Directory across the stack from the data lake -> data warehouse -> data mart.

    Leveraging the data lake allows an accessible location to store all my raw data that can be queried in raw format by the end users BEFORE it's processed by whatever ETL or ELT you may use. It also allows me to offload, like you are looking to do, a lot of that process into another system, either for real-time loading of data or batch processing of data. This means, less stored procedures for SQL and most importantly, less work.

    If SQL Server was a wheelbarrow, it was constantly carrying all the rocks. Leveraging a data lake means we have more than one wheelbarrow helping distribute those rocks so SQL Server can focus on more important things, like serving and securing the data.

    So yes, it's good thinking. I just again, would leverage more managed services than trying to manage Hadoop yourself. Newer versions of SQL Server and Azure Data Warehouse all come with Polybase now, which is the equivalent of Hive for Hadoop. You can connect it to Azure Data Lake Store or S3-like buckets such as Blob storage. It's really nice.

Viewing 5 posts - 1 through 4 (of 4 total)

You must be logged in to reply to this topic. Login to reply