Recently, our senior management went to some event and attended seminar on Data Lake. When they came back they are very exited to implement it due to following reasons :
- To explore the open source technologies and parallel computing. We are planning to explore Apache Hadoop and related tools such as Hive, Sqoop, Ambari and ZooKeeper for this. We shall also explore Apache Spark.
- Minimize the time taken by our batch activities up-to 100%. Currently we use SQL Server 2016 with SSIS and processes sometimes run up-to 5 hours which we want to bring down to 10 minutes. I know that's a very aggressive expectation but it seems possible. We want to do it with the help of Parallel Computing offered by Hadoop cluster.
- We have too many SQL Server Production instances and it keeps on growing. We wish to save the licensing cost of SQL Server by replacing the SQL Server instances which are not referred by any application and used only for batch processing.
- We want to combine the multiple batch processes which runs across multiple servers and SQL Server instances and wish to have a data lake. This can be further referred for reporting and data analytics purpose.
- We wanted to have the common storage for all the raw data and processed data to have better control.
Few important fact that is worth sharing :
- We do not want to use Cloud due to regulatory as well as Data Security constraints.
- None of the team members assigned for the POC are experienced in Hadoop. All of us are SQL Server guys with programming experience of other technologies such as .Net, C#, VB etc. But the Leader initiated the project is experienced in Hadoop.
This is very interesting topic but somehow I'm not experienced with either data lake or Hadoop so my fingers are crossed. Any feedback on whether we are on the right direction ? If yes, then what would be the correct approach to do it? would be really appreciated.