I am working with a client to install and test Hadoop. They are looking to use it to off-load some of their resource intensive ETL work off of the Microsoft SQL Server platform and data archive. There are many reasons for this move but a strong one is cost. Hadoop is designed to run on low-cost commodity servers so are planning to use servers rotating out of other areas within the company to build our physical cluster. Hadoop replicates data across the data nodes in the cluster so HA is built-in. The POC is using Cloudera Hadoop Distribution 5.1.0, Cloudera Manager 5.1.1, and Impala 1.4 installed on a 15 Node cluster in a VM environment. Each VM has 32 GB of memory, 8 cpu cores, and 1 TB of space. The VM setup is to learn how Hadoop mechanically works and to gain Admin experience and not for Performance Testing.
At this point, we have begun testing oozie jobs running Sqoop commands to import data from SQL Tables into the Hive Metastore and Impala for transformations. Our testing has not been free of issues but Cloudera’s support has been very responsive.
I will keep you updated!!
Written by Montrial Harrell