Tag Archives: Analytics

Hadoop and SQL Integration – Part 1

Hadoop and SQL Server Integration – Part 1

Hadoop_elephantFor the last few months I have been learning about Hadoop using the Hortonworks Distribution (HDP) Sandbox and thinking of use cases of how it can be integrated with a client’s current SQL Server environment to access and process “Big Data” to deliver results.
To begin I had to find out what is Big Data. According to Gartner, Big Data is defined as high-volume, high-velocity and high-variety information assets that demand cost-effective, innovative forms of information processing for enhanced insight and decision making. Some examples of types of Big Data that may be found in an Enterprise are streams of data generated by reports, logs, video footage, etc. Big Data could also be a result of data gathered while monitoring customer behavior (web clicks), scientific investigations (the human genome project) or other various data sources.
My next question was what is Hadoop? Hadoop is defined in Wikipedia as an open-source software framework for storage and large scale processing of data-sets on clusters of commodity hardware and uses a distributed file system (HDFS) that stores data on commodity(less inexpensive) hardware and provides high aggregate bandwidth across the Hadoop Cluster. The Hadoop framework itself is mostly written in Java with some native code in C and command line utilities written as shell-scripts.
The technology that was to become Hadoop was created by Google to help them with indexing the Web. Google’s technology was rolled into Open Source Project Nutch and from there Hadoop was spun off and was heavily contributed to by Yahoo.
The next question I needed to answer was how Hadoop is different than Microsoft SQL Server. The answer to this question is the very nature of Big Data. The issue is commonly referred to as the 4V’s of Data which are volume (amount of data), velocity (speed of data in and out), variety (range of data types and sources) and veracity (uncertainty of data). To top all of this off, a lot of this data is unstructured which means it does not fit into structured tables very well. Those of us that have been SQL DBAs and Developers for relational database management systems have all been taught to create the database, schemas and tables before we load data. The schema driven approach can be very time intensive and often frustrating because of the unknowns. Hadoop specializes in bringing computing power to the data and allows you to Extract Terabytes of data, store it on cheap drives on a large number of commodity servers and produce results in less time than using the traditional methods. This allows data professionals to Extract Terabytes of data from disparate source systems, Load it onto cheap disk for initial analysis, and then transform it for consumption by end-users using Self-Service BI tools like PowerView or export into structured objects.

Hadoop is actually the name for a package of different components that work together to give users a flexible platform. Some of the main parts of Hadoop are HDFS, HCatalog, MapReduce, HBase, Hive, Pig, Zookeeper and Ambari.

HadoopStack

One of the other differentiators of Hadoop vs traditional RDMS is price. Hadoop is free to install and has been designed to bring the processing to the data and allows an organization to invest in commodity servers that are cheaper per unit and immediately prove that the technology is viable. This also allows an organization to invest more in its staff and training.
Hadoop is a relatively new technology so there are not a lot of experts or Technical Staff with a lot of experience using it. Companies either have to take current staff and train them with smaller projects to build their skill level or hire a consulting firm. Hadoop and other NoSQL tools is not a replacement for schema based databases such as SQL Server, Oracle, access, etc. It is a tool for a specific purpose and not a one-size fits all.
Over the next couple of months I am going to be blogging on the different parts of Hadoop and how to integrate it with SQL Server. My goal is to get an understanding of how to work with Hadoop and different data sets and share this information with readers.