Category Archives: Big Data

A Little Spark

What about Spark?

Many timeSparkLogos you hear a lot about some new application or service that has been added to the Big Data Ecosystem.  Some are all hype and others are really awesome but take a little bit of thinking of how you or your Enterprise could use it.  One of the awesome tools is Apache Spark.  Apache Spark is a cluster computing platform designed to be fast and general purpose.  Spark is general because it can run different types of workloads and fast because it does its works preferably in-memory.  I say preferably because you can persist data to disk or if there is not enough memory to complete a job Spark will use disk.  Spark also has APIs in SQL, Python, Scala, and Java.  I have been using the python shell (pyshark) to interact with it on my new HDP 2.3 Hadoop Cluster hosted by BitRefinery.

Spark like most Apache Projects I have worked with is a Stack.  Spark Core is responsible for basic functionality like memory management and task scheduling.  Spark SQL works with structured data that includes SQL and the Hive variant of SQL.  Spark Streaming enables processing of streaming data such as log files.  There are other portions of the Spark Stack and I encourage you to read about each layer to become familiar with it.

Spark Example

The easiest way to see how something works is to try it out. I am going to walk you through creating your first RDD (Resilient Distributed Dataset).  RDD is a fault-tolerant, parallel data structure that allows users to explicitly persist intermediate results in memory, control their partitioning to optimize data placement, and manipulate them using a rich set of operators. In my example, we are going to load data into a RDD from a Read Me file in the Spark directory located on the Linux File system and execute some functions on it.  This example is based on the Apache Spark quick start guide.   My screenshots may look different then your own because I am running mine on a HDP Cluster but the steps should work.

  1. Log into a data node that is running Spark and on the command line type pyspark. Your spark shell should open and your screen should look like the below.

screen1

  1. Create an RDD from the README.md file contained in the Spark folder path. Type lines = sc.textFile(“/FilePath//README.md”). **Notice in my example screenshot that I did not define “lines” correctly at initially but the error message pointed out the issue.

screen2

  1. Verify that the RDD was created successfully by counting the lines in it. Type count(). You should get a result of 98.

screen3

  1. Filter the number of lines that contain “Python” into another RDD. Type pythonLines = lines.filter(lambda line: “Python” in line). Your result should be 3.

screen4

  1. Find the first line in the pythonLines RDD. Type pythonLines.first().

screen5

  1. Display the contents of pythonLines. Type pythonLines.collect().

screen6

Done!!

There is much more we can do with Spark and big companies such as Yahoo, Google, Amazon, EBay, and others use it to examine data in a very fast and efficient way. This was just a simple introduction but realize that we moved a file into memory, counted the number of lines in it, read the first line, and outputted the lines in our RDD. This was all done in a matter of seconds. Granted the file was small but imagine how you can use it to analyze something such as log data from any source in real time or anything else that you can dream up.

Thank you for reading and have an awesome day!!