"What's the Big Deal?"

Definition

Big data is the term for a collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications... The trend to larger data sets is due to the additional information derivable from analysis of a single large set of related data, as compared to separate smaller sets with the same total amount of data, allowing correlations to be found to "spot business trends, determine quality of research, prevent diseases, link legal citations, combat crime, and determine real-time roadway traffic conditions.
  1. Volume
  2. Velocity
  3. Variety

Example

  • 1 Boeing 737 engine generates 20 terabytes of data an hour. Each aircraft has two of those engines. That’s 40 TB of data an hour.
  • So in a standard 6 hour cross country flight, you are talking 240 TB of new data.
  • Now multiply that with the number of flights for all commercial jetliners daily in the US.
  • For Southwest Airlines alone... Total data generated every day by Southwest Airlines’ fleet of 607 Boeing 737 aircraft:
    (20TB/ hr x 10.8hr avg operation per day) x 2 engines x 607 =
    262,224 terabytes (or 256 petabytes) PER DAY
http://avoa.com/2014/01/20/are-enterprises-prepared-for-the-data-tsunami/

2012

http://magazine.good.is/infographics/the-world-of-data-we-re-creating-on-the-internet#open

2015

http://www.intel.com/content/www/us/en/communications/internet-minute-infographic.html

Increase 2012-2014

2012 2014 Diff
Mobile Traffic consumed / mnth 1.5 exabytes 2.5 exabytes 69% increase
Video Uploaded to YouTube / min 20 hrs 100 hrs 5X More
Tweets / min 34,722 347,222 10X More
Warning!!!

How can we deal with all this data?

Hadoop Apache Hadoop is an open-source software framework that supports data-intensive distributed applications, licensed under the Apache v2 license. It supports the running of applications on large clusters of commodity hardware.

Windows Azure HDInsight provides ease of management, agility and an open Enterprise-ready Hadoop service in the cloud.

Agenda

  • Setup the Problem Space
  • Define HADOOP and it's core Technologies
  • Demo/Discuss Wordcount
  • Demonstrate Running Locally using Cloudera's Single-Node Hadoop Cluster (or HDInsight Emulator for Microsoft Azure for .NET Track)
  • Demo/Discuss MaxTemperature in Java / C#
  • Demo/Discuss MaxTemperature in Pig Latin
  • Demo/Discuss MaxTemperature in Hive
  • Discuss Cloud Offerings
  • Wrapup

What is Hadoop? Mike Olson, Cloudera CEO

Silicon Angle. (Jan 28, 2011) “What is Hadoop? Hadoop 101 with Mike Olsen” Retrieved from http://www.youtube.com/watch?v=qNP4_ICDeqE

What is Hadoop Composed of?

HDFS: HDFS is a distributed, scalable, and portable file system written in Java for the Hadoop framework. Each node in a Hadoop instance typically has a single namenode; a cluster of datanodes form the HDFS cluster. The situation is typical because each node does not require a datanode to be present. Each datanode serves up blocks of data over the network using a block protocol specific to HDFS. The file system uses the TCP/IP layer for communication.

MapReduce: is a programming model for processing large data sets, and the name of an implementation of the model by Google. MapReduce is typically used to do distributed computing on clusters of computers.

Writing a parallel-executable program has proven over the years to be a very challenging task. MapReduce simples the process by requiring coders to write only the simpler Map() and Reduce() functions, which focus on the logic of the specific problem at hand, while the "MapReduce System" handles the marshalling of the distributed servers, running the various tasks in parallel, managing all communications and data transfers between the various parts of the system, providing for redundancy and failures, and overall management of the whole process.

HDFS Features

  • Designed to store large files
  • Fault tolerant and self-healing distributed file system
  • Stores files as large blocks (64 to 128 MB)
  • Each block stored on multiple servers
  • Data is automatically re-replicated on need
  • Accessed from command line, Java API, C API, and... C#

Demo

What is MapReduce

MapReduce: is a programming model for processing large data sets, and the name of an implementation of the model by Google. MapReduce is typically used to do distributed computing on clusters of computers.

Writing a parallel-executable program has proven over the years to be a very challenging task. MapReduce simples the process by requiring coders to write only the simpler Map() and Reduce() functions, which focus on the logic of the specific problem at hand, while the "MapReduce System" handles the marshalling of the distributed servers, running the various tasks in parallel, managing all communications and data transfers between the various parts of the system, providing for redundancy and failures, and overall management of the whole process.

Not new concept (Just wasn't in parrallel):

     #!/usr/bin/env bash
     for year in ../input/ncdc/all/*
     do
        echo -ne `basename $year .gz`"\t"
        gunzip -c $year | \
        awk '{ temp = substr($0, 88, 5) + 0;
        if (temp !=9999 && q ~ /[01459]/ && temp > max) max = temp }
        END { print max }'
     done
		  

Mapreduce Overview

Map Reduce Features

  • Fine grained Map and Reduce tasks
    • Improved load balancing
    • Faster recovery from failed tasks
  • Automatic re-execution on failure
    • In a large cluster, some nodes are always slow or flaky
    • Introduces long tails or failures in computation
    • Framework re-executes failed tasks
  • Locality optimizations
    • With big data, bandwidth to data is a problem
    • Map-Reduce + HDFS is a very effective solution
    • Map-Reduce queries HDFS for locations of input data
    • Map tasks are scheduled local to the inputs when possible
http://www.cs.wright.edu/~tkprasad/courses/cs707/ProgrammingHadoop.pdf

Putting it All Together With Hadoop's Version of Hello World

“45% of all Hadoop tutorials count words. 25% count sentences. 20% are about paragraphs. 10% are log parsers. The remainder are helpful.”
-jandersen @http://twitter.com/jandersen/

  • Talk about how to create a local HADOOP Environment using Microsoft Technology
  • Copy Some data into the HDFS
  • Execute some code (Using Visual Studio & C#)
  • Count some Words!

Demo

Setting up Enviornments

For java, all you need is maven, Java (I'm using 7), and the VM from Cloudera (or an installed environment for hadoop)

For .NET, you can install the HDInsight Emulator and configure Visual Studio with some NuGet plugins. The next couple of slides talk specifically about .NET and C#.

Installing HDInsight Emulator

Setting Up Visual Studio

  • Startup Visual Studio
  • Create a Console App
  • Use NuGet to add Microsoft .NET API for Hadoop, Map Reduce API for Hadoop, & Hadoop Web Client
  • Let's See Some Code

WordCount::Map Function

  public class WordCountMapper : MapperBase
  {
      private static int one = 1;
      public override void Map(string inputLine, MapperContext context)
      {
          string[] words = inputLine.Split(' ');
          foreach (String word in words) {
              if (word.Length>1) {
                  //outputs each word into a single line with the number 1
                  //skip spaces
                  context.EmitKeyValue(word, one.ToString());
              }
          }
      }
  }		
		  

WordCount::Flow

WordCount::Reduce Function

    public class WordCountReducer : ReducerCombinerBase
    {
        public override void Reduce(string key, IEnumerable values, ReducerCombinerContext context)
        {
            //sum all the ones
           int sum = 0;
           foreach(String value in values)
           {
              sum += int.Parse(value);
           }
           context.EmitKeyValue(key, sum.ToString());
        }
    }		
			  

WordCount::Flow

WordCount::Driver

	public class WordCountjob : HadoopJob
	{
	    public override HadoopJobConfiguration Configure(ExecutorContext context)
	    {
	        HadoopJobConfiguration config = new HadoopJobConfiguration();
	        config.InputPath = "Input/wordcount";
	        config.OutputFolder = "Output/wordcount";
	        return config;
	    }
	}
	static void Main(string[] args)
	{
	    var hadoop = Hadoop.Connect();
	    var result = hadoop.MapReduceJob.ExecuteJob();

	    Console.In.Read();
	}		
			  

Compile and Run It (Java)

		
  <dependencies>
      <dependency>
          <groupId>org.apache.hadoop</groupId>
          <artifactId>hadoop-core</artifactId>
          <version>1.1.2</version>
      </dependency>
   </dependencies>
		  
		
mvn clean install 
export target/ 
hadoop jar WordCount-0.0.1-SNAPSHOT.jar com.wesleyreisz.bigdata.wordcount.v1.WordCount \
	/home/reiszwt/samples/bible/bible+shakes.nopunc /home/reiszwt/samples/bible/output
	  

Compile and Run it (C#)

"I have an idea... let's build something!"

National Climatic Data Center (NCDC) Weather Station Data

Purpose: Write a Program to Grab the Highest Temperatures over a large volume of Datasets from NCDC

In the process, we will:

  • Write AND TEST using sample data
  • Test against the local filesystem
  • Test against the localhost installation of hadoop

National Climatic Data Center (NCDC) Weather Station Data

0029029720999991901072213004+60450+022267FM-12+001499999V0200201N001019999999N0000001N9+03171+99999101681ADDGF101991999999999999999999			
		
http://www1.ncdc.noaa.gov/pub/data/noaa/ish-format-document.pdf

POS: 88-92
AIR-TEMPERATURE-OBSERVATION air temperature
The temperature of the air.
MIN: -0932 MAX: +0618 UNITS: Degrees Celsius
SCALING FACTOR: 10
DOM: A general domain comprised of the numeric characters (0-9), a plus sign (+), and a minus10 sign (-).
+9999 = Missing

			//verify data
			gunzip 1901.gz
			cat 1901 | cut -c88-92 | sort 
		

Steps

	
# show project in IntelliJ 
# discuss maven
# discuss testcases

(Java Way)
# build it & run local
hadoop jar target/hadoop-examples.jar com.wesleyreisz.lionsAndTigersAndPig.MaxTemperatureDriver 
--conf ../../conf/hadoop-local.xml ../../input/ncdc/all/ output
# build it & run localhost
hadoop jar target/hadoop-examples.jar com.wesleyreisz.lionsAndTigersAndPig.MaxTemperatureDriver 
--conf ../../conf/hadoop-localhost.xml  /user/cloudera/ncdc/input  /user/cloudera/ncdc/output

	

Steps

	
(.NET Way)
http://www.windowsazure.com/en-us/manage/services/hdinsight/submit-hadoop-jobs-programmatically/
configure core-site.xml for azure (when needed)
first make sure the old output is removed:

hadoop fs -rmr Output/ncdc

hadoop jar %HADOOP_HOME%\lib\hadoop-streaming.jar 
  -conf={path to core-site.xml}
  -input "Input/ncdc" 
  -output "Output/ncdc" 
  -file "C:\Users\Wes\projects\lionsAndTigersAndPigOhhMy\dotnet\MaxTemperatureMapper\bin\Debug\MaxTemperatureMapper.exe"
	

Demo

There are also other SAAS Solutions using Hadoop

Amazon
  • EC2 has both IAAS and SAAS offerings
  • S3 is an HDFS Filesystem... you can read as input anything loaded there and write it back out
Microsoft
  • Azure has HDInsight
Google
  • Click to Deploy
  • Uses buckets as storage
  • gcloud compute ssh --zone= \ --ssh-flag="-t" --command="sudo su -l hadoop"

"What!? What happened to Pig?"

Different Ways to Write your Hadoop Queries

  • Java/C/C#
  • Pig
  • Hive

Pig's Philosophy

What does it mean to be a pig?
The Apache Pig Project has some founding principles that help pig developers decide how the system should grow over time. This page presents those principles.

Pigs Eat Anything
Pig can operate on data whether it has metadata or not. It can operate on data that is relational, nested, or unstructured. And it can easily be extended to operate on data beyond files, including key/value stores, databases, etc.

Pigs Live Anywhere
Pig is intended to be a language for parallel data processing. It is not tied to one particular parallel framework. It has been implemented first on Hadoop, but we do not intend that to be only on Hadoop.

Pigs Are Domestic Animals
Pig is designed to be easily controlled and modified by its users.

Pigs Fly
Pig processes data quickly. We want to consistently improve performance, and not implement features in ways that weigh pig down so it can't fly.

http://pig.apache.org/

Let's Take a Look at Pig

c:\Hadoop\hadoop-1.1.0-SNAPSHOT>pig
2013-11-20 21:46:25,949 [main] INFO  org.apache.pig.backend.hadoop.executionengi
ne.HExecutionEngine - Connecting to map-reduce job tracker at: localhost:50300
grunt>
		
records= LOAD '/user/Wes/Input/sample/sample.txt' AS (year:chararray, temperature:int, quality:int);
filtered_records =  FILTER records  By temperature !=9999 AND quality==1;
grouped_recrds = GROUP filtered_records by year;
max_temp = FOREACH grouped_recrds GENERATE group, MAX(filtered_records.temperature);
dump max_temp;			
		

Hive's Approach

Apache Hive
The Apache HiveTM data warehouse software facilitates querying and managing large datasets residing in distributed storage. Built on top of Apache HadoopTM , it provides

Tools to enable easy data extract/transform/load (ETL)
A mechanism to impose structure on a variety of data formats Access to files stored either directly in Apache HDFSTM or in other data storage systems such as Apache HBase

Query execution via MapReduce
Hive defines a simple SQL-like query language, called QL, that enables users familiar with SQL to query the data. At the same time, this language also allows programmers who are familiar with the MapReduce framework to be able to plug in their custom mappers and reducers to perform more sophisticated analysis that may not be supported by the built-in capabilities of the language. QL can also be extended with custom scalar functions (UDF's), aggregations (UDAF's), and table functions (UDTF's).

Hive is not designed for OLTP workloads and does not offer real-time queries or row-level updates. It is best used for batch jobs over large sets of append-only data (like web logs). What Hive values most are scalability (scale out with more machines added dynamically to the Hadoop cluster), extensibility (with MapReduce framework and UDF/UDAF/UDTF), fault-tolerance, and loose-coupling with its input formats.

https://cwiki.apache.org/confluence/display/Hive/Home

Let's Take a Look at Hive

hive
		
LOAD DATA LOCAL INPATH 'Input/ncdc/micro-tab/sample.txt' OVERWRITE INTO Table records;
select  year, MAX(temperature) FROM records where temperature != 9999 and quality==1 group by year;
		

But wait... we're not quite done now are we?'

Hadoop is a massive



Summary

  • Hadoop Ecosystem is Huge
  • Support multiple ways to get and work with data, depending on your needs and skills
  • Is Based in Open Source!!!
  • Based on Java, but can be used in .NET!
  • Documentation is weak still...
  • Multitude of options from local machines to IAAS and SAAS offerings
  • The market potential is huge with IoT and Big Data... worth knowing the basics

Shoutouts

Hadoop: The Definitive Guide, 3rd Edition
Storage and Analysis at Internet Scale
By Tom White
Publisher: O'Reilly Media / Yahoo Press
Released: May 2012

*NCDC Samples came directly from this book
http://www.cisco.com/c/en/us/solutions/collateral/service-provider/visual-networking-index-vni/white_paper_c11-520862.html
http://www.cloudera.com
Provided VM I used
Provided Installed for the Cloud Deployed Cluster
Wonderful help via google Group Forums @ https://groups.google.com/a/cloudera.org/groups/dir
Special Shoutout to Sandy Ryza of Cloudera for helping me with an issue I had with the CDH VM! Thank you!!!

Thank You!

Wesley Reisz