Lions and Tigers and Pig Ohh My

"What's the Big Deal?"

Definition

Big data is the term for a collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications... The trend to larger data sets is due to the additional information derivable from analysis of a single large set of related data, as compared to separate smaller sets with the same total amount of data, allowing correlations to be found to "spot business trends, determine quality of research, prevent diseases, link legal citations, combat crime, and determine real-time roadway traffic conditions.

Volume
Velocity
Variety

Example

1 Boeing 737 engine generates 20 terabytes of data an hour. Each aircraft has two of those engines. That’s 40 TB of data an hour.
So in a standard 6 hour cross country flight, you are talking 240 TB of new data.
Now multiply that with the number of flights for all commercial jetliners daily in the US.
For Southwest Airlines alone... Total data generated every day by Southwest Airlines’ fleet of 607 Boeing 737 aircraft:
(20TB/ hr x 10.8hr avg operation per day) x 2 engines x 607 =
262,224 terabytes (or 256 petabytes) PER DAY

http://avoa.com/2014/01/20/are-enterprises-prepared-for-the-data-tsunami/

2012

http://magazine.good.is/infographics/the-world-of-data-we-re-creating-on-the-internet#open

2015

http://www.intel.com/content/www/us/en/communications/internet-minute-infographic.html

Increase 2012-2014

	2012	2014	Diff
Mobile Traffic consumed / mnth	1.5 exabytes	2.5 exabytes	69% increase
Video Uploaded to YouTube / min	20 hrs	100 hrs	5X More
Tweets / min	34,722	347,222	10X More

Warning!!!

How can we deal with all this data?

Apache Hadoop is an open-source software framework that supports data-intensive distributed applications, licensed under the Apache v2 license. It supports the running of applications on large clusters of commodity hardware.

Windows Azure HDInsight provides ease of management, agility and an open Enterprise-ready Hadoop service in the cloud.

Agenda

Setup the Problem Space
Define HADOOP and it's core Technologies
Demo/Discuss Wordcount
Demonstrate Running Locally using Cloudera's Single-Node Hadoop Cluster (or HDInsight Emulator for Microsoft Azure for .NET Track)
Demo/Discuss MaxTemperature in Java / C#
Demo/Discuss MaxTemperature in Pig Latin
Demo/Discuss MaxTemperature in Hive
Discuss Cloud Offerings
Wrapup

What is Hadoop? Mike Olson, Cloudera CEO

What is Hadoop Composed of?

HDFS: HDFS is a distributed, scalable, and portable file system written in Java for the Hadoop framework. Each node in a Hadoop instance typically has a single namenode; a cluster of datanodes form the HDFS cluster. The situation is typical because each node does not require a datanode to be present. Each datanode serves up blocks of data over the network using a block protocol specific to HDFS. The file system uses the TCP/IP layer for communication.

MapReduce: is a programming model for processing large data sets, and the name of an implementation of the model by Google. MapReduce is typically used to do distributed computing on clusters of computers.

Writing a parallel-executable program has proven over the years to be a very challenging task. MapReduce simples the process by requiring coders to write only the simpler Map() and Reduce() functions, which focus on the logic of the specific problem at hand, while the "MapReduce System" handles the marshalling of the distributed servers, running the various tasks in parallel, managing all communications and data transfers between the various parts of the system, providing for redundancy and failures, and overall management of the whole process.

HDFS Features

Designed to store large files
Fault tolerant and self-healing distributed file system
Stores files as large blocks (64 to 128 MB)
Each block stored on multiple servers
Data is automatically re-replicated on need
Accessed from command line, Java API, C API, and... C#

Demo

What is MapReduce

MapReduce: is a programming model for processing large data sets, and the name of an implementation of the model by Google. MapReduce is typically used to do distributed computing on clusters of computers.

Writing a parallel-executable program has proven over the years to be a very challenging task. MapReduce simples the process by requiring coders to write only the simpler Map() and Reduce() functions, which focus on the logic of the specific problem at hand, while the "MapReduce System" handles the marshalling of the distributed servers, running the various tasks in parallel, managing all communications and data transfers between the various parts of the system, providing for redundancy and failures, and overall management of the whole process.

Not new concept (Just wasn't in parrallel):

     #!/usr/bin/env bash
     for year in ../input/ncdc/all/*
     do
        echo -ne `basename $year .gz`"\t"
        gunzip -c $year | \
        awk '{ temp = substr($0, 88, 5) + 0;
        if (temp !=9999 && q ~ /[01459]/ && temp > max) max = temp }
        END { print max }'
     done

Mapreduce Overview

Map Reduce Features

Fine grained Map and Reduce tasks

Improved load balancing
Faster recovery from failed tasks

Automatic re-execution on failure

In a large cluster, some nodes are always slow or flaky
Introduces long tails or failures in computation
Framework re-executes failed tasks

Locality optimizations

With big data, bandwidth to data is a problem
Map-Reduce + HDFS is a very effective solution
Map-Reduce queries HDFS for locations of input data
Map tasks are scheduled local to the inputs when possible

Putting it All Together With Hadoop's Version of Hello World

“45% of all Hadoop tutorials count words. 25% count sentences. 20% are about paragraphs. 10% are log parsers. The remainder are helpful.”
-jandersen @http://twitter.com/jandersen/

Talk about how to create a local HADOOP Environment using Microsoft Technology
Copy Some data into the HDFS
Execute some code (Using Visual Studio & C#)
Count some Words!

Demo

(Java Version)
cat bible+shakes.nopunc | wc -l
cat bible+shakes.nopunc | wc -w
tail -1000 bible+shakes.nopunc
head -1000 bible+shakes.nopunc
------------
hadoop fs -ls /user/cloudera/
hadoop fs -mkdir /user/cloudera/bible
hadoop fs -put bible+shakes.nopunc /user/cloudera/bible
hadoop fs -ls /user/cloudera/bible

hadoop fs -ls /user/cloudera/bible
hadoop jar WordCount-0.0.1-SNAPSHOT.jar com.wesleyreisz.bigdata.wordcount.v1.WordCount
/user/cloudera/bible/bible+shakes.nopunc
/user/cloudera/bible/output
------------
hadoop fs -ls /user/cloudera/bible/output
hadoop fs -cat /user/cloudera/bible/output/part-00000
mkdir working
cd working/
hadoop fs -get /user/cloudera/bible/output/part*
cat part-00000
sort -nk2 part-00000
(.NET Version)
run in VS
hadoop fs -get Output/wordcount/part* .
linux: sort -k2 -n part-00000
ps: Get-Content .\part-00000 | Sort-Object { [double]$_.split()[-1] } -Descending

Setting up Enviornments

For java, all you need is maven, Java (I'm using 7), and the VM from Cloudera (or an installed environment for hadoop)

For .NET, you can install the HDInsight Emulator and configure Visual Studio with some NuGet plugins. The next couple of slides talk specifically about .NET and C#.

Installing HDInsight Emulator

Setting Up Visual Studio

Startup Visual Studio
Create a Console App
Use NuGet to add Microsoft .NET API for Hadoop, Map Reduce API for Hadoop, & Hadoop Web Client
Let's See Some Code

WordCount::Map Function

  public class WordCountMapper : MapperBase
  {
      private static int one = 1;
      public override void Map(string inputLine, MapperContext context)
      {
          string[] words = inputLine.Split(' ');
          foreach (String word in words) {
              if (word.Length>1) {
                  //outputs each word into a single line with the number 1
                  //skip spaces
                  context.EmitKeyValue(word, one.ToString());
              }
          }
      }
  }

  public static class Map extends MapReduceBase 
   	implements Mapper {
     private final static IntWritable one = new IntWritable(1);
     private Text word = new Text();

     public void map(LongWritable key, Text value, 
   		  OutputCollector output, Reporter reporter) throws IOException {
       String line = value.toString();
       StringTokenizer tokenizer = new StringTokenizer(line);
       while (tokenizer.hasMoreTokens()) {
         word.set(tokenizer.nextToken());
         output.collect(word, one);
       }
     }
   }

WordCount::Flow

WordCount::Reduce Function

    public class WordCountReducer : ReducerCombinerBase
    {
        public override void Reduce(string key, IEnumerable values, ReducerCombinerContext context)
        {
            //sum all the ones
           int sum = 0;
           foreach(String value in values)
           {
              sum += int.Parse(value);
           }
           context.EmitKeyValue(key, sum.ToString());
        }
    }

    public static class Reduce extends MapReduceBase 
    	implements Reducer {
      public void reduce(Text key, Iterator values, 
    		  OutputCollector output, Reporter reporter) throws IOException {
        int sum = 0;
        while (values.hasNext()) {
          sum += values.next().get();
        }
        output.collect(key, new IntWritable(sum));
      }
    }

WordCount::Flow

WordCount::Driver

	public class WordCountjob : HadoopJob
	{
	    public override HadoopJobConfiguration Configure(ExecutorContext context)
	    {
	        HadoopJobConfiguration config = new HadoopJobConfiguration();
	        config.InputPath = "Input/wordcount";
	        config.OutputFolder = "Output/wordcount";
	        return config;
	    }
	}
	static void Main(string[] args)
	{
	    var hadoop = Hadoop.Connect();
	    var result = hadoop.MapReduceJob.ExecuteJob();

	    Console.In.Read();
	}

    public static void main(String[] args) throws Exception {
       JobConf conf = new JobConf(WordCount.class);
       conf.setJobName("wordcount");

       conf.setOutputKeyClass(Text.class);
       conf.setOutputValueClass(IntWritable.class);

       conf.setMapperClass(Map.class);
       conf.setCombinerClass(Reduce.class);
       conf.setReducerClass(Reduce.class);

       conf.setInputFormat(TextInputFormat.class);
       conf.setOutputFormat(TextOutputFormat.class);

       FileInputFormat.setInputPaths(conf, new Path(args[0]));
       FileOutputFormat.setOutputPath(conf, new Path(args[1]));

       JobClient.runJob(conf);
    }

Compile and Run It (Java)

		
  <dependencies>
      <dependency>
          <groupId>org.apache.hadoop</groupId>
          <artifactId>hadoop-core</artifactId>
          <version>1.1.2</version>
      </dependency>
   </dependencies>

		
mvn clean install 
export target/ 
hadoop jar WordCount-0.0.1-SNAPSHOT.jar com.wesleyreisz.bigdata.wordcount.v1.WordCount \
	/home/reiszwt/samples/bible/bible+shakes.nopunc /home/reiszwt/samples/bible/output

Compile and Run it (C#)

"I have an idea... let's build something!"

National Climatic Data Center (NCDC) Weather Station Data

Purpose: Write a Program to Grab the Highest Temperatures over a large volume of Datasets from NCDC

In the process, we will:

Write AND TEST using sample data
Test against the local filesystem
Test against the localhost installation of hadoop

National Climatic Data Center (NCDC) Weather Station Data

0029029720999991901072213004+60450+022267FM-12+001499999V0200201N001019999999N0000001N9+03171+99999101681ADDGF101991999999999999999999

http://www1.ncdc.noaa.gov/pub/data/noaa/ish-format-document.pdf

POS: 88-92
AIR-TEMPERATURE-OBSERVATION air temperature
The temperature of the air.
MIN: -0932 MAX: +0618 UNITS: Degrees Celsius
SCALING FACTOR: 10
DOM: A general domain comprised of the numeric characters (0-9), a plus sign (+), and a minus10 sign (-).
+9999 = Missing

			//verify data
			gunzip 1901.gz
			cat 1901 | cut -c88-92 | sort

Steps

	
# show project in IntelliJ 
# discuss maven
# discuss testcases

(Java Way)
# build it & run local
hadoop jar target/hadoop-examples.jar com.wesleyreisz.lionsAndTigersAndPig.MaxTemperatureDriver 
--conf ../../conf/hadoop-local.xml ../../input/ncdc/all/ output
# build it & run localhost
hadoop jar target/hadoop-examples.jar com.wesleyreisz.lionsAndTigersAndPig.MaxTemperatureDriver 
--conf ../../conf/hadoop-localhost.xml  /user/cloudera/ncdc/input  /user/cloudera/ncdc/output

Steps

	
(.NET Way)
http://www.windowsazure.com/en-us/manage/services/hdinsight/submit-hadoop-jobs-programmatically/
configure core-site.xml for azure (when needed)
first make sure the old output is removed:

hadoop fs -rmr Output/ncdc

hadoop jar %HADOOP_HOME%\lib\hadoop-streaming.jar 
  -conf={path to core-site.xml}
  -input "Input/ncdc" 
  -output "Output/ncdc" 
  -file "C:\Users\Wes\projects\lionsAndTigersAndPigOhhMy\dotnet\MaxTemperatureMapper\bin\Debug\MaxTemperatureMapper.exe"

Demo

There are also other SAAS Solutions using Hadoop

Amazon

EC2 has both IAAS and SAAS offerings
S3 is an HDFS Filesystem... you can read as input anything loaded there and write it back out

Microsoft

Azure has HDInsight

Google

Click to Deploy
Uses buckets as storage
gcloud compute ssh --zone= \ --ssh-flag="-t" --command="sudo su -l hadoop"

"What!? What happened to Pig?"

Different Ways to Write your Hadoop Queries

Java/C/C#
Pig
Hive

Pig's Philosophy

What does it mean to be a pig?
The Apache Pig Project has some founding principles that help pig developers decide how the system should grow over time. This page presents those principles.

Pigs Eat Anything
Pig can operate on data whether it has metadata or not. It can operate on data that is relational, nested, or unstructured. And it can easily be extended to operate on data beyond files, including key/value stores, databases, etc.

Pigs Live Anywhere
Pig is intended to be a language for parallel data processing. It is not tied to one particular parallel framework. It has been implemented first on Hadoop, but we do not intend that to be only on Hadoop.

Pigs Are Domestic Animals
Pig is designed to be easily controlled and modified by its users.

Pigs Fly
Pig processes data quickly. We want to consistently improve performance, and not implement features in ways that weigh pig down so it can't fly.

http://pig.apache.org/

Let's Take a Look at Pig

c:\Hadoop\hadoop-1.1.0-SNAPSHOT>pig
2013-11-20 21:46:25,949 [main] INFO  org.apache.pig.backend.hadoop.executionengi
ne.HExecutionEngine - Connecting to map-reduce job tracker at: localhost:50300
grunt>

records= LOAD '/user/Wes/Input/sample/sample.txt' AS (year:chararray, temperature:int, quality:int);
filtered_records =  FILTER records  By temperature !=9999 AND quality==1;
grouped_recrds = GROUP filtered_records by year;
max_temp = FOREACH grouped_recrds GENERATE group, MAX(filtered_records.temperature);
dump max_temp;

Hive's Approach

Apache Hive
The Apache HiveTM data warehouse software facilitates querying and managing large datasets residing in distributed storage. Built on top of Apache HadoopTM , it provides

Tools to enable easy data extract/transform/load (ETL)
A mechanism to impose structure on a variety of data formats Access to files stored either directly in Apache HDFSTM or in other data storage systems such as Apache HBase

Query execution via MapReduce
Hive defines a simple SQL-like query language, called QL, that enables users familiar with SQL to query the data. At the same time, this language also allows programmers who are familiar with the MapReduce framework to be able to plug in their custom mappers and reducers to perform more sophisticated analysis that may not be supported by the built-in capabilities of the language. QL can also be extended with custom scalar functions (UDF's), aggregations (UDAF's), and table functions (UDTF's).

Hive is not designed for OLTP workloads and does not offer real-time queries or row-level updates. It is best used for batch jobs over large sets of append-only data (like web logs). What Hive values most are scalability (scale out with more machines added dynamically to the Hadoop cluster), extensibility (with MapReduce framework and UDF/UDAF/UDTF), fault-tolerance, and loose-coupling with its input formats.

https://cwiki.apache.org/confluence/display/Hive/Home

Let's Take a Look at Hive

hive

LOAD DATA LOCAL INPATH 'Input/ncdc/micro-tab/sample.txt' OVERWRITE INTO Table records;
select  year, MAX(temperature) FROM records where temperature != 9999 and quality==1 group by year;

But wait... we're not quite done now are we?'

Hadoop is a massive

Summary

Hadoop Ecosystem is Huge
Support multiple ways to get and work with data, depending on your needs and skills
Is Based in Open Source!!!
Based on Java, but can be used in .NET!
Documentation is weak still...
Multitude of options from local machines to IAAS and SAAS offerings
The market potential is huge with IoT and Big Data... worth knowing the basics

Shoutouts

	Hadoop: The Definitive Guide, 3rd Edition Storage and Analysis at Internet Scale By Tom White Publisher: O'Reilly Media / Yahoo Press Released: May 2012 *NCDC Samples came directly from this book http://www.cisco.com/c/en/us/solutions/collateral/service-provider/visual-networking-index-vni/white_paper_c11-520862.html
	http://www.cloudera.com Provided VM I used Provided Installed for the Cloud Deployed Cluster Wonderful help via google Group Forums @ https://groups.google.com/a/cloudera.org/groups/dir Special Shoutout to Sandy Ryza of Cloudera for helping me with an issue I had with the CDH VM! Thank you!!!

Thank You!

Wesley Reisz