2012 | 2014 | Diff | |
---|---|---|---|
Mobile Traffic consumed / mnth | 1.5 exabytes | 2.5 exabytes | 69% increase |
Video Uploaded to YouTube / min | 20 hrs | 100 hrs | 5X More |
Tweets / min | 34,722 | 347,222 | 10X More |
MapReduce: is a programming model for processing large data sets, and the name of an implementation of the model by Google. MapReduce is typically used to do distributed computing on clusters of computers.
Writing a parallel-executable program has proven over the years to be a very challenging task. MapReduce simples the process by requiring coders to write only the simpler Map() and Reduce() functions, which focus on the logic of the specific problem at hand, while the "MapReduce System" handles the marshalling of the distributed servers, running the various tasks in parallel, managing all communications and data transfers between the various parts of the system, providing for redundancy and failures, and overall management of the whole process.
Not new concept (Just wasn't in parrallel):
#!/usr/bin/env bash
for year in ../input/ncdc/all/*
do
echo -ne `basename $year .gz`"\t"
gunzip -c $year | \
awk '{ temp = substr($0, 88, 5) + 0;
if (temp !=9999 && q ~ /[01459]/ && temp > max) max = temp }
END { print max }'
done
For java, all you need is maven, Java (I'm using 7), and the VM from Cloudera (or an installed environment for hadoop)
For .NET, you can install the HDInsight Emulator and configure Visual Studio with some NuGet plugins. The next couple of slides talk specifically about .NET and C#.
public class WordCountMapper : MapperBase { private static int one = 1; public override void Map(string inputLine, MapperContext context) { string[] words = inputLine.Split(' '); foreach (String word in words) { if (word.Length>1) { //outputs each word into a single line with the number 1 //skip spaces context.EmitKeyValue(word, one.ToString()); } } } }
public class WordCountReducer : ReducerCombinerBase { public override void Reduce(string key, IEnumerablevalues, ReducerCombinerContext context) { //sum all the ones int sum = 0; foreach(String value in values) { sum += int.Parse(value); } context.EmitKeyValue(key, sum.ToString()); } }
public class WordCountjob : HadoopJob{ public override HadoopJobConfiguration Configure(ExecutorContext context) { HadoopJobConfiguration config = new HadoopJobConfiguration(); config.InputPath = "Input/wordcount"; config.OutputFolder = "Output/wordcount"; return config; } } static void Main(string[] args) { var hadoop = Hadoop.Connect(); var result = hadoop.MapReduceJob.ExecuteJob (); Console.In.Read(); }
<dependencies> <dependency> <groupId>org.apache.hadoop</groupId> <artifactId>hadoop-core</artifactId> <version>1.1.2</version> </dependency> </dependencies>
mvn clean install export target/ hadoop jar WordCount-0.0.1-SNAPSHOT.jar com.wesleyreisz.bigdata.wordcount.v1.WordCount \ /home/reiszwt/samples/bible/bible+shakes.nopunc /home/reiszwt/samples/bible/output
Purpose: Write a Program to Grab the Highest Temperatures over a large volume of Datasets from NCDC
In the process, we will:
0029029720999991901072213004+60450+022267FM-12+001499999V0200201N001019999999N0000001N9+03171+99999101681ADDGF101991999999999999999999http://www1.ncdc.noaa.gov/pub/data/noaa/ish-format-document.pdf
POS: 88-92
AIR-TEMPERATURE-OBSERVATION air temperature
The temperature of the air.
MIN: -0932 MAX: +0618 UNITS: Degrees Celsius
SCALING FACTOR: 10
DOM: A general domain comprised of the numeric characters (0-9), a plus sign (+), and a minus10
sign (-).
+9999 = Missing
//verify data gunzip 1901.gz cat 1901 | cut -c88-92 | sort
# show project in IntelliJ # discuss maven # discuss testcases (Java Way) # build it & run local hadoop jar target/hadoop-examples.jar com.wesleyreisz.lionsAndTigersAndPig.MaxTemperatureDriver --conf ../../conf/hadoop-local.xml ../../input/ncdc/all/ output # build it & run localhost hadoop jar target/hadoop-examples.jar com.wesleyreisz.lionsAndTigersAndPig.MaxTemperatureDriver --conf ../../conf/hadoop-localhost.xml /user/cloudera/ncdc/input /user/cloudera/ncdc/output
(.NET Way) http://www.windowsazure.com/en-us/manage/services/hdinsight/submit-hadoop-jobs-programmatically/ configure core-site.xml for azure (when needed) first make sure the old output is removed: hadoop fs -rmr Output/ncdc hadoop jar %HADOOP_HOME%\lib\hadoop-streaming.jar -conf={path to core-site.xml} -input "Input/ncdc" -output "Output/ncdc" -file "C:\Users\Wes\projects\lionsAndTigersAndPigOhhMy\dotnet\MaxTemperatureMapper\bin\Debug\MaxTemperatureMapper.exe"
What does it mean to be a pig?
The Apache Pig Project has some founding principles that help pig developers decide how the system should grow over time. This page presents those principles.
Pigs Eat Anything
Pig can operate on data whether it has metadata or not. It can operate on data that is relational, nested, or unstructured. And it can easily be extended to operate on data beyond files, including key/value stores, databases, etc.
Pigs Live Anywhere
Pig is intended to be a language for parallel data processing. It is not tied to one particular parallel framework. It has been implemented first on Hadoop, but we do not intend that to be only on Hadoop.
Pigs Are Domestic Animals
Pig is designed to be easily controlled and modified by its users.
Pigs Fly
Pig processes data quickly. We want to consistently improve performance, and not implement features in ways that weigh pig down so it can't fly.
http://pig.apache.org/
c:\Hadoop\hadoop-1.1.0-SNAPSHOT>pig 2013-11-20 21:46:25,949 [main] INFO org.apache.pig.backend.hadoop.executionengi ne.HExecutionEngine - Connecting to map-reduce job tracker at: localhost:50300 grunt>
records= LOAD '/user/Wes/Input/sample/sample.txt' AS (year:chararray, temperature:int, quality:int); filtered_records = FILTER records By temperature !=9999 AND quality==1; grouped_recrds = GROUP filtered_records by year; max_temp = FOREACH grouped_recrds GENERATE group, MAX(filtered_records.temperature); dump max_temp;
Apache Hive
The Apache HiveTM data warehouse software facilitates querying and managing large datasets residing in distributed storage. Built on top of Apache HadoopTM , it provides
Tools to enable easy data extract/transform/load (ETL)
A mechanism to impose structure on a variety of data formats Access to files stored either directly in Apache HDFSTM or in other data storage systems such as Apache HBase
Query execution via MapReduce
Hive defines a simple SQL-like query language, called QL, that enables users familiar with SQL to query the data. At the same time, this language also allows programmers who are familiar with the MapReduce framework to be able to plug in their custom mappers and reducers to perform more sophisticated analysis that may not be supported by the built-in capabilities of the language. QL can also be extended with custom scalar functions (UDF's), aggregations (UDAF's), and table functions (UDTF's).
Hive is not designed for OLTP workloads and does not offer real-time queries or row-level updates. It is best used for batch jobs over large sets of append-only data (like web logs). What Hive values most are scalability (scale out with more machines added dynamically to the Hadoop cluster), extensibility (with MapReduce framework and UDF/UDAF/UDTF), fault-tolerance, and loose-coupling with its input formats.
https://cwiki.apache.org/confluence/display/Hive/Home
hive
LOAD DATA LOCAL INPATH 'Input/ncdc/micro-tab/sample.txt' OVERWRITE INTO Table records; select year, MAX(temperature) FROM records where temperature != 9999 and quality==1 group by year;