Wednesday 31 July 2013

Apache Cassandra Training @ BigDataTraining.IN

Apache Cassandra
Key improvements
Cassandra Query Language (CQL) Enhancements
   One of the main objectives of Cassandra 1.1 was to bring CQL up to parity with the legacy API and command line interface (CLI) that has shipped with Cassandra for several years. This release achieves that goal. CQL is now the primary interface into the DBMS.

Composite Primary Key Columns
    The most significant enhancement of CQL is support for composite primary key columns and wide rows. Composite keys distribute column family data among the nodes. New querying capabilities are a beneficial side effect of wide-row support. You use an ORDER BY clause to sort the result set.

Global Row and Key Caches

    Memory caches for column families are now managed globally instead of at the individual column family level, simpliying  configuration and tuning. Cassandra automatically distributes memory for various column families based on the overall workload and specific column family usage.   Administrators can choose to include or exclude column families from being cached via the caching parameter that is used when creating or modifying column families.

Row-Level Isolation
    Full row-level isolation is now in place so that writes to a row are isolated to the client performing the write and are not visible to any other user until they are complete. From a transactional ACID (atomic, consistent, isolated, durable) standpoint, this enhancement now gives Cassandra transactional AID support. Consistency in the ACID sense typically involves referential integrity with foreign keys among related tables, which Cassandra does not have. Cassandra offers tunable consistency not in the ACID sense, but in the CAP theorem sense where data is made consistent across all the nodes in a distributed database cluster. A user can pick and choose on a per operation basis how many nodes must receive a DML command or respond to a SELECT query.

Hadoop Integration
The following low-level features have been added to Cassandra’s support for Hadoop:

  • Secondary index support for the column family input format. Hadoop jobs can now make use of Cassandra secondary indexes.
  • Wide row support. Previously, wide rows that had, for example, millions of columns could not be accessed, but now they can be read and paged through in Hadoop.
  • The bulk output format provides a more efficient way to load data into Cassandra from a Hadoop job.

Basic architecture
   A Cassandra instance is a collection of independent nodes that are configured together into a cluster. In a Cassandra cluster, all nodes are peers, meaning there is no master node or centralized management process. A node joins a Cassandra cluster based on certain aspects of its configuration. This section explains those aspects of the Cassandra cluster architecture.

     Cassandra uses a protocol called gossip to discover location and state information about the other nodes participating in a Cassandra cluster. Gossip is a peer-to-peer communication protocol in which nodes periodically exchange state information about themselves and about other nodes they know about.

    In Cassandra, the gossip process runs every second and exchanges state messages with up to three other nodes in the cluster. The nodes exchange information about themselves and about the other nodes that they have gossiped about, so all nodes quickly learn about all other nodes in the cluster. A gossip message has a version associated with it, so that during a gossip exchange, older information is overwritten with the most current state for a particular node.

    When a node first starts up, it looks at its configuration file to determine the name of the Cassandra cluster it belongs to and which node(s), called seeds, to contact to obtain information about the other nodes in the cluster. These cluster contact points are configured in the cassandra.yaml configuration file for a node.

    Failure detection is a method for locally determining, from gossip state, if another node in the system is up or down. Failure detection information is also used by Cassandra to avoid routing client requests to unreachable nodes whenever possible.


BigDataTraining.IN has a strong focus and established thought leadership in the area of Big Data and Analytics. We use a global delivery model to help you to evaluate and implement solutions tailored to your specific technical and business context.

http://www.bigdatatraining.in/hadoop-development/training-schedule/

http://www.bigdatatraining.in/contact/

Mail:
info@bigdatatraining.in

Call:
+91 9789968765
044 - 42645495

Visit Us:
#67, 2nd Floor, Gandhi Nagar 1st Main Road, Adyar, Chennai - 20
[Opp to Adyar Lifestyle Super Market]

Monday 29 July 2013

Alternatives to DAS in Hadoop storage - Hadoop Training in chennai

Alternatives to DAS in Hadoop storage

We concluded the last article in this series on managing "big data" with a question: DoesHadoop storage have to be direct-attached storage (DAS)? This article introduces some alternatives to the strict use of embedded DAS for hardware-based storage within theHadoop MapReduce framework.

Direct-attached storage (DAS) is computer storage that is directly attached to one computer or server and is not, without special support, directly accessible to other ones. The main alternatives to direct-attached storage are network-attached storage (NAS) and the storage area network (SAN).

For an individual computer user, the hard drive is the usual form of direct-attached storage. In an enterprise, providing for storage that can be shared by multiple computers and their users tends to be more efficient and easier to manage.


To do so, we'll examine the alternatives in terms of a three-stage model:

Stage one: DAS in the form of small amounts of disk (JBOD/RAID) embedded within each cluster node is replaced with larger, high-performance arrays that are external to the cluster nodes that are still directly attached to provide data locality. In a way, we’re rephrasing our original question:

Does Hadoop data storage have to be relatively small groupings of DASembedded within each cluster node? No, but the larger external storage arrays that replace embedded DAS still function as DAS.

Stage two: The node-based DAS layer used as primary storage by the cluster is augmented with the addition of a second storage layer consisting of network-attached storage (NAS) or a storage-area network (SAN).

Stage three: The node-based DAS layer used as primary storage is replaced by a networked storage layer consisting of NAS or a SAN.

Learn Big Data from Big Data Solutions Architects! Hadoop Training Chennai with Hands-On Practical 


 BigDataTraining.IN - India's Leading BigData Consulting & Training Provider
BigDataTraining.IN is a leading Global Talent Development Corporation, building skilled manpower pool for global industry requirements. BigData Training.in has today grown to be amongst world's leading talent development companies offering learning solutions to Individuals, Institutions & Corporate Clients.
BigDataTraining.IN is the only software institute which offers in-depth classroom & online training in different Cutting   

Classroom / Online / Corporate Training

 get connected with BigDataTraining.IN

http://www.hadooptrainingchennai.in/

http://www.hadooptrainingchennai.in/courses/

http://www.hadooptrainingchennai.in/hadoop-training/

http://www.hadooptrainingchennai.in/course-content/


Contact us:
Mail: info@bigdatatraining.in
Call: +91 97899 68765 / 044 – 42645495
Visit us: #67,2nd Floor, 1st Main Road, Gandhi Nagar, Adyar, Chennai- 600020

 

Saturday 27 July 2013

Apache Pig Hadoop Training in chennai - BigDataTraining.IN

Pig Latin Statements

A Pig Latin statement is an operator that takes a relation as input and produces another relation as output. (This definition applies to all Pig Latin operators except LOAD and STORE which read data from and write data to the file system.) Pig Latin statements can span multiple lines and must end with a semi-colon ( ; ). Pig Latin statements are generally organized in the following manner:
  1. A LOAD statement reads data from the file system.
  2. A series of "transformation" statements process the data.
  3. A STORE statement writes output to the file system; or, a DUMP statement displays output to the screen.

 Pig Latin is a relatively simple language that executes statements. A statement is an operation that takes input (such as a bag, which represents a set of tuples) and emits another bag as its output. A bag is a relation, similar to table, that you'll find in a relational database (where tuples represent the rows, and individual tuples are made up of fields).

A script in Pig Latin often follows a specific format in which data is read from the file system, a number operations are performed on the data (transforming it in one or more ways), and then the resulting relation is written back to the file system.

 BigDataTraining.IN - India's Leading BigData Consulting & Training Provider, Request a Quote!
Pig has a rich set of data types, supporting not only high-level concepts like bags, tuples, and maps, but also simple data types such as ints, longs, floats, doubles, chararrays, and bytearrays. With the simple types, you'll find a range of arithmetic operators (such as add, subtract, multiply, divide, and module) in addition to a conditional operator called bincond that operates similar to the C ternary operator. And as you'd expect, a full suite of comparison operators, including rich pattern matching using regular expressions.
All Pig Latin statements operate on relations (and are called relational operators).  there's an operator for loading data from and storing data in the file system. There's a means to FILTER data by iterating the rows of a relation. This functionality is commonly used to remove data from the relation that is not needed for subsequent operations. Alternatively, if you need to iterate the columns of a relation instead of the rows, you can use the FOREACH operator. FOREACH permits nested operations such as FILTER and ORDER to transform the data during the iteration.
The ORDER operator provides the ability to sort a relation based on one or more fields. The JOIN operator performs an inner or outer join of two or more relations based on common fields. The SPLIT operator provides the ability to split a relation into two or more relations based on a user-defined expression. Finally, the GROUP operator groups the data in one or more relations based on some expression.

 http://www.bigdatatraining.in/contact/
http://www.bigdatatraining.in/hadoop-development/training-schedule/


Hadoop Training Chennai with Hands-On Practical Approach !

 


Mail:
info@bigdatatraining.in

Call:
+91 9789968765
044 - 42645495

Visit Us:
#67, 2nd Floor, Gandhi Nagar 1st Main Road, Adyar, Chennai - 20
[Opp to Adyar Lifestyle Super Market]

Wednesday 24 July 2013

Apache Sqoop for importing data from a relational DB

Sqoop


Sqoop is a tool designed to transfer data between Hadoop and relational databases. You can use Sqoop to import data from a relational database management system (RDBMS) such as MySQL or Oracle into the Hadoop Distributed File System (HDFS), transform the data in Hadoop MapReduce, and then export the data back into an RDBMS.
Sqoop automates most of this process, relying on the database to describe the schema for the data to be imported. Sqoop uses MapReduce to import and export the data, which provides parallel operation as well as fault tolerance.

This document describes how to get started using Sqoop to move data between databases and Hadoop and provides reference information for the operation of the Sqoop 
command-line tool suite. 

This document is intended for:
  • System and application programmers
  • System administrators
  • Database administrators
  • Data analysts
  • Data engineers


Using Apache Sqoop for Data Import from Relational DBs

 

Apache Sqoop can be used to import data from any relational DB into HDFS, Hive or HBase.
To import data into HDFS, use the sqoop import command and specify the relational DB table and connection parameters:
sqoop import --connect <JDBC connection string> --table <tablename> --username
                                            <username> --password <password>
This will import the data and store it as a CSV file in a directory in HDFS.
To import data into Hive, use the sqoop import command and specify the option ‘hive-import’.
sqoop import --connect <JDBC connection string> --table <tablename> --username
                               <username> --password <password> --hive-import

 

http://www.bigdatatraining.in/hadoop-development/training-schedule/
Mail:
info@bigdatatraining.in

Call:
+91 9789968765
044 - 42645495

Visit Us:
#67, 2nd Floor, Gandhi Nagar 1st Main Road, Adyar, Chennai - 20
[Opp to Adyar Lifestyle Super Market]

Saturday 13 July 2013

HBase -Big Data Hadoop Training in Bangalore @ BigDataTraining.IN

What is HBase?
“HBase is a distributed, persistent, strictly consistent storage system with near-optimal write-- in terms of  I/O channel saturation-- and excellent read  performance, and it makes efficient use of disk space  by supporting pluggable compression algorithms that  can be selected based on the nature of the data in  specific column families”

Why HBase?
Handles unstructured or semi-structured data.
Handles enormous data volumes.
Flexible. Ad-Hoc access as well as Full/Partial Table Scans.
Cost-effective scalability.
Near Linear scalability.
Part of the Hadoop Ecosystem.

HBase Architecture

HBase: HMaster

Responsible for assigning regions to Region Servers (RS)
• Handles load balancing of regions across Region Servers
• Has nothing to do with the actual data access.
• Schema and metadata management.
• Fairly Lightweight Process

HBase: Region Server
Responsible for the read/write requests.
• Handles region splits
• Clients communicate directly with the Region Servers
• Creates their own ephemeral node in ZooKeeper
• Region Servers contain the following:
• WAL (HLOG)
• Memstore
• HFile

Learn what the industry is in need of – Showcase required BigData Expertise with real experience integrating a BigData Workflow as Proof of Concept(PoC) Project work. Get hands-on Expertise – Powered by our Expert Team!



BigDataTraining.IN is an organization of learners who aim to advance individual and organizational competitiveness by acquiring and applying right knowledge.

24/7 Technical Support

To facilitate smooth training, In addition to our CloudLab Portal- we offer 24/7 Technical Support.

Our Expertise

* Work / Learn on production level Cloud Servers
* Big Data Thought Leadership
* Primary focus - hands-on sessions
* Leaders in Big Data Consulting Services in India.
http://www.bigdatatraining.in/bigdata-consulting/big-data-services/
* Founders of Big Data Real Time Analytics Platform
http://www.bigdatatraining.in/announcing-our-big-data-analytics-platform/

http://www.bigdatatraining.in/amazon-elastic-mapreduce-training/

http://www.bigdatatraining.in/launching-apache-cassandra-training/

http://www.bigdatatraining.in/launching-mongodb-training/

http://www.hadooptrainingindia.in/hadoop-bigdata-training-online/  BigDataTraining.IN is a leading Global Talent Development Corporation, building skilled manpower pool for global industry requirements. BigData Training.in has today grown to be amongst world's leading talent development companies offering learning solutions to Individuals, Institutions & Corporate Clients.

BigDataTraining.IN has a strong focus and established thought leadership in the area of Big Data and Analytics. We use a global delivery model to help you to evaluate and implement solutions tailored to your specific technical and business context.

http://www.bigdatatraining.in/hadoop-development/training-schedule/

http://www.bigdatatraining.in/contact/

Mail:
info@bigdatatraining.in

Call:
+91 9789968765
044 - 42645495

Visit Us:
#67, 2nd Floor, Gandhi Nagar 1st Main Road, Adyar, Chennai - 20
[Opp to Adyar Lifestyle Super Market]

Monday 8 July 2013

Secondary Name Node - hadoop HDFS Training

Secondary Name Node
 
 
Hadoop has server role called the Secondary Name Node.  A common misconception is that this role provides a high availability backup for the Name Node.  This is not the case.
The Secondary Name Node occasionally connects to the Name Node (by default, ever hour) and grabs a copy of the Name Node’s in-memory metadata and files used to store metadata (both of which may be out of sync).  The Secondary Name Node combines this information in a fresh set of files and delivers them back to the Name Node, while keeping a copy for itself.
Should the Name Node die, the files retained by the Secondary Name Node can be used to recover the Name Node.  In a busy cluster, the administrator may configure the Secondary Name Node to provide this housekeeping service much more frequently than the default setting of one hour.  Maybe every minute.

Big Data to create a new boom in job market

 
The 'Big Data' industry - the ability to access , analyze and use humungous 

volumes of data through    specific technology - will require a whole new army of 

 data workers globally. India itself will require a minimum of 1,00,000 data 

scientists in the next couple of years, in addition to scores of data managers and 

data analysts , to support the fast emerging Big Data space.


 The exponentially decreasing costs of data storage combined with the soaring 


volume of data being captured presents challenges and opportunities to those who 

work in the new frontiers of data science. Businesses, government agencies, and 

scientists leveraging data-based decisions are more successful than those relying 


on decades of trial-and-error. But taming and harnessing big data can be a 

herculean undertaking. The data must be collected, processed and distilled, 

analyzed, and presented in a manner humans can understand. Because there are no 

degrees in data science, data scientists must grow into their roles. If you are 

looking for resources to help you better understand big data and analytics, We have

 the knowledge and experience needed to help make your systems contribute to 

the success of your business. Form a tandem with us and take advantage of our 

capacity to manage, process and analyze big data effectively, quickly and 

economically.

BigDataTraining.IN has a strong focus and established thought leadership in the 


area of Big Data and Analytics. We use a global delivery model to help you to 

 evaluate and implement solutions tailored to your specific technical and business

 context.
Get Hands-on Training @ BigDataTraining.IN

email : info@bigdatatraining.in

Phone: +91 9789968765, 044-42645495
Contact us:
#67,2nd Floor, 1st Main Road, Gandhi Nagar, Adyar, Chennai- 600020

Name Node in HDFS - Hadoop Classroom / Online Training

Name Node in HDFS
The Name Node holds all the file system metadata for the cluster and oversees the health of Data Nodes and coordinates access to data.  The Name Node is the central controller of HDFS.  It does not hold any cluster data itself.  The Name Node only knows what blocks make up a file and where those blocks are located in the cluster.  The Name Node points Clients to the Data Nodes they need to talk to and keeps track of the cluster’s storage capacity, the health of each Data Node, and making sure each block of data is meeting the minimum defined replica policy.
Data Nodes send heartbeats to the Name Node every 3 seconds via a TCP handshake, using the same port number defined for the Name Node daemon, usually TCP 9000.  Every tenth heartbeat is a Block Report, where the Data Node tells the Name Node about all the blocks it has.  The block reports allow the Name Node build its metadata and insure (3) copies of the block exist on different nodes, in different racks.
The Name Node is a critical component of the Hadoop Distributed File System (HDFS).  Without it, Clients would not be able to write or read files from HDFS, and it would be impossible to schedule and execute Map Reduce jobs.  Because of this, it’s a good idea to equip the Name Node with a highly redundant enterprise class server configuration; dual power supplies, hot swappable fans, redundant NIC connections, etc.
Get Hands-on Training @ BigDataTraining.IN

http://www.bigdatatraining.in/

email : info@bigdatatraining.in

Phone: +91 9789968765, 044-42645495
Contact us:
#67,2nd Floor, 1st Main Road, Gandhi Nagar, Adyar, Chennai- 600020

Friday 5 July 2013

HBase MapReduce Examples - Training

          HBase MapReduce Examples

        HBase MapReduce Read Example:

The following is an example of using HBase as a MapReduce source in read-only manner. Specifically, there is a Mapper instance but no Reducer, and nothing is being emitted from the Mapper. There job would be defined as follows...
Configuration config = HBaseConfiguration.create();
Job job = new Job(config, "ExampleRead");
job.setJarByClass(MyReadJob.class);     // class that contains mapper

Scan scan = new Scan();
scan.setCaching(500);        // 1 is the default in Scan, which 
                                              will be bad for MapReduce jobs
scan.setCacheBlocks(false);  // don't set to true for MR jobs
// set other scan attrs
...

TableMapReduceUtil.initTableMapperJob(
  tableName,        // input HBase table name
  scan,             // Scan instance to control CF and attribute selection
  MyMapper.class,   // mapper
  null,             // mapper output key
  null,             // mapper output value
  job);
job.setOutputFormatClass(NullOutputFormat.class);   // because we aren't
                                                 emitting anything from mapper

boolean b = job.waitForCompletion(true);
if (!b) {
  throw new IOException("error with job!");
}

 

public static class MyMapper extends TableMapper<Text, Text> {

  public void map(ImmutableBytesWritable row, Result value, Context context) 
                                 throws InterruptedException, IOException {
    // process data for the row from the Result instance.
   }
}
    

       HBase MapReduce Read/Write Example

The following is an example of using HBase both as a source and as a sink with MapReduce. This example will simply copy data from one table to another.
Configuration config = HBaseConfiguration.create();
Job job = new Job(config,"ExampleReadWrite");
job.setJarByClass(MyReadWriteJob.class);    // class that contains mapper

Scan scan = new Scan();
scan.setCaching(500);        // 1 is the default in Scan, which will be bad
                                              for MapReduce jobs
scan.setCacheBlocks(false);  // don't set to true for MR jobs
// set other scan attrs

TableMapReduceUtil.initTableMapperJob(
 sourceTable,      // input table
 scan,           // Scan instance to control CF and attribute selection
 MyMapper.class,   // mapper class
 null,           // mapper output key
 null,           // mapper output value
 job);
TableMapReduceUtil.initTableReducerJob(
 targetTable,      // output table
 null,             // reducer class
 job);
job.setNumReduceTasks(0);

boolean b = job.waitForCompletion(true);
if (!b) {
    throw new IOException("error with job!");
}
    
An explanation is required of what TableMapReduceUtil is doing, especially with the reducer. TableOutputFormat is being used as the outputFormat class, and several parameters are being set on the config (e.g., TableOutputFormat.OUTPUT_TABLE), as well as setting the reducer output key to ImmutableBytesWritable and reducer value to Writable. These could be set by the programmer on the job and conf, but TableMapReduceUtil tries to make things easier.
The following is the example mapper, which will create a Put and matching the input Result and emit it. Note: this is what the CopyTable utility does.
public static class MyMapper extends TableMapper<ImmutableBytesWritable, Put>  {

 public void map(ImmutableBytesWritable row, Result value, Context context) throws
IOException, InterruptedException {
  // this example is just copying the data from the source table...
     context.write(row, resultToPut(row,value));
    }

   private static Put resultToPut(ImmutableBytesWritable key, Result result)
throws IOException {
    Put put = new Put(key.get());
   for (KeyValue kv : result.raw()) {
   put.add(kv);
  }
  return put;
    }
}
    
There isn't actually a reducer step, so TableOutputFormat takes care of sending the Put to the target table.

Get Hands-on Training @ BigDataTraining.IN
BigDataTraining.IN - India's Leading BigData Consulting & Training Provider, Request a Quote!

Hadoop & Big Data Training | Development | Consulting | Projects

http://www.bigdatatraining.in/hadoop-training-chennai/

http://www.hadooptrainingchennai.in/hadoop-training-in-chennai/

http://www.bigdatatraining.in/

email : info@bigdatatraining.in



 Call:
+91 9789968765
044 - 42645495

Visit Us:
#67, 2nd Floor, Gandhi Nagar 1st Main Road, Adyar, Chennai - 20
[Opp to Adyar Lifestyle Super Market]

 

 

Thursday 4 July 2013

Hadoop Streaming API Training from Big Data Hadoop Experts

Hadoop Streaming

Hadoop streaming is a utility that comes with the Hadoop distribution. The utility allows you to create and run Map/Reduce jobs with any executable or script as the mapper and/or the reducer. For example:
$HADOOP_HOME/bin/hadoop  jar $HADOOP_HOME/hadoop-streaming.jar \
    -input myInputDirs \
    -output myOutputDir \
    -mapper /bin/cat \
    -reducer /bin/wc

How Does Streaming Work

In the above example, both the mapper and the reducer are executables that read the input from stdin (line by line) and emit the output to stdout. The utility will create a Map/Reduce job, submit the job to an appropriate cluster, and monitor the progress of the job until it completes.
When an executable is specified for mappers, each mapper task will launch the executable as a separate process when the mapper is initialized. As the mapper task runs, it converts its inputs into lines and feed the lines to the stdin of the process. In the meantime, the mapper collects the line oriented outputs from the stdout of the process and converts each line into a key/value pair, which is collected as the output of the mapper. By default, the prefix of a line up to the first tab character is the key and the the rest of the line (excluding the tab character) will be the value. If there is no tab character in the line, then entire line is considered as key and the value is null. However, this can be customized, as discussed later.
When an executable is specified for reducers, each reducer task will launch the executable as a separate process then the reducer is initialized. As the reducer task runs, it converts its input key/values pairs into lines and feeds the lines to the stdin of the process. In the meantime, the reducer collects the line oriented outputs from the stdout of the process, converts each line into a key/value pair, which is collected as the output of the reducer. By default, the prefix of a line up to the first tab character is the key and the the rest of the line (excluding the tab character) is the value. However, this can be customized, as discussed later.
This is the basis for the communication protocol between the Map/Reduce framework and the streaming mapper/reducer.
You can supply a Java class as the mapper and/or the reducer. The above example is equivalent to:
$HADOOP_HOME/bin/hadoop  jar $HADOOP_HOME/hadoop-streaming.jar \
    -input myInputDirs \
    -output myOutputDir \
    -mapper org.apache.hadoop.mapred.lib.IdentityMapper \
    -reducer /bin/wc
User can specify stream.non.zero.exit.is.failure as true or false to make a streaming task that exits with a non-zero status to be Failure or Successrespectively. By default, streaming tasks exiting with non-zero status are considered to be failed tasks.

Package Files With Job Submissions

You can specify any executable as the mapper and/or the reducer. The executables do not need to pre-exist on the machines in the cluster; however, if they don't, you will need to use "-file" option to tell the framework to pack your executable files as a part of job submission. For example:
$HADOOP_HOME/bin/hadoop  jar $HADOOP_HOME/hadoop-streaming.jar \
    -input myInputDirs \
    -output myOutputDir \
    -mapper myPythonScript.py \
    -reducer /bin/wc \
    -file myPythonScript.py 
The above example specifies a user defined Python executable as the mapper. The option "-file myPythonScript.py" causes the python executable shipped to the cluster machines as a part of job submission.
In addition to executable files, you can also package other auxiliary files (such as dictionaries, configuration files, etc) that may be used by the mapper and/or the reducer. For example:
$HADOOP_HOME/bin/hadoop  jar $HADOOP_HOME/hadoop-streaming.jar \
    -input myInputDirs \
    -output myOutputDir \
    -mapper myPythonScript.py \
    -reducer /bin/wc \
    -file myPythonScript.py \
    -file myDictionary.txt

Streaming Options and Usage


Mapper-Only Jobs

Often, you may want to process input data using a map function only. To do this, simply set mapred.reduce.tasks to zero. The Map/Reduce framework will not create any reducer tasks. Rather, the outputs of the mapper tasks will be the final output of the job.
To be backward compatible, Hadoop Streaming also supports the "-reduce NONE" option, which is equivalent to "-D mapred.reduce.tasks=0".

Specifying Other Plugins for Jobs

Just as with a normal Map/Reduce job, you can specify other plugins for a streaming job:
   -inputformat JavaClassName
   -outputformat JavaClassName
   -partitioner JavaClassName
   -combiner JavaClassName
The class you supply for the input format should return key/value pairs of Text class. If you do not specify an input format class, the TextInputFormat is used as the default. Since the TextInputFormat returns keys of LongWritable class, which are actually not part of the input data, the keys will be discarded; only the values will be piped to the streaming mapper.
The class you supply for the output format is expected to take key/value pairs of Text class. If you do not specify an output format class, the TextOutputFormat is used as the default.



Large files and archives in Hadoop Streaming

The -files and -archives options allow you to make files and archives available to the tasks. The argument is a URI to the file or archive that you have already uploaded to HDFS. These files and archives are cached across jobs. You can retrieve the host and fs_port values from the fs.default.name config variable.
Here are examples of the -files option:
-files hdfs://host:fs_port/user/testfile.txt#testlink
In the above example, the part of the url after # is used as the symlink name that is created in the current working directory of tasks. So the tasks will have a symlink called testlink in the cwd that points to a local copy of testfile.txt. Multiple entries can be specified as:
-files hdfs://host:fs_port/user/testfile1.txt#testlink1 -files
                hdfs://host:fs_port/user/testfile2.txt#testlink2
The -archives option allows you to copy jars locally to the cwd of tasks and automatically unjar the files. For example:
-archives hdfs://host:fs_port/user/testfile.jar#testlink3
In the example above, a symlink testlink3 is created in the current working directory of tasks. This symlink points to the directory that stores the unjarred contents of the uploaded jar file.
Here's another example of the -archives option. Here, the input.txt file has two lines specifying the names of the two files: testlink/cache.txt and testlink/cache2.txt. "testlink" is a symlink to the archived directory, which has the files "cache.txt" and "cache2.txt".
$HADOOP_HOME/bin/hadoop  jar $HADOOP_HOME/hadoop-streaming.jar \
                  -input "/user/me/samples/cachefile/input.txt"  \
                  -mapper "xargs cat"  \
                  -reducer "cat"  \
                  -output "/user/me/samples/cachefile/out" \  
                  -archives 'hdfs://hadoop-nn1.example.com/user/me/samples/
                                              cachefile/cachedir.jar#testlink' \  
                  -D mapred.map.tasks=1 \
                  -D mapred.reduce.tasks=1 \ 
                  -D mapred.job.name="Experiment"

$ ls test_jar/
cache.txt  cache2.txt

$ jar cvf cachedir.jar -C test_jar/ .
added manifest
adding: cache.txt(in = 30) (out= 29)(deflated 3%)
adding: cache2.txt(in = 37) (out= 35)(deflated 5%)

$ hadoop dfs -put cachedir.jar samples/cachefile

$ hadoop dfs -cat /user/me/samples/cachefile/input.txt
testlink/cache.txt
testlink/cache2.txt

$ cat test_jar/cache.txt 
This is just the cache string

$ cat test_jar/cache2.txt 
This is just the second cache string

$ hadoop dfs -ls /user/me/samples/cachefile/out      
Found 1 items
/user/me/samples/cachefile/out/part-00000  <r 3>   69

$ hadoop dfs -cat /user/me/samples/cachefile/out/part-00000
This is just the cache string   
This is just the second cache string


Specifying Additional Configuration Variables for Jobs

You can specify additional configuration variables by using "-D <n>=<v>". For example:
$HADOOP_HOME/bin/hadoop  jar $HADOOP_HOME/hadoop-streaming.jar \
    -input myInputDirs \
    -output myOutputDir \
    -mapper org.apache.hadoop.mapred.lib.IdentityMapper\
    -reducer /bin/wc \
    -D mapred.reduce.tasks=2
The -D mapred.reduce.tasks=2 in the above example specifies to use two reducers for the job.

Big Data to create a new boom in job market

 
The 'Big Data' industry - the ability to access , analyze and use humungous 


volumes of data through    specific technology - will require a whole new army of 

 data workers globally. India itself will require a minimum of 1,00,000 data 

scientists in the next couple of years, in addition to scores of data managers and 

data analysts , to support the fast emerging Big Data space.

 
 The exponentially decreasing costs of data storage combined with the soaring 


volume of data being captured presents challenges and opportunities to those who 

work in the new frontiers of data science. Businesses, government agencies, and 

scientists leveraging data-based decisions are more successful than those relying 


on decades of trial-and-error. But taming and harnessing big data can be a 

herculean undertaking. The data must be collected, processed and distilled, 

analyzed, and presented in a manner humans can understand. Because there are no 

degrees in data science, data scientists must grow into their roles. If you are 

looking for resources to help you better understand big data and analytics, We have

 the knowledge and experience needed to help make your systems contribute to 

the success of your business. Form a tandem with us and take advantage of our 

capacity to manage, process and analyze big data effectively, quickly and 

economically.

BigDataTraining.IN has a strong focus and established thought leadership in the 


area of Big Data and Analytics. We use a global delivery model to help you to 

 evaluate and implement solutions tailored to your specific technical and business

 context.
Contact us:
#67,2nd Floor, 1st Main Road, Gandhi Nagar, Adyar, Chennai- 600020