Sunday 30 June 2013

Introduction to HDFS - Big Data Hadoop Training

HDFS is the primary distributed storage used by Hadoop applications. A HDFS cluster primarily consists of a NameNode that manages the file system metadata and DataNodes that store the actual data. The HDFS Architecture Guide describes HDFS in detail. This user guide primarily deals with the interaction of users and administrators with HDFS clusters. The HDFS architecture diagram depicts basic interactions among NameNode, the DataNodes, and the clients. Clients contact NameNode for file metadata or file modifications and perform actual file I/O directly with the DataNodes.



The following are some of the salient features that could be of interest to many users.
  • Hadoop, including HDFS, is well suited for distributed storage and distributed processing using commodity hardware. It is fault tolerant, scalable, and extremely simple to expand. MapReduce, well known for its simplicity and applicability for large set of distributed applications, is an integral part of Hadoop.
  • HDFS is highly configurable with a default configuration well suited for many installations. Most of the time, configuration needs to be tuned only for very large clusters.
  • Hadoop is written in Java and is supported on all major platforms.
  • Hadoop supports shell-like commands to interact with HDFS directly.
  • The NameNode and Datanodes have built in web servers that makes it easy to check current status of the cluster.
  • New features and improvements are regularly implemented in HDFS. The following is a subset of useful features in HDFS:
    • File permissions and authentication.
    • Rack awareness: to take a node's physical location into account while scheduling tasks and allocating storage.
    • Safemode: an administrative mode for maintenance.
    • fsck: a utility to diagnose health of the file system, to find missing files or blocks.
    • fetchdt: a utility to fetch DelegationToken and store it in a file on the local system.
    • Rebalancer: tool to balance the cluster when the data is unevenly distributed among DataNodes.
    • Upgrade and rollback: after a software upgrade, it is possible to rollback to HDFS' state before the upgrade in case of unexpected problems.
    • Secondary NameNode: performs periodic checkpoints of the namespace and helps keep the size of file containing log of HDFS modifications within certain limits at the NameNode.
    • Checkpoint node: performs periodic checkpoints of the namespace and helps minimize the size of the log stored at the NameNode containing changes to the HDFS. Replaces the role previously filled by the Secondary NameNode, though is not yet battle hardened. The NameNode allows multiple Checkpoint nodes simultaneously, as long as there are no Backup nodes registered with the system.
    • Backup node: An extension to the Checkpoint node. In addition to checkpointing it also receives a stream of edits from the NameNode and maintains its own in-memory copy of the namespace, which is always in sync with the active NameNode namespace state. Only one Backup node may be registered with the NameNode at once.

    SHELL COMMANDS:

    Hadoop includes various shell-like commands that directly interact with HDFS and other file systems that Hadoop supports. The command bin/hdfs dfs -help lists the commands supported by Hadoop shell. Furthermore, the command bin/hdfs dfs -help command-name displays more detailed help for a command. These commands support most of the normal files system operations like copying files, changing file permissions, etc. It also supports a few HDFS specific operations like changing replication of files. For more information see .

    DFSAdmin Command

    The bin/hadoop dfsadmin command supports a few HDFS administration related operations. The bin/hadoop dfsadmin -help command lists all the commands currently supported. For e.g.:
    • -report: reports basic statistics of HDFS. Some of this information is also available on the NameNode front page.
    • -safemode: though usually not required, an administrator can manually enter or leave Safemode.
    • -finalizeUpgrade: removes previous backup of the cluster made during last upgrade.
    • -refreshNodes: Updates the namenode with the set of datanodes allowed to connect to the namenode. Namenodes re-read datanode hostnames in the file defined bydfs.hosts, dfs.hosts.exclude. Hosts defined in dfs.hosts are the datanodes that are part of the cluster. If there are entries in dfs.hosts, only the hosts in it are allowed to register with the namenode. Entries in dfs.hosts.exclude are datanodes that need to be decommissioned. Datanodes complete decommissioning when all the replicas from them are replicated to other datanodes. Decommissioned nodes are not automatically shutdown and are not chosen for writing for new replicas.
    • -printTopology : Print the topology of the cluster. Display a tree of racks and datanodes attached to the tracks as viewed by the NameNode.

     SECONDARY NAMENODE
    The NameNode stores modifications to the file system as a log appended to a native file system file, edits. When a NameNode starts up, it reads HDFS state from an image file, fsimage, and then applies edits from the edits log file. It then writes new HDFS state to the fsimage and starts normal operation with an empty edits file. Since NameNode merges fsimage and edits files only during start up, the edits log file could get very large over time on a busy cluster. Another side effect of a larger edits file is that next restart of NameNode takes longer.
    The secondary NameNode merges the fsimage and the edits log files periodically and keeps edits log size within a limit. It is usually run on a different machine than the primary NameNode since its memory requirements are on the same order as the primary NameNode.
    The start of the checkpoint process on the secondary NameNode is controlled by two configuration parameters.
  • dfs.namenode.checkpoint.period, set to 1 hour by default, specifies the maximum delay between two consecutive checkpoints, and
  • dfs.namenode.checkpoint.txns, set to 40000 default, defines the number of uncheckpointed transactions on the NameNode which will force an urgent checkpoint, even if the checkpoint period has not been reached.
The secondary NameNode stores the latest checkpoint in a directory which is structured the same way as the primary NameNode's directory. So that the check pointed image is always ready to be read by the primary NameNode if necessary.

CHECKPOINT NODE

NameNode persists its namespace using two files: fsimage, which is the latest checkpoint of the namespace and edits, a journal (log) of changes to the namespace since the checkpoint. When a NameNode starts up, it merges the fsimage and edits journal to provide an up-to-date view of the file system metadata. The NameNode then overwrites fsimage with the new HDFS state and begins a new edits journal.

The Checkpoint node periodically creates checkpoints of the namespace. It downloads fsimage and edits from the active NameNode, merges them locally, and uploads the new image back to the active NameNode. The Checkpoint node usually runs on a different machine than the NameNode since its memory requirements are on the same order as the NameNode. The Checkpoint node is started by bin/hdfs namenode -checkpoint on the node specified in the configuration file.
The location of the Checkpoint (or Backup) node and its accompanying web interface are configured via the dfs.namenode.backup.address and dfs.namenode.backup.http-address configuration variables.

The start of the checkpoint process on the Checkpoint node is controlled by two configuration parameters.
  • dfs.namenode.checkpoint.period, set to 1 hour by default, specifies the maximum delay between two consecutive checkpoints
  • dfs.namenode.checkpoint.txns, set to 40000 default, defines the number of uncheckpointed transactions on the NameNode which will force an urgent checkpoint, even if the checkpoint period has not been reached.
The Checkpoint node stores the latest checkpoint in a directory that is structured the same as the NameNode's directory. This allows the checkpointed image to be always available for reading by the NameNode if necessary. See Import checkpoint.
Multiple checkpoint nodes may be specified in the cluster configuration file.

BACKUP NODE

The Backup node provides the same checkpointing functionality as the Checkpoint node, as well as maintaining an in-memory, up-to-date copy of the file system namespace that is always synchronized with the active NameNode state. Along with accepting a journal stream of file system edits from the NameNode and persisting this to disk, the Backup node also applies those edits into its own copy of the namespace in memory, thus creating a backup of the namespace.

The Backup node does not need to download fsimage and edits files from the active NameNode in order to create a checkpoint, as would be required with a Checkpoint node or Secondary NameNode, since it already has an up-to-date state of the namespace state in memory. The Backup node checkpoint process is more efficient as it only needs to save the namespace into the local fsimage file and reset edits.

As the Backup node maintains a copy of the namespace in memory, its RAM requirements are the same as the NameNode.

The NameNode supports one Backup node at a time. No Checkpoint nodes may be registered if a Backup node is in use. Using multiple Backup nodes concurrently will be supported in the future.
The Backup node is configured in the same manner as the Checkpoint node. It is started with bin/hdfs namenode -backup.

The location of the Backup (or Checkpoint) node and its accompanying web interface are configured via the dfs.namenode.backup.address and dfs.namenode.backup.http-address configuration variables.

Use of a Backup node provides the option of running the NameNode with no persistent storage, delegating all responsibility for persisting the state of the namespace to the Backup node. To do this, start the NameNode with the -importCheckpoint option, along with specifying no persistent storage directories of type edits dfs.namenode.edits.dirfor the NameNode configuration.

http://www.bigdatatraining.in/contact/

http://www.bigdatatraining.in/hadoop-development/training-schedule/

http://www.bigdatatraining.in/amazon-elastic-mapreduce-training/



 Mail:
info@bigdatatraining.in
Call:
+91 9789968765
044 - 42645495

Visit Us:
#67, 2nd Floor, Gandhi Nagar 1st Main Road, Adyar, Chennai - 20
[Opp to Adyar Lifestyle Super Market]

Wednesday 12 June 2013

Installing CouchDB - Get Training with Real time POC Project

Installing CouchDB

 CouchDB allows you to write a client side application that talks directly to the Couch without the need for a server side middle layer, significantly reducing development time. With CouchDB, you can easily handle demand by adding more replication nodes with ease. CouchDB allows you to replicate the database to your client and with filters you could even replicate that specific user’s data.
Having the database stored locally means your client side application can run with almost no latency. CouchDB will handle the replication to the cloud for you. Your users could access their invoices on their mobile phone and make changes with no noticeable latency, all whilst being offline. When a connection is present and usable, CouchDB will automatically replicate those changes to your cloud CouchDB.
CouchDB is a database designed to run on the internet of today for today’s desktop-like applications and the connected devices through which we access the internet.

Step 1 

The easiest way to get CouchDB up and running on your system is to head to CouchOne and download a CouchDB distribution for your OS — OSX in my case. Download the zip, extract it and drop CouchDBX in my applications folder (instructions for other OS’s on CouchOne).
Finally, open CouchDBX.

Step 2 – Welcome to Futon

After CouchDB has started, you should see the Futon control panel in the CouchDBX application. In case you can’t, you can access Futon via your browser. Looking at the log, CouchDBX tells us CouchDB was started at http://127.0.0.1:5984/ (may be different on your system). Open a browser and go tohttp://127.0.0.1:5984/_utils/ and you should see Futon.

CouchDB jQuery Plugin

Futon is actually using a jQuery plugin to interact with CouchDB. You can view that plugin athttp://127.0.0.1:5984/_utils/script/jquery.couch.js (bear in mind your port may be different). This gives you a great example of interacting with CouchDB.

Step 3 – Users in CouchDB

CouchDB, by default, is completely open, giving every user admin rights to the instance and all its databases. This is great for development but obviously bad for production. Let’s go ahead and setup an admin. In the bottom right, you will see “Welcome to Admin Party! Everyone is admin! Fix this”.
Go ahead and click fix this and give yourself a username and password. This creates an admin account and gives anonymous users access to read and write operations on all the databases, but no configuration privileges.
Users in CouchDB can be a little confusing to grasp initially, specially if you’re used to creating a single user for your entire application and then managing users yourself within a users table (not the MySQL users table). In CouchDB, it would be unwise to create a single super user and have that user do all the read/write, because if your app is client-side then this super user’s credentials will be in plain sight in your JavaScript source code.
CouchDB has user creation and authentication baked in. You can create users with the jQuery plugin using$.couch.signup(). These essentially become the users of your system. Users are just JSON documents like everything else so you can store any additional attributes you wish like email for example. You can then use groups within CouchDB to control what documents each user has write access to. For example, you can create a database for that user to which they can write to and then add them to a group with read access to the other databases as required.

Step 4 – Creating a Product Document

Now let’s create our first document using Futon through the following steps:
  1. Open the mycouchshop database.
  2. Click “New Document”.
  3. Click “Add Field” to begin adding data to the JSON document. Notice how an ID is pre-filled out for you, I would highly advise not changing this. Add key “name” with the value of “Nettuts CouchDB Tutorial One”.
  4. Make sure you click the tick next to each attribute to save it.
  5. Click “Save Document”.

Step 5 – Updating a Document

CouchDB is an append only database — new updates are appended to the database and do not overwrite the old version. Each new update to a JSON document with a pre-existing ID will add a new revision. This is what the automatically inserted revision key signifies. Follow the steps below to see this in action:
  • Viewing the contents of the mycouchshop database, click the only record visible.
  • Add another attribute with the key “type” and the value “product”.
  • Hit “Save Document”.

Step 6 – Creating a Document Using cURL

I’ve already mentioned that CouchDB uses a RESTful interface and the eagle eyed reader would have noticed Futon using this via the console in Firebug. In case you didn’t, let’s prove this by inserting a document using cURL via the Terminal.
First, let’s create a JSON document with the below contents and save it to the desktop calling the fileperson.json.
  1. {  
  2.     "forename""Gavin",  
  3.     "surname":  "Cooper",  
  4.     "type":     "person"  
  5. }  
Next, open the terminal and execute cd ~/Desktop/ putting you in the correct directory and then perform the insert with curl -X POST http://127.0.0.1:5984/mycouchshop/ -d @person.json -H "Content-Type: application/json". CouchDB should have returned a JSON document similar to the one below.
  1. {"ok":true,"id":"c6e2f3d7f8d0c91ce7938e9c0800131c","rev":"1-abadd48a09c270047658dbc38dc8a892"}  
This is the ID and revision number of the inserted document. CouchDB follows the RESTful convention and thus:
  • POST – creates a new record
  • GET – reads records
  • PUT – updates a record
  • DELETE – deletes a record

Step 7 – Viewing All Documents

We can further verify our insert by viewing all the documents in our mycouchshop database by executingcurl -X GET http://127.0.0.1:5984/mycouchshop/_all_docs.

Step 8 – Creating a Simple Map Function

Viewing all documents is fairly useless in practical terms. What would be more ideal is to view all product documents. Follow the steps below to achieve this:
  • Within Futon, click on the view drop down and select “Temporary View”.
  • This is the map reduce editor within Futon. Copy the code below into the map function.
    1. function (doc) {  
    2.     if (doc.type === "product" && doc.name) {  
    3.         emit(doc.name, doc);  
    4.     }  
    5. }  
  • Click run and you should see the single product we added previously.
  • Go ahead and make this view permanent by saving it.
After creating this simple map function, we can now request this view and see its contents over HTTP using the following command curl -X GET http://127.0.0.1:5984/mycouchshop/_design/products/_view/products.
A small thing to notice is how we get the document’s ID and revision by default.

Step 9 – Performing a Reduce

To perform a useful reduce, let’s add another product to our database and add a price attribute with the value of 1.75 to our first product.
  1. {  
  2.     "name":     "My Product",  
  3.     "price":    2.99,  
  4.     "type":     "product"  
  5. }  
For our new view, we will include a reduce as well as a map. First, we need to map defined as below.
  1. function (doc) {  
  2.     if (doc.type === "product" && doc.price) {  
  3.         emit(doc.id, doc.price);  
  4.     }  
  5. }  
The above map function simply checks to see if the inputted document is a product and that it has a price. If these conditions have been met, the products price is emitted. The reduce function is below.
  1. function (keys, prices) {  
  2.     return sum(prices);  
  3. }  
The above function takes the prices and returns the sum using one of CouchDB’s built in reduce functions. Make sure you check the reduce option in the top right of the results table as you may otherwise be unable to see the results of the reduce. You may need to do a hard-refresh on the page to view the reduce option.

Get Hands-on Training @ BigDataTraining.IN
Contact us:

#67,2nd Floor, 1st Main Road, Gandhi Nagar, Adyar, Chennai- 600020