With my latest assignment I have started exploring Hadoop and related technologies. When exploring HDFS and playing with it, I came across these two syntaxes of querying HDFS:

> hadoop dfs
> hadoop fs

Initally could not differentiate between the two and keep wondering why we have two different syntaxes for a common purpose. I googled the web and found people too having the same question and below are there reasonings:

Per Chris explanation looks like there's no difference between the two syntaxes. If we look at the definitions of the two commands (hadoop fs and hadoop dfs) in $HADOOP_HOME/bin/hadoop
...
elif [ "$COMMAND" = "datanode" ] ; then
  CLASS='org.apache.hadoop.hdfs.server.datanode.DataNode'
  HADOOP_OPTS="$HADOOP_OPTS $HADOOP_DATANODE_OPTS"
elif [ "$COMMAND" = "fs" ] ; then
  CLASS=org.apache.hadoop.fs.FsShell
  HADOOP_OPTS="$HADOOP_OPTS $HADOOP_CLIENT_OPTS"
elif [ "$COMMAND" = "dfs" ] ; then
  CLASS=org.apache.hadoop.fs.FsShell
  HADOOP_OPTS="$HADOOP_OPTS $HADOOP_CLIENT_OPTS"
elif [ "$COMMAND" = "dfsadmin" ] ; then
  CLASS=org.apache.hadoop.hdfs.tools.DFSAdmin
  HADOOP_OPTS="$HADOOP_OPTS $HADOOP_CLIENT_OPTS"
...
that's his reasoning behind the difference.

I am not convinced with this, I looked out for a more convincing answer and here's are a few excerpts which made better sense to me:

FS relates to a generic file system which can point to any file systems like local, HDFS etc. But dfs is very specific to HDFS. So when we use FS it can perform operation with from/to local or hadoop distributed file system to destination . But specifying DFS operation relates to HDFS.

Below are the excerpts from hadoop documentation which describes these two as different shells.


FS Shell
The FileSystem (FS) shell is invoked by bin/hadoop fs . All the FS shell commands take path URIs as arguments. The URI format is scheme://autority/path. For HDFS the scheme is hdfs, and for the local filesystem the scheme is file. The scheme and authority are optional. If not specified, the default scheme specified in the configuration is used. An HDFS file or directory such as /parent/child can be specified as hdfs://namenodehost/parent/child or simply as /parent/child (given that your configuration is set to point to hdfs://namenodehost). Most of the commands in FS shell behave like corresponding Unix commands. 

DFShell
The HDFS shell is invoked by bin/hadoop dfs . All the HDFS shell commands take path URIs as arguments. The URI format is scheme://autority/path. For HDFS the scheme is hdfs, and for the local filesystem the scheme is file. The scheme and authority are optional. If not specified, the default scheme specified in the configuration is used. An HDFS file or directory such as /parent/child can be specified as hdfs://namenode:namenodeport/parent/child or simply as /parent/child (given that your configuration is set to point to namenode:namenodeport). Most of the commands in HDFS shell behave like corresponding Unix commands. 

So from the above it can be concluded that it all depends upon the scheme configure. When using this two command with absolute URI, i.e. scheme://a/b the behavior shall be identical. Only its the default configured scheme value for file and hdfs for fs and dfs respectively which is the cause for difference in behavior.
0

Add a comment

You might be seeking for the option to profile (capturing method's execution time) your spring application. Spring provides different ways to profile the application. Profiling should be treated as a separate concern and spring AOP facilitates easy approach to separate this concern.

Why Locking is required?

When two concurrent users try to update database row simultaneously, there are absolute chances of losing data integrity. Locking comes in picture to avoid simultaneous updates and ensure data integrity.

I have got a problem in coding contest. In this post, I would like to share the approach to solve this problem. I would definitely not say that I have invented something new and I am not trying to reinvent the wheel again.

Jenkins is an open source tool which provides continuous integration services for software development. If you want to get more detail about Jenkins and it's history I would suggest refer this link. This post will help you installing and configuring Jenkins and creating jobs to trigger maven builds.

Peer code review is an important activity to find and fix mistakes which are overlooked during development. It helps improving both software quality as well as developers skills. Though, it’s a good process for quality improvement.

If you are using Hudson as continuous integration server and you might feel lazy about accessing Hudson explicitly to check the build status or checking Hudson build status mails, there is an option to monitor Hudson build and perform build activities in Eclipse IDE itself.

As a developer or architect, you always need to draw some sequence diagrams to demonstrate or document your functionality. And of course, if you do this manually you have to spare much time for this activity.

If you are using JPA 2.0 with Hibernate and you want to do audit logging from middle-ware itself, I believe you landed up on the exact place where you should be. You can try audit logging in your local environment by following this post.

If any issue is observed in production, there are two major aspects related to providing the solution. First ‘how quick you can analyze the root cause’ and second ‘how quick you can fix the issue’. Story starts from analyzing the root cause.

Few months back we had a debate about using Camel vs. Enterprise bus in a new project. I was on the camel side , I found hard to have an ESB just for integration and service chaining. With this blog post, based upon my understanding I will try to summarize when to use what.

The last time our team worked on Esper for complex event processing, it was version 3.4.0. One of the requirement we envisaged was for EPL statements to be externalized into configuration files rather than keeping them in code.

More than a year back, during some research related to CEP, I came across Storm which was "touted" as being a CEP engine and it was very difficult to come to terms with these assertions.

Since the day one when we started working on Storm, I was mistaken on spouts modus operandi. I believed spouts can both pull and push data from the sources. But when I was about to implement a push styled spout, I stumbled with a few challenges.

Yesterday our Cassandra development cluster broke down, Mahendra reported that on executing any statement on cassandra-cli, it errs prompting a weird message 'schema disagreement error' on console. I googled, my usual way of being :) and found this FAQ on Cassandra wiki.

We faced a weird issue with Hector, one of the API which we developed to read data from cassandra crashed when we tried integrating DAO layer with other app dependencies. We were shocked and had no clues on what went wrong. Though the code we developed was thoroughly tested.

Cassandra composites as discussed in Datastax blog influenced us to adapt composite modelling in one of our pilot project. We used Hector APIs as the client library for this assignment. Below is the column family example which uses composite comparator type.

Prologue

Rich media applications (RIA) have led to a tremendous acceptance of web applications. And along with this, instead of HTML travelling back and forth, XMLs are interchanged to communicate the information.

Here I am writing to cover up the difference between OrphanRemoval=true and CascadeType.REMOVE in JPA2.0 with hibernate.

Having installed Graylog2 for centralized logging, we quickly wanted to test it's syslog capability. Configuring syslog daemon to redirect its log on remote server is easy to setup, but  requires administrator privileges.

Criteria Query has been introduced in JPA 2.0. With the help of criteria queries you can write your queries in a type-safe way. Before criteria queries, developers had to write queries through the construction of object-based query definitions.

1

One of our current assignment demanded to migrate data from HDFS to Mongo database. Data contained in HDFS was in JSON format and this was a plus since Mongo explicitly support JSON documents.

I started looking out for strategies how shall this migration be executed.

Liferay is one of the portal frameworks based on java. You can create portlets in Liferay using Spring MVC framework. This post may give you the answer of your one question "How to create a portlet in Liferay using Spring MVC ? ".  You can setup Liferay from here.

As it happens, there are many instances that one comes across while programming client side of a web-applications where one has to make a trade-off between the so called ‘best-practices’ and the current scope of change in the application.

Gone through Hortonworks prediction for 2013, I find one item missing from the list. I feel it's the time when most of enterprises shall be looking at building PaaS infrastructure to support there business, this is what even Gartner study reveals.

The word count example explained @ http://static.springsource.org/spring-hadoop/docs/current/reference/html/batch-wordcount.html didn’t run for me.

In a previous blog-post we got to know about enterprise integration patterns (EIPs). Now in this post we will look into Apache Camel framework that realizes those patterns. 

About Camel:

Apache Camel is an open source project which is almost 5 years old and has a large community of users.

Couple of the issues encountered when I started experimenting with Spring for Apache Hadoop

One, the Hadoop job that I was running was not appearing on the Map/Reduce Administration console or the Job Tracker Interface

And the other: I was trying to run the job from Spring Tool Suite (STS) IDE on

Couple of set-up issues observed while installing and using hive.

In this blog entry we will go through some of Enterprise Integration Patterns. These are known design patterns that aims to solve integration challenges. After reading this one would get a head start in designing integration solutions.

[New approach to build nextGen enterprise applications]

Javascript is available in web world for a while and has become one of the most popular, known, understood and also hated language available.

1

In previous blog, ‘Marking the map’, we discussed different available frameworks and APIs for marking geographical coordinates on the map. This edition is an extension to that; here we will discuss an algorithm which targets extracting geographical coordinates (i.e.

Yesterday we were working on setting up our first Hadoop cluster. Though there are many online documentation on this even then we faced a few challenges getting with it.

1

UMLGraph allows the declarative specification and drawing of UML class and sequence diagrams. The specification is done in text diagrams, that are then transformed into the appropriate graphical representations.

2

With my latest assignment I have started exploring Hadoop and related technologies.

Maps representing geography based analysis are mostly composed of multiple layers overlaid together to represent the information.  Most of these maps have atleast two distinct set(s) of layer, which can be termed as Base layers and Information layers.

We were struggling with improving the performance of a Oracle stored procedure which extensively uses cursors. Yesterday Aditya has done an amazing job in improving the performance of this procedure from ~3 minutes to ~15 seconds.

OpenGeo hosts a very good reference on the technology and software stack for the mapping applications. However, for my simple use-case, which involved fetching the geography aware data from a relational database and rendering it on a map, this seemed to be a overkill.

Dependency mediation - this determines what version of a dependency will be used when multiple versions of an artifact are encountered. Currently, Maven 2.0 only supports using the "nearest definition" which means that it will use the version of the closest dependency in the tree of dependencies.

We faced a serialization/deserialization issue wherein cyclic dependency between entities restricted us to use bi directional relationship.

Seldom we come across situtations where we need to initialize containers with static data.Data is read once from datasources and cached in memory. Code that initializes the static cache needs to read the data once and store in a collections object.

Before starting to build new UI for our new project, I made some research on different AJAX RIA Frameworks. Below are my findings on the same:

1.

Loading