Hadoop in the Cloud – Amazon EMR vs Cloudera on EC2

hadoopcloud1

Cloud (AWS, Microsoft) abstracts the organizations from IT capacity forecasting and associated capital expenditure of setting up a physical data center. This enables the organizations to scale their IT infrastructure as and when they themselves grow.

Furthermore, the complexity of modern analytics requirements has increased significantly with the explosion of data (variety, velocity and volume). With this, organizations are increasingly offloading their data integration to cloud and exploring the options to commoditize the underlying processing capacity – hadoop clusters. While doing this, they face an important question – whether to use managed hadoop service (Amazon EMR, Microsoft HDInsights etc) or other hadoop services (Cloudera, Hortonworks etc) on cloud. The answer becomes crucial because the continuous evolution of cloud services gradually complicates the cloud infrastructure (Scripts, Customizations etc.) and eventually, increases the switching costs.

Let’s look at two of the widely popular options on AWS –Amazon EMR cluster and Cloudera cluster on EC2.

Auto-Scaling Cluster

Amazon EMR

EMR segregates slave nodes into two subtypes – Core Nodes and Task nodes. The core node acts as both the data node and the worker node, whereas, the task node only acts elastic-mapreduceas the worker node. Apart from scalability, this segregation allows the users following key advantages:

    • No loss of HDFS data – You can remove (Scale-In) task nodes without losing HDFS data since these nodes do not act as DataNodes.
    • Lower Costs – Using spot instances for the task nodes cuts the costs by a factor of 10.

Additionally, AWS CloudWatch can be used to monitor and scale the cluster based on various pre-defined rules – Memory Utilization, Free Containers Remaining etc.

Cloudera on EC2

Cloudera, on the other hand, does not segregate the slave nodes into core and task nodes. Removing the node in an auto-scaling cluster becomes a little risky as HDFS data might be lost if replicated only on the removed nodes.

Advantage: Amazon EMR (For use-cases where scaling is important and usual)

High Availability

Amazon EMR

EMR continuously monitors the slave nodes and replaces any unhealthy node with a new node, behind the scene. While it reduces the work for the administrators, it might lead to the loss of data on the unhealthy (replaced) node.

Additionally, master node of an EMR cluster is not highly available. This makes it the single point of failure.

Cloudera on EC2

Cloudera, on the other hand, does not replace the unhealthy node cloudera-logo-3with new nodes automatically. A high availability architecture can be deployed using cloudera architecture ensuring the availability of hadoop services at all time. (For more information: http://www.cloudera.com/documentation/director/latest/topics/director_create_ha_clusters.html)

Advantage: Cloudera on EC2 (If hadoop services need to be highly available)

Dynamic Provisioning of Cluster

Amazon EMR

You can dynamically provision a new cluster on-demand within a very short span of time. After successful completion of the jobs, this cluster can be terminated, in turn, improving the utilization and reducing the costs drastically.

Cloudera on EC2

Cloudera does provide Cloudera Manager/Director to create a cluster with minimum efforts. The option is not as intuitive and easy as launching an EMR cluster.

Advantage: Amazon EMR (If cluster is only used as a commodity for data processing)

Managing Hadoop Services

Amazon EMR

Amazon EMR does not provide any management console in your hadoop cluster. This makes it really complicated and frustrated to monitor and manage various services on the cluster.

Cloudera on EC2

Cloudera provides an easy to use web GUI based ‘Cloudera Manager’. This helps manage and monitor Hadoop services, cluster and physical host hardware. Additionally, Cloudera also provides Cloudera Director to enable self-service for using cloudera in the cloud.

Advantage: Cloudera on EC2 (If managing the hadoop services is important)

Accessing data on S3

Amazon EMR

Data on S3 can be accessed directly through s3 CLI or hive tables. In both the options, EMR uses proprietary binaries, which are highly tuned for performance. In case of hive tables on s3, EMR also provides proprietary ‘s3:’ URI to invoke their proprietary code base.

Cloudera on EC2

Cloudera can not use AWS proprietary code base and instead, uses Apache binaries and ‘s3a’ URI to access the data on s3.

Advantage: Amazon EMR (for high-velocity, high-volume data transfer between s3 and EMR)

Customer Support Services

       Amazon EMR

AWS manages EMR Hadoop service as well as underlying AWS infrastructure. This results in less friction and overhead when getting expert support from AWS.

Cloudera on EC2

In this case, the responsibility is divided between Cloudera and AWS. While cloudera manages hadoop and the softwares on your cluster, AWS takes care of the underlying hardware. This could result in more time and overhead to resolve complex technical issues.

Advantage: Amazon EMR

Security Configurations

        Amazon EMR

AWS provides various out of the box options to secure the EMR cluster. These options are primarily focussed towards network-level security using Virtual Private Cloud, private-public subnets, Security Groups, Network ACLs etc. EMR does not have an intuitive and automated way of securing your cluster from inside. For e.g. EMR does not have an option to configure Kerberos seamlessly.

Cloudera on EC2

Cloudera cluster can also have the same network-level security as mentioned above because the underlying infrastructure is provided by AWS.

Additionally, cloudera makes it intuitive to configure security (Kerberos, Sentry etc.) through Cloudera Director. Cloudera also provides extensive documentation and has on-premise expertise in securing hadoop clusters.

Advantage: Cloudera on EC2 (Ease of configuring additional security)

On-Premise and Cloud Options

Amazon EMR

Amazon EMR cluster can only be provisioned in AWS cloud.

Cloudera on EC2

Cloudera offers both on-premise and on-cloud options. While increasing the options for the users, it also helps the users reuse their on-premise expertise – experience, human resources and learnings.

Advantage: Cloudera on EC2

Summarily, Amazon EMR and Cloudera on EC2, both, have their advantages and limitations. Based on the use case and after weighting the above parameters, organizations can choose one option over the other.

emrvscdh

Comparative Analysis

Key Takeaway: Factors to consider when choosing your cloud hadoop distribution!!!

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s