Cloudera Hadoop Cluster Deployment Automation Options

If you need transient/ephemeral Cloudera Hadoop (CDH) clusters for testing or training purposes, or you just simply would like to re-spawn the cluster using the same profile whenever necessary, you can use a wide range of tools supporting automated CDH deployments on both cloud and bare metal platfoms. This post overviews the automation options and collects all related publicly available documents.

Deployment Automation in Public Cloud

In Amazon AWS, Microsoft Azure, and Google Cloud Platform use Cloudera Director. Director can be downloaded for free and it is officially supported by Cloudera if a Cloudera Enterprise license is purchased.

Cloudera Director 2.3 and earlier versions were far from being production ready, but about a year ago Cloudera sped up its development and Director was promoted to be the second flagship product of the comany after Cloudera Manager.

After quickly releasing two minor versions within a few months, Director got rid of a bunch of annoying issues and limitations. For example, before version 2.5 Cloudera Director ran on unrecoverable failure each time it could not allocate enough VM instances or any of the VMs got killed during cluster bootstrap/expansion. Unless you knew the trick, you had to call Cloudera Support who tidied up Cloudera Director using some unpublished CRaSH shell commands as discussed here as well. However, as release notes say, this vexatious issue has been fixed since Cloudera Director 2.5.

Though Cloudera Director has a pretty handy user interface, from automation point of view, one of the most important feature is that clusters can be auto-deployed from command line (CLI) using the JSON style Director cluster configuration file. The auto-deployment of a Kerberized, HA cluster of 25 nodes from Director CLI takes around 30-60 minutes, with respect to the manual setup, which can easily be up to 2-3 days. A few basic cluster configuration files can be found here, but, unfortunately, no comprehensive documentation is available out there. For example, you will not find examples on how to configure HOST properties or how to deploy Spark 2.0 from CSD+parcel.

There are also some very important missing features. For example, Director cannot use LDAP service as external authentication provider, or Director gets disconnected from Cloudera Manager as soon as you turn on https for Cloudera Manager UI. (This latter enables https also to the Cloudera Manager REST API preventing Director from communicating with Cloudera Manager.)

Despite of the above-mentioned hiccups, we can say that Cloudra Director (with Cloudera Manager 5.12) is nearly production ready from version 2.5. However, do not miss thoroughly reading and understanding the impact of limitations on the operation of your deployments.

Cloudera Director in Private Cloud

If the cluster is planned to be deployed on vSphere, then you can opt for using the pre-implemented vSphere Cloudera Director plugin. The specialities of this kind of cluster installation is documented here.

The community has long been waiting for supporting deployments on OpenStack. Now it has arrived. As of version 5.12 Cloudera Enterprise is capable to spin up and down clusters on OpenStack. Would be good to see the reference architecture mentioned in the article, but it has not been published on Cloudera’s web site, yet.

Update: A few days after the announcement Cloudera published the OpenStack Reference Architecture.

If you use other private IaaS provider, Cloudera Director allows you to implement your own provider plugin. For an example implementation of Bring-Your-Own-Nodes concept go to Cloudera’s BYON GitHub project.

Deployment Automation on Bare Metal Servers

  • If you have a pool of bare metal servers consider downloading Cloudera’s BYON GitHub project and tailoring to your needs. The existing sources are not suitable for production use, so you even have to revise the already existing code lines.
  • If plugin implementation is far more than the effort you want to put in cluster automation, but configuration management and orchestration tools sound better, consider using Cloudera Ansible Playbook implemented by James Kinley (Principal Solutions Architect, Cloudera). The tool is not officially supported by Cloudera, but I strongly encourage you to use it.
  • If prefer Puppet to the headless Ansible, try Mike Arnold’s deployer. Unfortunately, the project has not been tuched since 2015 March and it is probably not maintained anymore, but it can be a good starting point of a Puppet backed custom auto-deployer.
  • Finally, pick your favourite configuration management tool (for example, Chef, Salt, or just simply shell scripts) and implement installation Path B or C based on the Cloudera Manager and CDH installation procedure given at Cloudera’s web site.

Hope it helps.

Leave a Reply

Your email address will not be published. Required fields are marked *