Deploying an ad-hoc IPython cluster on Joyent in 20 minutes for 15 cents

This is a how-to for deploying an ad hoc IPython cluster for parallel computation on the Joyent Public Cloud (or other compatible ones, like Telefónica’s InstantServers or a private SmartDataCenter deployment). Most of this blog post — beyond the provisioning and some configuration detection — is adaptable to a private SmartOS deployment, as well.

One of the motivating factors for this work was seeing the easy integration of IPython and StarCluster. I enjoy the computing platform that Joyent has developed, and find it far easier, faster, and more trustworthy to use than other choices, so I sought to establish the foundations for similarly quick provisioning of an ad hoc cluster. This page by no means presents a complete solution, but I hope that it can get others started working on their own variants.

Before continuing, please be aware of IPython’s security model. In order to keep this to readable length, this solution does not get into provisioning with SSH key management, so this is not well-suited for a long-running cluster deployed on a public cloud.

This is a long article that looks at a lot of details along the way. If you’re impatient, you may “safely” skip to the deployment section, below, so long as you solemnly vow to eventually read the details of the scripts I wrote and you just installed on your machine.

Examining the internals

In the previous installment in this series, I discussed how to use Joyent’s flavor of pkgsrc to build packages outside of the webapp mainstream. I used IPython as an example, partly as a preface to this tutorial, partly as an example of writing a custom SMF manifest. Let’s look at that manifest now to see what we’re running.

<?xml version='1.0'?>
<!DOCTYPE service_bundle SYSTEM '/usr/share/lib/xml/dtd/service_bundle.dtd.1'>
<service_bundle type='manifest' name='export'>
  <service name='pkgsrc/py27-ipython' type='service' version='0'>
    <dependency name='network' grouping='require_all' restart_on='error' type='service'>
      <service_fmri value='svc:/milestone/network:default'/>
    </dependency>
    <dependency name='filesystem-local' grouping='require_all' restart_on='none' type='service'>
      <service_fmri value='svc:/system/filesystem/local:default'/>
    </dependency>
    <dependency name='mdata' grouping='require_all' restart_on='error' type='service'>
      <service_fmri value='svc:/smartdc/mdata:fetch'/>
    </dependency>
    <method_context>
      <method_credential group='www' user='www'/>
      <method_environment>
        <envvar name='PATH' value='/opt/local/bin:/opt/local/sbin:/usr/sbin:/usr/bin'/>
        <envvar name='IPYTHONDIR' value='/opt/local/share/ipython'/>
      </method_environment>
    </method_context>
    <instance name='engine' enabled='false'>
      <exec_method name='start' type='method' exec='/opt/local/bin/ipengine --url=%{config/url} --Session.key=%{config/exec_key} --timeout=%{config/timeout} --profile=%{config/profile} %{config/args}' timeout_seconds='60'/>
      <exec_method name='stop' type='method' exec=':kill' timeout_seconds='60'/>
      <property_group name='config' type='application'>
        <propval name='url' type='astring' value='tcp://127.0.0.1:59999'/>
        <propval name='exec_key' type='astring' value=''/>
        <propval name='timeout' type='integer' value='4'/>
        <propval name='profile' type='astring' value='engine'/>
        <propval name='args' type='astring' value='--IPEngineApp.auto_create=True'/>
      </property_group>
      <property_group name="startd" type="framework">
          <propval name="duration" type="astring" value="child"/>
      </property_group>
    </instance>
    <instance name='controller' enabled='false'>
      <exec_method name='start' type='method' exec='/opt/local/bin/ipcontroller --ip=%{config/ip} --profile=%{config/profile} %{config/args} --reuse' timeout_seconds='60'/>
      <exec_method name='stop' type='method' exec=':kill' timeout_seconds='60'/>
      <property_group name='config' type='application'>
        <propval name='ip' type='astring' value='127.0.0.1'/>
        <propval name='profile' type='astring' value='default'/>
        <propval name='args' type='astring' value='--HeartMonitor.period=7500'/>
      </property_group>
      <property_group name="startd" type="framework">
          <propval name="duration" type="astring" value="child"/>
      </property_group>
    </instance>
    <stability value='Unstable'/>
    <template>
      <common_name>
        <loctext xml:lang='C'>IPython</loctext>
      </common_name>
    </template>
  </service>
</service_bundle>

If you’re not familiar with the Service Management Facility (SMF) framework from Solaris, don’t be intimidated. This defines a general service named svc:/pkgsrc/py27-ipython with two instances: controller and engine.

An IPython Controller coordinates the cluster, and provides an interface for clients to connect to. In this quickstart-oriented implementation of IPython as SMF services, I made the choice to be primarily driven by command-line options for basic configuration, but also allowed for finer-grained customization via IPython profiles. IPython profile management is out of the scope of this how-to.

We are interested in the start method of the controller:

<exec_method name='start' type='method' 
  exec='/opt/local/bin/ipcontroller --ip=%{config/ip} --profile=%{config/profile} %{config/args} --reuse'
  timeout_seconds='60'/>

This runs the ipcontroller command, telling it to use a resuable, configurable profile name, listening on a configurable IP address, with some additional arguments that are customizable later.

The custom INSTALL script I wrote for pkgsrc installation of IPython (currently SmartOS-centric) creates the default profile the service uses, in a system-wide standard directory. It defaults to listening on the machine’s private IP address, and implicitly requests IPython to create its own randomized session key and listening port. These are output in the /opt/local/share/ipython/profile_default directory:

IPYTHONDIR=${PREFIX}/share/ipython ${PREFIX}/bin/ipython profile create --parallel --profile=default
/opt/local/gnu/bin/timeout 4 ${PREFIX}/bin/ipcontroller --profile=default --ip=${PRIVATE} --location=${PRIVATE} --reuse --ipython-dir=${PREFIX}/share/ipython

Another design choice I made was so that IPython services run as lower-privileged users. I chose www as an expedient, but this is customizable.

chown -R www ${PREFIX}/share/ipython
chgrp -R www ${PREFIX}/share/ipython

Keep in mind that each IPython installation via pkgsrc can act as a controller by default, and thus creates its own default profile with its own randomized values defaulting to listen on its own private IP address. However, if we want to get an IPython engine to connect to a cluster, we need to be able to pass the correct configuration for the central controller that the clients will connect to. Most of the IPython documentation presupposes a shared filesystem. That’s not currently (March 2013) very easy with Joyent, so we want to be able to configure engines out-of-band.

The service instance definition of the engine therefore allows for more command-line configuration than most examples show:

<instance name='engine' enabled='false'>
  <exec_method name='start' type='method' 
    exec='/opt/local/bin/ipengine --url=%{config/url} --Session.key=%{config/exec_key} --timeout=%{config/timeout} --profile=%{config/profile} %{config/args}' 
    timeout_seconds='60'/>
  <exec_method name='stop' type='method' exec=':kill' timeout_seconds='60'/>
  <property_group name='config' type='application'>
    <propval name='url' type='astring' value='tcp://127.0.0.1:59999'/>
    <propval name='exec_key' type='astring' value=''/>
    <propval name='timeout' type='integer' value='4'/>
    <propval name='profile' type='astring' value='engine'/>
    <propval name='args' type='astring' value='--IPEngineApp.auto_create=True'/>
  </property_group>
  <property_group name="startd" type="framework">
      <propval name="duration" type="astring" value="child"/>
  </property_group>
</instance>

An engine is launched with the ipengine command and is passed an URL and a session token along with some other custom arguments. This executes as a long-running process managed by SMF.

Boot scripts for deployment

The preceding is all encapsulated in the pkgsrc package, and should auto-run and yield a default configuration that’s fine for a single-machine setup.

In order to configure a controller that’s capable of controlling more machines, here’s a basic boot script that can be run upon initial boot to help automate deployment:

#!/usr/bin/bash

# startup-controller.sh
# by Adam T. Lindsay, 2013
# free to use and modify

export PRIVATE=$(ifconfig -a | /opt/local/gnu/bin/grep -A 1 net1 | /opt/local/gnu/bin/awk '/inet/ { print $2 }');
export ABI=`/opt/local/bin/bmake -D BSD_PKG_MK -f /opt/local/etc/mk.conf -V ABI`

/opt/local/sbin/pkg_add http://pkgsrc.atl.me/2012Q4/$ABI/All/py27-zmq-2.1.11.tgz
/opt/local/sbin/pkg_add http://pkgsrc.atl.me/2012Q4/$ABI/All/py27-tornado-2.3.tgz
/opt/local/sbin/pkg_add http://pkgsrc.atl.me/2012Q4/$ABI/All/py27-ipython-0.13.1.tgz
/opt/local/sbin/pkg_add http://pkgsrc.atl.me/2012Q4/$ABI/All/py27-pygments-1.5.tgz     
/opt/local/bin/pkgin -y install py27-sqlite3
/opt/local/sbin/pkg_add http://pkgsrc.atl.me/2012Q4/$ABI/All/py27-matplotlib-1.2.0.tgz

/usr/sbin/svccfg -s py27-ipython:controller setprop config/ip=astring: $PRIVATE
/usr/sbin/svcadm refresh py27-ipython:controller
/usr/sbin/svcadm enable py27-ipython:controller

After setting basic environment variables, the script connects to my small binary pkgsrc repository — you’re free to build your own if you don’t want to trust an unknown blogger as a source for binaries. It installs the IPython cluster basics, the dependencies needed for running an IPython notebook instance (but configuration thereof is not discussed here), and matplotlib, which is not only very cool to use with the IPython notebook, but it also pulls in the dependency NumPy, which gives some power to our parallel computation.

(I also have built packages like Pandas, SymPy, SciPy, and NetworkX for SmartOS with pkgsrc. Feel free to customize the script to install what you like. The script itself is hosted on a Gist.)

The key configuration here is to set the IPython controller’s listening interface to the private IP address, not localhost. If configuring for SSH tunneling, you may want to change it back to localhost.

The script then ends by telling SMF to start the controller service.

The key configuration is now held in /opt/local/share/ipython/profile_default/security/ipcontroller-engine.json. All we need to do is fetch that file, and we have the information needed to connect engines to the controller. Although most IPython tutorials instruct you to distribute that file to the engines, we’re going to be more focused in our approach by parsing the file and passing the information using the SmartDataCenter metadata service.

Presuming that such data is passed (we’ll see how when we run through the deployment), we use it in the boot script for the engine:

#!/usr/bin/bash

# startup-engine.sh
# by Adam T. Lindsay, 2013
# free to use and modify

export ABI=`/opt/local/bin/bmake -D BSD_PKG_MK -f /opt/local/etc/mk.conf -V ABI`

/opt/local/sbin/pkg_add http://pkgsrc.atl.me/2012Q4/$ABI/All/py27-zmq-2.1.11.tgz
/opt/local/sbin/pkg_add http://pkgsrc.atl.me/2012Q4/$ABI/All/py27-ipython-0.13.1.tgz
/opt/local/sbin/pkg_add http://pkgsrc.atl.me/2012Q4/$ABI/All/py27-numpy-1.6.2.tgz

/usr/sbin/svccfg -s py27-ipython:engine setprop config/url=astring: `/usr/sbin/mdata-get ipython.url`
/usr/sbin/svccfg -s py27-ipython:engine setprop config/exec_key=astring: `/usr/sbin/mdata-get ipython.key`
/usr/sbin/svcadm refresh py27-ipython:engine
/usr/sbin/svcadm enable py27-ipython:engine

This script only installs the IPython parallel basics, plus NumPy for some more computational oomph. The script gets the IPython URL and Session Key via the mdata-get command, one of my favorite hidden treasures on Joyent’s platform, and passes them to IPython engine service configuration. (Other, more serious deployments might use other services like ZooKeeper, but we’re starting from scratch, creating our own ad hoc deployment with a minimum of external dependencies.)

The service starts, and if everything goes as expected, it connects with the IPython controller.

Deploying a cluster

With all the tedious details of what goes on behind the scenes out of the way, let’s look at how setting up an IPython cluster could be scripted. The tool I favor (and show below) is my own py-smartdc library for connecting to the Joyent SDC API, used in conjunction with the IPython REPL. There’s nothing in the following instructions that gets in the way of shell or Node.js scripting with the standard sdc command-line tools, but I find the easy scriptability from within Python (or, say, using Fabric) makes for a seamless, familiar, flexible experience. Again, the idea was to use a minimum number of external tools both for people to get started quickly, as well as to maximize adaptability to their own workflows.

If you would like more of the hows and whys of connecting to Joyent’s Cloud API with Python, you can go to the py-smartdc tutorial and my quick provisioning walkthrough written earlier. They provide more data and context than the current article.

Also be aware that executing these commands requires a valid Joyent Cloud account, and will cost approximately 0.15USD to start a controller and four engines and tear them down within an hour. Fortunately, at the time of this writing, Joyent appears to be running a free trial with enough credit to run the small cluster of five minimal (512 MiB RAM) instances, as described below, for over one month.

After installing smartdc, start an interactive Python session. If you want to follow along, this session is available as an IPython Notebook (HTML) (Gist) (raw):

import time, sys
import json
from smartdc import DataCenter
import paramiko

sdc = DataCenter('eu-ams-1', key_id='/username/keys/keyname', verbose=True)

Connect to the DataCenter of your choosing, filling in your username and key identifier.

ipc = sdc.create_machine(dataset='sdc:sdc:base64:', package='Extra Small 512 MB', 
  tags={'cluster': 'ipython', 'role': 'controller'},
  boot_script='./startup-controller.sh', name='ipcontroller')
ipc.poll_until('running')

Choose your dataset (I chose the most recent baseline 64-bit image) and machine size. I attached machine tags to this instance in order to quickly find the machine instance group via the Joyent API later. I also point to the controller boot script (listed earlier), and give the instance a name. We then pause until the machine is running.

Let’s define a little helper function that takes an open (paramiko) SSH connection and polls the machine until an SMF service is up (or times out). We want this function as a roadblock, pausing further operation until the boot script completely finishes running, and we have some confidence that the controller’s engine configurations have been written.

def wait_for_svc(connection, fmri, interval=3, timeout=315):
    SERVICE_POLL = 'svcs -H -o STA,NSTA %s' % fmri
    for _ in xrange(timeout//interval):
        # Take stdout and split it into a two-tuple of "state, nextstate"
        states = tuple(connection.exec_command(SERVICE_POLL)[1].read().strip().split())
        if states == ('ON', '-'):
            print >>sys.stderr
            return True
        elif states == ('MNT', '-'):
            raise StandardError('Bootscript failed: now in maintenance mode')
        elif states == ('OFF', 'ON'):
            # heartbeat
            print >>sys.stderr, '.',
        else:
            # slightly unusual state
            print >>sys.stderr, '?',
        time.sleep(interval)
    raise StandardError('Timeout')

It’s no “expect”-level tool, but it’s been good in limited testing as a minimal indicator of service status, using an SSH connection that we’re already using in order to grab data from the server.

Let’s open that connection, wait for the boot script to run, and then grab that data:

ssh_conn = paramiko.SSHClient()
ssh_conn.set_missing_host_key_policy(paramiko.AutoAddPolicy())
ssh_conn.connect(ipc.public_ips[0], username='root')
wait_for_svc(ssh_conn, 'svc:/smartdc/mdata:execute')

_, rout, _ = ssh_conn.exec_command('cat /opt/local/share/ipython/profile_default/security/ipcontroller-engine.json')
ipcontroller = json.load(rout)

ssh_conn.close()

The key take-away is that we get a small dict containing the crucial connection information from the IPython controller node. We can that pass that on as metadata to as many other machines as we like. Let’s instantiate four engines:

for _ in xrange(4):
    sdc.create_machine(dataset='sdc:sdc:base64:', 
        package='Extra Small 512 MB', 
        metadata={'ipython.url': ipcontroller['url'], 
                  'ipython.key': ipcontroller['exec_key']},
        tags={'cluster': 'ipython', 'role': 'engine'},
        boot_script='./startup-engine.sh')

We use the engine boot script examined above. We’ve also added machine tags to group this in the same cluster as the controller, but identify them with a different role in case more fine-grained management was needed. Most importantly, we have passed the metadata that we obtained from the IPython controller instance to the machines as metadata, which is then used by the engine boot script.

There’s no need to bother with a unique name, as an anonymous UUID-style name will be assigned automatically; nor even to refer to the instance again directly through our provisioning scripts, as it’s more important to identify the machine by its role as an IPython engine. It’s an anonymous cog in a cluster.

However, the IPython controller machine matters. Look at the public IP address of that machine instance so that you can log in:

>>> ipc.public_ips
[u'37.153.98.199']

If you were to log into the server at this point, and looked at what was happening to the controller service at this point, you might see something like the following scroll by:

$ ssh root@37.153.98.199
...
# tail -f  /var/svc/log/pkgsrc-py27-ipython:controller.log
2013-03-01 22:00:09.287 [IPControllerApp] Hub listening on tcp://192.168.24.24:57697 for registration.
2013-03-01 22:00:09.310 [IPControllerApp] Hub using DB backend: 'NoDB'
2013-03-01 22:00:09.564 [IPControllerApp] hub::created hub
2013-03-01 22:00:09.565 [IPControllerApp] task::using Python leastload Task scheduler
2013-03-01 22:00:09.565 [IPControllerApp] Heartmonitor started
2013-03-01 22:00:09.646 [IPControllerApp] Creating pid file: /opt/local/share/ipython/profile_default/pid/ipcontroller.pid
2013-03-01 22:00:09.691 [scheduler] Scheduler started [leastload]
2013-03-01 22:00:09.914 [IPControllerApp] client::client '4ab01efb-6500-4b37-bdb9-fdf7cf92550f' requested 'registration_request'
2013-03-01 22:00:24.565 [IPControllerApp] registration::finished registering engine 0:'4ab01efb-6500-4b37-bdb9-fdf7cf92550f'
2013-03-01 22:00:24.567 [IPControllerApp] engine::Engine Connected: 0
2013-03-01 22:05:46.023 [IPControllerApp] client::client '2935378d-bb37-4997-8c5f-d87104968b07' requested 'registration_request'
2013-03-01 22:05:54.565 [IPControllerApp] registration::finished registering engine 1:'2935378d-bb37-4997-8c5f-d87104968b07'
2013-03-01 22:05:54.566 [IPControllerApp] engine::Engine Connected: 1
2013-03-01 22:06:17.258 [IPControllerApp] client::client 'f2c6de71-2976-43b5-ba70-6915f5154f85' requested 'registration_request'
2013-03-01 22:06:21.174 [IPControllerApp] client::client '0a276722-936f-4736-a54f-2011ed4952c2' requested 'registration_request'
2013-03-01 22:06:32.065 [IPControllerApp] registration::finished registering engine 3:'0a276722-936f-4736-a54f-2011ed4952c2'
2013-03-01 22:06:32.066 [IPControllerApp] engine::Engine Connected: 3
2013-03-01 22:06:32.067 [IPControllerApp] registration::finished registering engine 2:'f2c6de71-2976-43b5-ba70-6915f5154f85'
2013-03-01 22:06:32.067 [IPControllerApp] engine::Engine Connected: 2

Using the cluster

Once confident that the cluster is up and connected, you can try running some example files to get a feel for how parallel computation can speed performance:

# export IPYTHONDIR=/opt/local/share/ipython 
# cd /opt/local/share/doc/ipython/examples/parallel/wave2D
# python parallelwave.py -g 400 400 -t 10. -p 1 1
Running [400, 400] system on [1, 1] processes until 10.000000
vector inner-version, Wtime=44.9788, norm=0.129738
# python parallelwave.py -g 400 400 -t 10. -p 4 1 
Running [400, 400] system on [4, 1] processes until 10.000000
vector inner-version, Wtime=18.2465, norm=0.129738

… or run Python to play with data in parallel interactively:

# IPYTHONDIR=/opt/local/share/ipython ipython

(Note that we have created profile directories not specific to one user. By pointing the IPYTHONDIR environment variable to its system-wide location, we can take advantage of the IPython libraries wanting to look for a default profile in that given location.)

>>> from IPython.parallel import Client
>>> rc = Client()
>>> rc.ids
[0, 1, 2, 3]
>>> dv = rc[:]
>>> with dv.sync_imports():
...     import numpy
...
importing numpy on engine(s)
>>> %px a = numpy.random.rand(2,2)
>>> %px numpy.linalg.eigvals(a)
Out[0:2]: array([ 1.08148666, -0.45514645])
Out[1:2]: array([ 0.16029289,  0.90288354])
Out[2:2]: array([ 0.09095063,  1.00347168])
Out[3:2]: array([ 1.30304228, -0.04250421])
>>> dv.block=True
>>> %autopx
%autopx enabled
>>> max_evals = []
>>> for i in range(100):
...     a = numpy.random.rand(10,10)
...     a = a+a.transpose()
...     evals = numpy.linalg.eigvals(a)
...     max_evals.append(evals[0].real)
>>> print "Average max eigenvalue is: %f" % (sum(max_evals)/len(max_evals))
[stdout:0] Average max eigenvalue is: 10.082603
[stdout:1] Average max eigenvalue is: 10.037152
[stdout:2] Average max eigenvalue is: 10.072430
[stdout:3] Average max eigenvalue is: 10.227973

Shutting down your machines

When you’re finished computing the next large prime number, you can return to your SmartDataCenter session on your local machine and turn off all of the machines you’ve just provisioned.

from operator import methodcaller
cluster = sdc.machines(tags={'cluster': 'ipython'})
map(methodcaller('stop'), cluster)
map(methodcaller('poll_until', 'stopped'), cluster)
map(methodcaller('delete'), cluster)

Conclusion

Although there’s a lot more work to be done in refining the simple IPython deployment picture presented here, allowing for more complex deployments, SSH tunneled connections, and remote clients, I believe there’s a lot possible with these simple pieces.

This blog post looked at the custom scripts and declarations that turned IPython into an SMF service and set up a default profile upon installation. The integration as an easily installable and configurable service was essential to aiding the rapid provisioning and deployment that followed. The article then dissected the provisioning scripts that install, configure, and start the IPython services on initial machine boot. Again, installation and configuration via adaptive scripts was instrumental in getting the two types of machines in the cluster to connect to each other without having to interactively log in or otherwise deal with heavy configuration management.

There was then a brief walk-through, using Python, on connecting to the Joyent Cloud API, provisioning a controller machine, extracting information from it once running and configured, and passing that information to an arbitrary number of compute engine machines. Each of those machines automatically connect to the central controller once they are provisioned and have run their boot scripts. There are some example commands to run on the Controller machine to verify that the cluster is indeed computing in parallel. Finally, we shut down and delete the machines we had provisioned for the cluster.

Resources

Here is a list of relevant resources and references for this post:

Personal note: I will be in the San Francisco Bay Area for PyCon2013 and a half-week before and after (10—21 March 2013). I’d love to speak to folks interested in this and similar topics. If you want to say hi, you should contact me on Twitter or by email to hello at this.domain.

(Although the chosen subjects in this series might suggest some qualities in this direction, I wouldn’t primarily consider myself a “DevOps” person. I’m a hacker with an interest in distributed systems who enjoys making tools that enable greater things. To me, Joyent and Python/IPython are each fantastic platforms that can enable a lot of exciting things. I like to synthesize systems to achieve more than the sum of their parts, and to share that knowledge with others.)