I'm currently on my way back from the first-ever Spark Summit, where I presented a talk on some of my work with the Fedora Big Data SIG to package Apache Spark and its infrastructure for Fedora. (My slides are online, but they aren't particularly useful without the talk. I'll post a link to the video when it's available, though.)
If you're interested in learning more about Spark, a great place to start is the guided exercises that the Spark team put together; simply follow their instructions to fire up an EC2 cluster with Spark installed and then work through the exercises. In one of the exercises, you'll have an opportunity to build up one of the classic Spark demos: distributed k-means clustering in about a page of code.
Implementing k-means on resilient distributed datasets is an excellent introduction to key Spark concepts and idioms. With recent releases of Spark, though, machine learning can be simpler still: MLLib includes an implementation of k-means clustering (as well as several other fundamental algorithms). One of my spare-time projects has been experimenting with featurizing bicycling telemetry data (coordinates, altitude, mean maximal power, and heart rate) in order to aid self-coaching, and I've been using MLLib for this project. I don't have any results yet that are interesting from a coaching perspective, but simply using GPS coordinates as feature vectors leads naturally to an expressive visualization:
The above map visualizes about six weeks of road rides in late summer and early fall. It does so by plotting the centers of clusters; darker markers correspond to clusters that contain more trackpoints. I've generated similar maps by hand before, and Strava offers automatic activity heatmaps now, but I like the clustering visualization since it can plot routes (when run with hundreds of clusters) or plot hot areas (when run with dozens of clusters).
Some fairly rough code to generate such a map is available in my cycling data analysis sandbox; you can download and run the app yourself. First, place a bunch of TCX files in a directory (here we're using "activities"). Then build and run the app, specifying the location of your activities directory with the "-d" parameter:
You can influence the output and execution of the app with several environment variables: SLP_MASTER sets the Spark master (defaults to local with 8 threads); SLP_OUTPUT_FILE sets the name of the GeoJSON output file (defaults to slp.json), SLP_CLUSTERS sets the number of clusters and SLP_ITERATIONS sets the number of k-means iterations. Once you have the GeoJSON file, you can publish it by posting it to GitHub or your favorite map hosting service.
To get started with MLLib in your own projects, make sure to add spark-mllib to your build.sbt file:
From there, it's extremely straightforward to get k-means running; here are the relevant lines from my app (vectors is an RDD of Array[Double]):
val km = new KMeans()
val model = km.run(vectors)
val labeledVectors = vectors.map((arr:Array[Double]) => (model.predict(arr), arr))
In just a few lines of code, this code initializes a k-means object, optimizes a model, and labels each trackpoint with the cluster the model expects it to belong to. Since this functionality is blazing fast and available interactively from the Spark shell, we can easily experiment with different feature extraction policies and see what helps us get some insight from our data.
You probably already know that Apache Thrift is a framework for developing distributed services and clients to access these in multiple languages. You probably also knew that Thrift is extremely popular among the sorts of cool projects that those of us in the Fedora Big Data SIG find interesting. What you might not have known before you saw the title of this post is that Thrift is currently available in Fedora 19 and later versions for your testing, development, and general yum install-based happiness! Please check it out and let us know how it works with your favorite upstream projects.
(Thanks to Gil Cattaneo for an extremely heroic effort reviewing this package.)
Scala combines a lot of excellent features (functional-style pattern matching, an expressive type system, closures, etc.) with JVM compatibility and a very interesting developer ecosystem (e.g., Akka, Play, Lift, scalacheck, and Spark, just to name a few notable projects). Fedora has included a package for Scala itself for some time, but it doesn't include any of the ecosystem projects. The main obstacle to having Scala ecosystem projects in Fedora is that many projects use sbt, the Simple Build Tool, but there is no native Fedora sbt package. In this post, I'm going to discuss some of the things that make sbt very interesting as a build tool but challenging to package for Fedora, as well as the solutions I've come up with to these problems. First, however, we'll discuss some background.
I've noted in the past that there is a big tension between the Fedora model of dependency management and the models adopted by many language-specific dependency managers. Put simply, the Fedora model is that projects should depend upon system copies of the latest versions of libraries, which have been built with system tools, which were themselves built from pristine sources in a controlled environment. Language-specific models, such as the ones we see with RubyGems and rvm; Python eggs; Java projects using Maven or Ivy; and Erlang releases (especially those managed with rebar), typically allow developers more flexibility to install multiple versions of libraries, fetch dependencies from canonical locations on the web or from source repositories, and rely on different versions of language environments and language runtimes. sbt, which provides both build and dependency management, is no exception in this regard; in fact, it provides as much flexibility as any other language-specific build tool I've encountered.
You don't actually download sbt. Instead, you download a small, self-contained JAR file that will run on any Java 1.6 JRE and includes enough of sbt, Apache Ivy, and the Scala standard library to fetch the whole Scala standard library and compiler, sbt itself, and its dependencies. It can also fetch multiple versions of each of these. This approach means that it's absolutely straightforward to get started using sbt in almost any environment with a JVM, but it conflicts with Fedora policies on bundling, single versions of libraries, and pristine sources.
My solution to this problem is to develop a Fedora-specific sbt launcher that is willing to run against system copies of sbt itself, the Scala compiler and libraries, and other locally-installed JAR files.
sbt uses Apache Ivy to manage dependencies. Fedora has excellent support for building packages that use Maven, but Ivy is still not well-represented in Fedora. Just as with Maven, most of the concerns that Ivy is meant to handle are either addressed by RPM itself (specifying versions of dependencies, finding transitive dependencies, etc.) or do not apply to packages that meet Fedora guidelines (e.g. running different projects against different versions of their dependencies).
However, /usr/share/java isn't set up as a proper Ivy repository; it contains no Ivy module descriptor files (i.e., ivy.xml files). This isn't a problem if we're using Ivy from Ant or standalone, but sbt calls out to Ivy in a way that requires module descriptors and doesn't expose the setting to make them optional.
I have solved this problem in two ways: the first is a simple script that makes an ersatz Ivy repository from locally-installed packages, which can then be used by an sbt build. The second is a small patch to sbt that exposes the Ivy setting to make module descriptor files optional. (I use the former to build sbt binaries that include the latter.)
sbt is used to build itself, as well as some of its dependencies. Fedora has a policy for packaging projects that need to bootstrap in this way, and some other build tools (like rebar) also depend on libraries that are built with that tool. Because of how sbt uses a launcher and because of its dependency management, it is trickier to bootstrap in Fedora than other similar projects (since the initial sbt binary must run locally and must incorporate other Fedora-specific patches, like the module descriptor patch above and patches to work with versions of libraries that ship in Fedora).
Where from here?
Having sbt in Fedora would remove the biggest barrier to getting a lot of the Scala ecosystem in to Fedora, and sbt is a really interesting framework in its own right. However, it's one of the projects where the mismatch between what Fedora requires of upstream projects and the assumptions that contemporary developers work under is particularly pronounced. These difficulties aren't insurmountable, although I found the way that they combine and interweave somewhat daunting when I started investigating sbt.
I'm envisioning a sbt package for Fedora that provides the best of both worlds: an unrestricted sbt environment for developers who want to use Fedora but have the flexibility to target development to other Scala versions (or to use libraries that are available in Ivy repositories but not in Fedora) and a Fedora-specific sbt script that builds software against system packages in a Fedora-friendly way, much like the xmvn and mvn-rpmbuild tools were for Maven. This way, Fedora packagers would have a straightforward way to generate high-quality RPMs from Scala sources and Scala hackers who just want Fedora to meet their needs today could use the system package without restrictions (while having a path to package their projects for Fedora in the future should they choose to).
I welcome feedback and collaboration from Scala hackers who'd like to use Fedora (or other downstream distributions with similar packaging constraints) and from Fedora hackers who'd like to see Fedora as a better place for Scala.
The Spark project is an actively-developed open-source engine for data analytics on clusters using Scala, Python, or Java. It offers map, filter, and reduce operations over in-memory collections, data from local files, or data taken from HDFS, but unlike standard map-reduce frameworks, it offers the opportunity to cache intermediate results across the cluster (and can thus offer orders-of-magnitude improvements over standard map-reduce when implementing iterative algorithms). I’ve been using it lately and have been really impressed. However — as with many cool projects in the “big data” space — the chain of dependencies to get a working installation can be daunting.
In this post, we’ll walk through setting up Spark to run on a stock Fedora 18 installation. We’ll also build the Mesos cluster manager so that we can run Spark jobs under Mesos, and we’ll build Hadoop with support for Mesos (so that we’ll have the option to run standard Hadoop MapReduce jobs under Mesos as well). By following these steps, you should be up and running with Spark quickly and painlessly.
Recall that Wallaby applies partial configurations to groups of nodes. Groups can be either explicit —- that is, a named subset of nodes created by the user, or special groups that are built-in to Wallaby; each node’s group memberships have a priority ordering, so that an individual node’s configuration can favor the partial configuration from one group over another. There are two kinds of special groups: the default group, which contains every node and a set of identity groups, each of which only contains a single node. In addition, Wallaby includes a skeleton group, which combines aspects of explicit and special groups: while it can be managed like an explicit group, all newly-created nodes become members of the skeleton group automatically. The default group is always the lowest-priority membership for a node and its identity group is always the highest-priority membership; a node’s skeleton group membership can be reprioritized or removed as necessary.
The default group is an appealing target for common pool configuration, since it is guaranteed to be applied to every node. However, because Wallaby’s configuration model is additive, it is likely not the best place for every configured feature or parameter setting that you might initially consider applying to the whole pool. For example, if a group’s partial configuration installs a feature, every node that is a member of that group will install that feature (there is no way to say “this group’s configuration removes feature F, if it happens to be installed already). Similarly, if a parameter is set on a group, every node that is a member of that group will have that parameter set; individual node configurations can override the value that the parameter is set to within the group, but there is no way to unset the parameter altogether. Therefore, if you need to enable a feature on almost every node, the default group is not the right place to install that feature. (Indeed, the default group is not the right place to put a putatively universal parameter setting or feature installation even if you can imagine a future exception to its universality.) The default group is also not a great place to put configuration that you expect to take priority over other possible group configurations, since it will always be at the lowest priority.
At this point, you may be asking yourself why you’d want to put any configuration in the default group. While I tend towards a minimal default group configuration myself, I absolutely see several use cases for the default group:
Setting configuration parameters that are actually uniform across the whole pool (and will require a new value, not the absence of a value, if they change). One such parameter is FILESYSTEM_DOMAIN, which you can set to an arbitrary string in order to identify machines that access the same shared filesystem. If you were to extend your pool with machines that couldn’t access that same filesystem, you’d provide a new value for this parameter.
Installing Wallaby features that you actually want installed on every node. I’d include features like Master, which ensures that the condor_master daemon is running (since Wallaby’s configuration daemon runs under the condor_master; a node can’t be configured if it isn’t running) and NodeAccess, which controls access to the pool (although the specific policy parameters required by NodeAccess may change in pool subsets).
Rapidly prototyping configurations for homogeneous or small, experimental pools. When you’re first using Wallaby, the default group is a convenient way to get things running. Similarly, if most of your nodes are execute nodes with the same policies, you may be able to put most of your configuration in the default group, especially if your submit and central manager configurations are generally a superset of your execute node configuration. Fortunately, Wallaby makes it straightforward to move configuration from the default group to an explicit group if you should require more flexibility.
I’m sure there are other cases in which using the default group makes good sense, but in general, you should strongly consider using an explicit group or the skeleton group for almost all of your nearly-universal configuration. If you haven’t used the skelton group before, read more about it!
Rob Rati and I gave a tutorial on highly-available job queues at Condor Week this year. While it was not a Wallaby-specific tutorial, we did point out that configuring highly-available job queues is easier for users who manage and deploy their configurations with Wallaby; compare the manual and automated approaches in the following slides:
Configuring highly-available central managers (HA CMs) is rather more involved than configuring highly-available job queues. Here’s what a successful HA CM setup requires:
all hosts that serve as candidate central managers (CMs) must be included in the CONDOR_HOST variable across the pool
the had and replication daemons must be set up to run on candidate CMs
the HAD_LIST and REPLICATION_LIST configuration variables must include a list of candidate CMs and the ports on which the had and replication daemons are running on these hosts
various tunable settings related to shared-state and failure detection must be set
Wallaby includes HACentralManager, a ready-to-install feature that has sensible defaults for setting up a candidate CM. The tedious work of constructing lists of hostnames and ports — and ensuring that these are set everywhere that they must be — can take great advantage of Wallaby’s scriptability. At the bottom of this post is a simple Wallaby shell command that sets up a highly-available central manager with several nodes serving as candidate CMs. To use it, download it and place it in your WALLABY_COMMAND_DIR (review how to install Wallaby shell command extensions if necessary). Then invoke it with
wallaby setup-ha-cms fred barney wilma betty
The above invocation will set up fred, barney, wilma, and betty as candidate CMs, place the candidate CMs in the PotentialCMs group (creating this group if necessary), and configure Wallaby’s default group to use the highly-available CM cluster. (The setup-ha-cms command takes options to put candidate CMs in a different group or apply this configuration to some subset of the pool; invoke it with --help for more information.)
Once you’ve set up your candidate CMs, be sure to activate the new configuration:
Of course, wallaby activate will alert you to any problems that prevent your configuration from taking effect. Correct any errors that come up and activate again, if necessary. The setup-ha-cms command is a pretty simple example of automating configuration, but it saves a lot of repetitive and error-prone effort!
UPDATE: The command will now remove all nodes from the candidate CM group before adding any nodes to it. This ensures that if the command is run multiple times with different candidate CM node sets, only the most recent set will receive the candidate CM configuration. (The command as initially posted would apply the candidate CM configuration to every node that was in the candidate CM group at invocation time, but only those nodes that were named in its most recent invocation would actually become candidate CMs.) Thanks to Rob Rati for the observation.
Wallaby 0.16.0, which updates the Wallaby API version to 20101031.6, includes support for authorizing broker users with various roles that can interact with Wallaby in different ways. This post will explain how the authorization support works and show how to get started using it. If you just want to get started using Wallaby with authorization support as quickly as possible, skip ahead to the section titled “Getting Started” below. Detailed information about which role is required for each Wallaby API method is after the jump.
Users must authenticate to the AMQP broker before using Wallaby (although some installations may allow users to authenticate as “anonymous”), but previous versions of Wallaby implicitly authorized any user who had authenticated to the broker to perform any action. Wallaby now includes a database mapping from user names to roles, which allows installations to define how each broker user can interact with Wallaby. Each method is annotated with the role required to invoke it, and each method invocation is checked to ensure that the currently-authenticated user is authorized to assume the role required by the method. The roles Wallaby recognizes are NONE, READ, WRITE, or ADMIN, where each role includes all of the capabilities of the role that preceded it.
If WALLABY_USERDB_NAME is set in the Wallaby agent’s environment upon startup and represents a valid pathname, Wallaby will use that as the location of the user-role database. If this variable is set to a valid pathname but no file exists at that pathname, the Wallaby user-role database will be created upon agent startup. If WALLABY_USERDB_NAME is not set, the user-role database will be initialized in memory only and thus will not persist across agent restarts.
When Wallaby is about to service an API request, it:
checks the role required to invoke the method.
checks the authorization level specified for the user. There are several possibilities under which a user could be authorized to invoke a method:
the user is explicitly authorized for a role that includes the required role (e.g. the user has an ADMIN role but the method only requires READ);
the user is implicitly authorized for a role that includes the required role (e.g. there is an entry for the wildcard user * giving it READ access and the method requires READ access)
the role database is empty, in which case all authenticated users are implicitly authorized for all actions (this is the same behavior as in older versions of Wallaby)
the invocation is of a user-role database maintenance method and the client is authorized via shared secret (see below)
if none of the conditions of the above step hold, the method invocation is unauthorized and fails with an API-level error. If the API method is invoked over the Ruby client library, it will raise an exception. If it is invoked via a wallaby shell command-line tool, it will print a human-readable error message and exit with a nonzero exit status.
if the user is authorized to invoke the method, invocation proceeds normally.
Authorization with secret-based authentication
This version of the Wallaby API introduces three new methods: Store#set_user_privs, Store#del_user, and Store#users. These enable updating and reading the user-role database; the first two require ADMIN access, while the last requires READ access. Because changes in the user-role database may result in an administrator inadvertently removing administrator rights from his or her broker user, Wallaby provides another mechanism to authorize access to these methods. Each of these three methods supports a special secret option in its options argument. When the Wallaby service starts up, it loads a secret string from a file. Clients that supply the correct secret as an option to one of these calls will be authorized to invoke these calls, even if the broker user making the invocation is not authorized by the user-role database.
The pathname to the secret file is given by the environment variable WALLABY_SECRET_FILE. If this variable is unset upon agent startup, Wallaby will not use a shared secret (and secret-based authorization will not be available to API clients). It this variable is set and names an existing file that the Wallaby agent user can read, the Wallaby shared secret will be set to the entire contents of this file. If this variable is set and names a nonexistent file in a path that does exist, Wallaby will create a file at this path upon startup with a randomly-generated secret (consisting of a digest hash of some data read from /dev/urandom). If this variable is set to a pathname that includes nonexistent directory components, the Wallaby agent will raise an error. If you create your own secret file, ensure that it is only readable by the UNIX user that the Wallaby agent runs as (typically wallaby).
The Wallaby agent’s authorization support is designed to prevent broker users from altering Condor pool configurations in excess of their authority. It is not intended to keep all configuration data strictly confidential. (This is not as bad as it might sound, since Wallaby-generated configurations are available for inspection by Condor users.) Furthermore, due to technical limitations, it is not possible to protect object property accesses over the API with the same authorization support that we use for API method invocations. Therefore, if concealing configuration data from some subset of users is important for your installation, you should prevent these users from authenticating to the broker that the Wallaby agent runs on.
Here is a quick overview of how to get started with auth-enabled Wallaby:
Stop your running Wallaby and restart your broker before starting the new Wallaby (this is necessary to pick up the new API methods). Set WALLABY_USERDB_NAME in your environment to a path where you can store the user-role database. Install and start your new Wallaby.
If you’re using the RPM package, it will create a “secret file” for you in /var/lib/wallaby/secret. If not, you will need to set WALLABY_SECRET_FILE in the environment to specify a location for this secret file and then restart Wallaby. The Wallaby secret is a special token that can be passed to certain API methods (specifically, those related to user database management) in order to authorize users who aren’t authorized in the user database.
Try using some of the new shell commands: wallaby set-user-role, wallaby list-users, and wallaby delete-user.
Make sure that you have a secret in your secret file. Make a note of it. Try setting the role for your current broker user to READ or NONE (e.g. “wallaby set-user-role anonymous NONE”) and then see what happens when you try and run some other Wallaby shell commands. You can recover from this by passing the Wallaby secret to “wallaby set-user-role”; see its online help for details.
The default user database is empty, which will result in the same behavior as in older versions of Wallaby (viz., all actions are available to all broker users), but only until a user role is added, at which point all actions must be explicitly or implicitly authorized.
Many Condor users are interested in high-availability (HA) services: they don't want their compute resources to become unavailable due to the failure of a single machine that is running an important Condor daemon. (See this talk that Rob Rati and I gave at Condor Week this year for a couple of solutions to HA with the Condor schedd.) So it's only natural that Condor users who are interested in configuring their pools with Wallaby might wonder how Wallaby responds in the face of failure.
Some of the technologies that the current version of Wallaby is built upon do not lend themselves to traditional active-active high-availability solutions. However, the good news is that due to Wallaby's architecture, almost all running Condor nodes will not be affected by a failure of the Wallaby service or the machine it is running on. Nodes that have already checked in with Wallaby will have their latest activated configurations as of the last checking. The only limitations in the event of service failure are:
new nodes will not be able to check in and get the default configuration;
it will not be possible to access older activated configurations; and
it will not be possible to alter, activate, or deploy the configuration.
These limitations, of course, disappear when the service is restored. For most users, losing the ability to update or deploy configurations due to a service failure — but not losing their deployed configurations or otherwise affecting normal pool operation — represents an acceptable risk. Some installations, especially those who aggressively exploit Wallaby's scripting interface or versioning capability, may want a more robust solution: these users might want to be able to access older versions of their activated configurations even if Wallaby is down, or they might want a mechanism to speed recovery by starting a replica of their service on another machine. In the remainder of this post, we'll discuss some approaches to provide more access to Wallaby data when Wallaby is down.
Accessing older versioned configurations
If you need to access historical versioned configurations, the easiest way to do it is to set up a cron job on the machine running Wallaby that periodically runs wallaby vc-export on your snapshot database and outputs versioned configurations to a shared filesystem. wallaby vc-export, which is documented in this post, exports all historical snapshots to plain-text files, so you can access the configuration for foo.example.com at version 1234 in a file called something like 1234/nodes/foo.example.com. This cron job needs to be able to access the filesystem path of the Wallaby snapshot database; furthermore, to run it efficiently, you'll probably want to limit the number of snapshots it processes each time; see wallaby vc-export's online help for more details.
Exporting Wallaby state to a file
If you're just interested in the state of the Wallaby service (including possibly unactivated changes), you can periodically run wallaby dump over the network. This will produce a YAML file consisting of the serialized state of the Wallaby; you can later load this file by using the wallaby load command, possibly against another Wallaby agent.
Backing up the raw database files
The easiest way to pick up and recover from a Wallaby node failure is to start a new Wallaby service with the same databases as the failed node. In turn, the easiest way to do this is by periodically copying these database files from their locations on the Wallaby node (typically in /var/lib/wallaby) to some location on a shared filesystem. The following Ruby script will safely copy the SQLite files that Wallaby uses even while Wallaby is running:
Wallaby 0.15.0 includes a new feature called the skeleton group. (This feature was available in earlier versions of Wallaby, too, but it was experimental and had some rough edges.) Find out how the skeleton group makes configuration more flexible by reading all about it.
Often, if you're trying to reproduce a problem someone else is having with Condor, you'll need their configuration. Likewise, if you're trying to help someone reproduce a problem you're having, you'll want to send along your configuration to aid them in replicating your setup. For installations that use legacy flat-file configurations (optionally with a local configuration directory), this can be a pain, since you'll need to copy several files from site to site (ensuring that you've included all the files necessary to replicate your configuration, perhaps across multiple machines on the site experiencing the problem).
If everyone involved uses Wallaby for configuration management, things can be a lot simpler: the site experiencing the problem can use wallaby dump to save the state of their configuration for an entire pool to a flat file, which the troubleshooting site can then inspect or restore with wallaby load. If the problem appears in some configuration snapshots but not in others, the reporting site can use wallaby vc-export to generate a directory of all of their configurations over time, so that the troubleshooting site can attempt to pinpoint the differences between what worked and what didn't.
(Thanks to Matt for pointing out the value of versioned semantic configuration management in reproducing problems!)
On this site, I write about topics related to things I'm working on now and things I've worked on in the past: distributed computing and programming languages. I don't speak for my employer, and any opinions on this site are mine alone.
Will Benton: Erik, I absolutely agree; this should be considered early-access stuff. read more
Erik Erlandson: It might be better to organize as: import wallaby.tagging read more
ferkeltongs: Hi Will, I came across your post while looking for read more