One problem that a lot of enthusiastic amateur cyclists encounter is how to make sense of all the workout telemetry data that their smartphone or cycle computer captures. Most riders have some sense of how their cadence, heart rate, speed, road grade, and wattage come into play at any given moment in a ride as it’s happening, but answering questions about the bigger picture about how these fit together over time remains more difficult. I’ve been experimenting with cycling data analytics using Apache Spark for some time now, but I thought I’d share some visualizations that I put together recently to answer a question that’s been nagging me as the weather warms up here in Wisconsin.

In my last post on using Spark to process fitness data, I presented a very simple visualization based on plotting the centers of clustered GPS traces. By plotting darker center markers for denser clusters (and generating a large number of clusters), I was able to picture which roads and intersections I spent the most time riding on in the set of activities that I analyzed. This time, however, I was more interested in a visualization that would tell me what to do rather than a visualization that would tell me what I had already done.

Background

One of the most useful tools for a cyclist who is interested in quantifying his or her performance and training is a direct-force power meter. By measuring the actual force applied at some point on the bicycle drivetrain, these devices can accurately tell riders how many calories they’re burning in a ride, whether or not they’re “burning matches” (that is, using anaerobic metabolism instead of aerobic metabolism) at a given point in a race, how to pace long steady efforts to maximize performance, and precisely how hard to work in interval training in order to improve various kinds of fitness. The last of these capabilities will be our focus in this post.

It’s obvious that there is a difference between ultra-endurance efforts and sprint efforts; no one would try to sprint for an entire 40km time trial (or run a marathon at their 100m pace), and it would be pointless to do sprint-duration efforts at the sort of pace one could maintain for a 12-hour race. More generally, every athlete has a power-duration curve of the best efforts they could produce over time: one’s best 5-second power might be double their best one-minute power and four times their best one-hour power, for example. There are several points where this curve changes for most people, and these correspond to various physiological systems (for example, the shift from anaerobic to aerobic metabolism). By targeting interval workouts to certain power zones, athletes can improve the corresponding physiological systems.

Technique

I began by clustering points from GPS traces, but instead of plotting the cluster centers, I plotted the convex hulls of all of the points in each cluster. By giving me polygons containing every point from my data set, this gave me a pretty good picture of where I’d actually been. I then calculated my mean power for three durations — corresponding roughly to anaerobic, VO2max, and just-above-aerobic efforts — at every point in each activity. In other words, I mapped each point in each ride to the mean power I was about to produce in that ride. Then, for each duration, I found the best efforts starting in each cluster and used these data to shade the convex hulls so that hulls where better “best efforts” originated would thus appear more saturated.

Because Spark is expressive and can work interactively, it was straightforward to experiment with various techniques and constant factors to make the most sense of these data. Debugging is straightforward; since I stick to effect-free code as much as possible, I can test my logic without running it under Spark. Furthermore, Spark is fast enough to make trying a bunch of different options completely painless, even on my desktop computer.

Results

I’m including here three plots of cluster hulls, shaded by the best mean power I achieved starting in that cluster for one minute (green), three minutes (blue), and ten minutes (red). With these visualizations (and with increasingly friendly road cycling weather here in Wisconsin), I can decide where to go to do interval workouts based on where I’ve had my best efforts in the past. The data tell me that if I want to work on my one-minute power, I should focus on the Timber Lane climb up from Midtown; if I want to work on my three-minute power, it’s either Barlow Road or the east side of Indian Lake; and if I want to work on my ten-minute power, it’s off to Mounds Park Road for the same climb that made everyone suffer in the national championship road race last year.

(Click and drag or zoom to inspect any map; if one is missing polygons, drag and they should render.)

Future work

I have many ideas for where to take this work next and have some implementation in progress that is producing good results but not (yet) perspicuous visualizations. However, even the more mundane things on my to-do list are pretty interesting: among other things, I’d like to do some performance evaluation and see just how much cycling data we could feasibly process on a standard workstation or small cluster (my code is currently unoptimized); to add a web-based front end allowing more interactive analysis; and to improve my (currently very simple) computational geometry and power-analysis code to make better use of Spark’s abstractions and distributed execution. (The code itself, of course, is available under the Apache license and I welcome your feedback or pull requests.)

I love tools that make it easy to sketch solutions to hard problems interactively (indeed, I spent a lot of time in graduate school developing an interactive tool for designing program analyses — although in general it’s more fun to think about bicycling problems than whether or not two references alias one another), and Spark is one of the most impressive interactive environments I’ve seen for solving big problems. I’m looking forward to prototyping and refining more tools for understanding cycling training and performance in the future.

sbt is in Fedora 20

| No Comments

Longtime Chapeau readers may recall last summer’s lament about the state of the Scala ecosystem in Fedora. We’ve taken a lot of steps since then. After a rough patch for the Fedora Scala package, Scala 2.10 is available and works again on all current Fedora releases and Rawhide. We’ve added more Scala packages to Fedora as well, including scalacheck, sbinary, and test-interface. Today I’m especially pleased to announce that, by the time you read this, sbt 0.13.1 will be available in Fedora 20 testing.

Having sbt available in Fedora means that we can start packaging more of the Scala ecosystem in Fedora. In fact, the sbt package in Fedora is primarily intended for use building Scala packages for Fedora. While you’ll be able to use it for general Scala development, it has several important limitations in order to comply with the Fedora packaging guidelines: most notably, you won’t be able to cross-compile libraries for different Scala versions with it or launch different versions of sbt for different projects. (I use sbt-extras, which I’ve renamed to xsbt, for my general Scala development.)

In the future (i.e. F21 and on), sbt-based Fedora builds will be greatly streamlined by improved Ivy support in an upcoming version of xmvn. For now, we have to manage dependencies somewhat manually using scripts and macros, but it’s absolutely straightforward. To get started building Scala projects for Fedora right now, check out these guidelines I wrote up for the Big Data SIG and let me know if you have any trouble. There are many example spec files using sbt available among my Github repositories.

This was a big effort and thanks are due to several people, including Mark Harrah, who offered a lot of advice on sbt itself and gave prompt and thorough feedback on my patches; MikoĊ‚aj Izdebski, who helped a lot with my understanding of Java dependency resolution in Fedora and implemented improved support for Ivy resolution in xmvn; Rob Rati, who took on the task of reviewing the sbt package and did a thoughtful and careful job; and the Fedora Packaging Committee for their quick and helpful response to my request for a bootstrap-binary exception.

I’m looking forward to seeing new Scala projects packaged for Fedora soon!

The GNU project’s Four Freedoms present the essential components of Free software: freedom to use the program for any purpose, freedom to study and change the program’s source code, freedom to redistribute the author’s version of the code, and freedom to distribute your modifications. While not everyone who publishes their source code for others to use cares about freedom as the FSF defines it, these principles also motivate common open-source licenses. This post will discuss a surprisingly common obstacle to software freedom and show how you can avoid it in your own projects.

I'm currently on my way back from the first-ever Spark Summit, where I presented a talk on some of my work with the Fedora Big Data SIG to package Apache Spark and its infrastructure for Fedora. (My slides are online, but they aren't particularly useful without the talk. I'll post a link to the video when it's available, though.)

If you're interested in learning more about Spark, a great place to start is the guided exercises that the Spark team put together; simply follow their instructions to fire up an EC2 cluster with Spark installed and then work through the exercises. In one of the exercises, you'll have an opportunity to build up one of the classic Spark demos: distributed k-means clustering in about a page of code.

Implementing k-means on resilient distributed datasets is an excellent introduction to key Spark concepts and idioms. With recent releases of Spark, though, machine learning can be simpler still: MLLib includes an implementation of k-means clustering (as well as several other fundamental algorithms). One of my spare-time projects has been experimenting with featurizing bicycling telemetry data (coordinates, altitude, mean maximal power, and heart rate) in order to aid self-coaching, and I've been using MLLib for this project. I don't have any results yet that are interesting from a coaching perspective, but simply using GPS coordinates as feature vectors leads naturally to an expressive visualization:

The above map visualizes about six weeks of road rides in late summer and early fall. It does so by plotting the centers of clusters; darker markers correspond to clusters that contain more trackpoints. I've generated similar maps by hand before, and Strava offers automatic activity heatmaps now, but I like the clustering visualization since it can plot routes (when run with hundreds of clusters) or plot hot areas (when run with dozens of clusters).

Some fairly rough code to generate such a map is available in my cycling data analysis sandbox; you can download and run the app yourself. First, place a bunch of TCX files in a directory (here we're using "activities"). Then build and run the app, specifying the location of your activities directory with the "-d" parameter:

% sbt console
scala> com.freevariable.surlaplaque.GPSClusterApp.main(Array("-dactivities"))

You can influence the output and execution of the app with several environment variables: SLP_MASTER sets the Spark master (defaults to local with 8 threads); SLP_OUTPUT_FILE sets the name of the GeoJSON output file (defaults to slp.json), SLP_CLUSTERS sets the number of clusters and SLP_ITERATIONS sets the number of k-means iterations. Once you have the GeoJSON file, you can publish it by posting it to GitHub or your favorite map hosting service.

To get started with MLLib in your own projects, make sure to add spark-mllib to your build.sbt file:

libraryDependencies += "org.apache.spark" % "spark-core_2.9.3" % "0.8.0-incubating"

libraryDependencies += "org.apache.spark" % "spark-mllib_2.9.3" % "0.8.0-incubating"

From there, it's extremely straightforward to get k-means running; here are the relevant lines from my app (vectors is an RDD of Array[Double]):

val km = new KMeans()
km.setK(numClusters)
km.setMaxIterations(numIterations)

val model = km.run(vectors)

val labeledVectors = vectors.map((arr:Array[Double]) => (model.predict(arr), arr))

In just a few lines of code, this code initializes a k-means object, optimizes a model, and labels each trackpoint with the cluster the model expects it to belong to. Since this functionality is blazing fast and available interactively from the Spark shell, we can easily experiment with different feature extraction policies and see what helps us get some insight from our data.

Apache Thrift in Fedora

| No Comments

You probably already know that Apache Thrift is a framework for developing distributed services and clients to access these in multiple languages. You probably also knew that Thrift is extremely popular among the sorts of cool projects that those of us in the Fedora Big Data SIG find interesting. What you might not have known before you saw the title of this post is that Thrift is currently available in Fedora 19 and later versions for your testing, development, and general yum install-based happiness! Please check it out and let us know how it works with your favorite upstream projects.

(Thanks to Gil Cattaneo for an extremely heroic effort reviewing this package.)

Scala combines a lot of excellent features (functional-style pattern matching, an expressive type system, closures, etc.) with JVM compatibility and a very interesting developer ecosystem (e.g., Akka, Play, Lift, scalacheck, and Spark, just to name a few notable projects). Fedora has included a package for Scala itself for some time, but it doesn't include any of the ecosystem projects. The main obstacle to having Scala ecosystem projects in Fedora is that many projects use sbt, the Simple Build Tool, but there is no native Fedora sbt package. In this post, I'm going to discuss some of the things that make sbt very interesting as a build tool but challenging to package for Fedora, as well as the solutions I've come up with to these problems. First, however, we'll discuss some background.

I've noted in the past that there is a big tension between the Fedora model of dependency management and the models adopted by many language-specific dependency managers. Put simply, the Fedora model is that projects should depend upon system copies of the latest versions of libraries, which have been built with system tools, which were themselves built from pristine sources in a controlled environment. Language-specific models, such as the ones we see with RubyGems and rvm; Python eggs; Java projects using Maven or Ivy; and Erlang releases (especially those managed with rebar), typically allow developers more flexibility to install multiple versions of libraries, fetch dependencies from canonical locations on the web or from source repositories, and rely on different versions of language environments and language runtimes. sbt, which provides both build and dependency management, is no exception in this regard; in fact, it provides as much flexibility as any other language-specific build tool I've encountered.

Launching

You don't actually download sbt. Instead, you download a small, self-contained JAR file that will run on any Java 1.6 JRE and includes enough of sbt, Apache Ivy, and the Scala standard library to fetch the whole Scala standard library and compiler, sbt itself, and its dependencies. It can also fetch multiple versions of each of these. This approach means that it's absolutely straightforward to get started using sbt in almost any environment with a JVM, but it conflicts with Fedora policies on bundling, single versions of libraries, and pristine sources.

My solution to this problem is to develop a Fedora-specific sbt launcher that is willing to run against system copies of sbt itself, the Scala compiler and libraries, and other locally-installed JAR files.

Dependency management

sbt uses Apache Ivy to manage dependencies. Fedora has excellent support for building packages that use Maven, but Ivy is still not well-represented in Fedora. Just as with Maven, most of the concerns that Ivy is meant to handle are either addressed by RPM itself (specifying versions of dependencies, finding transitive dependencies, etc.) or do not apply to packages that meet Fedora guidelines (e.g. running different projects against different versions of their dependencies).

It is possible (but clearly suboptimal) to build Ivy packages against RPM-installed dependencies by specifying an Ivy resolver pattern that ignores version numbers and finds JAR artifacts where Fedora packages put them in the filesystem, like this:

/usr/share/java/[artifact].[ext]

However, /usr/share/java isn't set up as a proper Ivy repository; it contains no Ivy module descriptor files (i.e., ivy.xml files). This isn't a problem if we're using Ivy from Ant or standalone, but sbt calls out to Ivy in a way that requires module descriptors and doesn't expose the setting to make them optional.

I have solved this problem in two ways: the first is a simple script that makes an ersatz Ivy repository from locally-installed packages, which can then be used by an sbt build. The second is a small patch to sbt that exposes the Ivy setting to make module descriptor files optional. (I use the former to build sbt binaries that include the latter.)

Bootstrapping

sbt is used to build itself, as well as some of its dependencies. Fedora has a policy for packaging projects that need to bootstrap in this way, and some other build tools (like rebar) also depend on libraries that are built with that tool. Because of how sbt uses a launcher and because of its dependency management, it is trickier to bootstrap in Fedora than other similar projects (since the initial sbt binary must run locally and must incorporate other Fedora-specific patches, like the module descriptor patch above and patches to work with versions of libraries that ship in Fedora).

Where from here?

Having sbt in Fedora would remove the biggest barrier to getting a lot of the Scala ecosystem in to Fedora, and sbt is a really interesting framework in its own right. However, it's one of the projects where the mismatch between what Fedora requires of upstream projects and the assumptions that contemporary developers work under is particularly pronounced. These difficulties aren't insurmountable, although I found the way that they combine and interweave somewhat daunting when I started investigating sbt.

I'm envisioning a sbt package for Fedora that provides the best of both worlds: an unrestricted sbt environment for developers who want to use Fedora but have the flexibility to target development to other Scala versions (or to use libraries that are available in Ivy repositories but not in Fedora) and a Fedora-specific sbt script that builds software against system packages in a Fedora-friendly way, much like the xmvn and mvn-rpmbuild tools were for Maven. This way, Fedora packagers would have a straightforward way to generate high-quality RPMs from Scala sources and Scala hackers who just want Fedora to meet their needs today could use the system package without restrictions (while having a path to package their projects for Fedora in the future should they choose to).

I welcome feedback and collaboration from Scala hackers who'd like to use Fedora (or other downstream distributions with similar packaging constraints) and from Fedora hackers who'd like to see Fedora as a better place for Scala.

Installing Spark on Fedora 18

| No Comments

The Spark project is an actively-developed open-source engine for data analytics on clusters using Scala, Python, or Java. It offers map, filter, and reduce operations over in-memory collections, data from local files, or data taken from HDFS, but unlike standard map-reduce frameworks, it offers the opportunity to cache intermediate results across the cluster (and can thus offer orders-of-magnitude improvements over standard map-reduce when implementing iterative algorithms). I’ve been using it lately and have been really impressed. However — as with many cool projects in the “big data” space — the chain of dependencies to get a working installation can be daunting.

In this post, we’ll walk through setting up Spark to run on a stock Fedora 18 installation. We’ll also build the Mesos cluster manager so that we can run Spark jobs under Mesos, and we’ll build Hadoop with support for Mesos (so that we’ll have the option to run standard Hadoop MapReduce jobs under Mesos as well). By following these steps, you should be up and running with Spark quickly and painlessly.

Recall that Wallaby applies partial configurations to groups of nodes. Groups can be either explicit —- that is, a named subset of nodes created by the user, or special groups that are built-in to Wallaby; each node’s group memberships have a priority ordering, so that an individual node’s configuration can favor the partial configuration from one group over another. There are two kinds of special groups: the default group, which contains every node and a set of identity groups, each of which only contains a single node. In addition, Wallaby includes a skeleton group, which combines aspects of explicit and special groups: while it can be managed like an explicit group, all newly-created nodes become members of the skeleton group automatically. The default group is always the lowest-priority membership for a node and its identity group is always the highest-priority membership; a node’s skeleton group membership can be reprioritized or removed as necessary.

The default group is an appealing target for common pool configuration, since it is guaranteed to be applied to every node. However, because Wallaby’s configuration model is additive, it is likely not the best place for every configured feature or parameter setting that you might initially consider applying to the whole pool. For example, if a group’s partial configuration installs a feature, every node that is a member of that group will install that feature (there is no way to say “this group’s configuration removes feature F, if it happens to be installed already). Similarly, if a parameter is set on a group, every node that is a member of that group will have that parameter set; individual node configurations can override the value that the parameter is set to within the group, but there is no way to unset the parameter altogether. Therefore, if you need to enable a feature on almost every node, the default group is not the right place to install that feature. (Indeed, the default group is not the right place to put a putatively universal parameter setting or feature installation even if you can imagine a future exception to its universality.) The default group is also not a great place to put configuration that you expect to take priority over other possible group configurations, since it will always be at the lowest priority.

At this point, you may be asking yourself why you’d want to put any configuration in the default group. While I tend towards a minimal default group configuration myself, I absolutely see several use cases for the default group:

  1. Setting configuration parameters that are actually uniform across the whole pool (and will require a new value, not the absence of a value, if they change). One such parameter is FILESYSTEM_DOMAIN, which you can set to an arbitrary string in order to identify machines that access the same shared filesystem. If you were to extend your pool with machines that couldn’t access that same filesystem, you’d provide a new value for this parameter.
  2. Installing Wallaby features that you actually want installed on every node. I’d include features like Master, which ensures that the condor_master daemon is running (since Wallaby’s configuration daemon runs under the condor_master; a node can’t be configured if it isn’t running) and NodeAccess, which controls access to the pool (although the specific policy parameters required by NodeAccess may change in pool subsets).
  3. Rapidly prototyping configurations for homogeneous or small, experimental pools. When you’re first using Wallaby, the default group is a convenient way to get things running. Similarly, if most of your nodes are execute nodes with the same policies, you may be able to put most of your configuration in the default group, especially if your submit and central manager configurations are generally a superset of your execute node configuration. Fortunately, Wallaby makes it straightforward to move configuration from the default group to an explicit group if you should require more flexibility.

I’m sure there are other cases in which using the default group makes good sense, but in general, you should strongly consider using an explicit group or the skeleton group for almost all of your nearly-universal configuration. If you haven’t used the skelton group before, read more about it!

Rob Rati and I gave a tutorial on highly-available job queues at Condor Week this year. While it was not a Wallaby-specific tutorial, we did point out that configuring highly-available job queues is easier for users who manage and deploy their configurations with Wallaby; compare the manual and automated approaches in the following slides:

Haschedd

Configuring highly-available central managers (HA CMs) is rather more involved than configuring highly-available job queues. Here’s what a successful HA CM setup requires:

  • all hosts that serve as candidate central managers (CMs) must be included in the CONDOR_HOST variable across the pool
  • the had and replication daemons must be set up to run on candidate CMs
  • the HAD_LIST and REPLICATION_LIST configuration variables must include a list of candidate CMs and the ports on which the had and replication daemons are running on these hosts
  • various tunable settings related to shared-state and failure detection must be set

Wallaby includes HACentralManager, a ready-to-install feature that has sensible defaults for setting up a candidate CM. The tedious work of constructing lists of hostnames and ports — and ensuring that these are set everywhere that they must be — can take great advantage of Wallaby’s scriptability. At the bottom of this post is a simple Wallaby shell command that sets up a highly-available central manager with several nodes serving as candidate CMs. To use it, download it and place it in your WALLABY_COMMAND_DIR (review how to install Wallaby shell command extensions if necessary). Then invoke it with

wallaby setup-ha-cms fred barney wilma betty

The above invocation will set up fred, barney, wilma, and betty as candidate CMs, place the candidate CMs in the PotentialCMs group (creating this group if necessary), and configure Wallaby’s default group to use the highly-available CM cluster. (The setup-ha-cms command takes options to put candidate CMs in a different group or apply this configuration to some subset of the pool; invoke it with --help for more information.)

Once you’ve set up your candidate CMs, be sure to activate the new configuration:

wallaby activate

Of course, wallaby activate will alert you to any problems that prevent your configuration from taking effect. Correct any errors that come up and activate again, if necessary. The setup-ha-cms command is a pretty simple example of automating configuration, but it saves a lot of repetitive and error-prone effort!

UPDATE: The command will now remove all nodes from the candidate CM group before adding any nodes to it. This ensures that if the command is run multiple times with different candidate CM node sets, only the most recent set will receive the candidate CM configuration. (The command as initially posted would apply the candidate CM configuration to every node that was in the candidate CM group at invocation time, but only those nodes that were named in its most recent invocation would actually become candidate CMs.) Thanks to Rob Rati for the observation.

Authorization for Wallaby clients

| No Comments

Wallaby 0.16.0, which updates the Wallaby API version to 20101031.6, includes support for authorizing broker users with various roles that can interact with Wallaby in different ways. This post will explain how the authorization support works and show how to get started using it. If you just want to get started using Wallaby with authorization support as quickly as possible, skip ahead to the section titled “Getting Started” below. Detailed information about which role is required for each Wallaby API method is after the jump.

Overview

Users must authenticate to the AMQP broker before using Wallaby (although some installations may allow users to authenticate as “anonymous”), but previous versions of Wallaby implicitly authorized any user who had authenticated to the broker to perform any action. Wallaby now includes a database mapping from user names to roles, which allows installations to define how each broker user can interact with Wallaby. Each method is annotated with the role required to invoke it, and each method invocation is checked to ensure that the currently-authenticated user is authorized to assume the role required by the method. The roles Wallaby recognizes are NONE, READ, WRITE, or ADMIN, where each role includes all of the capabilities of the role that preceded it.

If WALLABY_USERDB_NAME is set in the Wallaby agent’s environment upon startup and represents a valid pathname, Wallaby will use that as the location of the user-role database. If this variable is set to a valid pathname but no file exists at that pathname, the Wallaby user-role database will be created upon agent startup. If WALLABY_USERDB_NAME is not set, the user-role database will be initialized in memory only and thus will not persist across agent restarts.

Standard authorization

When Wallaby is about to service an API request, it:

  1. checks the role required to invoke the method.
  2. checks the authorization level specified for the user. There are several possibilities under which a user could be authorized to invoke a method:
    • the user is explicitly authorized for a role that includes the required role (e.g. the user has an ADMIN role but the method only requires READ);
    • the user is implicitly authorized for a role that includes the required role (e.g. there is an entry for the wildcard user * giving it READ access and the method requires READ access)
    • the role database is empty, in which case all authenticated users are implicitly authorized for all actions (this is the same behavior as in older versions of Wallaby)
    • the invocation is of a user-role database maintenance method and the client is authorized via shared secret (see below)
  3. if none of the conditions of the above step hold, the method invocation is unauthorized and fails with an API-level error. If the API method is invoked over the Ruby client library, it will raise an exception. If it is invoked via a wallaby shell command-line tool, it will print a human-readable error message and exit with a nonzero exit status.
  4. if the user is authorized to invoke the method, invocation proceeds normally.

Authorization with secret-based authentication

This version of the Wallaby API introduces three new methods: Store#set_user_privs, Store#del_user, and Store#users. These enable updating and reading the user-role database; the first two require ADMIN access, while the last requires READ access. Because changes in the user-role database may result in an administrator inadvertently removing administrator rights from his or her broker user, Wallaby provides another mechanism to authorize access to these methods. Each of these three methods supports a special secret option in its options argument. When the Wallaby service starts up, it loads a secret string from a file. Clients that supply the correct secret as an option to one of these calls will be authorized to invoke these calls, even if the broker user making the invocation is not authorized by the user-role database.

The pathname to the secret file is given by the environment variable WALLABY_SECRET_FILE. If this variable is unset upon agent startup, Wallaby will not use a shared secret (and secret-based authorization will not be available to API clients). It this variable is set and names an existing file that the Wallaby agent user can read, the Wallaby shared secret will be set to the entire contents of this file. If this variable is set and names a nonexistent file in a path that does exist, Wallaby will create a file at this path upon startup with a randomly-generated secret (consisting of a digest hash of some data read from /dev/urandom). If this variable is set to a pathname that includes nonexistent directory components, the Wallaby agent will raise an error. If you create your own secret file, ensure that it is only readable by the UNIX user that the Wallaby agent runs as (typically wallaby).

Caveats

The Wallaby agent’s authorization support is designed to prevent broker users from altering Condor pool configurations in excess of their authority. It is not intended to keep all configuration data strictly confidential. (This is not as bad as it might sound, since Wallaby-generated configurations are available for inspection by Condor users.) Furthermore, due to technical limitations, it is not possible to protect object property accesses over the API with the same authorization support that we use for API method invocations. Therefore, if concealing configuration data from some subset of users is important for your installation, you should prevent these users from authenticating to the broker that the Wallaby agent runs on.

Getting started

Here is a quick overview of how to get started with auth-enabled Wallaby:

  1. Stop your running Wallaby and restart your broker before starting the new Wallaby (this is necessary to pick up the new API methods). Set WALLABY_USERDB_NAME in your environment to a path where you can store the user-role database. Install and start your new Wallaby.
  2. If you’re using the RPM package, it will create a “secret file” for you in /var/lib/wallaby/secret. If not, you will need to set WALLABY_SECRET_FILE in the environment to specify a location for this secret file and then restart Wallaby. The Wallaby secret is a special token that can be passed to certain API methods (specifically, those related to user database management) in order to authorize users who aren’t authorized in the user database.
  3. Try using some of the new shell commands: wallaby set-user-role, wallaby list-users, and wallaby delete-user.
  4. Make sure that you have a secret in your secret file. Make a note of it. Try setting the role for your current broker user to READ or NONE (e.g. “wallaby set-user-role anonymous NONE”) and then see what happens when you try and run some other Wallaby shell commands. You can recover from this by passing the Wallaby secret to “wallaby set-user-role”; see its online help for details.

The default user database is empty, which will result in the same behavior as in older versions of Wallaby (viz., all actions are available to all broker users), but only until a user role is added, at which point all actions must be explicitly or implicitly authorized.

Find recent content on the main index or look in the archives to find all content.

About Chapeau

  • I work for Red Hat on distributed computing projects. I hold a PhD in computer sciences from the University of Wisconsin, where I mainly worked on program analysis and concurrency.
  • On this site, I write about topics related to things I'm working on now and things I've worked on in the past: distributed computing and programming languages. I don't speak for my employer, and any opinions on this site are mine alone.

Recent Comments

  • Will Benton: Erik, I absolutely agree; this should be considered early-access stuff. read more
  • Erik Erlandson: It might be better to organize as: import wallaby.tagging read more
  • ferkeltongs: Hi Will, I came across your post while looking for read more

Recent Assets

Categories

Pages

Powered by Movable Type 4.25
Gorgeous bbw women on bbwtomeet.com | Dowload music for free all albums on site http://www.iseemp3.com/ | VERIZON FIOS PROMOTION CODE