Highly-available configuration data with Wallaby

Many Condor users are interested in high-availability (HA) services: they don’t want their compute resources to become unavailable due to the failure of a single machine that is running an important Condor daemon. (See this talk that Rob Rati and I gave at Condor Week this year for a couple of solutions to HA with the Condor schedd.) So it’s only natural that Condor users who are interested in configuring their pools with Wallaby might wonder how Wallaby responds in the face of failure.

Some of the technologies that the current version of Wallaby is built upon do not lend themselves to traditional active-active high-availability solutions. However, the good news is that due to Wallaby’s architecture, almost all running Condor nodes will not be affected by a failure of the Wallaby service or the machine it is running on. Nodes that have already checked in with Wallaby will have their latest activated configurations as of the last checking. The only limitations in the event of service failure are:

new nodes will not be able to check in and get the default configuration;
it will not be possible to access older activated configurations; and
it will not be possible to alter, activate, or deploy the configuration.

These limitations, of course, disappear when the service is restored. For most users, losing the ability to update or deploy configurations due to a service failure — but not losing their deployed configurations or otherwise affecting normal pool operation — represents an acceptable risk. Some installations, especially those who aggressively exploit Wallaby’s scripting interface or versioning capability, may want a more robust solution: these users might want to be able to access older versions of their activated configurations even if Wallaby is down, or they might want a mechanism to speed recovery by starting a replica of their service on another machine. In the remainder of this post, we’ll discuss some approaches to provide more access to Wallaby data when Wallaby is down.

Accessing older versioned configurations

If you need to access historical versioned configurations, the easiest way to do it is to set up a cron job on the machine running Wallaby that periodically runs wallaby vc-export on your snapshot database and outputs versioned configurations to a shared filesystem. wallaby vc-export, which is documented in this post, exports all historical snapshots to plain-text files, so you can access the configuration for foo.example.com at version 1234 in a file called something like 1234/nodes/foo.example.com. This cron job needs to be able to access the filesystem path of the Wallaby snapshot database; furthermore, to run it efficiently, you’ll probably want to limit the number of snapshots it processes each time; see wallaby vc-export’s online help for more details.

Exporting Wallaby state to a file

If you’re just interested in the state of the Wallaby service (including possibly unactivated changes), you can periodically run wallaby dump over the network. This will produce a YAML file consisting of the serialized state of the Wallaby; you can later load this file by using the wallaby load command, possibly against another Wallaby agent.

Backing up the raw database files

The easiest way to pick up and recover from a Wallaby node failure is to start a new Wallaby service with the same databases as the failed node. In turn, the easiest way to do this is by periodically copying these database files from their locations on the Wallaby node (typically in /var/lib/wallaby) to some location on a shared filesystem. The following Ruby script will safely copy the SQLite files that Wallaby uses even while Wallaby is running:

#!/usr/lib/env ruby
 
# Acquires a shared lock on a SQLite database file and copies it to a backup
# usage:  backup-db.rb DBFILE.db BACKUPFILE.db
# author:  William Benton (willb@redhat.com)
# Public domain.
 
require 'sqlite3'
require 'fileutils'
 
def backup_db(db_file, backup_file)
  begin
    db = SQLite3::Database.new(db_file)
 
    db.transaction(:immediate) do |ignored|
      # starting a transaction with ":immediate" means we get a shared lock
      # and thus any db writes (unlikely!) complete before we copy the file
      FileUtils.cp(db_file, backup_file, :preserve=>true)
    end
  ensure
    db.close
  end
end
 
backup_db(ARGV[0], ARGV[1])