In this installment of our series on type coercions, we’re going to introduce a way to support type widening in a language interpreter. We’ll present a general approach based on semilattices of types and a particular implementation of this approach that uses a straightforward encoding in Scala’s type system. We’ll work from a simple interpreter to allow for a clear exposition, but our general techniques will be applicable to more involved languages as well.

Widening functions

A widening function maps values from some type T to a wider type U. We can implement a trivial but generic type-widening method based on the partial ordering of types encoded in Scala’s type system:

Widening example
1
2
3
4
5
6
7
8
9
10
scala> def widen[T,U](t: T)(implicit w: T => U): U = w(t)

scala> val longFive = widen[Int,Long](5)
longFive: Long = 5

scala> val doubleFive = widen[Int,Double](5)
doubleFive: Double = 5.0

scala> val arbitraryFive = widen[Int,BigDecimal](5)
arbitraryFive: BigDecimal = 5

Invoking widen[A,B] on an argument of type A will succeed if there is a witness object for A => B. By default, we’ll be able to see the predefined widenings in Scala, including reflexive widenings. Note that there are no implicitly defined narrowings, though:

Exploring witness objects for predefined widenings
1
2
3
4
5
6
7
8
9
10
scala> implicitly[Int => Double]
res0: Int => Double = <function1>

scala> implicitly[Int => Int]
res1: Int => Int = <function1>

scala> implicitly[Double => Int]
<console>:8: error: No implicit view available from Double => Int.
              implicitly[Double => Int]
                        ^

It’s important to note that we could declare other witnesses to A => B for other A and B types and have them in scope; we aren’t constrained by Scala’s predefined definitions or by static relationships between implementation types. (We’ll come back to this point later, when we’re thinking about how to model types in the interpreted language.)

A simply-typed interpreter

We’ll start with a simple interpreter with four kinds of values (integers, doubles, strings, and nulls) and one kind of expression representing numeric addition or string concatenation, depending on the types of its operands. It is a stretch to call the language embodied in this interpreter “typed” (since it has only literal values and expressions but no variables). However, because of the way we’ve encoded the interpreter in Scala, it is impossible to express programs with runtime errors. In particular, it is only possible to create an AddExpression with two arguments that evaluate to values of the same type.

SimpleInterpreter.scalalink
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
object SimpleInterpreter {

  trait Addable[T] {
    def plus(self: T, other: T): Value[T]
  }

  implicit object IntAddable extends Addable[Int] {
    def plus(self: Int, other: Int) = IntValue(self + other)
  }

  implicit object DoubleAddable extends Addable[Double] {
    def plus(self: Double, other: Double) = DoubleValue(self + other)
  }

  implicit object StringAddable extends Addable[String] {
    def plus(self: String, other: String) = StringValue(self + other)
  }

  abstract class Expression[T] {
    def eval: Value[T]
  }

  abstract class Value[T](v: T) extends Expression[T] {
    def eval: Value[T] = this
    def get: T = v
  }

  case class IntValue(v: Int) extends Value(v) {}
  case class DoubleValue(v: Double) extends Value(v) {}
  case class StringValue(v: String) extends Value(v) {}
  case object NullValue extends Value(null) {}

  case class AddExpression[T](lhs: Expression[T], rhs: Expression[T])(implicit ev: Addable[T]) extends Expression[T] {
    def eval: Value[T] = {
      val lv = lhs.eval
      val rv = rhs.eval
      ev.plus(lv.get, rv.get)
    }
  }
}

Adding widening

If we have an expression of the form t1 • t2, where the left-hand side is of type T1 and the right-hand side is of type T2, we will be able to convert this to an expression in which both operands have the same type if the following conditions are met:

  1. There must exist some type U such that T1UT2U, and
  2. There must exist widening functions with the signatures T1 ⇒ U and T2 ⇒ U.

Finding U is simply finding the least upper bound of T1 and T2 on a semilattice of types. Once we have this least upper bound, if we also have appropriate widening functions, we can convert both t1 and t2 to values of the same type. In the following code, we extend our simple interpreter by modeling interpreter types in Scala, making types properties of values, and adding widening conversions to AddExpression by explicitly encoding the partial ordering of types — in this case, based on a relationship between the nativeType type member of each Type class. (We’re doing widening statically through Scala’s type system in this case, but there’s no reason why we couldn’t take a similar approach dynamically, handling errors by raising exceptions at runtime.)

WideningInterpreter.scalalink
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
object WideningInterpreter {
  import scala.language.implicitConversions

  sealed abstract class Type {
    type nativeType <: Any
  }
  class IntType extends Type {
    override type nativeType = Int
  }
  class DoubleType extends Type {
    override type nativeType = Double
  }
  class StringType extends Type {
    override type nativeType = String
  }

  abstract class Expression[K <: Type] {
    def eval: Value[K]
  }

  class Value[K <: Type](v: K#nativeType) extends Expression[K] {
    def eval: Value[K] = this
    def get: K#nativeType = v
  }

  case class IntValue(v: Int) extends Value[IntType](v) {}
  case class DoubleValue(v: Double) extends Value[DoubleType](v) {}
  case class StringValue(v: String) extends Value[StringType](v) {}

  sealed trait Addable[K <: Type] {
    def plus(self: K#nativeType, other: K#nativeType): Value[K]
  }

  implicit object IntAddable extends Addable[IntType] {
    def plus(self: Int, other: Int) = IntValue(self + other)
  }

  implicit object DoubleAddable extends Addable[DoubleType] {
    def plus(self: Double, other: Double) = DoubleValue(self + other)
  }

  implicit object StringAddable extends Addable[StringType] {
    def plus(self: String, other: String) = StringValue(self + other)
  }

  // We need some way to constrain our generic widening operators so
  // that an expression with identical operand types won't have an
  // ambiguous implicit argument.  One way to do this is to make sure
  // that one of the widening functions will only apply if the arguments
  // are of different types.  

  // These type inequality instances are taken from an answer Miles Sabin 
  // gave on StackOverflow:  http://stackoverflow.com/a/6944070

  trait =!=[A, B]
  implicit def neq[A, B] : A =!= B = null
  implicit def neqAmbig1[A] : A =!= A = null
  implicit def neqAmbig2[A] : A =!= A = null

  implicit def leftWiden[T <: Type, U <: Type](v1: Value[T], v2: Value[U])
      (implicit conv: (T#nativeType => U#nativeType)): (Value[U], Value[U]) =
    (new Value[U](conv(v1.get)), v2)

  implicit def rightWiden[T <: Type, U <: Type](v1: Value[U], v2: Value[T])
      (implicit neq: T =!= U,
                conv: (T#nativeType => U#nativeType)): (Value[U], Value[U]) =
    (v1, new Value[U](conv(v2.get)))

  case class AddExpression[T <: Type, U <: Type, V <: Type]
      (lhs: Expression[T], rhs: Expression[U])
      (implicit widen: (Value[T], Value[U]) => (Value[V], Value[V]), adder: Addable[V]) extends Expression[V] {
    def eval = {
      val lv = lhs.eval
      val rv = rhs.eval
      val args = widen(lv, rv)
      adder.plus(args._1.get, args._2.get)
    }
  }
}

In WideningInterpreter we extend AddExpression to allow it to have two potentially distinct argument types (Value[T] and Value[U]) and by requiring evidence of an implicit conversion from a pair of values with distinct types to a pair of values with the same type.1 We define two witness functions, leftWiden for the case in which the left element of the pair is narrower than the right, and rightWiden for the case in which the right element of the pair is narrower than the left. In both cases, we determine that a type T is narrower than another type U if Scala knows how to widen values of the representation type (Type#nativeType) of T to the representation type of U; this is the case if an implicit resolution exists for the conv argument.

The problem we might encounter is that, because our partial ordering is reflexive, if T and U are the same type, then there will be witnesses both for T#nativeType => U#nativeType and U#nativeType => T#nativeType. So if we were to have naive implementations of leftWiden and rightWiden that only depended on evidence of such a conversion, Scala would be unable to unambiguously resolve which would apply in the case of monomorphic AddExpressions. We resolve this problem by adding a test for type inequality (due to Miles Sabin) to the implicit argument list of rightWiden, so that it will not apply if the arguments are of the same type.2

Note that the partial ordering among interpreter types (IntType, DoubleType, and StringType) does not depend on Scala-level subtyping relationships between interpreter types. This is important because in a more realistic language we will want the flexibility to model data types independently of properties of our object-language implementation. Instead, we have a generic partial ordering in this example based on predefined relationships between representation types, and we could extend the partial ordering to other types by adding other instances of =>[A,B] for other types of interest.

For a small number of interpreter types, we could also explicitly encode the partial ordering, as in the example below:

A more explicit partial ordering encodinglink
1
2
3
4
5
implicit def intDoubleWiden(v1: Value[IntType], v2: Value[DoubleType]): (Value[DoubleType], Value[DoubleType]) =
  (DoubleValue(v1.get.toDouble), v2)

implicit def doubleIntWiden(v1: Value[DoubleType], v2: Value[IntType]): (Value[DoubleType], Value[DoubleType]) =
  (v1, DoubleValue(v2.get.toDouble))

Since in this example we have a total ordering among types, we can also easily widen one argument at a time by adding a witness object for the least upper bound of T and U,3 as in the example below:

A partial ordering encoding that finds appropriate widenings one argument at a timelink
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
implicit def widen[T <: Type, U <: Type](v1: Value[T])
    (implicit conv: (T#nativeType => U#nativeType)): (Value[U]) =
  new Value[U](conv(v1.get))

implicit def reflexiveWiden[T <: Type](v1: Value[T]): Value[T] = v1

trait LT[T <: Type, U <: Type] {}
implicit def intDoubleLT: LT[IntType, DoubleType] = null

trait WiderThan[T <: Type, U <: Type, V <: Type] {}
implicit def rightWider[T <: Type, U <: Type, V <: Type]
  (implicit rw: LT[T, U],
            conv2: (U#nativeType => V#nativeType),
            conv1: (T#nativeType => V#nativeType)): WiderThan[T,U,V] = null

implicit def leftWider[T <: Type, U <: Type, V <: Type]
  (implicit rw: LT[U, T],
            conv2: (T#nativeType => V#nativeType),
            conv1: (U#nativeType => V#nativeType)): WiderThan[T,U,V] = null

implicit def reflWider[T <: Type]: WiderThan[T, T, T] = null

case class AddExpression[T <: Type, U <: Type, V <: Type]
    (lhs: Expression[T], rhs: Expression[U])
    (implicit lub: WiderThan[T, U, V],
              widenLeft: Value[T] => Value[V],
              widenRight: Value[U] => Value[V],
              adder: Addable[V]) extends Expression[V] {
  def eval = {
    val lv = widenLeft(lhs.eval)
    val rv = widenRight(rhs.eval)
    adder.plus(lv.get, rv.get)
  }
}

Again, a similar approach would be applicable in an untyped Scala representation of interpreter-language types: we could represent types as terms, implement the least-upper-bound relation as a partial function mapping from a pair of terms to the least upper bound of the pair, and implement widenings as functions taking a value and a term representing the type to widen to.


  1. See the signature for the widen implicit argument to AddExpression: (Value[T], Value[U]) => (Value[V], Value[V])

  2. This type-inequality test will fail if it is given two identical types because both neqAmbig1 and neqAmbig2 will both be applicable.

  3. It occurred to me while developing these examples that using witness objects in this way is a lot like forward-chaining logic programming. (Note that we use negation-as-failure in implicit resolution when testing for type inequality and we use implicit arguments to guide further implicit resolution with the WiderThan witness.) Unsurprisingly, it turns out that other people have had the same idea! See the discussion on this post or this talk for two examples.

In my last post, I introduced two kinds of implicit type coercions that can appear in database query languages: type widenings, in which values are converted to wider types (e.g. from an int to a long or double), and type translations, in which a value of some type T might be converted to one of an unrelated type U if it is used where a value of U is expected. In this post, we’ll look at what sort of type coercions are available in Apache Hive and (in less detail) Microsoft SQL Server.

Implicit conversions in Apache Hive

Apache Hive features several kinds of types, many of which are also present in ANSI SQL with similar definitions:

  1. hardware-supported integral types, such as tinyint (one byte), smallint (two bytes), int (four bytes), and bigint (eight bytes);
  2. hardware-supported floating-point types, such as float (single-precision, four bytes) and double (double-precision, eight bytes);
  3. decimal values (38 digits precision in Hive 0.11 and 0.12; arbitrary-precision in Hive 0.13.0 and later);
  4. date and time types, such as timestamp and date;
  5. string types, including string (of arbitrary length), varchar[N] (of arbitrary length but less than N characters), and char[N] (of exactly N characters, possibly padded with spaces);
  6. boolean values;
  7. binary values (sequences of bytes); and
  8. compound values made up of Hive types: homogeneous arrays with some element type, maps containing keys of one type and values of another, and C-style struct and union types.

Hive supports some widenings and narrowings between these types.1 Among the hardware-supported numeric types, values can be widened but not narrowed.2 Strings can be narrowed to be used as varchar values; converting a string value to a varchar[N], where N is insufficient to hold the contents of the string, will cause the string to be truncated to N characters. It is also possible (as of Hive 0.13) to supply a decimal argument to many numeric functions that expect a double input, although in most cases the function will only process a double approximating the supplied arbitrary-precision value.

Hive also supports type translations to and from string values. Hive permits implicitly converting a value of any type (with the exception of boolean and binary) to a string. String representations of double or decimal values (but not the smaller integral or floating-point types) can also be converted to values of those types.

Hive supports widenings as part of object comparisons; the FunctionRegistry.getCommonClassForComparison method returns the least upper bound of two types. The code excerpt below shows how Hive also explicitly encodes which widenings and translations are permissible:

excerpted from Hive 0.12’s FunctionRegistry.javalink
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
public static boolean implicitConvertable(PrimitiveCategory from, PrimitiveCategory to) {
  if (from == to) {
    return true;
  }

  PrimitiveGrouping fromPg = PrimitiveObjectInspectorUtils.getPrimitiveGrouping(from);
  PrimitiveGrouping toPg = PrimitiveObjectInspectorUtils.getPrimitiveGrouping(to);

  // Allow implicit String to Double conversion
  if (fromPg == PrimitiveGrouping.STRING_GROUP && to == PrimitiveCategory.DOUBLE) {
    return true;
  }
  // Allow implicit String to Decimal conversion
  if (fromPg == PrimitiveGrouping.STRING_GROUP && to == PrimitiveCategory.DECIMAL) {
    return true;
  }
  // Void can be converted to any type
  if (from == PrimitiveCategory.VOID) {
    return true;
  }

  // Allow implicit String to Date conversion
  if (fromPg == PrimitiveGrouping.DATE_GROUP && toPg == PrimitiveGrouping.STRING_GROUP) {
    return true;
  }
  // Allow implicit Numeric to String conversion
  if (fromPg == PrimitiveGrouping.NUMERIC_GROUP && toPg == PrimitiveGrouping.STRING_GROUP) {
    return true;
  }
  // Allow implicit String to varchar conversion, and vice versa
  if (fromPg == PrimitiveGrouping.STRING_GROUP && toPg == PrimitiveGrouping.STRING_GROUP) {
    return true;
  }

  // Allow implicit conversion from Byte -> Integer -> Long -> Float -> Double
  // Decimal -> String
  Integer f = numericTypes.get(from);
  Integer t = numericTypes.get(to);
  if (f == null || t == null) {
    return false;
  }
  if (f.intValue() > t.intValue()) {
    return false;
  }
  return true;
}

To see how Hive actually performs type coercions, we’ll have to take a step back and look at Hive’s architecture for defining functions.3 Hive has two interfaces for defining functions: UDF, which models a simple function with simply-typed arguments and a simply-typed return value, and GenericUDF, which models functions that can operate on and return values of compound types.

Subclasses of UDF include at least one method called evaluate (of arbitrary argument and return types); this is what gets called when the user-defined function is evaluated. Due to their flexible signatures, these evaluate methods are not specified in any interface and instead found via Java reflection. By contrast, a GenericUDF must support an initialize method that takes an array of ObjectInspector instances (essentially adapters from arbitrary types to concrete object values) and an evaluate method taking an array of DeferredObject instances (essentially futures representing objects).

The initialize method in GenericUDF is invoked with ObjectInspector instances corresponding to actual parameters; if the actuals aren’t implicitly convertible to the proper types, it will fail. Otherwise, it will return an ObjectInspector instance for the return type. As a simple example, see the initialize method in the class providing Hive’s implementation of the SQL CONCAT function:

excerpted from Hive 0.12’s GenericUDFConcatWS.javalink
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
@Override
public ObjectInspector initialize(ObjectInspector[] arguments) throws UDFArgumentException {
  if (arguments.length < 2) {
    throw new UDFArgumentLengthException(
        "The function CONCAT_WS(separator,[string | array(string)]+) "
          + "needs at least two arguments.");
  }

  // check if argument is a string or an array of strings
  for (int i = 0; i < arguments.length; i++) {
    switch(arguments[i].getCategory()) {
      case LIST:
        if (isStringOrVoidType(
            ((ListObjectInspector) arguments[i]).getListElementObjectInspector())) {
          break;
        }
      case PRIMITIVE:
        if (isStringOrVoidType(arguments[i])) {
        break;
        }
      default:
        throw new UDFArgumentTypeException(i, "Argument " + (i + 1)
          + " of function CONCAT_WS must be \"" + serdeConstants.STRING_TYPE_NAME
          + " or " + serdeConstants.LIST_TYPE_NAME + "<" + serdeConstants.STRING_TYPE_NAME
          + ">\", but \"" + arguments[i].getTypeName() + "\" was found.");
    }
  }

  argumentOIs = arguments;
  return PrimitiveObjectInspectorFactory.writableStringObjectInspector;
}

Note that the above verifies both the correct number of arguments and the correct types of each argument before returning an ObjectInspector instance for writable strings. The evaluate method then invokes DeferredObject.get() on each argument, converts them to String values using built-in coercions, and concatenates them together, returning the result as a text value.

Plain UDF instances and GenericUDF instances alike are stored in Hive’s function registry, but the former are converted to GenericUDF instances first by wrapping them GenericUDFBridge, which is a proxy that uses Java introspection on the underlying UDF instance to determine what a function’s expected argument types are; it can then convert actual parameters to values of appropriate types using built-in coercions at execution time.

Implicit conversions in Microsoft SQL Server

While we can’t examine conversions supported in Microsoft SQL Server in as great detail as we can with Apache Hive (since the source for SQL Server isn’t available), the published documentation indicates which conversions are supported. In brief, SQL Server supports most of the same kinds of type coercions as Hive, with the following additions:

  1. bidirectional implicit translation from char[N] and varchar[N] to all numeric types (not merely double and decimal, as in Hive);
  2. financial types (money and smallmoney) are supported and can be implicitly translated to and from numeric types;
  3. bidirectional implicit translation between timestamp values to and from character and integral types;
  4. the sql_variant type, which can receive values of most types via implicit conversions but must be converted with an explicit CAST in contexts expecting a value of a different type; and
  5. various other types (xml, uniqueidentifier, and user-defined types from the CLR) with varying conversion semantics.

These additions are useful but their absence does not limit Hive’s expressive power. In the next post in this series, we’ll look at a general approach to implementing type widening, along with a specific (and statically-safe) realization of this approach using Scala’s type system.


  1. The Hive wiki includes a full conversion matrix.

  2. For example, it is permissible to use a tinyint where an int or double is expected, but not vice versa.

  3. Actually implementing new functions in Hive is outside the scope of this post, but there are lots of resources online if you’re interested. In particular, Matthew Rathbone has a great article about extending Hive with new functions.

In this post, we’re going to introduce two kinds of implicit type conversions that are common in database query languages:

  1. Type widening, in which a value of type T is converted to a value of some wider type T’, where T ⊆ T’. As an example, given the expression b / d, where b holds an unsigned 8-bit value and d holds a double-precision floating-point value, a type-widening conversion would convert the value of b to a double-precision value and use this value as the left-hand operand of a floating-point division.
  2. Type translation, in which a value is transformed to produce a value of some unrelated type. As an example, consider a programming language that evaluates the expression sd / d as floating-point division where d is a double-precision floating-point value and sd is a string value that represents a number.

Type widening is common in typed general-purpose programming languages. For example, in C or Java, a / b is integer division with an integer result if both a and b are integer values but floating-point division — evaluating to the wider of the types of its operands — if either a or b is a floating-point value. Similarly, it is generally possible in such languages to assign a value of a narrower type to a variable of a wider one.

However, type translation (by necessity) only appears in untyped1 languages. This is the case because types are static properties but we cannot in general statically determine whether or not type translations will succeed. By allowing type translation, we are thus necessarily deferring decisions about whether or not a program fragment makes sense in some type system until it executes, and trading some flexibility for the possibility of runtime errors. Some programming communities regard this as an acceptable tradeoff.

Several extant database systems, including Hive, PostgreSQL, SQLite (see also here), and Microsoft SQL Server) support both type widening and type translations, so that one can, for example, take the cosine of the string representing a double-precision floating point value.

In subsequent posts, we’ll look at what type coercions some of these systems support and present general approaches to implementing support for type widening and type coercion in language interpreters, with specific techniques for realizing these general approaches in interpreters implemented in Scala.


  1. It may also appear in languages in which all programs are trivially well-typed and in which the type system thus cannot provide useful static guarantees about runtime behavior. An example such language is Tcl, where everything is a string. (By most definitions of typedness, though, “trivially typed” languages are untyped.)

Yesterday’s post provided a couple of minimal Docker images for spinning up containers to do Scala and Java builds and tests. The second of the images had a pre-loaded Ivy cache for Apache Spark’s dependencies installed in /root. The setup I suggested for launching these images included mounting a directory from your host inside the container using Docker’s -v option, but this is less than ideal: any files that are written or modified during a build or test run will be owned by root instead of by whoever owned the directory on the host!

As I understand it, Docker will be able to map container users to host users in the future, but for now we can work around this problem with a quick hack. The docker run command lets us pass environment variables that will be set inside the container, and we can use this to make sure that any files written in the container get owned by the right user on the host. Here’s what I did:

  1. I created another image based on my Spark development image, which I called willb/java-dev:centos7-spark-uid
  2. This new image adds a normal user called dockerdev and copies the Ivy, Maven, and sbt caches from /root to that user’s home directory.
  3. The new image runs a small script on container launch that fixes up the user ID and group ID of the dockerdev user and its home directory based on the DEV_UID and DEV_GID environment variables supplied to docker run before substituting to the dockerdev user and executing a shell.

So to launch a container for Spark development in which your ~/devel directory is mapped to /devel and in which all changes to that directory wind up owned by you on the host, you’d do something like this:

docker run --privileged=true -e DEV_UID=$UID -e DEV_GID=$GID -i -t -v ${HOME}/devel:/devel willb/java-dev:centos7-spark-uid

Again, this is a hack — use it at your own risk! —, but it’s useful until Docker supports user mapping more cleanly. I’m interested to hear about your approaches to solving this problem as well!




Appendix: boot.sh, the user-mapping boot script

boot.shlink
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
#!/bin/sh

export ORIGPASSWD=$(cat /etc/passwd | grep dockerdev)
export ORIG_UID=$(echo $ORIGPASSWD | cut -f3 -d:)
export ORIG_GID=$(echo $ORIGPASSWD | cut -f4 -d:)

export DEV_UID=${DEV_UID:=$ORIG_UID}
export DEV_GID=${DEV_GID:=$ORIG_GID}

ORIG_HOME=$(echo $ORIGPASSWD | cut -f6 -d:)

sed -i -e "s/:$ORIG_UID:$ORIG_GID:/:$DEV_UID:$DEV_GID:/" /etc/passwd
sed -i -e "s/dockerdev:x:$ORIG_GID:/dockerdev:x:$DEV_GID:/" /etc/group

chown -R ${DEV_UID}:${DEV_GID} ${ORIG_HOME}

exec su - dockerdev

I recently experienced some bizarre failures running Akka actors and sbt tests in forked mode on my Fedora laptop. As far as I can tell, the root of both problems was that my network configuration changed when switching wireless networks, which caused java.net.InetAddress.getLocalHost() to return my WAN IP address (which processes on my machine couldn’t bind to or connect to) instead of my LAN IP address.1 This was especially frustrating since the relevant changes to my network configuration happened not when I switched networks, but after waking my laptop from sleep on the new network — so tests succeeded before sleep but failed after it!

My patience for troubleshooting networking configurations has diminished substantially since the first time I set up pppd on a Linux 1.1 system and I was unwilling to disconnect from the network just to run tests. Fortunately, running my tests in a Docker container proved to be a fairly straightforward workaround — Java was able to correctly identify the right address to bind to when running on a bridged network and the tests ran at native speed. Maintaining a Docker container for Java development has other benefits, as well: while I usually want to work within the Fedora Java ecosystem (to ensure that I can package my work for Fedora), if I’m working on patches to go to upstream projects, I want an environment that more closely resembles the one that upstream CI and conventional deployments will run under.

I’ve created and published a minimal Docker image for a conventional JVM testing environment. This image is based on the Centos 7 image; while I love Fedora for day-to-day and development use, Centos is a much slower-moving target and is thus far more common as a deployment target. My image also installs java, java-devel, maven, ivy, git, and emacs via yum and pulls down Paul Phillips’ excellent sbt-extras script to provide ready access to multiple versions of sbt.

Here’s how to get started using it on Fedora: first, you’ll need to install Docker and start up the Docker daemon (if you haven’t already). The gpasswd line is optional but will enable you to run Docker commands as yourself without su or sudo:

sudo yum install -y docker-io
sudo gpasswd -a ${USER} docker
sudo systemctl start docker

Then you’ll want to pull my image from the Docker Hub (run this and all Docker commands under sudo if you didn’t add yourself to the Docker group above):

docker pull willb/java-dev

You should then be able to see that the image file is installed:

docker images willb/java-dev

Now you’re ready to start a container from the image. I keep most of my ongoing development work in subdirectories of ${HOME}/devel and I’d like to mount that tree from within my container; replace ${HOME}/devel below with whatever path from the host you’d like to expose to the container:

docker run -i -t -v ${HOME}/devel:/devel willb/java-dev:centos7

This will start up the container interactively (-i), allocating a TTY (-t), mounting ${HOME}/devel from the host under /devel in the container, and giving you a zsh prompt. You can now run sbt and Maven builds and tests in a vanilla, upstream-friendly CentOS environment while still using Fedora as your desktop OS.

Regular readers of this site know that I’ve been interested in using and contributing to Apache Spark for quite some time now, and I was running Spark’s test suite when I ran into the network trouble that motivated this post. If you’re also working on Spark, you might want to try an alternate image that has a prepopulated Ivy cache for Spark’s dependencies:2

docker run -i -t -v ${HOME}/devel:/devel willb/java-dev:centos7-spark

Running your initial Spark build and tests in a container with the centos7-spark image will be faster since the necessary artifacts are baked into the image. (If you don’t have the Ivy cache, you’ll need to download dependencies every time you run your first build in a container based on this image.) The easiest way to build such a cache for sbt projects is to have the Dockerfile clone the relevant source repository into a temporary directory and then run sbt update from that clone. Let me know if you build other images with prepopulated dependency caches!


  1. Java networking on Linux seems to have a lot of corner cases; StackOverflow and other question/answer sites are full of people who are frustrated that InetAddress.getLocalHost() returns the loopback address instead of a LAN IP, but I had the opposite problem.

  2. This image should include all of the necessary build and test dependencies for the Spark master branch as of when I last generated it; I will update it periodically but it may not necessarily contain everything necessary to build and test the current Spark master branch in the future.

I’m looking forward to giving a talk at Spark Summit this week about some of my recent work using Apache Spark to make sense of my bike data (see also previous posts here and here).

I’ll post a link to the video of my talk once it’s online, but in the meantime, I’ve made a short (~3 minute) video demonstrating one of the visualizations I was able to make with Spark; it’s embedded below:

Demo of cycling analytics with Apache Spark from William Benton on Vimeo.

In an earlier post, I showed how I had used Apache Spark to cluster points from GPS traces of bike rides and plot the convex hulls of each cluster, coloring each hull based on my the best power outputs at given durations I’d produced starting from within each cluster. This visualization had some interesting results and pointed me to some roads that I hadn’t thought of for interval workouts, but it was a little coarser than one might like.

To get a better picture of exactly where I should be working out, I set out to plot the actual road segments that I’d set my power bests on, but the naïve approach was completely unsatisfactory. Because I regularly do repeats on a local hill-climb time trial course, I often produce power for three to four minutes on this hill that is good enough to beat many of my other three-minute efforts (which happen during rides in which I am not working on those sorts of efforts). As a consequence, I might have thirty to sixty overlapping three-minute sample windows from the same climb, many of which could represent top-20 or top-50 three minute efforts. When I plotted my top 20 three-minute efforts, the result looked like this:

Twin valley

This visualization is correct but trivial. I already know that I can produce a lot of power for three to four minutes on this hill; that’s one of the reasons that I ride it so often. I changed my analysis to only consider the best output from several temporally-overlapping windows:

Non overlapping

This represented an improvement over just taking the best efforts overall, since it wasn’t merely a plot of the best windows of my many climbs up the Twin Valley hill. It also showed something interesting that was abstracted away in the cluster-based analysis: some strong three-minute efforts began in the same cluster but involved different roads. In particular, the inverted “T” shape at the bottom of the map excerpt shows two different route fragments beginning on the same hill, both of which lend themselves well to high three-minute average power. However, merely excluding windows that temporally overlapped still wound up plotting many (just not quite as many) trips up Twin Valley from separate climbs.

My next approach was to combine the clustering information with a best-effort-path visualization. Instead of considering only the best effort starting in a given cluster (as the hull visualization from the previous post did) or considering only non-temporally-overlapping efforts (but potentially considering multiple spatially-overlapping efforts), I considered only the best efforts starting and ending in a given pair of clusters. This is obviously far more straightforward than identifying which roads were involved or applying similarity metrics to efforts (as Strava does to match efforts to segments), but it manages to eliminate many (but not all) overlapping efforts while being fine-grained enough to not exclude nearby but non-overlapping efforts.

Here are the results from plotting my best one-, three-, and ten-minute efforts, considering only the best effort starting and ending in a given pair of clusters; as usual, you can click and drag or zoom to inspect the map:

On this map, each path is labeled with the average wattage for the effort (click on a path to see its label). This showed me a few surprising things:

  • I only get really strong, consistent shorter efforts in a few places; once we filter out all but the best results from these few places, the “top efforts” for a given duration include some surprisingly quotidian ones. I don’t have a lot of trouble consistently producing strong efforts on the trainer, but I’d rather be outside whenever possible, so I should probably do more structured workouts and fewer “rides.”
  • Many of the one- and three-minute efforts overlap, as do the three- and ten-minute efforts. I should probably also plot best efforts with low power variability in order to plan workouts.
  • One of my ten-minute bests (shown below) was during a local criterium. Since I wasn’t in a break at the time, I was clearly racing inefficiently.

GDVC criterium laps

If you’re interested in trying this visualization out on your own data, download the code, put a bunch of TCX files with wattage and GPS coordinates in a directory called activities, and fire up sbt console. Then you can run the application from the Scala REPL. Here are the commands I used to generate the above plot:

Finally, I’ll be talking about my work with Spark for bicycling data at Spark Summit this year; if you’re interested in this stuff and will be there, find me and we can chat!​

This post contains embedded maps; you’ll need to view it in a browser with JavaScript support in order to see them.

One problem that a lot of enthusiastic amateur cyclists encounter is how to make sense of all the workout telemetry data that their smartphone or cycle computer captures. Most riders have some sense of how their cadence, heart rate, speed, road grade, and wattage come into play at any given moment in a ride as it’s happening, but answering questions about the bigger picture about how these fit together over time remains more difficult. I’ve been experimenting with cycling data analytics using Apache Spark for some time now, but I thought I’d share some visualizations that I put together recently to answer a question that’s been nagging me as the weather warms up here in Wisconsin.

In my last post on using Spark to process fitness data, I presented a very simple visualization based on plotting the centers of clustered GPS traces. By plotting darker center markers for denser clusters (and generating a large number of clusters), I was able to picture which roads and intersections I spent the most time riding on in the set of activities that I analyzed. This time, however, I was more interested in a visualization that would tell me what to do rather than a visualization that would tell me what I had already done.

Background

One of the most useful tools for a cyclist who is interested in quantifying his or her performance and training is a direct-force power meter. By measuring the actual force applied at some point on the bicycle drivetrain, these devices can accurately tell riders how many calories they’re burning in a ride, whether or not they’re “burning matches” (that is, using anaerobic metabolism instead of aerobic metabolism) at a given point in a race, how to pace long steady efforts to maximize performance, and precisely how hard to work in interval training in order to improve various kinds of fitness. The last of these capabilities will be our focus in this post.

It’s obvious that there is a difference between ultra-endurance efforts and sprint efforts; no one would try to sprint for an entire 40km time trial (or run a marathon at their 100m pace), and it would be pointless to do sprint-duration efforts at the sort of pace one could maintain for a 12-hour race. More generally, every athlete has a power-duration curve of the best efforts they could produce over time: one’s best 5-second power might be double their best one-minute power and four times their best one-hour power, for example. There are several points where this curve changes for most people, and these correspond to various physiological systems (for example, the shift from anaerobic to aerobic metabolism). By targeting interval workouts to certain power zones, athletes can improve the corresponding physiological systems.

Technique

I began by clustering points from GPS traces, but instead of plotting the cluster centers, I plotted the convex hulls of all of the points in each cluster. By giving me polygons containing every point from my data set, this gave me a pretty good picture of where I’d actually been. I then calculated my mean power for three durations – corresponding roughly to anaerobic, VO2max, and just-above-aerobic efforts – at every point in each activity. In other words, I mapped each point in each ride to the mean power I was about to produce in that ride. Then, for each duration, I found the best efforts starting in each cluster and used these data to shade the convex hulls so that hulls where better “best efforts” originated would thus appear more saturated.

Because Spark is expressive and can work interactively, it was straightforward to experiment with various techniques and constant factors to make the most sense of these data. Debugging is straightforward; since I stick to effect-free code as much as possible, I can test my logic without running it under Spark. Furthermore, Spark is fast enough to make trying a bunch of different options completely painless, even on my desktop computer.

Results

I’m including here three plots of cluster hulls, shaded by the best mean power I achieved starting in that cluster for one minute (green), three minutes (blue), and ten minutes (red). With these visualizations (and with increasingly friendly road cycling weather here in Wisconsin), I can decide where to go to do interval workouts based on where I’ve had my best efforts in the past. The data tell me that if I want to work on my one-minute power, I should focus on the Timber Lane climb up from Midtown; if I want to work on my three-minute power, it’s either Barlow Road or the east side of Indian Lake; and if I want to work on my ten-minute power, it’s off to Mounds Park Road for the same climb that made everyone suffer in the national championship road race last year.

(Click and drag or zoom to inspect any map; if one is missing polygons, drag and they should render.)

Future work

I have many ideas for where to take this work next and have some implementation in progress that is producing good results but not (yet) perspicuous visualizations. However, even the more mundane things on my to-do list are pretty interesting: among other things, I’d like to do some performance evaluation and see just how much cycling data we could feasibly process on a standard workstation or small cluster (my code is currently unoptimized); to add a web-based front end allowing more interactive analysis; and to improve my (currently very simple) computational geometry and power-analysis code to make better use of Spark’s abstractions and distributed execution. (The code itself, of course, is available under the Apache license and I welcome your feedback or pull requests.)

I love tools that make it easy to sketch solutions to hard problems interactively (indeed, I spent a lot of time in graduate school developing an interactive tool for designing program analyses – although in general it’s more fun to think about bicycling problems than whether or not two references alias one another), and Spark is one of the most impressive interactive environments I’ve seen for solving big problems. I’m looking forward to prototyping and refining more tools for understanding cycling training and performance in the future.

Longtime Chapeau readers may recall last summer’s lament about the state of the Scala ecosystem in Fedora. We’ve taken a lot of steps since then. After a rough patch for the Fedora Scala package, Scala 2.10 is available and works again on all current Fedora releases and Rawhide. We’ve added more Scala packages to Fedora as well, including scalacheck, sbinary, and test-interface. Today I’m especially pleased to announce that, by the time you read this, sbt 0.13.1 will be available in Fedora 20 testing.

Having sbt available in Fedora means that we can start packaging more of the Scala ecosystem in Fedora. In fact, the sbt package in Fedora is primarily intended for use building Scala packages for Fedora. While you’ll be able to use it for general Scala development, it has several important limitations in order to comply with the Fedora packaging guidelines: most notably, you won’t be able to cross-compile libraries for different Scala versions with it or launch different versions of sbt for different projects. (I use sbt-extras, which I’ve renamed to xsbt, for my general Scala development.)

In the future (i.e. F21 and on), sbt-based Fedora builds will be greatly streamlined by improved Ivy support in an upcoming version of xmvn. For now, we have to manage dependencies somewhat manually using scripts and macros, but it’s absolutely straightforward. To get started building Scala projects for Fedora right now, check out these guidelines I wrote up for the Big Data SIG and let me know if you have any trouble. There are many example spec files using sbt available among my Github repositories.

This was a big effort and thanks are due to several people, including Mark Harrah, who offered a lot of advice on sbt itself and gave prompt and thorough feedback on my patches; Mikołaj Izdebski, who helped a lot with my understanding of Java dependency resolution in Fedora and implemented improved support for Ivy resolution in xmvn; Rob Rati, who took on the task of reviewing the sbt package and did a thoughtful and careful job; and the Fedora Packaging Committee for their quick and helpful response to my request for a bootstrap-binary exception.

I’m looking forward to seeing new Scala projects packaged for Fedora soon!​