My last post covered some considerations for using Spark SQL on a real-world JSON dataset. In particular, schema inference can suffer when you’re ingesting a dataset of heterogeneous objects. In this post, I’d like to sketch out some ways to connect schema inference to type inference, in order to point to automated solutions to some of the problems we’ve seen.
We’ll start out by reviewing some definitions from the world of type systems:
- a type is a range of values.
- a product type is a compound type that describes structures with multiple values of (possibly) different types (e.g., tuples or records). You can think of a product type as being like the Cartesian product of several other types: if we have types A and B, the type A × B (i.e., the type of pairs of one value in A and one value in B) can represent the Cartesian product of A and B.
- a sum type is a compound type that represents the disjoint union of several other types. Given types A and B, the type A + B represents the sum of A and B (i.e., any value in A or any value in B).
In Scala terms, a simple example of a product type is a tuple:
and a simple example of a sum type is a basic linked list:
1 2 3 4
Note that, in the case of the linked list (as with other standard recursive data structures), we’re actually looking at a sum type in which one of the summed types is a product type: an
IntList can either be an
EmptyList or a cons cell, consisting of a pair of an
Int value (the head of the list) and an
IntList (the tail of the list).
Also note that, in Scala, sum types are tagged. You can statically distinguish between cons cells and the empty list in a case match.1 C programmers are familiar with a related kind of sum type: the untagged union, which describes a structure that can hold one value that can be of several different types. If you have a C
union, you’ll need to explicitly track what kind of value it’s holding since you won’t be able to tell what kind of value it contains statically.
Given a collection of JSON objects,2 a schema inference algorithm produces a type that is the supertype of every object in the collection. (This is an interesting problem only if the collection contains objects of different types.) As an example, consider the following two objects:
The first models some characteristics of a bicycle and the second models some characteristics of a car. Both have
gears, but bicycles have data about what kind of
handlebars they include; cars have data about whether or not they include anti-lock brakes (
abs) and specify what kind of
fuel they expect. If we ask Spark SQL to infer a schema for a collection including these two objects, we’ll get something like this:
root |-- abs: boolean (nullable = true) |-- fuel: string (nullable = true) |-- gears: integer (nullable = true) |-- handlebars: string (nullable = true) |-- pedals: string (nullable = true)
Given a collection of objects with types T₁, T₂, ⋯, Tₓ, what we’re trying to get is the sum type T₁ + T₂ + ⋯ + Tₓ . The way we get this type is as an untagged union: we get an object that has the union of all of the fields of T₁, all of the fields of T₂, and so on. (Put another way, we’re really treating a heterogeneous collection as a union of homogeneous collections and identifying its schema as the union of the schemas for each included homogeneous collection.) Just as if we were using a C
union, we’d have to figure out whether or not each object was a bicycle or a car. Fortunately, since we’re be in a safe-language context, we’d know that an object with a value in the
pedals field wouldn’t have a value in the
It should be clear that this is a totally reasonable approach, since it combines a straightforward implementation with a pleasant and unsurprising user experience: by adding a simple filter to my queries, I can select only objects of a given type from a heterogeneous collection. For example, if I want to find the mean number of gears for every bicycle with drop handlebars, I could write the following query (assuming I’d registered my
SchemaRDD as a table called
handlebars will be
NULL for every car, this query will only consider bicycles. Other collections may not admit such straightforward differentiation between object kinds, of course.
Problems and possible solutions
Tagging diverging types
As I mentioned last week, things begin to break down when we attempt to construct the sum of types that have the same field names but diverging types for those fields, as in the following example:
A few days ago, Spark would have failed to infer a schema for this collection of objects, but open-source development moves quickly and now it will handle the conflicting types by giving users a schema in which the
branches field contains an array of strings (and in which every object in the collection has its
branches field cast to an array of strings).
In general, it seems like whatever technique we might adopt schema inference should prefer imprecise results to a crash. But are there ways Spark could do a better job of inferring schemas for heterogeneous collections with diverging field types? I believe there might be, and part of the answer is going from untagged unions to tagged unions.
The fedmsg data set is a heterogeneous collection of objects; in some of these objects, the
branches field models one concept as an array of strings, and in others, it models a different concept as an array of objects. I solved this problem by preprocessing the JSON to rename the different
branches fields — in effect, dividing the
branches fields into two tagged possibilities, like this. We could imagine that Spark could handle schema divergence by automatically identifying equivalence classes of field types, perhaps naming them with an integer subscript (e.g.,
branches_2) or with some abbreviated identifier for the value type, as follows:
Here’s an alternative way we could make the tagging explicit, by wrapping the diverging field’s values in an object:
The advantage to this tagged-union approach is that it makes it possible to write queries that handle heterogeneous collections. The disadvantage is that the path from the schema root to the tagged value will be different from the path to that value in the schemas of any of the objects in the original heterogeneous collection. The schema change may be surprising to the user, but — by definition — any approach to derive a fixed schema for a collection of objects with diverging schemas will not result in a schema that is totally compatible with the objects in the original collection.
The other problem I observed related to encoding maps as JSON objects, as in the following example where we have a map from bicycle names to vehicle objects:
1 2 3 4 5 6
The inferred schema for a collection including an object like this will have a distinct field for every key in every map. In this case, we don’t have an immediately great solution from the broader world of type systems. However, it seems that some heuristic — for example, “if an object has more than n fields and each has a value of the same type, it is a map” — might be able to do a good job of identifying objects that model maps.
In any case, it is possible for the user to preprocess their objects in order to encode maps in a different way if that will make it easier to query the data after ingest.
Spark SQL’s support for schema inference is an extremely cool feature, although it obviously isn’t able to work magic by ascribing rigid structure to heterogeneous collections of ad hoc objects. Some minor improvements could make inferred schemas more useful for these kinds of collections, though.
Scala doesn’t support untagged union types directly — although there are clever ways to encode them in the type system — and Scala won’t actually be able to infer sum types from collections of untagged values: the inferred type of
Array[String + Int].↩
(Slightly) more precisely: objects are defined recursively as unordered collections of key-value pairs, where keys are strings and values can be numbers, booleans, strings, objects, or arrays of values.↩