Implicit type coercion support in existing database systems

type systems
sql
hive
type coercions
Published

August 19, 2014

In my last post, I introduced two kinds of implicit type coercions that can appear in database query languages: type widenings, in which values are converted to wider types (e.g. from an int to a long or double), and type translations, in which a value of some type T might be converted to one of an unrelated type U if it is used where a value of U is expected. In this post, we’ll look at what sort of type coercions are available in Apache Hive and (in less detail) Microsoft SQL Server.

Implicit conversions in Apache Hive

Apache Hive features several kinds of types, many of which are also present in ANSI SQL with similar definitions:

  1. hardware-supported integral types, such as tinyint (one byte), smallint (two bytes), int (four bytes), and bigint (eight bytes);
  2. hardware-supported floating-point types, such as float (single-precision, four bytes) and double (double-precision, eight bytes);
  3. decimal values (38 digits precision in Hive 0.11 and 0.12; arbitrary-precision in Hive 0.13.0 and later);
  4. date and time types, such as timestamp and date;
  5. string types, including string (of arbitrary length), varchar[N] (of arbitrary length but less than N characters), and char[N] (of exactly N characters, possibly padded with spaces);
  6. boolean values;
  7. binary values (sequences of bytes); and
  8. compound values made up of Hive types: homogeneous arrays with some element type, maps containing keys of one type and values of another, and C-style struct and union types.

Hive supports some widenings and narrowings between these types.1 Among the hardware-supported numeric types, values can be widened but not narrowed.2 Strings can be narrowed to be used as varchar values; converting a string value to a varchar[N], where N is insufficient to hold the contents of the string, will cause the string to be truncated to N characters. It is also possible (as of Hive 0.13) to supply a decimal argument to many numeric functions that expect a double input, although in most cases the function will only process a double approximating the supplied arbitrary-precision value.

Hive also supports type translations to and from string values. Hive permits implicitly converting a value of any type (with the exception of boolean and binary) to a string. String representations of double or decimal values (but not the smaller integral or floating-point types) can also be converted to values of those types.

Hive supports widenings as part of object comparisons; the FunctionRegistry.getCommonClassForComparison method returns the least upper bound of two types. The code excerpt below shows how Hive also explicitly encodes which widenings and translations are permissible:

// excerpted from Hive 0.12's FunctionRegistry.java  https://github.com/apache/hive/blob/51dafbacab14eb57e603d04fdcecfec6ed6f7fd1/ql/src/java/org/apache/hadoop/hive/ql/exec/FunctionRegistry.java#L1011
public static boolean implicitConvertable(PrimitiveCategory from, PrimitiveCategory to) {
  if (from == to) {
    return true;
  }

  PrimitiveGrouping fromPg = PrimitiveObjectInspectorUtils.getPrimitiveGrouping(from);
  PrimitiveGrouping toPg = PrimitiveObjectInspectorUtils.getPrimitiveGrouping(to);

  // Allow implicit String to Double conversion
  if (fromPg == PrimitiveGrouping.STRING_GROUP && to == PrimitiveCategory.DOUBLE) {
    return true;
  }
  // Allow implicit String to Decimal conversion
  if (fromPg == PrimitiveGrouping.STRING_GROUP && to == PrimitiveCategory.DECIMAL) {
    return true;
  }
  // Void can be converted to any type
  if (from == PrimitiveCategory.VOID) {
    return true;
  }

  // Allow implicit String to Date conversion
  if (fromPg == PrimitiveGrouping.DATE_GROUP && toPg == PrimitiveGrouping.STRING_GROUP) {
    return true;
  }
  // Allow implicit Numeric to String conversion
  if (fromPg == PrimitiveGrouping.NUMERIC_GROUP && toPg == PrimitiveGrouping.STRING_GROUP) {
    return true;
  }
  // Allow implicit String to varchar conversion, and vice versa
  if (fromPg == PrimitiveGrouping.STRING_GROUP && toPg == PrimitiveGrouping.STRING_GROUP) {
    return true;
  }

  // Allow implicit conversion from Byte -> Integer -> Long -> Float -> Double
  // Decimal -> String
  Integer f = numericTypes.get(from);
  Integer t = numericTypes.get(to);
  if (f == null || t == null) {
    return false;
  }
  if (f.intValue() > t.intValue()) {
    return false;
  }
  return true;
}

To see how Hive actually performs type coercions, we’ll have to take a step back and look at Hive’s architecture for defining functions.3 Hive has two interfaces for defining functions: UDF, which models a simple function with simply-typed arguments and a simply-typed return value, and GenericUDF, which models functions that can operate on and return values of compound types.

Subclasses of UDF include at least one method called evaluate (of arbitrary argument and return types); this is what gets called when the user-defined function is evaluated. Due to their flexible signatures, these evaluate methods are not specified in any interface and instead found via Java reflection. By contrast, a GenericUDF must support an initialize method that takes an array of ObjectInspector instances (essentially adapters from arbitrary types to concrete object values) and an evaluate method taking an array of DeferredObject instances (essentially futures representing objects).

The initialize method in GenericUDF is invoked with ObjectInspector instances corresponding to actual parameters; if the actuals aren’t implicitly convertible to the proper types, it will fail. Otherwise, it will return an ObjectInspector instance for the return type. As a simple example, see the initialize method in the class providing Hive’s implementation of the SQL CONCAT function:

// excerpted from Hive 0.12's GenericUDFConcatWS.java https://github.com/apache/hive/blob/b8250ac2f30539f6b23ce80a20a9e338d3d31458/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDFConcatWS.java#L56
@Override
public ObjectInspector initialize(ObjectInspector[] arguments) throws UDFArgumentException {
  if (arguments.length < 2) {
    throw new UDFArgumentLengthException(
        "The function CONCAT_WS(separator,[string | array(string)]+) "
          + "needs at least two arguments.");
  }

  // check if argument is a string or an array of strings
  for (int i = 0; i < arguments.length; i++) {
    switch(arguments[i].getCategory()) {
      case LIST:
        if (isStringOrVoidType(
            ((ListObjectInspector) arguments[i]).getListElementObjectInspector())) {
          break;
        }
      case PRIMITIVE:
        if (isStringOrVoidType(arguments[i])) {
        break;
        }
      default:
        throw new UDFArgumentTypeException(i, "Argument " + (i + 1)
          + " of function CONCAT_WS must be \"" + serdeConstants.STRING_TYPE_NAME
          + " or " + serdeConstants.LIST_TYPE_NAME + "<" + serdeConstants.STRING_TYPE_NAME
          + ">\", but \"" + arguments[i].getTypeName() + "\" was found.");
    }
  }

  argumentOIs = arguments;
  return PrimitiveObjectInspectorFactory.writableStringObjectInspector;
}

Note that the above verifies both the correct number of arguments and the correct types of each argument before returning an ObjectInspector instance for writable strings. The evaluate method then invokes DeferredObject.get() on each argument, converts them to String values using built-in coercions, and concatenates them together, returning the result as a text value.

Plain UDF instances and GenericUDF instances alike are stored in Hive’s function registry, but the former are converted to GenericUDF instances first by wrapping them GenericUDFBridge, which is a proxy that uses Java introspection on the underlying UDF instance to determine what a function’s expected argument types are; it can then convert actual parameters to values of appropriate types using built-in coercions at execution time.

Implicit conversions in Microsoft SQL Server

While we can’t examine conversions supported in Microsoft SQL Server in as great detail as we can with Apache Hive (since the source for SQL Server isn’t available), the published documentation indicates which conversions are supported. In brief, SQL Server supports most of the same kinds of type coercions as Hive, with the following additions:

  1. bidirectional implicit translation from char[N] and varchar[N] to all numeric types (not merely double and decimal, as in Hive);
  2. financial types (money and smallmoney) are supported and can be implicitly translated to and from numeric types;
  3. bidirectional implicit translation between timestamp values to and from character and integral types;
  4. the sql_variant type, which can receive values of most types via implicit conversions but must be converted with an explicit CAST in contexts expecting a value of a different type; and
  5. various other types (xml, uniqueidentifier, and user-defined types from the CLR) with varying conversion semantics.

These additions are useful but their absence does not limit Hive’s expressive power. In the next post in this series, we’ll look at a general approach to implementing type widening, along with a specific (and statically-safe) realization of this approach using Scala’s type system.

Footnotes

  1. The Hive wiki includes a full conversion matrix.↩︎

  2. For example, it is permissible to use a tinyint where an int or double is expected, but not vice versa.↩︎

  3. Actually implementing new functions in Hive is outside the scope of this post, but there are lots of resources online if you’re interested. In particular, Matthew Rathbone has a great article about extending Hive with new functions.↩︎