In my last post, I introduced two kinds of implicit type coercions that can appear in database query languages: type widenings, in which values are converted to wider types (e.g. from an
int to a
double), and type translations, in which a value of some type T might be converted to one of an unrelated type U if it is used where a value of U is expected. In this post, we’ll look at what sort of type coercions are available in Apache Hive and (in less detail) Microsoft SQL Server.
Implicit conversions in Apache Hive
- hardware-supported integral types, such as
int(four bytes), and
- hardware-supported floating-point types, such as
float(single-precision, four bytes) and
double(double-precision, eight bytes);
decimalvalues (38 digits precision in Hive 0.11 and 0.12; arbitrary-precision in Hive 0.13.0 and later);
- date and time types, such as
- string types, including
string(of arbitrary length),
varchar[N](of arbitrary length but less than N characters), and
char[N](of exactly N characters, possibly padded with spaces);
binaryvalues (sequences of bytes); and
- compound values made up of Hive types: homogeneous arrays with some element type, maps containing keys of one type and values of another, and C-style
Hive supports some widenings and narrowings between these types.1 Among the hardware-supported numeric types, values can be widened but not narrowed.2 Strings can be narrowed to be used as
varchar values; converting a
string value to a
varchar[N], where N is insufficient to hold the contents of the string, will cause the string to be truncated to N characters. It is also possible (as of Hive 0.13) to supply a
decimal argument to many numeric functions that expect a
double input, although in most cases the function will only process a
double approximating the supplied arbitrary-precision value.
Hive also supports type translations to and from
string values. Hive permits implicitly converting a value of any type (with the exception of
binary) to a string. String representations of
decimal values (but not the smaller integral or floating-point types) can also be converted to values of those types.
Hive supports widenings as part of object comparisons; the
FunctionRegistry.getCommonClassForComparison method returns the least upper bound of two types. The code excerpt below shows how Hive also explicitly encodes which widenings and translations are permissible:
To see how Hive actually performs type coercions, we’ll have to take a step back and look at Hive’s architecture for defining functions.3 Hive has two interfaces for defining functions:
UDF, which models a simple function with simply-typed arguments and a simply-typed return value, and
GenericUDF, which models functions that can operate on and return values of compound types.
UDF include at least one method called
evaluate (of arbitrary argument and return types); this is what gets called when the user-defined function is evaluated. Due to their flexible signatures, these
evaluate methods are not specified in any interface and instead found via Java reflection. By contrast, a
GenericUDF must support an
initialize method that takes an array of
ObjectInspector instances (essentially adapters from arbitrary types to concrete object values) and an
evaluate method taking an array of
DeferredObject instances (essentially futures representing objects).
initialize method in
GenericUDF is invoked with
ObjectInspector instances corresponding to actual parameters; if the actuals aren’t implicitly convertible to the proper types, it will fail. Otherwise, it will return an
ObjectInspector instance for the return type. As a simple example, see the
initialize method in the class providing Hive’s implementation of the SQL
Note that the above verifies both the correct number of arguments and the correct types of each argument before returning an
ObjectInspector instance for writable strings. The
evaluate method then invokes
DeferredObject.get() on each argument, converts them to
String values using built-in coercions, and concatenates them together, returning the result as a text value.
UDF instances and
GenericUDF instances alike are stored in Hive’s function
registry, but the former are converted to
GenericUDF instances first by wrapping them
GenericUDFBridge, which is a proxy that uses Java introspection on the underlying
UDF instance to determine what a function’s expected argument types are; it can then convert actual parameters to values of appropriate types using built-in coercions at execution time.
Implicit conversions in Microsoft SQL Server
While we can’t examine conversions supported in Microsoft SQL Server in as great detail as we can with Apache Hive (since the source for SQL Server isn’t available), the published documentation indicates which conversions are supported. In brief, SQL Server supports most of the same kinds of type coercions as Hive, with the following additions:
- bidirectional implicit translation from
varchar[N]to all numeric types (not merely
decimal, as in Hive);
- financial types (
smallmoney) are supported and can be implicitly translated to and from numeric types;
- bidirectional implicit translation between
timestampvalues to and from character and integral types;
sql_varianttype, which can receive values of most types via implicit conversions but must be converted with an explicit
CASTin contexts expecting a value of a different type; and
- various other types (
uniqueidentifier, and user-defined types from the CLR) with varying conversion semantics.
These additions are useful but their absence does not limit Hive’s expressive power. In the next post in this series, we’ll look at a general approach to implementing type widening, along with a specific (and statically-safe) realization of this approach using Scala’s type system.
For example, it is permissible to use a
doubleis expected, but not vice versa. ↩
Actually implementing new functions in Hive is outside the scope of this post, but there are lots of resources online if you’re interested. In particular, Matthew Rathbone has a great article about extending Hive with new functions. ↩