In my last post, I introduced two kinds of implicit type coercions that can appear in database query languages: type widenings, in which values are converted to wider types (e.g. from an int
to a long
or double
), and type translations, in which a value of some type T might be converted to one of an unrelated type U if it is used where a value of U is expected. In this post, we’ll look at what sort of type coercions are available in Apache Hive and (in less detail) Microsoft SQL Server.
Implicit conversions in Apache Hive
Apache Hive features several kinds of types, many of which are also present in ANSI SQL with similar definitions:
- hardware-supported integral types, such as
tinyint
(one byte),smallint
(two bytes),int
(four bytes), andbigint
(eight bytes); - hardware-supported floating-point types, such as
float
(single-precision, four bytes) anddouble
(double-precision, eight bytes); decimal
values (38 digits precision in Hive 0.11 and 0.12; arbitrary-precision in Hive 0.13.0 and later);- date and time types, such as
timestamp
anddate
; - string types, including
string
(of arbitrary length),varchar[N]
(of arbitrary length but less than N characters), andchar[N]
(of exactly N characters, possibly padded with spaces); boolean
values;binary
values (sequences of bytes); and- compound values made up of Hive types: homogeneous arrays with some element type, maps containing keys of one type and values of another, and C-style
struct
andunion
types.
Hive supports some widenings and narrowings between these types.1 Among the hardware-supported numeric types, values can be widened but not narrowed.2 Strings can be narrowed to be used as varchar
values; converting a string
value to a varchar[N]
, where N is insufficient to hold the contents of the string, will cause the string to be truncated to N characters. It is also possible (as of Hive 0.13) to supply a decimal
argument to many numeric functions that expect a double
input, although in most cases the function will only process a double
approximating the supplied arbitrary-precision value.
Hive also supports type translations to and from string
values. Hive permits implicitly converting a value of any type (with the exception of boolean
and binary
) to a string. String representations of double
or decimal
values (but not the smaller integral or floating-point types) can also be converted to values of those types.
Hive supports widenings as part of object comparisons; the FunctionRegistry.getCommonClassForComparison
method returns the least upper bound of two types. The code excerpt below shows how Hive also explicitly encodes which widenings and translations are permissible:
// excerpted from Hive 0.12's FunctionRegistry.java https://github.com/apache/hive/blob/51dafbacab14eb57e603d04fdcecfec6ed6f7fd1/ql/src/java/org/apache/hadoop/hive/ql/exec/FunctionRegistry.java#L1011
public static boolean implicitConvertable(PrimitiveCategory from, PrimitiveCategory to) {
if (from == to) {
return true;
}
= PrimitiveObjectInspectorUtils.getPrimitiveGrouping(from);
PrimitiveGrouping fromPg = PrimitiveObjectInspectorUtils.getPrimitiveGrouping(to);
PrimitiveGrouping toPg
// Allow implicit String to Double conversion
if (fromPg == PrimitiveGrouping.STRING_GROUP && to == PrimitiveCategory.DOUBLE) {
return true;
}
// Allow implicit String to Decimal conversion
if (fromPg == PrimitiveGrouping.STRING_GROUP && to == PrimitiveCategory.DECIMAL) {
return true;
}
// Void can be converted to any type
if (from == PrimitiveCategory.VOID) {
return true;
}
// Allow implicit String to Date conversion
if (fromPg == PrimitiveGrouping.DATE_GROUP && toPg == PrimitiveGrouping.STRING_GROUP) {
return true;
}
// Allow implicit Numeric to String conversion
if (fromPg == PrimitiveGrouping.NUMERIC_GROUP && toPg == PrimitiveGrouping.STRING_GROUP) {
return true;
}
// Allow implicit String to varchar conversion, and vice versa
if (fromPg == PrimitiveGrouping.STRING_GROUP && toPg == PrimitiveGrouping.STRING_GROUP) {
return true;
}
// Allow implicit conversion from Byte -> Integer -> Long -> Float -> Double
// Decimal -> String
Integer f = numericTypes.get(from);
Integer t = numericTypes.get(to);
if (f == null || t == null) {
return false;
}
if (f.intValue() > t.intValue()) {
return false;
}
return true;
}
To see how Hive actually performs type coercions, we’ll have to take a step back and look at Hive’s architecture for defining functions.3 Hive has two interfaces for defining functions: UDF
, which models a simple function with simply-typed arguments and a simply-typed return value, and GenericUDF
, which models functions that can operate on and return values of compound types.
Subclasses of UDF
include at least one method called evaluate
(of arbitrary argument and return types); this is what gets called when the user-defined function is evaluated. Due to their flexible signatures, these evaluate
methods are not specified in any interface and instead found via Java reflection. By contrast, a GenericUDF
must support an initialize
method that takes an array of ObjectInspector
instances (essentially adapters from arbitrary types to concrete object values) and an evaluate
method taking an array of DeferredObject
instances (essentially futures representing objects).
The initialize
method in GenericUDF
is invoked with ObjectInspector
instances corresponding to actual parameters; if the actuals aren’t implicitly convertible to the proper types, it will fail. Otherwise, it will return an ObjectInspector
instance for the return type. As a simple example, see the initialize
method in the class providing Hive’s implementation of the SQL CONCAT
function:
// excerpted from Hive 0.12's GenericUDFConcatWS.java https://github.com/apache/hive/blob/b8250ac2f30539f6b23ce80a20a9e338d3d31458/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDFConcatWS.java#L56
@Override
public ObjectInspector initialize(ObjectInspector[] arguments) throws UDFArgumentException {
if (arguments.length < 2) {
throw new UDFArgumentLengthException(
"The function CONCAT_WS(separator,[string | array(string)]+) "
+ "needs at least two arguments.");
}
// check if argument is a string or an array of strings
for (int i = 0; i < arguments.length; i++) {
switch(arguments[i].getCategory()) {
case LIST:
if (isStringOrVoidType(
((ListObjectInspector) arguments[i]).getListElementObjectInspector())) {
break;
}
case PRIMITIVE:
if (isStringOrVoidType(arguments[i])) {
break;
}
default:
throw new UDFArgumentTypeException(i, "Argument " + (i + 1)
+ " of function CONCAT_WS must be \"" + serdeConstants.STRING_TYPE_NAME
+ " or " + serdeConstants.LIST_TYPE_NAME + "<" + serdeConstants.STRING_TYPE_NAME
+ ">\", but \"" + arguments[i].getTypeName() + "\" was found.");
}
}
= arguments;
argumentOIs return PrimitiveObjectInspectorFactory.writableStringObjectInspector;
}
Note that the above verifies both the correct number of arguments and the correct types of each argument before returning an ObjectInspector
instance for writable strings. The evaluate
method then invokes DeferredObject.get()
on each argument, converts them to String
values using built-in coercions, and concatenates them together, returning the result as a text value.
Plain UDF
instances and GenericUDF
instances alike are stored in Hive’s function registry, but the former are converted to GenericUDF
instances first by wrapping them GenericUDFBridge
, which is a proxy that uses Java introspection on the underlying UDF
instance to determine what a function’s expected argument types are; it can then convert actual parameters to values of appropriate types using built-in coercions at execution time.
Implicit conversions in Microsoft SQL Server
While we can’t examine conversions supported in Microsoft SQL Server in as great detail as we can with Apache Hive (since the source for SQL Server isn’t available), the published documentation indicates which conversions are supported. In brief, SQL Server supports most of the same kinds of type coercions as Hive, with the following additions:
- bidirectional implicit translation from
char[N]
andvarchar[N]
to all numeric types (not merelydouble
anddecimal
, as in Hive); - financial types (
money
andsmallmoney
) are supported and can be implicitly translated to and from numeric types; - bidirectional implicit translation between
timestamp
values to and from character and integral types; - the
sql_variant
type, which can receive values of most types via implicit conversions but must be converted with an explicitCAST
in contexts expecting a value of a different type; and - various other types (
xml
,uniqueidentifier
, and user-defined types from the CLR) with varying conversion semantics.
These additions are useful but their absence does not limit Hive’s expressive power. In the next post in this series, we’ll look at a general approach to implementing type widening, along with a specific (and statically-safe) realization of this approach using Scala’s type system.
Footnotes
The Hive wiki includes a full conversion matrix.↩︎
For example, it is permissible to use a
tinyint
where anint
ordouble
is expected, but not vice versa.↩︎Actually implementing new functions in Hive is outside the scope of this post, but there are lots of resources online if you’re interested. In particular, Matthew Rathbone has a great article about extending Hive with new functions.↩︎