Natural join is a useful special case of the relational join operation (and is extremely common when denormalizing data pulled in from a relational database). Spark’s DataFrame API provides an expressive way to specify arbitrary joins, but it would be nice to have some machinery to make the simple case of natural join as easy as possible. Here’s what a natural join needs to do:
For relations R and S, identify the columns they have in common, say c1 and c2;
join R and S on the condition that R.c1 == S.c1 and R.c2 == S.c2; and
project away the duplicated columns.
so, in Spark, a natural join would look like this:
We can generalize this as follows (note that joining two frames with no columns in common will produce an empty frame):
Furthermore, we can make this operation available to any DataFrame via implicit conversions:
If you’re interested in using this code in your own projects, simply add the Silex library to your project and import com.redhat.et.silex.frame._. (You can also get Silex via bintray.)