Spark/SparkSQL

Dependencies

Setup

We recommend you install Spark via conda from the blaze binstar channel:

$ conda install pyhive spark -c blaze

The package works well on Ubuntu Linux and Mac OS X. Other issues may arise when installing this package on a non-Ubuntu Linux distro. There’s a known issue with Arch Linux.

Interface

Spark diverges a bit from other areas of odo due to the way it works. With Spark, all objects are attached to a special object called SparkContext. There can only be one of these running at a time. In contrast, SparkSQL objects all live inside of one or more SQLContext objects. SQLContext objects must be attached to a SparkContext.

Here’s an example of how to setup a SparkContext:

>>> from pyspark import SparkContext
>>> sc = SparkContext('app', 'local')

Next we create a SQLContext:

>>> from pyspark.sql import SQLContext
>>> sql = SQLContext(sc)  # from the previous code block

From here, you can start using odo to create SchemaRDD objects, which are the SparkSQL version of a table:

>>> from odo import odo
>>> data = [('Alice', 300.0), ('Bob', 200.0), ('Donatello', -100.0)]
>>> type(sql)
<class 'pyspark.sql.SQLContext'>
>>> srdd = odo(data, sql, dshape='var * {name: string, amount: float64}')
>>> type(srdd)
<class 'pyspark.sql.SchemaRDD'>

Note the type of srdd. Usually odo(A, B) will return an instance of B if B is a type. With Spark and SparkSQL, we need to attach whatever we make to a context, so we “append” to an existing SparkContext/SQLContext. Instead of returning the context object, odo will return the SchemaRDD that we just created. This makes it more convenient to do things with the result.

This functionality is nascent, so try it out and don’t hesitate to report a bug or request a feature!

URIs

URI syntax isn’t currently implemented for Spark objects.

Conversions

The main paths into and out of RDD and SchemaRDD are through Python list objects:

RDD <-> list
SchemaRDD <-> list

Additionally, there’s a specialized one-way path for going directly to SchemaRDD from RDD:

RDD -> SchemaRDD

TODO

  • Resource/URIs
  • Native loaders for JSON and possibly CSV
  • HDFS integration