Hive Metastore

The Hive metastore relates SQL metadata to files on the Hadoop File System (HDFS). It is similar to a SQL database in that it contains information about SQL tables but dissimilar in that data isn’t stored in Hive but remains ordinary files on HDFS.

Odo interacts with Hive mostly through sqlalchemy and also with a bit of custom code due to its peculiarities.

URIs

Hive uris match exactly SQLAlchemy connection strings with the hive:// protocol. Additionally, Impala, another SQL database on HDFS can also connect to the same tables.

hive://hostname
hive://user@hostname:port/database-name
hive://user@hostname:port/database-name::table-name

impala://hostname::table-name

Additionally you should probably inspect docs on HDFS due to the tight integration between the two.

Options

Hive tables have a few non-standard options on top of normal SQL:

stored_as - File format on disk like TEXTFILE, PARQUET, ORC
path - Absolute path to file location on HDFS
external=True - Whether to keep the file external or move it to Hive
    directory

See Hive DDL docs for more information.

Conversions

We commonly load CSV files in to Hive, either from HDFS or from local disk on one of the machines that comprise the HDFS cluster:

HDFS(Directory(CSV)) -> Hive
SSH(Directory(CSV)) -> Hive
SSH(CSV) -> Hive

Additionally we can use Hive to efficiently migrate this data to new data in a different format:

Hive -> Hive

And as with all SQL systems through SQLAlchemy we can convert a Hive table to a Python Iterator, though this is somewhat slow:

Hive -> Iterator

Impala

Impala operates on the same data as Hive, is generally faster, though also has a couple of quirks.

While Impala connects to the same metastore it must connect to one of the worker nodes, not the same head node to which Hive connects. After you load data in to hive you need to send the invalidate metadata to Impala.

>>> odo('hdfs://hostname::/path/to/data/*.csv', 'hive://hostname::table')

>>> imp = resource('impala://workernode')
>>> imp.connect().execute('invalidate metadata')

This is arguably something that odo should handle in the future. After this, all tables in Hive are also available to Impala.

You may want to transform your data in to Parquet format for efficient querying. A two minute query on Hive in CSV might take one minute on Hive in Parquet and only three seconds in Impala in Parquet.

>>> odo('hive://hostname::table', 'hive://hostname::table_parquet',
...     external=False, stored_as='PARQUET')