Data Types

We can resolve errors and increase efficiency by explicitly specifying data types. Odo uses DataShape to specify datatypes across all of the formats that it supports.

First we motivate the use of datatypes with two examples, then we talk about how to use DataShape.

Datatypes prevent errors

Consider the following CSV file:

name,balance
Alice,100
Bob,200
...
<many more lines>
...
Zelda,100.25

When odo loads this file into a new container (DataFrame, new SQL Table, etc.) it needs to know the datatypes of the source so that it can create a matching target. If the CSV file is large then it looks only at the first few hundred lines and guesses a datatype from that. In this case it might incorrectly guess that the balance column is of integer type because it doesn’t see a decimal value until very late in the file with the line Zelda,100.25. This will cause odo to create a target with the wrong datatypes which will foul up the transfer.

Odo will err unless we provide an explicit datatype. So we had this datashape:

var * {name: string, balance: int64)

But we want this one:

var * {name: string, balance: float64)

Datatypes increase efficiency

If we move that same CSV file into a binary store like HDF5 then we can significantly increase efficiency if we use fixed-length strings rather than variable length. So we might choose to push all of the names into strings of length 100 instead of leaving their lengths variable. Even with the wasted space this is often more efficient. Good binary stores can often compress away the added space but have trouble managing things of indeterminate length.

So we had this datashape:

var * {name: string, balance: float64}

But we want this one:

var * {name: string[100], balance: float64}

What is DataShape?

DataShape is a datatype system that includes scalar types:

string, int32, float64, datetime, ...

Option / missing value types:

?string, ?int32, ?float64, ?datetime, ...

Fixed length Collections:

10 * int64

Variable length Collections:

var * int64

Record types:

{name: string, balance: float64}

And any composition of the above:

10 * 10 * {x: int32, y: int32}

var * {name: string,
       payments: var * {when: ?datetime, amount: float32}}

DataShape and odo

If you want to be explicit you can add a datashape to an odo call with the dshape= keyword

>>> odo('accounts.csv', pd.DataFrame,
...      dshape='var * {name: string, balance: float64}')

This removes all of the guesswork from the odo heuristics. This can be necessary in tricky cases.

Use discover to get approximate datashapes

We rarely write out a full datashape by hand. Instead, use the discover function to get the datashape of an object.

>>> import numpy as np
>>> from odo import discover

>>> x = np.ones((5, 6), dtype='f4')
>>> discover(x)
dshape("5 * 6 * float32")

In self describing formats like numpy arrays this datashape is guaranteed to be correct and will return very quickly. In other cases like CSV files this datashape is only a guess and might need to be tweaked.

>>> from odo import odo, resource, discover
>>> csv = resource('accounts.csv')  # Have to use resource to discover URIs
>>> discover(csv)
dshape("var * {name: string, balance: int64}")

>>> ds = dshape("var * {name: string, balance: float64")  # copy-paste-modify
>>> odo('accounts.csv', pd.DataFrame, dshape=ds)

In these cases we can copy-paste the datashape and modify the parts that we need to change. In the example above we couldn’t call discover directly on the URI, 'accounts.csv', so instead we called resource on the URI first. discover returns the datashape string on all strings, regardless of whether or not we intend them to be URIs.

Learn More

DataShape is a separate project from odo. You can learn more about it at http://datashape.pydata.org/