Type Modifiers¶
Odo decides what conversion functions to run based on the type (e.g.
pd.DataFrame, sqlalchemy.Table, odo.CSV
of the input. In many cases we
want slight variations to signify different circumstances such as the
difference between the following CSV files
- A local CSV file
- A sequence of CSV files
- A CSV file on a remote machine
- A CSV file on HDFS
- A CSV file on S3
- A temporary CSV file that should be deleted when we’re done
In principle we need to create subclasses for each of these and for their
JSON
, TextFile
, etc. equivalents. To assist with this we create
functions to create these subclasses for us. These functions are named the
following:
chunks - a sequence of data in chunks
SSH - data living on a remote machine
HDFS - data living on Hadoop File system
S3 - data living on Amazon's S3
Directory - a directory of data
Temp - a temporary piece of data to be garbage collected
We use these functions on types to construct new types.
>>> SSH(CSV)('/path/to/data', delimiter=',', user='ubuntu')
>>> Directory(JSON)('/path/to/data/')
We compose these functions to specify more complex situations like a temporary directory of JSON data living on S3
>>> Temp(S3(Directory(JSONLines)))
Use URIs¶
Most users don’t interact with these types. They are for internal use by developers to specify the situations in which a function should be called.
chunks
¶
A particularly important type modifier is chunks
, which signifies an
iterable of some other type. For example chunks(list)
means an iterable of
Python lists and chunks(pd.DataFrame)
an iterable of DataFrames. The
chunks
modifier is often used to convert between two out-of-core formats
via an in-core format. This is also a nice mechanism to interact with data in
an online fashion
>>> from odo import odo, chunks
>>> import pandas as pd
>>> seq = odo('postgresql://localhost::mytable', chunks(pd.DataFrame))
>>> for df in seq:
... # work on each dataframe sequentially
chunks
may also be used to write an iterable of chunks into another
resource. For example, we may use chunks to write a sequence of numpy arrays
into a postgres table while only ever holding one whole array in memory like so:
>>> import numpy as np
>>> from odo import odo, chunks
>>> seq = (np.random.randn(5, 3) for _ in range(3))
>>> odo(chunks(np.ndarray)(seq), 'postgresql://localhost::mytable')
chunks(type_)(seq)
is merely a small box wrapping the inner sequence that
allows odo to know the types of the elements in the sequence. We may still use
this sequence as we would any other, including looping over it.
Because this is wrapping the inner sequence, we may only iterate over the
chunks
multiple times if the inner sequence supports being iterated over
more than once. For example:
>>> from odo import chunks
>>> CL = chunks(list)
>>> multuple_iteration_seq = CL([[0, 1, 2], [3, 4, 5])
>>> tuple(multuple_iteration_seq)
([0, 1, 2], [3, 4, 5])
>>> tuple(multuple_iteration_seq)
([0, 1, 2], [3, 4, 5])
>>> single_iteration_seq = CL(iter([[0, 1, 2], [3, 4, 5]]))
>>> tuple(single_iteraton_seq)
([0, 1, 2], [3, 4, 5])
>>> tuple(single_iteration_seq)
()