dask.dataframe.read_orc

dask.dataframe.read_orc#

dask.dataframe.read_orc(path, engine='pyarrow', columns=None, index=None, split_stripes=1, aggregate_files=None, storage_options=None)[source]#

Read dataframe from ORC file(s)

Parameters:

path: str or list(str): Location of file(s), which can be a full URL with protocol specifier, and may include glob character if a single string.
engine: ‘pyarrow’ or ORCEngine: Backend ORC engine to use for I/O. Default is “pyarrow”.
columns: None or list(str): Columns to load. If None, loads all.
index: str: Column name to set as index.
split_stripes: int or False: Maximum number of ORC stripes to include in each output-DataFrame partition. Use False to specify a 1-to-1 mapping between files and partitions. Default is 1.
aggregate_filesbool, default False: Whether distinct file paths may be aggregated into the same output partition. A setting of True means that any two file paths may be aggregated into the same output partition, while False means that inter-file aggregation is prohibited.
storage_options: None or dict: Further parameters to pass to the bytes backend.

Returns:

Dask.DataFrame (even if there is only one column)

Examples

>>> df = dd.read_orc('https://github.com/apache/orc/raw/'
...                  'master/examples/demo-11-zlib.orc')

dask.dataframe.read_orc

Contents

dask.dataframe.read_orc#