dask.dataframe.to_parquet

dask.dataframe.to_parquet(df, path, engine='auto', compression='default', write_index=True, append=False, overwrite=False, ignore_divisions=False, partition_on=None, storage_options=None, custom_metadata=None, write_metadata_file=True, compute=True, compute_kwargs=None, schema=None, **kwargs)[source]

Store Dask.dataframe to Parquet files

Parameters
dfdask.dataframe.DataFrame
pathstring or pathlib.Path

Destination directory for data. Prepend with protocol like s3:// or hdfs:// for remote data.

engine{‘auto’, ‘fastparquet’, ‘pyarrow’}, default ‘auto’

Parquet library to use. If only one library is installed, it will use that one; if both, it will use ‘fastparquet’.

compressionstring or dict, default ‘default’

Either a string like "snappy" or a dictionary mapping column names to compressors like {"name": "gzip", "values": "snappy"}. The default is "default", which uses the default compression for whichever engine is selected.

write_indexboolean, default True

Whether or not to write the index. Defaults to True.

appendbool, default False

If False (default), construct data-set from scratch. If True, add new row-group(s) to an existing data-set. In the latter case, the data-set must exist, and the schema must match the input data.

overwritebool, default False

Whether or not to remove the contents of path before writing the dataset. The default is False. If True, the specified path must correspond to a directory (but not the current working directory). This option cannot be set to True if append=True. NOTE: overwrite=True will remove the original data even if the current write operation fails. Use at your own risk.

ignore_divisionsbool, default False

If False (default) raises error when previous divisions overlap with the new appended divisions. Ignored if append=False.

partition_onlist, default None

Construct directory-based partitioning by splitting on these fields’ values. Each dask partition will result in one or more datafiles, there will be no global groupby.

storage_optionsdict, default None

Key/value pairs to be passed on to the file-system backend, if any.

custom_metadatadict, default None

Custom key/value metadata to include in all footer metadata (and in the global “_metadata” file, if applicable). Note that the custom metadata may not contain the reserved b”pandas” key.

write_metadata_filebool, default True

Whether to write the special “_metadata” file.

computebool, default True

If True (default) then the result is computed immediately. If False then a dask.dataframe.Scalar object is returned for future computation.

compute_kwargsdict, default True

Options to be passed in to the compute method

schemaSchema object, dict, or {“infer”, None}, default None

Global schema to use for the output dataset. Alternatively, a dict of pyarrow types can be specified (e.g. schema={“id”: pa.string()}). For this case, fields excluded from the dictionary will be inferred from _meta_nonempty. If “infer”, the first non-empty and non-null partition will be used to infer the type for “object” columns. If None (default), we let the backend infer the schema for each distinct output partition. If the partitions produce inconsistent schemas, pyarrow will throw an error when writing the shared _metadata file. Note that this argument is ignored by the “fastparquet” engine.

**kwargs :

Extra options to be passed on to the specific backend.

See also

read_parquet

Read parquet data to dask.dataframe

Notes

Each partition will be written to a separate file.

Examples

>>> df = dd.read_csv(...)  
>>> df.to_parquet('/path/to/output/', ...)