dask.dataframe.DataFrame.to_parquet
dask.dataframe.DataFrame.to_parquet¶
- DataFrame.to_parquet(path, *args, **kwargs)[source]¶
Store Dask.dataframe to Parquet files
- Parameters
- dfdask.dataframe.DataFrame
- pathstring or pathlib.Path
Destination directory for data. Prepend with protocol like
s3://
orhdfs://
for remote data.- engine{‘auto’, ‘fastparquet’, ‘pyarrow’}, default ‘auto’
Parquet library to use. Options include: ‘auto’, ‘fastparquet’, and ‘pyarrow’. Defaults to ‘auto’, which uses
fastparquet
if it is installed, and falls back topyarrow
otherwise. Note that in the future this default ordering for ‘auto’ will switch, withpyarrow
being used if it is installed, and falling back tofastparquet
.- compressionstring or dict, default ‘snappy’
Either a string like
"snappy"
or a dictionary mapping column names to compressors like{"name": "gzip", "values": "snappy"}
. Defaults to"snappy"
.- write_indexboolean, default True
Whether or not to write the index. Defaults to True.
- appendbool, default False
If False (default), construct data-set from scratch. If True, add new row-group(s) to an existing data-set. In the latter case, the data-set must exist, and the schema must match the input data.
- overwritebool, default False
Whether or not to remove the contents of path before writing the dataset. The default is False. If True, the specified path must correspond to a directory (but not the current working directory). This option cannot be set to True if append=True. NOTE: overwrite=True will remove the original data even if the current write operation fails. Use at your own risk.
- ignore_divisionsbool, default False
If False (default) raises error when previous divisions overlap with the new appended divisions. Ignored if append=False.
- partition_onlist, default None
Construct directory-based partitioning by splitting on these fields’ values. Each dask partition will result in one or more datafiles, there will be no global groupby.
- storage_optionsdict, default None
Key/value pairs to be passed on to the file-system backend, if any.
- custom_metadatadict, default None
Custom key/value metadata to include in all footer metadata (and in the global “_metadata” file, if applicable). Note that the custom metadata may not contain the reserved b”pandas” key.
- write_metadata_filebool or None, default None
Whether to write the special
_metadata
file. IfNone
(the default), a_metadata
file will only be written ifappend=True
and the dataset already has a_metadata
file.- computebool, default True
If
True
(default) then the result is computed immediately. IfFalse
then adask.dataframe.Scalar
object is returned for future computation.- compute_kwargsdict, default True
Options to be passed in to the compute method
- schemaSchema object, dict, or {“infer”, None}, default None
Global schema to use for the output dataset. Alternatively, a dict of pyarrow types can be specified (e.g. schema={“id”: pa.string()}). For this case, fields excluded from the dictionary will be inferred from _meta_nonempty. If “infer”, the first non-empty and non-null partition will be used to infer the type for “object” columns. If None (default), we let the backend infer the schema for each distinct output partition. If the partitions produce inconsistent schemas, pyarrow will throw an error when writing the shared _metadata file. Note that this argument is ignored by the “fastparquet” engine.
- name_functioncallable, default None
Function to generate the filename for each output partition. The function should accept an integer (partition index) as input and return a string which will be used as the filename for the corresponding partition. Should preserve the lexicographic order of partitions. If not specified, files will created using the convention
part.0.parquet
,part.1.parquet
,part.2.parquet
, … and so on for each partition in the DataFrame.- **kwargs
Extra options to be passed on to the specific backend.
See also
read_parquet
Read parquet data to dask.dataframe
Notes
Each partition will be written to a separate file.
Examples
>>> df = dd.read_csv(...) >>> df.to_parquet('/path/to/output/', ...)
By default, files will be created in the specified output directory using the convention
part.0.parquet
,part.1.parquet
,part.2.parquet
, … and so on for each partition in the DataFrame. To customize the names of each file, you can use thename_function=
keyword argument. The function passed toname_function
will be used to generate the filename for each partition and should expect a partition’s index integer as input and return a string which will be used as the filename for the corresponding partition. Strings produced byname_function
must preserve the order of their respective partition indices.For example:
>>> name_function = lambda x: f"data-{x}.parquet" >>> df.to_parquet('/path/to/output/', name_function=name_function)
will result in the following files being created:
/path/to/output/ ├── data-0.parquet ├── data-1.parquet ├── data-2.parquet └── ...