dask.dataframe.from_pandas

dask.dataframe.from_pandas(data, npartitions=None, chunksize=None, sort=True, name=None)[source]

Construct a Dask DataFrame from a Pandas DataFrame

This splits an in-memory Pandas dataframe into several parts and constructs a dask.dataframe from those parts on which Dask.dataframe can operate in parallel. By default, the input dataframe will be sorted by the index to produce cleanly-divided partitions (with known divisions). To preserve the input ordering, make sure the input index is monotonically-increasing. The sort=False option will also avoid reordering, but will not result in known divisions.

Note that, despite parallelism, Dask.dataframe may not always be faster than Pandas. We recommend that you stay with Pandas for as long as possible before switching to Dask.dataframe.

Parameters
datapandas.DataFrame or pandas.Series

The DataFrame/Series with which to construct a Dask DataFrame/Series

npartitionsint, optional

The number of partitions of the index to create. Note that depending on the size and index of the dataframe, the output may have fewer partitions than requested.

chunksizeint, optional

The number of rows per index partition to use.

sort: bool

Sort the input by index first to obtain cleanly divided partitions (with known divisions). If False, the input will not be sorted, and all divisions will be set to None. Default is True.

name: string, optional

An optional keyname for the dataframe. Defaults to hashing the input

Returns
dask.DataFrame or dask.Series

A dask DataFrame/Series partitioned along the index

Raises
TypeError

If something other than a pandas.DataFrame or pandas.Series is passed in.

See also

from_array

Construct a dask.DataFrame from an array that has record dtype

read_csv

Construct a dask.DataFrame from a CSV file

Examples

>>> from dask.dataframe import from_pandas
>>> df = pd.DataFrame(dict(a=list('aabbcc'), b=list(range(6))),
...                   index=pd.date_range(start='20100101', periods=6))
>>> ddf = from_pandas(df, npartitions=3)
>>> ddf.divisions  
(Timestamp('2010-01-01 00:00:00', freq='D'),
 Timestamp('2010-01-03 00:00:00', freq='D'),
 Timestamp('2010-01-05 00:00:00', freq='D'),
 Timestamp('2010-01-06 00:00:00', freq='D'))
>>> ddf = from_pandas(df.a, npartitions=3)  # Works with Series too!
>>> ddf.divisions  
(Timestamp('2010-01-01 00:00:00', freq='D'),
 Timestamp('2010-01-03 00:00:00', freq='D'),
 Timestamp('2010-01-05 00:00:00', freq='D'),
 Timestamp('2010-01-06 00:00:00', freq='D'))