dask.dataframe.read_csv

dask.dataframe.read_csv

dask.dataframe.read_csv(urlpath, blocksize='default', lineterminator=None, compression='infer', sample=256000, sample_rows=10, enforce=False, assume_missing=False, storage_options=None, include_path_column=False, **kwargs)

Read CSV files into a Dask.DataFrame

This parallelizes the pandas.read_csv() function in the following ways:

  • It supports loading many files at once using globstrings:

    >>> df = dd.read_csv('myfiles.*.csv')  
    
  • In some cases it can break up large files:

    >>> df = dd.read_csv('largefile.csv', blocksize=25e6)  # 25MB chunks  
    
  • It can read CSV files from external resources (e.g. S3, HDFS) by providing a URL:

    >>> df = dd.read_csv('s3://bucket/myfiles.*.csv')  
    >>> df = dd.read_csv('hdfs:///myfiles.*.csv')  
    >>> df = dd.read_csv('hdfs://namenode.example.com/myfiles.*.csv')  
    

Internally dd.read_csv uses pandas.read_csv() and supports many of the same keyword arguments with the same performance guarantees. See the docstring for pandas.read_csv() for more information on available keyword arguments.

Parameters
urlpathstring or list

Absolute or relative filepath(s). Prefix with a protocol like s3:// to read from alternative filesystems. To read from multiple files you can pass a globstring or a list of paths, with the caveat that they must all have the same protocol.

blocksizestr, int or None, optional

Number of bytes by which to cut up larger files. Default value is computed based on available physical memory and the number of cores, up to a maximum of 64MB. Can be a number like 64000000 or a string like "64MB". If None, a single block is used for each file.

sampleint, optional

Number of bytes to use when determining dtypes

assume_missingbool, optional

If True, all integer columns that aren’t specified in dtype are assumed to contain missing values, and are converted to floats. Default is False.

storage_optionsdict, optional

Extra options that make sense for a particular storage connection, e.g. host, port, username, password, etc.

include_path_columnbool or str, optional

Whether or not to include the path to each particular file. If True a new column is added to the dataframe called path. If str, sets new column name. Default is False.

**kwargs

Extra keyword arguments to forward to pandas.read_csv().

Notes

Dask dataframe tries to infer the dtype of each column by reading a sample from the start of the file (or of the first file if it’s a glob). Usually this works fine, but if the dtype is different later in the file (or in other files) this can cause issues. For example, if all the rows in the sample had integer dtypes, but later on there was a NaN, then this would error at compute time. To fix this, you have a few options:

  • Provide explicit dtypes for the offending columns using the dtype keyword. This is the recommended solution.

  • Use the assume_missing keyword to assume that all columns inferred as integers contain missing values, and convert them to floats.

  • Increase the size of the sample using the sample keyword.

It should also be noted that this function may fail if a CSV file includes quoted strings that contain the line terminator. To get around this you can specify blocksize=None to not split files into multiple partitions, at the cost of reduced parallelism.