- dask.dataframe.read_csv(urlpath, blocksize='default', lineterminator=None, compression='infer', sample=256000, sample_rows=10, enforce=False, assume_missing=False, storage_options=None, include_path_column=False, **kwargs)¶
Read CSV files into a Dask.DataFrame
This parallelizes the
pandas.read_csv()function in the following ways:
It supports loading many files at once using globstrings:
>>> df = dd.read_csv('myfiles.*.csv')
In some cases it can break up large files:
>>> df = dd.read_csv('largefile.csv', blocksize=25e6) # 25MB chunks
It can read CSV files from external resources (e.g. S3, HDFS) by providing a URL:
>>> df = dd.read_csv('s3://bucket/myfiles.*.csv') >>> df = dd.read_csv('hdfs:///myfiles.*.csv') >>> df = dd.read_csv('hdfs://namenode.example.com/myfiles.*.csv')
pandas.read_csv()and supports many of the same keyword arguments with the same performance guarantees. See the docstring for
pandas.read_csv()for more information on available keyword arguments.
- urlpathstring or list
Absolute or relative filepath(s). Prefix with a protocol like
s3://to read from alternative filesystems. To read from multiple files you can pass a globstring or a list of paths, with the caveat that they must all have the same protocol.
- blocksizestr, int or None, optional
Number of bytes by which to cut up larger files. Default value is computed based on available physical memory and the number of cores, up to a maximum of 64MB. Can be a number like
64000000or a string like
None, a single block is used for each file.
- sampleint, optional
Number of bytes to use when determining dtypes
- assume_missingbool, optional
If True, all integer columns that aren’t specified in
dtypeare assumed to contain missing values, and are converted to floats. Default is False.
- storage_optionsdict, optional
Extra options that make sense for a particular storage connection, e.g. host, port, username, password, etc.
- include_path_columnbool or str, optional
Whether or not to include the path to each particular file. If True a new column is added to the dataframe called
path. If str, sets new column name. Default is False.
Extra keyword arguments to forward to
Dask dataframe tries to infer the
dtypeof each column by reading a sample from the start of the file (or of the first file if it’s a glob). Usually this works fine, but if the
dtypeis different later in the file (or in other files) this can cause issues. For example, if all the rows in the sample had integer dtypes, but later on there was a
NaN, then this would error at compute time. To fix this, you have a few options:
Provide explicit dtypes for the offending columns using the
dtypekeyword. This is the recommended solution.
assume_missingkeyword to assume that all columns inferred as integers contain missing values, and convert them to floats.
Increase the size of the sample using the
It should also be noted that this function may fail if a CSV file includes quoted strings that contain the line terminator. To get around this you can specify
blocksize=Noneto not split files into multiple partitions, at the cost of reduced parallelism.