dask.bag.read_text

dask.bag.read_text

dask.bag.read_text(urlpath, blocksize=None, compression='infer', encoding='utf-8', errors='strict', linedelimiter=None, collection=True, storage_options=None, files_per_partition=None, include_path=False)[source]

Read lines from text files

Parameters
urlpathstring or list

Absolute or relative filepath(s). Prefix with a protocol like s3:// to read from alternative filesystems. To read from multiple files you can pass a globstring or a list of paths, with the caveat that they must all have the same protocol.

blocksize: None, int, or str

Size (in bytes) to cut up larger files. Streams by default. Can be None for streaming, an integer number of bytes, or a string like “128MiB”

compression: string

Compression format like ‘gzip’ or ‘xz’. Defaults to ‘infer’

encoding: string
errors: string
linedelimiter: string or None
collection: bool, optional

Return dask.bag if True, or list of delayed values if false

storage_options: dict

Extra options that make sense to a particular storage connection, e.g. host, port, username, password, etc.

files_per_partition: None or int

If set, group input files into partitions of the requested size, instead of one partition per file. Mutually exclusive with blocksize.

include_path: bool

Whether or not to include the path in the bag. If true, elements are tuples of (line, path). Default is False.

Returns
dask.bag.Bag or list

dask.bag.Bag if collection is True or list of Delayed lists otherwise.

See also

from_sequence

Build bag from Python sequence

Examples

>>> b = read_text('myfiles.1.txt')  
>>> b = read_text('myfiles.*.txt')  
>>> b = read_text('myfiles.*.txt.gz')  
>>> b = read_text('s3://bucket/myfiles.*.txt')  
>>> b = read_text('s3://key:secret@bucket/myfiles.*.txt')  
>>> b = read_text('hdfs://namenode.example.com/myfiles.*.txt')  

Parallelize a large file by providing the number of uncompressed bytes to load into each partition.

>>> b = read_text('largefile.txt', blocksize='10MB')  

Get file paths of the bag by setting include_path=True

>>> b = read_text('myfiles.*.txt', include_path=True) 
>>> b.take(1) 
(('first line of the first file', '/home/dask/myfiles.0.txt'),)