dask.bag.read_text

dask.bag.read_text#

dask.bag.read_text(urlpath, blocksize=None, compression='infer', encoding='utf-8', errors='strict', linedelimiter=None, collection=True, storage_options=None, files_per_partition=None, include_path=False)[source]#

Read lines from text files

Parameters:

urlpathstring or list: Absolute or relative filepath(s). Prefix with a protocol like s3:// to read from alternative filesystems. To read from multiple files you can pass a globstring or a list of paths, with the caveat that they must all have the same protocol.
blocksize: None, int, or str: Size (in bytes) to cut up larger files. Streams by default. Can be None for streaming, an integer number of bytes, or a string like “128MiB”
compression: string: Compression format like ‘gzip’ or ‘xz’. Defaults to ‘infer’
encoding: string
errors: string
linedelimiter: string or None
collection: bool, optional: Return dask.bag if True, or list of delayed values if false
storage_options: dict: Extra options that make sense to a particular storage connection, e.g. host, port, username, password, etc.
files_per_partition: None or int: If set, group input files into partitions of the requested size, instead of one partition per file. Mutually exclusive with blocksize.
include_path: bool: Whether or not to include the path in the bag. If true, elements are tuples of (line, path). Default is False.

Returns:

dask.bag.Bag or list: dask.bag.Bag if collection is True or list of Delayed lists otherwise.

See also

from_sequence: Build bag from Python sequence

Examples

>>> b = read_text('myfiles.1.txt')
>>> b = read_text('myfiles.*.txt')
>>> b = read_text('myfiles.*.txt.gz')
>>> b = read_text('s3://bucket/myfiles.*.txt')
>>> b = read_text('s3://key:secret@bucket/myfiles.*.txt')
>>> b = read_text('hdfs://namenode.example.com/myfiles.*.txt')

Parallelize a large file by providing the number of uncompressed bytes to load into each partition.

>>> b = read_text('largefile.txt', blocksize='10MB')

Get file paths of the bag by setting include_path=True

>>> b = read_text('myfiles.*.txt', include_path=True)
>>> b.take(1)
(('first line of the first file', '/home/dask/myfiles.0.txt'),)

dask.bag.read_text

Contents

dask.bag.read_text#