IndexedParquetReader#

class IndexedParquetReader#

Reads an index file, containing paths to parquet files to be read and batched

chunksize#

maximum number of rows to process at once. Large files will be processed in chunks. Small files will be concatenated. Also passed to pyarrow.dataset.Dataset.to_batches as batch_size.

Type:: int

batch_readahead#

number of batches to read ahead. Passed to pyarrow.dataset.Dataset.to_batches.

Type:: int

fragment_readahead#

number of fragments to read ahead. Passed to pyarrow.dataset.Dataset.to_batches.

Type:: int

use_threads#

whether to use multiple threads for reading. Passed to pyarrow.dataset.Dataset.to_batches.

Type:: bool

column_names#

Names of columns to use from the input dataset. If None, use all columns.

Type:: list[str] or None

kwargs#: additional arguments to pass along to InputReader.read_index_file.

Methods

`__init__`([chunksize, batch_readahead, ...])
`read`(input_file[, read_columns])	Read the input file, or chunk of the input file.
`read_index_file`(input_file[, upath_kwargs])	Read an "indexed" file.
`regular_file_exists`(input_file, **_kwargs)	Check that the input_file points to a single regular file

__init__(chunksize=500000, batch_readahead=16, fragment_readahead=4, use_threads=True, column_names=None, upath_kwargs=None, **kwargs)#

classmethod __new__(*args, **kwargs)#

IndexedParquetReader

Contents

IndexedParquetReader#