IndexedParquetReader#

class IndexedParquetReader#

Reads an index file, containing paths to parquet files to be read and batched

chunksize#

maximum number of rows to process at once. Large files will be processed in chunks. Small files will be concatenated. Also passed to pyarrow.dataset.Dataset.to_batches as batch_size.

Type:

int

batch_readahead#

number of batches to read ahead. Passed to pyarrow.dataset.Dataset.to_batches.

Type:

int

fragment_readahead#

number of fragments to read ahead. Passed to pyarrow.dataset.Dataset.to_batches.

Type:

int

use_threads#

whether to use multiple threads for reading. Passed to pyarrow.dataset.Dataset.to_batches.

Type:

bool

column_names#

Names of columns to use from the input dataset. If None, use all columns.

Type:

list[str] or None

kwargs#

additional arguments to pass along to InputReader.read_index_file.

Methods

__init__([chunksize, batch_readahead, ...])

read(input_file[, read_columns])

Read the input file, or chunk of the input file.

read_index_file(input_file[, upath_kwargs])

Read an "indexed" file.

regular_file_exists(input_file, **_kwargs)

Check that the input_file points to a single regular file

__init__(chunksize=500000, batch_readahead=16, fragment_readahead=4, use_threads=True, column_names=None, upath_kwargs=None, **kwargs)#
classmethod __new__(*args, **kwargs)#