CsvPyarrowReader#

class CsvPyarrowReader#

CSV reader that uses the pyarrow library for reading.

This can be faster than the pandas reader, and can have different handling of data types.

chunksize#

number of BYTES of the file to process at once. This is different from chunksize seen in other readers!! For large files, this can prevent loading the entire file into memory at once.

Type:

int

column_names#

Names of columns to use from the input dataset. If None, use all columns.

Type:

list[str] or None

schema_file#

path to a parquet schema file. if provided, column names and types will match those of the schema.

Type:

str

read_options#

options for reading CSV files using pyarrow. We will set the block_size argument based on the value for chunksize. See https://arrow.apache.org/docs/python/generated/pyarrow.csv.ReadOptions.html

Type:

csv.ReadOptions

convert_options#

options for converting CSV data to pyarrow Table. We will pass the pyarrow schema from schema_file to the column_types property. See https://arrow.apache.org/docs/python/generated/pyarrow.csv.ConvertOptions.html

Type:

csv.ConvertOptions

kwargs#

arguments to pass along to pyarrow.parquet.ParquetFile. See https://arrow.apache.org/docs/python/generated/pyarrow.parquet.ParquetFile.html

Methods

__init__(*[, chunksize, compression, ...])

read(input_file[, read_columns])

Read the input file, or chunk of the input file.

read_index_file(input_file[, upath_kwargs])

Read an "indexed" file.

regular_file_exists(input_file, **_kwargs)

Check that the input_file points to a single regular file

__init__(*, chunksize=10485760, compression=None, column_names=None, schema_file=None, read_options=None, convert_options=None, **kwargs)#
classmethod __new__(*args, **kwargs)#