CsvPyarrowReader#
- class CsvPyarrowReader#
CSV reader that uses the pyarrow library for reading.
This can be faster than the pandas reader, and can have different handling of data types.
- chunksize#
number of BYTES of the file to process at once. This is different from chunksize seen in other readers!! For large files, this can prevent loading the entire file into memory at once.
- Type:
int
- column_names#
Names of columns to use from the input dataset. If None, use all columns.
- Type:
list[str] or None
- schema_file#
path to a parquet schema file. if provided, column names and types will match those of the schema.
- Type:
str
- read_options#
options for reading CSV files using pyarrow. We will set the
block_sizeargument based on the value forchunksize. See https://arrow.apache.org/docs/python/generated/pyarrow.csv.ReadOptions.html- Type:
csv.ReadOptions
- convert_options#
options for converting CSV data to pyarrow Table. We will pass the pyarrow schema from
schema_fileto thecolumn_typesproperty. See https://arrow.apache.org/docs/python/generated/pyarrow.csv.ConvertOptions.html- Type:
csv.ConvertOptions
- kwargs#
arguments to pass along to pyarrow.parquet.ParquetFile. See https://arrow.apache.org/docs/python/generated/pyarrow.parquet.ParquetFile.html
Methods
__init__(*[, chunksize, compression, ...])read(input_file[, read_columns])Read the input file, or chunk of the input file.
read_index_file(input_file[, upath_kwargs])Read an "indexed" file.
regular_file_exists(input_file, **_kwargs)Check that the input_file points to a single regular file
- __init__(*, chunksize=10485760, compression=None, column_names=None, schema_file=None, read_options=None, convert_options=None, **kwargs)#
- classmethod __new__(*args, **kwargs)#