get_file_reader

get_file_reader#

get_file_reader(file_format, chunksize=500000, schema_file=None, column_names=None, skip_column_names=None, type_map=None, **kwargs)#

Get a generator file reader for common file types

Currently supported formats include:

  • "csv", comma separated values. may also be tab- or pipe-delimited includes .csv.gz and other compressed csv files

  • "fits", flexible image transport system. often used for astropy tables.

  • "parquet", compressed columnar data format

  • "ecsv", astropy’s enhanced CSV

  • "indexed_csv", “index” style reader, that accepts a file with a list of csv files that are appended in-memory

  • "indexed_parquet", “index” style reader, that accepts a file with a list of parquet files that are appended in-memory

Parameters:
  • file_format (str) – specifier for the file type and extension. If using an input_path argument, we will look for files with this string as the extension.

  • chunksize (int) – number of rows to read in a single iteration. for single-file readers, large files are split into batches based on this value. for index-style readers, we read files until we reach this chunksize and create a single batch in-memory.

  • schema_file (str) – path to a parquet schema file. if provided, header names and column types will be pulled from the parquet schema metadata.

  • column_names (list[str]) – for CSV files, the names of columns if no header is available. for fits files, a list of columns to keep.

  • skip_column_names (list[str]) – for fits files, a list of columns to remove.

  • type_map (dict) – for CSV files, the data types to use for columns

  • kwargs – additional keyword arguments to pass to the underlying file reader.