hipscat_import.catalog.file_readers#

File reading generators for common file types.

Module Contents#

Classes#

InputReader

Base class for chunking file readers.

CsvReader

CSV reader for the most common CSV reading arguments.

AstropyEcsvReader

Reads astropy ascii .ecsv files.

FitsReader

Chunked FITS file reader.

ParquetReader

Parquet reader for the most common Parquet reading arguments.

Functions#

get_file_reader(file_format[, chunksize, schema_file, ...])

Get a generator file reader for common file types

get_file_reader(file_format, chunksize=500000, schema_file=None, column_names=None, skip_column_names=None, type_map=None, **kwargs)[source]#

Get a generator file reader for common file types

Parameters:
  • file_format (str) –

    specifier for the file type and extension. Currently supported formats include:

    • csv, comma separated values. may also be tab- or pipe-delimited includes .csv.gz and other compressed csv files

    • fits, flexible image transport system. often used for astropy tables.

    • parquet, compressed columnar data format

  • chunksize (int) – number of rows to read in a single iteration.

  • schema_file (str) – path to a parquet schema file. if provided, header names and column types will be pulled from the parquet schema metadata.

  • column_names (list[str]) – for CSV files, the names of columns if no header is available. for fits files, a list of columns to keep.

  • skip_column_names (list[str]) – for fits files, a list of columns to remove.

  • type_map (dict) – for CSV files, the data types to use for columns

class InputReader[source]#

Bases: abc.ABC

Base class for chunking file readers.

abstract read(input_file, read_columns=None)[source]#

Read the input file, or chunk of the input file.

Parameters:
  • input_file (str) – path to the input file.

  • read_columns (List[str]) – subset of columns to read. if None, all columns are read

Yields:

DataFrame containing chunk of file info.

abstract provenance_info() dict[source]#

Create dictionary of parameters for provenance tracking.

Returns:

dictionary with all argument_name -> argument_value as key -> value pairs.

regular_file_exists(input_file, storage_options: Dict[Any, Any] | None = None, **_kwargs)[source]#

Check that the input_file points to a single regular file

Raises

FileNotFoundError: if nothing exists at path, or directory found.

class CsvReader(chunksize=500000, header='infer', schema_file=None, column_names=None, type_map=None, parquet_kwargs=None, **kwargs)[source]#

Bases: InputReader

CSV reader for the most common CSV reading arguments.

This uses pandas.read_csv, and you can find more information on additional arguments in the pandas documentation: https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html

chunksize#

number of rows to read in a single iteration.

Type:

int

header#

rows to use as the header with column names

Type:

int, list of int, None, default ‘infer’

schema_file#

path to a parquet schema file. if provided, header names and column types will be pulled from the parquet schema metadata.

Type:

str

column_names#

the names of columns if no header is available

Type:

list[str]

type_map#

the data types to use for columns

Type:

dict

parquet_kwargs#

additional keyword arguments to use when reading the parquet schema metadata.

Type:

dict

kwargs#

additional keyword arguments to use when reading the CSV files.

Type:

dict

read(input_file, read_columns=None)[source]#

Read the input file, or chunk of the input file.

Parameters:
  • input_file (str) – path to the input file.

  • read_columns (List[str]) – subset of columns to read. if None, all columns are read

Yields:

DataFrame containing chunk of file info.

provenance_info() dict[source]#

Create dictionary of parameters for provenance tracking.

Returns:

dictionary with all argument_name -> argument_value as key -> value pairs.

class AstropyEcsvReader(**kwargs)[source]#

Bases: InputReader

Reads astropy ascii .ecsv files.

Note that this is NOT a chunked reader. Use caution when reading large ECSV files with this reader.

read(input_file, read_columns=None)[source]#

Read the input file, or chunk of the input file.

Parameters:
  • input_file (str) – path to the input file.

  • read_columns (List[str]) – subset of columns to read. if None, all columns are read

Yields:

DataFrame containing chunk of file info.

provenance_info()[source]#

Create dictionary of parameters for provenance tracking.

Returns:

dictionary with all argument_name -> argument_value as key -> value pairs.

class FitsReader(chunksize=500000, column_names=None, skip_column_names=None, **kwargs)[source]#

Bases: InputReader

Chunked FITS file reader.

There are two column-level arguments for reading fits files: column_names and skip_column_names.

  • If neither is provided, we will read and process all columns in the fits file.

  • If column_names is given, we will use only those names, and skip_column_names will be ignored.

  • If skip_column_names is provided, we will remove those columns from processing stages.

NB: Uses astropy table memmap to avoid reading the entire file into memory. See: https://docs.astropy.org/en/stable/io/fits/index.html#working-with-large-files

chunksize#

number of rows of the file to process at once. For large files, this can prevent loading the entire file into memory at once.

Type:

int

column_names#

list of column names to keep. only use one of column_names or skip_column_names

Type:

list[str]

skip_column_names#

list of column names to skip. only use one of column_names or skip_column_names

Type:

list[str]

read(input_file, read_columns=None)[source]#

Read the input file, or chunk of the input file.

Parameters:
  • input_file (str) – path to the input file.

  • read_columns (List[str]) – subset of columns to read. if None, all columns are read

Yields:

DataFrame containing chunk of file info.

provenance_info() dict[source]#

Create dictionary of parameters for provenance tracking.

Returns:

dictionary with all argument_name -> argument_value as key -> value pairs.

class ParquetReader(chunksize=500000, **kwargs)[source]#

Bases: InputReader

Parquet reader for the most common Parquet reading arguments.

chunksize#

number of rows of the file to process at once. For large files, this can prevent loading the entire file into memory at once.

Type:

int

read(input_file, read_columns=None)[source]#

Read the input file, or chunk of the input file.

Parameters:
  • input_file (str) – path to the input file.

  • read_columns (List[str]) – subset of columns to read. if None, all columns are read

Yields:

DataFrame containing chunk of file info.

provenance_info() dict[source]#

Create dictionary of parameters for provenance tracking.

Returns:

dictionary with all argument_name -> argument_value as key -> value pairs.