hipscat_import.catalog.file_readers

`hipscat_import.catalog.file_readers`#

File reading generators for common file types.

Module Contents#

Classes#

`InputReader`	Base class for chunking file readers.
`CsvReader`	CSV reader for the most common CSV reading arguments.
`AstropyEcsvReader`	Reads astropy ascii .ecsv files.
`FitsReader`	Chunked FITS file reader.
`ParquetReader`	Parquet reader for the most common Parquet reading arguments.

Functions#

get_file_reader(file_format[, chunksize, schema_file, ...])

Get a generator file reader for common file types

get_file_reader(file_format, chunksize=500000, schema_file=None, column_names=None, skip_column_names=None, type_map=None, **kwargs)[source]#

Get a generator file reader for common file types

Parameters:

file_format (str) –
specifier for the file type and extension. Currently supported formats include:
- csv, comma separated values. may also be tab- or pipe-delimited includes .csv.gz and other compressed csv files
- fits, flexible image transport system. often used for astropy tables.
- parquet, compressed columnar data format
chunksize (int) – number of rows to read in a single iteration.
schema_file (str) – path to a parquet schema file. if provided, header names and column types will be pulled from the parquet schema metadata.
column_names (list[str]) – for CSV files, the names of columns if no header is available. for fits files, a list of columns to keep.
skip_column_names (list[str]) – for fits files, a list of columns to remove.
type_map (dict) – for CSV files, the data types to use for columns

class InputReader[source]#

Bases: abc.ABC

Base class for chunking file readers.

abstract read(input_file, read_columns=None)[source]#

Read the input file, or chunk of the input file.

Parameters:

input_file (str) – path to the input file.
read_columns (List[str]) – subset of columns to read. if None, all columns are read

Yields:

DataFrame containing chunk of file info.

abstract provenance_info() → dict[source]#

Create dictionary of parameters for provenance tracking.

Returns:: dictionary with all argument_name -> argument_value as key -> value pairs.

regular_file_exists(input_file, storage_options: Dict[Any, Any] | None = None, **_kwargs)[source]#

Check that the input_file points to a single regular file

Raises: FileNotFoundError: if nothing exists at path, or directory found.

class CsvReader(chunksize=500000, header='infer', schema_file=None, column_names=None, type_map=None, parquet_kwargs=None, **kwargs)[source]#

Bases: InputReader

CSV reader for the most common CSV reading arguments.

This uses pandas.read_csv, and you can find more information on additional arguments in the pandas documentation: https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html

chunksize#

number of rows to read in a single iteration.

Type:: int

header#

rows to use as the header with column names

Type:: int, list of int, None, default ‘infer’

schema_file#

path to a parquet schema file. if provided, header names and column types will be pulled from the parquet schema metadata.

Type:: str

column_names#

the names of columns if no header is available

Type:: list[str]

type_map#

the data types to use for columns

Type:: dict

parquet_kwargs#

additional keyword arguments to use when reading the parquet schema metadata.

Type:: dict

kwargs#

additional keyword arguments to use when reading the CSV files.

Type:: dict

read(input_file, read_columns=None)[source]#

Read the input file, or chunk of the input file.

Parameters:

input_file (str) – path to the input file.
read_columns (List[str]) – subset of columns to read. if None, all columns are read

Yields:

DataFrame containing chunk of file info.

provenance_info() → dict[source]#

Create dictionary of parameters for provenance tracking.

Returns:: dictionary with all argument_name -> argument_value as key -> value pairs.

class AstropyEcsvReader(**kwargs)[source]#

Bases: InputReader

Reads astropy ascii .ecsv files.

Note that this is NOT a chunked reader. Use caution when reading large ECSV files with this reader.

read(input_file, read_columns=None)[source]#

Read the input file, or chunk of the input file.

Parameters:

input_file (str) – path to the input file.
read_columns (List[str]) – subset of columns to read. if None, all columns are read

Yields:

DataFrame containing chunk of file info.

provenance_info()[source]#

Create dictionary of parameters for provenance tracking.

Returns:: dictionary with all argument_name -> argument_value as key -> value pairs.

class FitsReader(chunksize=500000, column_names=None, skip_column_names=None, **kwargs)[source]#

Bases: InputReader

Chunked FITS file reader.

There are two column-level arguments for reading fits files: column_names and skip_column_names.

If neither is provided, we will read and process all columns in the fits file.

If column_names is given, we will use only those names, and skip_column_names will be ignored.

If skip_column_names is provided, we will remove those columns from processing stages.

NB: Uses astropy table memmap to avoid reading the entire file into memory. See: https://docs.astropy.org/en/stable/io/fits/index.html#working-with-large-files

chunksize#

number of rows of the file to process at once. For large files, this can prevent loading the entire file into memory at once.

Type:: int

column_names#

list of column names to keep. only use one of column_names or skip_column_names

Type:: list[str]

skip_column_names#

list of column names to skip. only use one of column_names or skip_column_names

Type:: list[str]

read(input_file, read_columns=None)[source]#

Read the input file, or chunk of the input file.

Parameters:

input_file (str) – path to the input file.
read_columns (List[str]) – subset of columns to read. if None, all columns are read

Yields:

DataFrame containing chunk of file info.

provenance_info() → dict[source]#

Create dictionary of parameters for provenance tracking.

Returns:: dictionary with all argument_name -> argument_value as key -> value pairs.

class ParquetReader(chunksize=500000, **kwargs)[source]#

Bases: InputReader

Parquet reader for the most common Parquet reading arguments.

chunksize#

number of rows of the file to process at once. For large files, this can prevent loading the entire file into memory at once.

Type:: int

read(input_file, read_columns=None)[source]#

Read the input file, or chunk of the input file.

Parameters:

input_file (str) – path to the input file.
read_columns (List[str]) – subset of columns to read. if None, all columns are read

Yields:

DataFrame containing chunk of file info.

provenance_info() → dict[source]#

Create dictionary of parameters for provenance tracking.

Returns:: dictionary with all argument_name -> argument_value as key -> value pairs.

hipscat_import.catalog.file_readers

Contents

hipscat_import.catalog.file_readers#

Module Contents#

Classes#

Functions#

`hipscat_import.catalog.file_readers`#