hipscat_import.catalog.file_readers
#
File reading generators for common file types.
Module Contents#
Classes#
Base class for chunking file readers. |
|
CSV reader for the most common CSV reading arguments. |
|
Reads astropy ascii .ecsv files. |
|
Chunked FITS file reader. |
|
Parquet reader for the most common Parquet reading arguments. |
Functions#
|
Get a generator file reader for common file types |
- get_file_reader(file_format, chunksize=500000, schema_file=None, column_names=None, skip_column_names=None, type_map=None, **kwargs)[source]#
Get a generator file reader for common file types
- Parameters:
file_format (str) –
specifier for the file type and extension. Currently supported formats include:
csv, comma separated values. may also be tab- or pipe-delimited includes .csv.gz and other compressed csv files
fits, flexible image transport system. often used for astropy tables.
parquet, compressed columnar data format
chunksize (int) – number of rows to read in a single iteration.
schema_file (str) – path to a parquet schema file. if provided, header names and column types will be pulled from the parquet schema metadata.
column_names (list[str]) – for CSV files, the names of columns if no header is available. for fits files, a list of columns to keep.
skip_column_names (list[str]) – for fits files, a list of columns to remove.
type_map (dict) – for CSV files, the data types to use for columns
- class InputReader[source]#
Bases:
abc.ABC
Base class for chunking file readers.
- abstract read(input_file, read_columns=None)[source]#
Read the input file, or chunk of the input file.
- Parameters:
input_file (str) – path to the input file.
read_columns (List[str]) – subset of columns to read. if None, all columns are read
- Yields:
DataFrame containing chunk of file info.
- class CsvReader(chunksize=500000, header='infer', schema_file=None, column_names=None, type_map=None, parquet_kwargs=None, **kwargs)[source]#
Bases:
InputReader
CSV reader for the most common CSV reading arguments.
This uses pandas.read_csv, and you can find more information on additional arguments in the pandas documentation: https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html
- chunksize#
number of rows to read in a single iteration.
- Type:
int
- header#
rows to use as the header with column names
- Type:
int, list of int, None, default ‘infer’
- schema_file#
path to a parquet schema file. if provided, header names and column types will be pulled from the parquet schema metadata.
- Type:
str
- column_names#
the names of columns if no header is available
- Type:
list[str]
- type_map#
the data types to use for columns
- Type:
dict
- parquet_kwargs#
additional keyword arguments to use when reading the parquet schema metadata.
- Type:
dict
- kwargs#
additional keyword arguments to use when reading the CSV files.
- Type:
dict
- class AstropyEcsvReader(**kwargs)[source]#
Bases:
InputReader
Reads astropy ascii .ecsv files.
Note that this is NOT a chunked reader. Use caution when reading large ECSV files with this reader.
- class FitsReader(chunksize=500000, column_names=None, skip_column_names=None, **kwargs)[source]#
Bases:
InputReader
Chunked FITS file reader.
There are two column-level arguments for reading fits files: column_names and skip_column_names.
If neither is provided, we will read and process all columns in the fits file.
If column_names is given, we will use only those names, and skip_column_names will be ignored.
If skip_column_names is provided, we will remove those columns from processing stages.
NB: Uses astropy table memmap to avoid reading the entire file into memory. See: https://docs.astropy.org/en/stable/io/fits/index.html#working-with-large-files
- chunksize#
number of rows of the file to process at once. For large files, this can prevent loading the entire file into memory at once.
- Type:
int
- column_names#
list of column names to keep. only use one of column_names or skip_column_names
- Type:
list[str]
- skip_column_names#
list of column names to skip. only use one of column_names or skip_column_names
- Type:
list[str]
- class ParquetReader(chunksize=500000, **kwargs)[source]#
Bases:
InputReader
Parquet reader for the most common Parquet reading arguments.
- chunksize#
number of rows of the file to process at once. For large files, this can prevent loading the entire file into memory at once.
- Type:
int