NEOWISE#

Getting the data#

See docs at IRSA.

https://irsa.ipac.caltech.edu/data/download/neowiser_year8/

Challenges with this data set#

  • The individual files are large, and so we want to use a chunked CSV reader.

  • The rows are wide, so the chunked reader cannot read too many rows at once.

  • The CSV files don’t have a header, so we need to provide the column names and type hints to the reader.

  • The numeric fields may be null, which is not directly supported by the int64 type in pandas, so we must use the nullable Int64 type.

  • Some fields are sparsely populated, and this can create type conversion issues. We use a schema parquet file to address these issues.

You can download our reference files, if you find that helpful:

Example import#

import pandas as pd

import hipscat_import.pipeline as runner
from hipscat_import.catalog.arguments import ImportArguments
from hipscat_import.catalog.file_readers import CsvReader

# Load the column names and types from a side file.
type_frame = pd.read_csv("neowise_types.csv")
type_map = dict(zip(type_frame["name"], type_frame["type"]))

args = ImportArguments(
    output_artifact_name="neowise_1",
    input_path="/path/to/neowiser_year8/",
    file_reader=CsvReader(
        header=None,
        sep="|",
        column_names=type_frame["name"].values.tolist(),
        type_map=type_map,
        chunksize=250_000,
    ).read,
    ra_column="RA",
    dec_column="DEC",
    pixel_threshold=2_000_000,
    highest_healpix_order=9,
    use_schema_file="neowise_schema.parquet",
    sort_columns="SOURCE_ID",
    output_path="/path/to/catalogs/",
)
runner.run(args)

Our performance#

Running on dedicated hardware, with 10 workers, this import took around a week.

  • Mapping stage: around 58 hours

  • Splitting stage: around 72 hours

  • Reducing stage: around 14 hours