SDSS#

Getting the data#

See docs at SDSS: https://data.sdss.org/sas/dr12/boss/sweeps/dr9/301/

We wanted to use the SDSS dr16q stars slice, so that started with this wget_script.

Challenges with this data set#

The FITS files contain several multidimensional fields. These are not easily convertible into numpy/pandas (this will just fail in the astropy library).

The approach we took was to convert the FITS files into parquet in a separate process. In this process, we expand all of the multidimensional fields into simple scalars. This works on this data set because these multidimensional are just encoding data on the five bands as an array with 5 elements.

Example conversion#

import glob
import re
import pandas as pd
from tqdm.auto import tqdm
from astropy.table import Table
from astropy.table.table import descr

files = glob.glob("/data/sdss/**.fits.gz")
files.sort()

for in_file in tqdm(files, bar_format='{l_bar}{bar:80}{r_bar}'):

    match = re.match(r".*(calibObj-.*-star).fits.gz", str(file))
    file_prefix = match.group(1)
    out_file = f"/data/sdss/parquet/{file_prefix}.parquet"

    table = Table.read(in_file)
    new_table = Table()

    for col in table.columns.values():
        descriptor = descr(col)
        col_name = descriptor[0]
        col_shape = descriptor[2]
        if col_shape == (5,):
            data_t = col.data.T
            for index, band_char in enumerate('ugriz'):
                new_table.add_column(data_t[index], name=f"{col_name}_{band_char}")
        elif col_shape == ():
            new_table.add_column(col)

    new_table.to_pandas().to_parquet(out_file)

Example import#

from hipscat_import.catalog.arguments import ImportArguments
import hipscat_import.pipeline as runner

args = ImportArguments(
    output_artifact_name="sdss_dr16q",
    input_path="/data/sdss/parquet/",
    file_reader="parquet",
    ra_column="RA",
    dec_column="DEC",
    sort_columns="ID",
    pixel_threshold=1_000_000,
    highest_healpix_order=7,
    output_path="/path/to/catalogs/",
)
runner.pipeline(args)