Margin Cache#

For more discussion of the whys and hows of margin caches, please see Max’s AAS iPoster for more information.

This page discusses topics around setting up a pipeline to generate a margin cache from an existing hipscat catalog on disk.

At a minimum, you need arguments that include where to find the input files, and where to put the output files. A minimal arguments block will look something like:

from hipscat_import.margin_cache.margin_cache_arguments import MarginCacheArguments

args = MarginCacheArguments(
    input_catalog_path="./my_data/my_catalog",
    output_path="./output",
    margin_threshold=10.0,
    output_artifact_name="my_catalog_10arcs",
)

More details on each of these parameters is provided in sections below.

For the curious, see the API documentation for hipscat_import.margin_cache.margin_cache_arguments.MarginCacheArguments, and its superclass hipscat_import.runtime_arguments.RuntimeArguments.

Dask setup#

We will either use a user-provided dask Client, or create a new one with arguments:

dask_tmp - str - directory for dask worker space. this should be local to the execution of the pipeline, for speed of reads and writes. For much more information, see Temporary files and disk usage

dask_n_workers - int - number of workers for the dask client. Defaults to 1.

dask_threads_per_worker - int - number of threads per dask worker. Defaults to 1.

If you find that you need additional parameters for your dask client (e.g are creating a SLURM worker pool), you can instead create your own dask client and pass along to the pipeline, ignoring the above arguments. This would look like:

from dask.distributed import Client
from hipscat_import.pipeline import pipeline_with_client

args = MarginCacheArguments(...)
with Client('scheduler:port') as client:
    pipeline_with_client(args, client)

If you’re running within a .py file, we recommend you use a main guard to potentially avoid some python threading issues with dask:

from hipscat_import.pipeline import pipeline

def margin_pipeline():
    args = MarginCacheArguments(...)
    pipeline(args)

if __name__ == '__main__':
    margin_pipeline()

Input Catalog#

For this pipeline, you will need to have already transformed your catalog into hipscat parquet format. Provide the path to the catalog data with the argument input_catalog_path.

The input hipscat catalog will provide its own right ascension and declination that will be used when computing margin populations.

Margin calculation parameters#

When creating a margin catalog, we need to know how large of a margin to include around each pixel in the input catalog.

margin_threshold is the size of the margin cache boundary, given in arcseconds. This defaults to 5 arcseconds, but you should set this value to whatever is appropriate for the astrometry error/PSF width for your instruments. If you’re not sure how to determine this, please reach out! We’d love to help! Contact us.

Setting margin_order can make your pipeline run faster.

  1. For each input catalog partition, we can quickly determine all possible neighboring healpix pixels at the given margin_order. All of these partitions may contain points that are inside the margin_threshold.

  2. For each point in the input catalog, we can quickly determine the healpix pixel at margin_order and filter points based on this.

  3. Using this smaller, constrained data set, we do precise boundary checking to determine if the points are within the margin_threshold.

Progress Reporting#

By default, we will display some progress bars during pipeline execution. To disable these (e.g. when you expect no output to standard out), you can set progress_bar=False.

There are several stages to the pipeline execution, and you can expect progress reporting to look like the following:

Mapping  : 100%|██████████| 2352/2352 [9:25:00<00:00, 14.41s/it]
Reducing : 100%|██████████| 2385/2385 [00:43<00:00, 54.47it/s]
Finishing: 100%|██████████| 4/4 [00:03<00:00,  1.15it/s]

For very long-running pipelines (e.g. multi-TB inputs), you can get an email notification when the pipeline completes using the completion_email_address argument. This will send a brief email, for either pipeline success or failure.

Output#

You must specify a name for the margin catalog, using output_artifact_name. A good convention is the name of the primary input catalog, followed by the margin threshold, e.g. gaia_10arcs would be a margin catalog based on gaia that uses 10 arcseconds for margins.

You must specify where you want your margin data to be written, using output_path. This path should be the base directory for your catalogs, as the full path for the margin will take the form of output_path/output_artifact_name.

If there is already catalog or margin data in the indicated directory, you can force new data to be written in the directory with the overwrite flag. It’s preferable to delete any existing contents, however, as this may cause unexpected side effects.

If you’re writing to cloud storage, or otherwise have some filesystem credential dict, put those in output_storage_options.

In addition, you can specify directories to use for various intermediate files:

  • dask worker space (dask_tmp)

  • sharded parquet files (tmp_dir)

Most users are going to be ok with simply setting the tmp_dir for all intermediate file use. For more information on these parameters, when you would use each, and demonstrations of temporary file use see Temporary files and disk usage