Margin Cache#
For more discussion of the whys and hows of margin caches, please see Max’s AAS iPoster for more information.
This page discusses topics around setting up a pipeline to generate a margin cache from an existing hipscat catalog on disk.
At a minimum, you need arguments that include where to find the input files, and where to put the output files. A minimal arguments block will look something like:
from hipscat_import.margin_cache.margin_cache_arguments import MarginCacheArguments
args = MarginCacheArguments(
input_catalog_path="./my_data/my_catalog",
output_path="./output",
margin_threshold=10.0,
output_artifact_name="my_catalog_10arcs",
)
More details on each of these parameters is provided in sections below.
For the curious, see the API documentation for
hipscat_import.margin_cache.margin_cache_arguments.MarginCacheArguments
,
and its superclass hipscat_import.runtime_arguments.RuntimeArguments
.
Dask setup#
We will either use a user-provided dask Client
, or create a new one with
arguments:
dask_tmp
- str
- directory for dask worker space. this should be local to
the execution of the pipeline, for speed of reads and writes. For much more
information, see Temporary files and disk usage
dask_n_workers
- int
- number of workers for the dask client. Defaults to 1.
dask_threads_per_worker
- int
- number of threads per dask worker. Defaults to 1.
If you find that you need additional parameters for your dask client (e.g are creating a SLURM worker pool), you can instead create your own dask client and pass along to the pipeline, ignoring the above arguments. This would look like:
from dask.distributed import Client
from hipscat_import.pipeline import pipeline_with_client
args = MarginCacheArguments(...)
with Client('scheduler:port') as client:
pipeline_with_client(args, client)
If you’re running within a .py
file, we recommend you use a main
guard to
potentially avoid some python threading issues with dask:
from hipscat_import.pipeline import pipeline
def margin_pipeline():
args = MarginCacheArguments(...)
pipeline(args)
if __name__ == '__main__':
margin_pipeline()
Input Catalog#
For this pipeline, you will need to have already transformed your catalog into
hipscat parquet format. Provide the path to the catalog data with the argument
input_catalog_path
.
The input hipscat catalog will provide its own right ascension and declination that will be used when computing margin populations.
Margin calculation parameters#
When creating a margin catalog, we need to know how large of a margin to include around each pixel in the input catalog.
margin_threshold
is the size of the margin cache boundary, given in arcseconds.
This defaults to 5 arcseconds, but you should set this value to whatever is
appropriate for the astrometry error/PSF width for your instruments. If you’re
not sure how to determine this, please reach out! We’d love to help! Contact us.
Setting margin_order
can make your pipeline run faster.
For each input catalog partition, we can quickly determine all possible neighboring healpix pixels at the given
margin_order
. All of these partitions may contain points that are inside themargin_threshold
.For each point in the input catalog, we can quickly determine the healpix pixel at
margin_order
and filter points based on this.Using this smaller, constrained data set, we do precise boundary checking to determine if the points are within the
margin_threshold
.
Progress Reporting#
By default, we will display some progress bars during pipeline execution. To
disable these (e.g. when you expect no output to standard out), you can set
progress_bar=False
.
There are several stages to the pipeline execution, and you can expect progress reporting to look like the following:
For very long-running pipelines (e.g. multi-TB inputs), you can get an
email notification when the pipeline completes using the
completion_email_address
argument. This will send a brief email,
for either pipeline success or failure.
Output#
You must specify a name for the margin catalog, using output_artifact_name
.
A good convention is the name of the primary input catalog, followed by the
margin threshold, e.g. gaia_10arcs
would be a margin catalog based on gaia
that uses 10 arcseconds for margins.
You must specify where you want your margin data to be written, using
output_path
. This path should be the base directory for your catalogs, as
the full path for the margin will take the form of output_path/output_artifact_name
.
If there is already catalog or margin data in the indicated directory, you can
force new data to be written in the directory with the overwrite
flag. It’s
preferable to delete any existing contents, however, as this may cause
unexpected side effects.
If you’re writing to cloud storage, or otherwise have some filesystem credential
dict, put those in output_storage_options
.
In addition, you can specify directories to use for various intermediate files:
dask worker space (
dask_tmp
)sharded parquet files (
tmp_dir
)
Most users are going to be ok with simply setting the tmp_dir
for all intermediate
file use. For more information on these parameters, when you would use each,
and demonstrations of temporary file use see Temporary files and disk usage