hipscat_import.index#

Create performance index for a single column of an already-hipscatted catalog

Submodules#

Package Contents#

Classes#

IndexArguments

Data class for holding indexing arguments

Functions#

create_index(args, client)

Read primary column, indexing column, and other payload data,

run(args, client)

Run index creation pipeline.

class IndexArguments[source]#

Bases: hipscat_import.runtime_arguments.RuntimeArguments

Data class for holding indexing arguments

input_catalog_path: str = ''#
input_catalog: hipscat.catalog.Catalog | None#
input_storage_options: Dict[Any, Any] | None#

optional dictionary of abstract filesystem credentials for the INPUT.

indexing_column: str = ''#
extra_columns: List[str]#
include_hipscat_index: bool = True#

Include the hipscat spatial partition index.

include_order_pixel: bool = True#

Include partitioning columns, Norder, Dir, and Npix. You probably want to keep these!

drop_duplicates: bool = True#

Should we check for duplicate rows (including new indexing column), and remove duplicates before writing to new index catalog? If you know that your data will not have duplicates (e.g. an index over a unique primary key), set to False to avoid unnecessary work.

compute_partition_size: int = 1000000000#

partition size used when creating leaf parquet files.

division_hints: List | None#

Hints used when splitting up the rows by the new index. If you have some prior knowledge about the distribution of your indexing_column, providing it here can speed up calculations dramatically. Note that these will NOT necessarily be the divisions that the data is partitioned along.

__post_init__()[source]#
_check_arguments()[source]#
to_catalog_info(total_rows) hipscat.catalog.index.index_catalog_info.IndexCatalogInfo[source]#

Catalog-type-specific dataset info.

additional_runtime_provenance_info() dict[source]#

Any additional runtime args to be included in provenance info from subclasses

create_index(args, client)[source]#

Read primary column, indexing column, and other payload data, and write to catalog directory.

run(args, client)[source]#

Run index creation pipeline.