Estimate pixel threshold#

For best performance, we try to keep catalog parquet files between 200-800MB in size.

Background

When creating a new catalog through the hipscat-import process, we try to create partitions with approximately the same number of rows per partition. This isn’t perfect, because the sky is uneven, but we still try to create smaller-area pixels in more dense areas, and larger-area pixels in less dense areas. We use the argument pixel_threshold and will split a partition into smaller healpix pixels until the number of rows is smaller than pixel_threshold.

We do this to increase parallelization of reads and downstream analysis: if the files are around the same size, and operations on each partition take around the same amount of time, we’re not as likely to be waiting on a single process to complete for the whole pipeline to complete.

In addition, a single catalog file should not exceed a couple GB - we’re going to need to read the whole thing into memory, so it needs to fit!

Objective

In this notebook, we’ll go over one strategy for estimating the pixel_threshold argument you can use when importing a new catalog into hipscat format.

This is not guaranteed to give you optimal results, but it could give you some hints toward better results.

Create a sample parquet file#

The first step is to read in your survey data in its original form, and convert a sample into parquet. This has a few benefits: - parquet uses compression in various ways, and by creating the sample, we can get a sense of both the overall and field-level compression with real dat - using the importer FileReader interface now sets you up for more success when you get around to importing!

If your data is already in parquet format, just change the sample_parquet_file path to an existing file, and don’t run the second cell.

[1]:
### Change this path!!!
import os
import tempfile

tmp_path = tempfile.TemporaryDirectory()
sample_parquet_file = os.path.join(tmp_path.name, "sample.parquet")
[2]:
from hipscat_import.catalog.file_readers import CsvReader

### Change this path!!!
input_file = "../../tests/hipscat_import/data/small_sky/catalog.csv"

file_reader = CsvReader(chunksize=5_000)

next(file_reader.read(input_file)).to_parquet(sample_parquet_file)

Inspect parquet file and metadata#

Now that we have parsed our survey data into parquet, we can check what the data will look like when it’s imported into hipscat format.

If you’re just here to get a naive estimate for your pixel threshold, we’ll do that first, then take a look at some other parquet characteristics later for the curious.

[3]:
import os
import pyarrow.parquet as pq

sample_file_size = os.path.getsize(sample_parquet_file)
parquet_file = pq.ParquetFile(sample_parquet_file)
num_rows = parquet_file.metadata.num_rows

## 100MB
ideal_file_small = 100 * 1024 * 1024
## 800MB
ideal_file_large = 800 * 1024 * 1024

threshold_small = ideal_file_small / sample_file_size * num_rows
threshold_large = ideal_file_large / sample_file_size * num_rows

print(f"threshold between {int(threshold_small):_} and {int(threshold_large):_}")
threshold between 2_491_628 and 19_933_024

Want to see more?#

I’m so glad you’re still here! I have more to show you!

The first step below shows us the file-level metadata, as stored by parquet. The number of columns here SHOULD match your expectations on the number of columns in your survey data.

The serialized_size value is just the size of the total metadata, not the size of the file.

[4]:
import pyarrow.parquet as pq

parquet_file = pq.ParquetFile(sample_parquet_file)
print(parquet_file.metadata)
<pyarrow._parquet.FileMetaData object at 0x7f5546fc2110>
  created_by: parquet-cpp-arrow version 16.0.0
  num_columns: 5
  num_rows: 131
  num_row_groups: 1
  format_version: 2.6
  serialized_size: 3211

The next step is to look at the column-level metadata. You can check that the on-disk type of each column matches your expectation of the data. Note that for some integer types, the on-disk type may be a smaller int than originally set (e.g. bitWidth=8 or 16). This is part of parquet’s multi-part compression strategy.

[5]:
print(parquet_file.schema)
<pyarrow._parquet.ParquetSchema object at 0x7f5544254100>
required group field_id=-1 schema {
  optional int64 field_id=-1 id;
  optional double field_id=-1 ra;
  optional double field_id=-1 dec;
  optional int64 field_id=-1 ra_error;
  optional int64 field_id=-1 dec_error;
}

Parquet will also perform some column-level compression, so not all columns with the same type will take up the same space on disk.

Below, we inspect the row and column group metadata to show the compressed size of the fields on disk. The last column, percent, show the percent of total size taken up by the column.

You can use this to inform which columns you keep when importing a catalog into hipscat format. e.g. if some columns are less useful for your science, and take up a lot of space, maybe leave them out!

[6]:
import numpy as np
import pandas as pd

num_cols = parquet_file.metadata.num_columns
num_row_groups = parquet_file.metadata.num_row_groups
sizes = np.zeros(num_cols)

for rg in range(num_row_groups):
    for col in range(num_cols):
        sizes[col] += parquet_file.metadata.row_group(rg).column(col).total_compressed_size

## This is just an attempt at pretty formatting
percents = [f"{s/sizes.sum()*100:.1f}" for s in sizes]
pd.DataFrame({"name": parquet_file.schema.names, "size": sizes.astype(int), "percent": percents}).sort_values(
    "size", ascending=False
)
[6]:
name size percent
0 id 769 42.7
1 ra 442 24.5
2 dec 393 21.8
3 ra_error 99 5.5
4 dec_error 99 5.5