Catalog import: Option to run a single stage #361

delucchi-cmu · 2024-07-31T15:27:36Z

Feature request

Add a new argument to specify the stages of the pipeline to run. If not specified, just run the whole thing.

e.g. run_stages=['mapping', 'splitting']

Before submitting
Please check the following:

I have described the purpose of the suggested change, specifying what I need the enhancement to accomplish, i.e. what problem it solves.
I have included any relevant links, screenshots, environment information, and data relevant to implementing the requested feature, as well as pseudocode for how I want to access the new functionality.
If I have ideas for how the new feature could be implemented, I have provided explanations and/or pseudocode and/or task lists for the steps.

The text was updated successfully, but these errors were encountered:

johnny-gossick · 2024-08-07T22:20:54Z

For static datasets, a hipscat import job may need to be run only once so the total runtime of the job is not a major concern. However, some datasets will receive regular updates so we may need to run hipscat import on either the full dataset or just the new data (if possible) on a regular schedule. When dealing with frequently updated catalog data, it will become very important to reduce the runtime of a hipscat job by increasing the parallelism and defining the ideal worker configuration for each stage of the job. It will also be important to automate these regular hipscat import jobs to reduce manual effort.

Different stages of the pipeline could benefit from having different memory amounts, cpu cores, and numbers of dask workers. For example, the optimum worker configuration for the mapping stage of an import job with 32 input files might be 32 workers with one core each and X GB of memory because only 32 tasks would be created. For later stages, there may be much larger task numbers which could benefit from a larger number of dask workers. In general, some stages may require different amounts of cpu cores or memory.

If hipscat import has the option to choose which stage(s) of the import pipeline to run then we may be able to use dask gateway in an autoscaling Kubernetes cluster to dynamically spin up different types and numbers of dask workers (up to a set limit) for each stage of the hipscat pipeline. This could dramatically decrease the total runtime of each import job which would allow us to automate these jobs and run them on a regular schedule.

delucchi-cmu · 2024-08-09T15:34:55Z

@troyraen I think this would have also helped you in the past with performing manual verification in-between stages of the import pipeline.

troyraen · 2024-08-09T18:00:56Z

Yes ➕1 for this. I'm less concerned about manual verification between steps now with your recent updates @delucchi-cmu (thank you!) and my growing familiarity with the pipeline (trusting it to fail appropriately). The big benefit for me would be the ability to use different cluster configurations for different stages, like @johnny-gossick said. In some cases, certain stages like "mapping" will overload our filesystem if there's too many workers but other steps can use a lot more workers. I end up copying the main run function into a separate file so that I can scale the cluster between steps.

delucchi-cmu added the enhancement New feature or request label Jul 31, 2024

delucchi-cmu added this to HATS / LSDB Jul 31, 2024

delucchi-cmu moved this to In Progress in HATS / LSDB Aug 9, 2024

delucchi-cmu mentioned this issue Aug 12, 2024

Support user-supplied run_stages on catalog import #377

Merged

4 tasks

delucchi-cmu closed this as completed in #377 Aug 12, 2024

github-project-automation bot moved this from In Progress to Done in HATS / LSDB Aug 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Catalog import: Option to run a single stage #361

Catalog import: Option to run a single stage #361

delucchi-cmu commented Jul 31, 2024

johnny-gossick commented Aug 7, 2024 •

edited

Loading

delucchi-cmu commented Aug 9, 2024

troyraen commented Aug 9, 2024

Catalog import: Option to run a single stage #361

Catalog import: Option to run a single stage #361

Comments

delucchi-cmu commented Jul 31, 2024

johnny-gossick commented Aug 7, 2024 • edited Loading

delucchi-cmu commented Aug 9, 2024

troyraen commented Aug 9, 2024

johnny-gossick commented Aug 7, 2024 •

edited

Loading