Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Catalog import: Option to run a single stage #361

Closed
3 tasks
delucchi-cmu opened this issue Jul 31, 2024 · 3 comments · Fixed by #377
Closed
3 tasks

Catalog import: Option to run a single stage #361

delucchi-cmu opened this issue Jul 31, 2024 · 3 comments · Fixed by #377
Labels
enhancement New feature or request

Comments

@delucchi-cmu
Copy link
Contributor

Feature request

Add a new argument to specify the stages of the pipeline to run. If not specified, just run the whole thing.

e.g. run_stages=['mapping', 'splitting']

Before submitting
Please check the following:

  • I have described the purpose of the suggested change, specifying what I need the enhancement to accomplish, i.e. what problem it solves.
  • I have included any relevant links, screenshots, environment information, and data relevant to implementing the requested feature, as well as pseudocode for how I want to access the new functionality.
  • If I have ideas for how the new feature could be implemented, I have provided explanations and/or pseudocode and/or task lists for the steps.
@delucchi-cmu delucchi-cmu added the enhancement New feature or request label Jul 31, 2024
@johnny-gossick
Copy link
Contributor

johnny-gossick commented Aug 7, 2024

For static datasets, a hipscat import job may need to be run only once so the total runtime of the job is not a major concern. However, some datasets will receive regular updates so we may need to run hipscat import on either the full dataset or just the new data (if possible) on a regular schedule. When dealing with frequently updated catalog data, it will become very important to reduce the runtime of a hipscat job by increasing the parallelism and defining the ideal worker configuration for each stage of the job. It will also be important to automate these regular hipscat import jobs to reduce manual effort.

Different stages of the pipeline could benefit from having different memory amounts, cpu cores, and numbers of dask workers. For example, the optimum worker configuration for the mapping stage of an import job with 32 input files might be 32 workers with one core each and X GB of memory because only 32 tasks would be created. For later stages, there may be much larger task numbers which could benefit from a larger number of dask workers. In general, some stages may require different amounts of cpu cores or memory.

If hipscat import has the option to choose which stage(s) of the import pipeline to run then we may be able to use dask gateway in an autoscaling Kubernetes cluster to dynamically spin up different types and numbers of dask workers (up to a set limit) for each stage of the hipscat pipeline. This could dramatically decrease the total runtime of each import job which would allow us to automate these jobs and run them on a regular schedule.

@delucchi-cmu
Copy link
Contributor Author

@troyraen I think this would have also helped you in the past with performing manual verification in-between stages of the import pipeline.

@troyraen
Copy link
Collaborator

troyraen commented Aug 9, 2024

Yes ➕1 for this. I'm less concerned about manual verification between steps now with your recent updates @delucchi-cmu (thank you!) and my growing familiarity with the pipeline (trusting it to fail appropriately). The big benefit for me would be the ability to use different cluster configurations for different stages, like @johnny-gossick said. In some cases, certain stages like "mapping" will overload our filesystem if there's too many workers but other steps can use a lot more workers. I end up copying the main run function into a separate file so that I can scale the cluster between steps.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
Status: Done
Development

Successfully merging a pull request may close this issue.

3 participants