-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Catalog import: Option to run a single stage #361
Comments
For static datasets, a hipscat import job may need to be run only once so the total runtime of the job is not a major concern. However, some datasets will receive regular updates so we may need to run hipscat import on either the full dataset or just the new data (if possible) on a regular schedule. When dealing with frequently updated catalog data, it will become very important to reduce the runtime of a hipscat job by increasing the parallelism and defining the ideal worker configuration for each stage of the job. It will also be important to automate these regular hipscat import jobs to reduce manual effort. Different stages of the pipeline could benefit from having different memory amounts, cpu cores, and numbers of dask workers. For example, the optimum worker configuration for the mapping stage of an import job with 32 input files might be 32 workers with one core each and X GB of memory because only 32 tasks would be created. For later stages, there may be much larger task numbers which could benefit from a larger number of dask workers. In general, some stages may require different amounts of cpu cores or memory. If hipscat import has the option to choose which stage(s) of the import pipeline to run then we may be able to use dask gateway in an autoscaling Kubernetes cluster to dynamically spin up different types and numbers of dask workers (up to a set limit) for each stage of the hipscat pipeline. This could dramatically decrease the total runtime of each import job which would allow us to automate these jobs and run them on a regular schedule. |
@troyraen I think this would have also helped you in the past with performing manual verification in-between stages of the import pipeline. |
Yes ➕1 for this. I'm less concerned about manual verification between steps now with your recent updates @delucchi-cmu (thank you!) and my growing familiarity with the pipeline (trusting it to fail appropriately). The big benefit for me would be the ability to use different cluster configurations for different stages, like @johnny-gossick said. In some cases, certain stages like "mapping" will overload our filesystem if there's too many workers but other steps can use a lot more workers. I end up copying the main |
Feature request
Add a new argument to specify the stages of the pipeline to run. If not specified, just run the whole thing.
e.g.
run_stages=['mapping', 'splitting']
Before submitting
Please check the following:
The text was updated successfully, but these errors were encountered: