Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support uploading more files from the target directory to remote_target_path #1293

Open
pankajkoti opened this issue Oct 30, 2024 · 3 comments
Labels
area:config Related to configuration, like YAML files, environment variables, or executer configuration area:execution Related to the execution environment/mode, like Docker, Kubernetes, Local, VirtualEnv, etc stale Issue has not had recent activity or appears to be solved. Stale issues will be automatically closed triage-needed Items need to be reviewed / assigned to milestone

Comments

@pankajkoti
Copy link
Contributor

pankajkoti commented Oct 30, 2024

Currently, the remote_target_path configuration, added in PR #1224, only uploads files from the compiled directory within the target directory of the dbt project—and solely when ExecutionMode.AIRFLOW_ASYNC is enabled. However, the target directory contains several other files and folders that could benefit users if they were also uploaded to remote_target_path.

Beyond the compiled directory, the target directory typically includes:

  1. run/ folder
  2. graph.gpickle
  3. graph_summary.json
  4. manifest.json
  5. partial_parse.msgpack
  6. run_results.json
  7. semantic_manifest.json

A specific request was made in a Slack conversation to have run_results.json uploaded and accessible in remote_target_path, highlighting its value to users.

We should evaluate the potential benefits of supporting uploads for these additional files and folders and explore enabling this feature across all execution modes, not just ExecutionMode.AIRFLOW_ASYNC. Additionally, it may be worthwhile to consider uploading files from the compiled directory in other execution modes if it proves beneficial.

We could potentially create sub tasks for each of these files & folders for evaluation of the benefits & supporting to upload those to the remote_target_path

@dosubot dosubot bot added area:config Related to configuration, like YAML files, environment variables, or executer configuration area:execution Related to the execution environment/mode, like Docker, Kubernetes, Local, VirtualEnv, etc labels Oct 30, 2024
@pankajkoti pankajkoti added the triage-needed Items need to be reviewed / assigned to milestone label Oct 30, 2024
@joppedh
Copy link

joppedh commented Oct 31, 2024

@pankajkoti partial_parse would speed-up the runtime. Now on each task run it still needs to parse, even when providing a manifest.json

[2024-10-31, 06:03:48 UTC] {logging_mixin.py:188} INFO - 06:03:48  Unable to do partial parsing because saved manifest not found. Starting full parse.

@tatiana
Copy link
Collaborator

tatiana commented Nov 6, 2024

@joppedh, on the partial parsing side, have you been able to leverage https://astronomer.github.io/astronomer-cosmos/configuration/partial-parsing.html#partial-parsing? Cosmos should be caching it. But you'd need to leverage render_config=RenderConfig(enable_mock_profile=False).

Copy link

github-actions bot commented Dec 7, 2024

This issue is stale because it has been open for 30 days with no activity.

@github-actions github-actions bot added the stale Issue has not had recent activity or appears to be solved. Stale issues will be automatically closed label Dec 7, 2024
tatiana pushed a commit that referenced this issue Dec 17, 2024
…te cloud storages (#1389)

This PR introduces helper functions that can be passed as callable
callbacks for Cosmos tasks to execute post-task execution. These helper
functions enable the uploading of artifacts (from the project's target
directory) to various cloud storage providers, including AWS S3, Google
Cloud Storage (GCS), Azure WASB, and general remote object stores using
Airflow’s ObjectStoragePath.

## Key Changes
Adds a `cosmos/io.py` module that includes the following helper
functions

1. `upload_artifacts_to_aws_s3`
- Uploads artifact files from a task’s local target directory to an AWS
S3 bucket.
- Supports dynamically appending DAG metadata (e.g., dag_id, task_id,
run_id, and try number) to the uploaded file paths.
      - Utilizes S3Hook from the airflow.providers.amazon.aws module.

2. `upload_artifacts_to_gcp_gs`

- Uploads artifact files from a task’s local target directory to a
Google Cloud Storage (GCS) bucket.
- Appends DAG-related context to the GCS object names for better
traceability.
      - Leverages GCSHook from airflow.providers.google.cloud.
    
3. `upload_artifacts_to_azure_wasb`
- Uploads artifact files from a task’s local target directory to an
Azure Blob Storage container.
- Automatically structures blob names with metadata, including dag_id,
task_id, and execution details.
- Utilizes WasbHook from the airflow.providers.microsoft.azure module.
  
4. `upload_artifacts_to_cloud_storage`
- A generic helper function that uploads artifacts from a task’s local
target directory to remote object stores configured via Airflow’s
ObjectStoragePath (Airflow 2.8+ feature).
- Supports custom remote storage configurations such as
`remote_target_path` and `remote_target_path_conn_id`.
- Dynamically constructs file paths that include DAG metadata for clear
organization.
     
These helpers functions can be passed as the `callback` argument to
`DbtDAG` or to your `Dag` instance as demonstrated in the example DAGs
`dev/dags/cosmos_callback_dag.py` and `dev/dags/example_operators.py`
correspondingly. You can also pass `callback_args` as shown in the
example DAGs. These helper functions are mere examples of how callback
functions can be written and passed to your operators/DAGs to be
executed after task completions. Taking reference of these helper
functions, you can write your own callback function and pass those.


## Limitations

1. This PR has been tested and is currently supported only in
`ExecutionMode.LOCAL`. We encourage the community to contribute by
adding callback support for other execution modes as needed, using the
implementation for `ExecutionMode.LOCAL` as a reference.

closes: #1350
closes: #976
closes: #867
closes: #801
closes: #1292
closes: #851
closes: #1351 
related: #1293
related: #1349
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:config Related to configuration, like YAML files, environment variables, or executer configuration area:execution Related to the execution environment/mode, like Docker, Kubernetes, Local, VirtualEnv, etc stale Issue has not had recent activity or appears to be solved. Stale issues will be automatically closed triage-needed Items need to be reviewed / assigned to milestone
Projects
None yet
Development

No branches or pull requests

3 participants