Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"Failed to parse the container spec json payload to requested prototype" within CustomTrainingJobOp #419

Closed
TheWispy opened this issue Mar 28, 2022 · 6 comments

Comments

@TheWispy
Copy link

Expected Behavior

  1. Submit a custom training job op within a VPC peering and associated reserved ip ranges with pipeline params passed as args.
  2. Component runs successfully

Actual Behavior

  1. The compiler fails, complaining that the pipelineparam is not json serializable
    TypeError: Object of type PipelineParam is not JSON serializable as seen here. For a parameter being passed to a training operation, this really doesn't make any sense.

If all pipiline params are removed before compilation, the Vertex component fails with the following error.
image

The full redacted object dump is here.

{
    "display_name": "SOMEOP",
    "job_spec": {
        "worker_pool_specs": [
            {
                "containerSpec": {
                    "args": [
                        "-A",
                        "AAAA",
                        "-B",
                        "BBBB",
                        "-C",
                        "CCCCC",
                        "-D",
                        "SOME_GS_URL"
                    ],
                    "env": [
                        {
                            "name": "AIP_MODEL_DIR",
                            "value": "SOME_GS_URL"
                        }
                    ],
                    "imageUri": "SOME_CONTAINER_IMAGE"
                },
                "replicaCount": "1",
                "machineSpec": {
                    "machineType": "n1-standard-8"
                }
            }
        ],
        "scheduling": {
            "timeout": "15m",
            "restart_job_on_worker_restart": "false"
        },
        "service_account": "A@cB.iam.gserviceaccount.com",
        "tensorboard": "TENSORBOARD_ID",
        "enable_web_access": "false",
        "network": "NETWORK_ID",
        "reserved_ip_ranges": [
            "google-reserved-range"
        ],
        "base_output_directory": {
            "output_uri_prefix": "SOME_GS_URL"
        }
    },
    "labels": {},
    "encryption_spec": {
        "kms_key_name": ""
    }
}

This is despite me following the guide as described here, which seems a little outdated in places? Any help would be greatly appreciated. Cheers!

Steps to Reproduce the Problem

  1. google-cloud-pipeline-components = ^1.0.1
  2. kfp ^1.8.11
  3. Compile pipeline and upload to vertex
  4. Training component fails

Specifications

  • Version: 1.0.1
  • Platform: Vertex AI on GCP
@andrewferlitsch
Copy link
Contributor

@TheWispy Hello, could you paste in your pipeline definition?

As per PipelineParams are not JSON serializable. That itself is correct. While a pipeline definition looks like Python code, it is not actually Python code. The parameters you pass into the Pipeline definition are of type PipelineParam and have some limitations -- such as one cannot JSON serialize their values.

@TheWispy
Copy link
Author

TheWispy commented Apr 6, 2022

Hi there! I managed to get this working oddly by removing the "timeout" definition, but the JSON serialisation issue remained. With regards to the pipeline parameters, I originally wanted the string from the parameter to be passed to my training container as an argument, but no matter how hard I tried and reconfigured it wouldn't compile.

@andrewferlitsch
Copy link
Contributor

@TheWispy Thanks for the update. Can you paste in the pipeline code which won't compile?

1 similar comment
@andrewferlitsch
Copy link
Contributor

@TheWispy Thanks for the update. Can you paste in the pipeline code which won't compile?

@andrewferlitsch
Copy link
Contributor

been over a month with no response. closing issue by policy

@RansSelected
Copy link

RansSelected commented Dec 20, 2022

Hi @andrewferlitsch !

I have the same issue with my pipeline. It appears when I'm constructing a worker_pool_spec in a different component and then returning it in the CustomTrainingJobOp. Here is the code sample:

from the pipeline:

get_worker_dict_task = (
        get_worker_dict(
            split_data_task.outputs["df_train"], BUCKET_URI
        )  
        .set_caching_options(False)
        .set_display_name("get worker dict task")
    )  

    worker_pool_specs_list = get_worker_dict_task.outputs["worker_dict"]

    custom_job_task = CustomTrainingJobOp(
        project=project,
        display_name="model-training",
        worker_pool_specs=worker_pool_specs_list,  # worker_pool_specs_list,  # worker_pool_spec_ld,
        base_output_directory=MODEL_DIR,
        location=REGION,
    ).after(get_worker_dict_task)`

the component to construct the dict:

@component(
    base_image="actual_image_hidden_due_to_privacy",
    output_component_file="create_worker_pool_spec_dict.yaml",
)
def get_worker_dict(
    df_train: Input[Dataset], BUCKET_URI: str
) -> NamedTuple("Outputs", [("worker_dict", list)],):

    import json
    import pandas as pd
    from collections import namedtuple

    data_uri = "gs://" + df_train.path.rsplit("/gcs/")[1].replace("//", "/")
    data_path = data_uri + ".csv"
    worker_pool_spec_dict = [
        {
            "machineSpec": {
                "machineType": "e2-standard-4",
                "acceleratorType": "ACCELERATOR_TYPE_UNSPECIFIED",
                "acceleratorCount": 0,
            },
            "replicaCount": "1",
            "containerSpec": {
                "imageUri": "my_image_actuall_imahe_hiden_due_to_privacy_in_this_code_sample",
                "args": ["--dataset_url", data_path],
                "env": [{"name": "AIP_MODEL_DIR", "value": BUCKET_URI}],
            },
        }
    ]
    example_output = namedtuple("Outputs", ["worker_dict"])
    return example_output(worker_pool_spec_dict)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants