Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extending the corresponding version of the official tensorflow/tfx image causes hanging Dataflow Worker #6911

Open
adammkerr opened this issue Sep 4, 2024 · 10 comments
Assignees
Labels

Comments

@adammkerr
Copy link

adammkerr commented Sep 4, 2024

System information

  • Have I specified the code to reproduce the issue (Yes, No): Yes
  • Environment in which the code is executed: Vertex AI
  • TensorFlow version: 2.15.1
  • TFX Version: 1.15.1
  • Python version: 3.10.14
  • Python dependencies (from pip freeze output):
absl-py==1.4.0
annotated-types==0.7.0
anyio==4.4.0
apache-beam==2.58.1
argon2-cffi==23.1.0
argon2-cffi-bindings==21.2.0
arrow==1.3.0
astunparse==1.6.3
async-lru==2.0.4
async-timeout==4.0.3
attrs==23.2.0
babel==2.16.0
backcall==0.2.0
beautifulsoup4==4.12.3
bleach==6.1.0
cachetools==5.5.0
certifi==2024.7.4
cffi==1.17.0
charset-normalizer==3.3.2
click==8.1.7
cloudpickle==2.2.1
colorama==0.4.6
comm==0.2.2
crcmod==1.7
debugpy==1.8.5
decorator==5.1.1
defusedxml==0.7.1
Deprecated==1.2.14
dill==0.3.1.1
dnspython==2.6.1
docker==4.4.4
docopt==0.6.2
docstring_parser==0.16
exceptiongroup==1.2.2
fastavro==1.9.5
fasteners==0.19
fastjsonschema==2.20.0
fire==0.6.0
flatbuffers==24.3.25
fqdn==1.5.1
gast==0.6.0
google-api-core==2.19.2
google-api-python-client==1.12.11
google-apitools==0.5.31
google-auth==2.34.0
google-auth-httplib2==0.2.0
google-auth-oauthlib==1.2.1
google-cloud-aiplatform==1.64.0
google-cloud-bigquery==3.25.0
google-cloud-bigquery-storage==2.25.0
google-cloud-bigtable==2.26.0
google-cloud-core==2.4.1
google-cloud-datastore==2.20.1
google-cloud-dlp==3.22.0
google-cloud-language==2.14.0
google-cloud-pubsub==2.23.0
google-cloud-pubsublite==1.11.1
google-cloud-recommendations-ai==0.10.12
google-cloud-resource-manager==1.12.5
google-cloud-spanner==3.48.0
google-cloud-storage==2.18.2
google-cloud-videointelligence==2.13.5
google-cloud-vision==3.7.4
google-crc32c==1.5.0
google-pasta==0.2.0
google-resumable-media==2.7.2
googleapis-common-protos==1.65.0
grpc-google-iam-v1==0.13.1
grpc-interceptor==0.15.4
grpcio==1.66.0
grpcio-status==1.48.2
h11==0.14.0
h5py==3.11.0
hdfs==2.7.3
httpcore==1.0.5
httplib2==0.22.0
httpx==0.27.2
idna==3.8
ipykernel==6.29.5
ipython==7.34.0
ipython-genutils==0.2.0
ipywidgets==7.8.3
isoduration==20.11.0
jedi==0.19.1
Jinja2==3.1.4
joblib==1.4.2
Js2Py==0.74
json5==0.9.25
jsonpickle==3.2.2
jsonpointer==3.0.0
jsonschema==4.23.0
jsonschema-specifications==2023.12.1
jupyter-events==0.10.0
jupyter-lsp==2.2.5
jupyter_client==8.6.2
jupyter_core==5.7.2
jupyter_server==2.14.2
jupyter_server_terminals==0.5.3
jupyterlab==4.2.5
jupyterlab_pygments==0.3.0
jupyterlab_server==2.27.3
jupyterlab_widgets==1.1.9
keras==2.15.0
keras-tuner==1.4.7
kfp==1.8.22
kfp-pipeline-spec==0.1.16
kfp-server-api==1.8.5
kt-legacy==1.0.5
kubernetes==12.0.1
libclang==18.1.1
lxml==5.3.0
Markdown==3.7
markdown-it-py==3.0.0
MarkupSafe==2.1.5
matplotlib-inline==0.1.7
mdurl==0.1.2
mistune==3.0.2
ml-dtypes==0.3.2
ml-metadata==1.15.0
ml-pipelines-sdk==1.15.1
nbclient==0.10.0
nbconvert==7.16.4
nbformat==5.10.4
nest-asyncio==1.6.0
nltk==3.9.1
notebook==7.2.2
notebook_shim==0.2.4
numpy==1.26.4
oauth2client==4.1.3
oauthlib==3.2.2
objsize==0.7.0
opt-einsum==3.3.0
orjson==3.10.7
overrides==7.7.0
packaging==24.1
pandas==1.5.3
pandocfilters==1.5.1
parso==0.8.4
pexpect==4.9.0
pickleshare==0.7.5
pillow==10.4.0
platformdirs==4.2.2
portalocker==2.10.1
portpicker==1.6.0
prometheus_client==0.20.0
prompt_toolkit==3.0.47
proto-plus==1.24.0
protobuf==3.20.3
psutil==6.0.0
ptyprocess==0.7.0
pyarrow==10.0.1
pyarrow-hotfix==0.6
pyasn1==0.6.0
pyasn1_modules==0.4.0
pycparser==2.22
pydantic==1.10.18
pydantic_core==2.20.1
pydot==1.4.2
pyfarmhash==0.3.2
Pygments==2.18.0
pyjsparser==2.7.1
pymongo==4.8.0
pyparsing==3.1.4
python-dateutil==2.9.0.post0
python-json-logger==2.0.7
pytz==2024.1
PyYAML==6.0.2
pyzmq==26.2.0
redis==5.0.8
referencing==0.35.1
regex==2024.7.24
requests==2.31.0
requests-oauthlib==2.0.0
requests-toolbelt==0.10.1
rfc3339-validator==0.1.4
rfc3986-validator==0.1.1
rich==13.8.0
rouge-score==0.1.2
rpds-py==0.20.0
rsa==4.9
sacrebleu==2.4.3
scipy==1.12.0
Send2Trash==1.8.3
shapely==2.0.6
shellingham==1.5.4
six==1.16.0
sniffio==1.3.1
soupsieve==2.6
sqlparse==0.5.1
strip-hints==0.1.10
tabulate==0.9.0
tensorboard==2.15.2
tensorboard-data-server==0.7.2
tensorflow==2.15.1
tensorflow-data-validation==1.15.1
tensorflow-estimator==2.15.0
tensorflow-hub==0.15.0
tensorflow-io-gcs-filesystem==0.37.1
tensorflow-metadata==1.15.0
tensorflow-serving-api==2.15.1
tensorflow-transform==1.15.0
tensorflow_model_analysis==0.46.0
termcolor==2.4.0
terminado==0.18.1
tfx==1.15.1
tfx-bsl==1.15.1
tinycss2==1.3.0
tomli==2.0.1
tornado==6.4.1
tqdm==4.66.5
traitlets==5.14.3
typer==0.12.5
types-python-dateutil==2.9.0.20240821
typing_extensions==4.12.2
tzlocal==5.2
uri-template==1.3.0
uritemplate==3.0.1
urllib3==1.26.19
wcwidth==0.2.13
webcolors==24.8.0
webencodings==0.5.1
websocket-client==1.8.0
Werkzeug==3.0.4
widgetsnbextension==3.6.8
wrapt==1.14.1
zstandard==0.23.0

Describe the current behavior
As per the following links and issues:

I am using a custom docker container, extended from the corresponding version of the official tensorflow/tfx image, as my Pipeline default image and Dataflow sdk_container_image.

I have:

  1. Created a custom image using the methods defined in the links above
  2. Specified that image as my default_image in KubeflowV2DagRunnerConfig
  3. Added the following params to my beam_args for each component which leverages Beam:
    '--experiments=use_runner_v2', 
    '--sdk_container_image=' + PIPELINE_IMAGE,
    '--sdk_location=container',

The worker will start, however the worker container hangs with the following error:

image

I am not sure why Apache Beam is not installed in the runtime environment, I can investigate the built image using:

docker run --rm -it \
		--name ${DOCKER_IMAGE_NAME} \
		--entrypoint=/bin/bash \
		${DOCKER_IMAGE_URI}

and I can confirm that apache-beam exists within the container:

root@dd7dd38913f6:/pipeline# pip list --format=freeze
absl-py==1.4.0
aiofiles==22.1.0
aiohttp==3.9.3
aiohttp-cors==0.7.0
aiosignal==1.3.1
aiosqlite==0.20.0
anyio==4.3.0
apache-beam==2.56.0
archspec==0.2.3
argon2-cffi==23.1.0
argon2-cffi-bindings==21.2.0
array_record==0.5.1
arrow==1.3.0
asttokens==2.4.1
astunparse==1.6.3
async-timeout==4.0.3
attrs==23.2.0
Babel==2.14.0
backcall==0.2.0
backports.tarfile==1.1.0
beatrix_jupyterlab==2024.49.170251
beautifulsoup4==4.12.3
bleach==6.1.0
blessed==1.20.0
boltons==24.0.0
Brotli==1.1.0
brotlipy==0.7.0
cached-property==1.5.2
cachetools==4.2.4
certifi==2024.2.2
cffi==1.16.0
charset-normalizer==3.3.2
click==8.1.7
cloud-tpu-client==0.10
cloud-tpu-profiler==2.4.0
cloudpickle==2.2.1
colorama==0.4.6
colorful==0.5.6
comm==0.2.2
conda==24.3.0
conda-libmamba-solver==24.1.0
conda-package-handling==2.2.0
conda_package_streaming==0.9.0
contourpy==1.2.1
crcmod==1.7
cryptography==42.0.5
cycler==0.12.1
Cython==3.0.10
dacite==1.8.1
dataproc_jupyter_plugin==0.1.74
db-dtypes==1.2.0
debugpy==1.8.1
decorator==5.1.1
defusedxml==0.7.1
Deprecated==1.2.14
dill==0.3.1.1
distlib==0.3.8
distro==1.9.0
dm-tree==0.1.8
docker==4.4.4
docopt==0.6.2
docstring_parser==0.16
entrypoints==0.4
etils==1.7.0
exceptiongroup==1.2.0
executing==2.0.1
explainable-ai-sdk==1.3.3
Farama-Notifications==0.0.4
fastapi==0.110.2
fastavro==1.9.4
fasteners==0.19
fastjsonschema==2.19.1
filelock==3.13.4
fire==0.6.0
flatbuffers==24.3.25
fonttools==4.51.0
fqdn==1.5.1
frozenlist==1.4.1
fsspec==2024.3.1
gast==0.5.4
gcsfs==2024.3.1
gitdb==4.0.11
GitPython==3.1.43
google-api-core==2.11.1
google-api-python-client==1.12.11
google-apitools==0.5.31
google-auth==2.29.0
google-auth-httplib2==0.1.1
google-auth-oauthlib==1.2.0
google-cloud-aiplatform==1.48.0
google-cloud-artifact-registry==1.11.3
google-cloud-bigquery==3.21.0
google-cloud-bigquery-storage==2.16.2
google-cloud-bigtable==2.23.1
google-cloud-core==2.4.1
WARNING: No metadata found in /opt/conda/lib/python3.10/site-packages
google-cloud-datastore==1.15.5
google-cloud-dlp==3.16.0
google-cloud-jupyter-config==0.0.5
google-cloud-language==2.13.3
google-cloud-monitoring==2.21.0
google-cloud-pubsub==2.21.1
google-cloud-pubsublite==1.10.0
google-cloud-recommendations-ai==0.7.1
google-cloud-resource-manager==1.12.3
google-cloud-spanner==3.45.0
google-cloud-storage==2.14.0
google-cloud-videointelligence==2.13.3
google-cloud-vision==3.7.2
google-crc32c==1.5.0
google-pasta==0.2.0
google-resumable-media==2.7.0
googleapis-common-protos==1.63.0
gpustat==1.0.0
greenlet==3.0.3
grpc-google-iam-v1==0.12.7
grpc-interceptor==0.15.4
grpcio==1.62.2
grpcio-status==1.48.1
gviz-api==1.10.0
gymnasium==0.28.1
h11==0.14.0
h5py==3.11.0
hdfs==2.7.3
htmlmin==0.1.12
httplib2==0.21.0
httptools==0.6.1
idna==3.7
ImageHash==4.3.1
imageio==2.34.0
importlib-metadata==7.0.0
importlib_resources==6.4.0
ipykernel==6.29.3
ipython==7.34.0
ipython-genutils==0.2.0
ipython-sql==0.5.0
ipywidgets==7.8.1
isoduration==20.11.0
jaraco.classes==3.4.0
jaraco.context==5.3.0
jaraco.functools==4.0.1
jax-jumpy==1.0.0
jedi==0.19.1
jeepney==0.8.0
Jinja2==3.1.3
joblib==1.4.0
Js2Py==0.74
json5==0.9.25
jsonpatch==1.33
jsonpickle==3.0.4
jsonpointer==2.4
jsonschema==4.21.1
jsonschema-specifications==2023.12.1
jupyter_client==7.4.9
jupyter_core==5.7.2
jupyter-events==0.10.0
jupyter-http-over-ws==0.0.8
jupyter_server==2.14.0
jupyter_server_fileid==0.9.2
jupyter-server-mathjax==0.2.6
jupyter_server_proxy==4.1.2
jupyter_server_terminals==0.5.3
jupyter_server_ydoc==0.8.0
jupyter-ydoc==0.2.5
jupyterlab==3.6.6
jupyterlab_git==0.44.0
jupyterlab_pygments==0.3.0
jupyterlab_server==2.26.0
jupyterlab-widgets==1.1.7
jupytext==1.16.1
keras==2.15.0
keras-tuner==1.4.7
kernels-mixer==0.0.10
keyring==25.1.0
keyrings.google-artifactregistry-auth==1.1.2
kfp==1.8.22
kfp-pipeline-spec==0.1.16
kfp-server-api==1.8.5
kiwisolver==1.4.5
kt-legacy==1.0.5
kubernetes==12.0.1
lazy_loader==0.4
libclang==18.1.1
libmambapy==1.5.8
linkify-it-py==2.0.3
llvmlite==0.41.1
lxml==5.2.2
lz4==4.3.3
Markdown==3.6
markdown-it-py==3.0.0
MarkupSafe==2.0.1
matplotlib==3.7.3
matplotlib-inline==0.1.7
mdit-py-plugins==0.4.0
mdurl==0.1.2
memray==1.12.0
menuinst==2.0.2
mistune==3.0.2
ml-dtypes==0.3.2
ml-metadata==1.15.0
ml-pipelines-sdk==1.15.1
mmh==2.2
more-itertools==10.2.0
msgpack==1.0.8
multidict==6.0.5
multimethod==1.11.2
nb_conda==2.2.1
nb_conda_kernels==2.3.1
nbclassic==1.0.0
nbclient==0.10.0
nbconvert==7.16.3
nbdime==3.2.0
nbformat==5.10.4
nest-asyncio==1.6.0
networkx==3.3
nltk==3.8.1
notebook==6.5.4
notebook_executor==0.2
notebook_shim==0.2.4
numba==0.58.1
numpy==1.24.4
nvidia-ml-py==11.495.46
oauth2client==4.1.3
oauthlib==3.2.2
objsize==0.6.1
opencensus==0.11.4
opencensus-context==0.1.3
opentelemetry-api==1.24.0
opentelemetry-exporter-otlp==1.24.0
opentelemetry-exporter-otlp-proto-common==1.24.0
opentelemetry-exporter-otlp-proto-grpc==1.24.0
opentelemetry-exporter-otlp-proto-http==1.24.0
opentelemetry-proto==1.24.0
opentelemetry-sdk==1.24.0
opentelemetry-semantic-conventions==0.45b0
opt-einsum==3.3.0
orjson==3.10.1
overrides==7.7.0
packaging==24.0
pandas==1.5.3
pandas-profiling==3.6.6
pandocfilters==1.5.0
papermill==2.5.0
parso==0.8.4
patsy==0.5.6
pendulum==3.0.0
pexpect==4.9.0
phik==0.12.4
pickleshare==0.7.5
pillow==10.3.0
pip==24.0
pkgutil_resolve_name==1.3.10
platformdirs==3.11.0
plotly==5.21.0
pluggy==1.5.0
portalocker==2.8.2
portpicker==1.6.0
prettytable==3.10.0
prometheus_client==0.20.0
promise==2.3
prompt-toolkit==3.0.42
proto-plus==1.23.0
protobuf==3.20.3
psutil==5.9.3
ptyprocess==0.7.0
pure-eval==0.2.2
py-spy==0.3.14
pyarrow==10.0.1
pyarrow-hotfix==0.6
pyasn1==0.6.0
pyasn1_modules==0.4.0
pycosat==0.6.6
pycparser==2.22
pydantic==1.10.15
pydot==1.4.2
pyfarmhash==0.3.2
Pygments==2.17.2
pyjsparser==2.7.1
PyJWT==2.8.0
pymongo==3.13.0
pyOpenSSL==24.0.0
pyparsing==3.1.2
PySocks==1.7.1
python-dateutil==2.9.0
python-dotenv==1.0.1
python-json-logger==2.0.7
python-snappy==0.5.4
pytz==2024.1
pyu2f==0.1.5
PyWavelets==1.6.0
PyYAML==6.0.1
pyzmq==24.0.1
ray==2.11.0
ray-cpp==2.11.0
redis==5.0.4
referencing==0.34.0
regex==2024.4.16
requests==2.31.0
requests-oauthlib==2.0.0
requests-toolbelt==0.10.1
retrying==1.3.3
rfc3339-validator==0.1.4
rfc3986-validator==0.1.1
rich==13.7.1
rouge_score==0.1.2
rpds-py==0.18.0
rsa==4.9
ruamel.yaml==0.18.6
ruamel.yaml.clib==0.2.8
ruamel-yaml-conda==0.15.100
sacrebleu==2.4.2
scikit-image==0.23.2
scikit-learn==1.4.2
scipy==1.11.4
seaborn==0.12.2
SecretStorage==3.3.3
Send2Trash==1.8.3
setuptools==69.5.1
shapely==2.0.4
shellingham==1.5.4
simpervisor==1.0.0
six==1.16.0
smart-open==7.0.4
smmap==5.0.1
sniffio==1.3.1
soupsieve==2.5
SQLAlchemy==2.0.29
sqlparse==0.5.0
stack-data==0.6.2
starlette==0.37.2
statsmodels==0.14.2
strip-hints==0.1.10
tabulate==0.9.0
tangled-up-in-unicode==0.2.0
tenacity==8.2.3
tensorboard==2.15.2
tensorboard-data-server==0.7.2
tensorboard_plugin_profile==2.15.1
tensorboardX==2.6.2.2
tensorflow==2.15.1
tensorflow-cloud==0.1.16
tensorflow-data-validation==1.15.1
tensorflow-datasets==4.9.4
tensorflow-estimator==2.15.0
tensorflow-hub==0.15.0
tensorflow-io==0.24.0
tensorflow-io-gcs-filesystem==0.24.0
tensorflow-metadata==1.15.0
tensorflow_model_analysis==0.46.0
tensorflow-probability==0.24.0
tensorflow-serving-api==2.15.1
tensorflow-transform==1.15.0
termcolor==2.4.0
terminado==0.18.1
textual==0.57.1
tf_keras==2.15.1
tfx==1.15.1
tfx-bsl==1.15.1
threadpoolctl==3.4.0
tifffile==2024.4.18
time-machine==2.14.1
tinycss2==1.2.1
toml==0.10.2
tomli==2.0.1
tornado==6.4
tqdm==4.66.2
traitlets==5.9.0
truststore==0.8.0
typeguard==4.2.1
typer==0.12.3
types-python-dateutil==2.9.0.20240316
typing_extensions==4.11.0
typing-utils==0.1.0
tzdata==2024.1
tzlocal==5.2
uc-micro-py==1.0.3
uri-template==1.3.0
uritemplate==3.0.1
urllib3==1.26.18
uvicorn==0.29.0
uvloop==0.19.0
virtualenv==20.21.0
visions==0.7.5
watchfiles==0.21.0
wcwidth==0.2.13
webcolors==1.13
webencodings==0.5.1
websocket-client==1.7.0
websockets==12.0
Werkzeug==2.1.2
wheel==0.43.0
widgetsnbextension==3.6.6
witwidget==1.8.1
wordcloud==1.9.3
wrapt==1.14.1
y-py==0.6.2
yarl==1.9.4
ydata-profiling==4.6.0
ypy-websocket==0.8.4
zipp==3.17.0
zstandard==0.22.0

Describe the expected behavior

The worker stages should start successfully, write worker logs, and complete. For example:

image

Note the image above was achieved using the following beam args:

EXAMPLE_GEN_BEAM_ARGS = [
    '--job_name={}'.format(PIPELINE_NAME.replace("_","-") + "-examplegen"),
    '--runner=DataflowRunner',
    '--project=' + GOOGLE_CLOUD_PROJECT,
    '--region=' + GOOGLE_CLOUD_REGION,
    '--temp_location=' + os.path.join('gs://', GCS_BUCKET_NAME, 'tmp'),
    '--staging_location=' + os.path.join('gs://', GCS_BUCKET_NAME, 'staging'),
    '--service_account_email=<redacted sa>',
    '--no_use_public_ips',
    '--subnetwork=<redacted subnet>',
    '--machine_type=e2-standard-8',
    '--experiments=use_runner_v2',
    '--max_num_workers=1',
    '--num_workers=1',
    '--disk_size_gb=100',
    '--prebuild_sdk_container_engine=cloud_build',
    '--docker_registry_push_url=northamerica-northeast1-docker.pkg.dev/prj-cxbi-dev-nane1-dsc-ttep/ml',
]

However this is not feasible because I need to get my custom code into the image for components which reference it (Transform, Trainer, Evalutaor). For example I get the following error with the Transform component (as I should):

Error processing instruction process_bundle-789893474543504259-101. Original traceback is Traceback (most recent call last): File "/usr/local/lib/python3.10/site-packages/apache_beam/internal/dill_pickler.py", line 418, in loads return dill.loads(s) File "/usr/local/lib/python3.10/site-packages/dill/_dill.py", line 275, in loads return load(file, ignore, **kwds) File "/usr/local/lib/python3.10/site-packages/dill/_dill.py", line 270, in load return Unpickler(file, ignore=ignore, **kwds).load() File "/usr/local/lib/python3.10/site-packages/dill/_dill.py", line 472, in load obj = StockUnpickler.load(self) File "/usr/local/lib/python3.10/site-packages/dill/_dill.py", line 462, in find_class return StockUnpickler.find_class(self, module, name) ModuleNotFoundError: No module named 'models'

Standalone code to reproduce the issue

My files are arranged as such:

- my_pipeline (the name of my pipeline)
     - data
     - models
          - model
              -  __init__.py
              - constants.py
              - model.py
              - predict_extractor.py
              - tuner.py
          - __init__.py
          - features.py
          - features.py
          - preprocessing.py
          - query.sql
     - pipeline
          - __init__.py
          - configs.py
          - pipeline.py
     - __init__.py
     - Dockerfile
     - kubeflow_v2_runner.py
     - Makefile
     - requirements.txt 

Dockerfile:

FROM tensorflow/tfx:1.15.1
WORKDIR /pipeline
COPY ./ ./
ENV PYTHONPATH="/pipeline:${PYTHONPATH}"

Apache Beam Args:

EXAMPLE_GEN_BEAM_ARGS = [
    '--job_name={}'.format(PIPELINE_NAME.replace("_","-") + "-examplegen"),
    '--runner=DataflowRunner',
    '--project=' + GOOGLE_CLOUD_PROJECT,
    '--region=' + GOOGLE_CLOUD_REGION,
    '--temp_location=' + os.path.join('gs://', GCS_BUCKET_NAME, 'tmp'),
    '--staging_location=' + os.path.join('gs://', GCS_BUCKET_NAME, 'staging'),
    '--service_account_email=<redacted sa>',
    '--no_use_public_ips',
    '--subnetwork=<redacted subnet>',
    '--machine_type=e2-standard-8',
    '--experiments=use_runner_v2',
    '--max_num_workers=1',
    '--num_workers=1',
    '--disk_size_gb=100',
    '--sdk_container_image=' + PIPELINE_IMAGE,
    '--sdk_location=container',
]

KubeFlow Runner & Config:

runner_config = kubeflow_v2_dag_runner.KubeflowV2DagRunnerConfig(
        default_image=configs.PIPELINE_IMAGE)

dsl_pipeline = pipeline.create_pipeline(**args)

runner = kubeflow_v2_dag_runner.KubeflowV2DagRunner(config=runner_config)
runner.run(pipeline=dsl_pipeline)

Component init in pipeline.py:

components = []

example_gen = tfx.components.CsvExampleGen(input_base=data_path)

if example_gen_beam_args is not None:
    example_gen.with_beam_pipeline_args(example_gen_beam_args)


return tfx.dsl.Pipeline(
        pipeline_name=pipeline_name,
        pipeline_root=pipeline_root,
        components=components,
        # Change this value to control caching of execution results. Default value is `False`.
        enable_cache=True,
        metadata_connection_config=metadata_connection_config
)

@adammkerr adammkerr changed the title Extending the corresponding version of the official tensorflow/tfx image causes hanging Dataflow Worker. Extending the corresponding version of the official tensorflow/tfx image causes hanging Dataflow Worker Sep 5, 2024
@lego0901 lego0901 self-assigned this Sep 25, 2024
@pritamdodeja
Copy link
Contributor

I have a fully functional tfx pipeline that was running in vertex a couple of months ago. The same pipeline no longer works, does not even get past examplegen. I am also running into dataflow related issues. Is there a canonical pipeline that can be run to verify vertex/dataflow is working as expected? I am able to get that same pipeline to locally via DirectRunner. How should one go about porting local tfx pipelines to vertex/dataflow runner?

@IzakMaraisTAL
Copy link
Contributor

stockUnpickler.load(self) File "/usr/local/lib/python3.10/site-packages/dill/_dill.py", line 462, in find_class return StockUnpickler.find_class(self, module, name) ModuleNotFoundError: No module named 'models'

It sounds like there is some path related problem. The Dockerfile looks reasonable.

Have you tried debugging this by running the container locally and then confirming that there is a module named models?

docker run --rm -it --entrypoint /bin/bash <your image>

@adammkerr
Copy link
Author

adammkerr commented Oct 1, 2024

stockUnpickler.load(self) File "/usr/local/lib/python3.10/site-packages/dill/_dill.py", line 462, in find_class return StockUnpickler.find_class(self, module, name) ModuleNotFoundError: No module named 'models'

It sounds like there is some path related problem. The Dockerfile looks reasonable.

Have you tried debugging this by running the container locally and then confirming that there is a module named models?

docker run --rm -it --entrypoint /bin/bash <your image>

Sorry, my description above was a bit obfuscated, let me clarify.

Attempt 1: Extend official Docker Image and add my custom code
If I extend the official image and pre-build a container containing my custom code (as noted by my dockerfile) the Dataflow worker hangs.

So if I use the following .with_beam_pipeline_args(example_gen_beam_args) to launch a dataflow job, it hangs:

EXAMPLE_GEN_BEAM_ARGS = [
    '--job_name={}'.format(PIPELINE_NAME.replace("_","-") + "-examplegen"),
    '--runner=DataflowRunner',
    '--project=' + GOOGLE_CLOUD_PROJECT,
    '--region=' + GOOGLE_CLOUD_REGION,
    '--temp_location=' + os.path.join('gs://', GCS_BUCKET_NAME, 'tmp'),
    '--staging_location=' + os.path.join('gs://', GCS_BUCKET_NAME, 'staging'),
    '--service_account_email=<redacted sa>',
    '--no_use_public_ips',
    '--subnetwork=<redacted subnet>',
    '--machine_type=e2-standard-8',
    '--experiments=use_runner_v2',
    '--max_num_workers=1',
    '--num_workers=1',
    '--disk_size_gb=100',
    '--sdk_container_image=' + PIPELINE_IMAGE,
    '--sdk_location=container',
]

Attempt 2: use --prebuild_sdk_container_engine

So if I use the following .with_beam_pipeline_args(example_gen_beam_args) to launch a dataflow job, the container does not hang:

EXAMPLE_GEN_BEAM_ARGS = [
    '--job_name={}'.format(PIPELINE_NAME.replace("_","-") + "-examplegen"),
    '--runner=DataflowRunner',
    '--project=' + GOOGLE_CLOUD_PROJECT,
    '--region=' + GOOGLE_CLOUD_REGION,
    '--temp_location=' + os.path.join('gs://', GCS_BUCKET_NAME, 'tmp'),
    '--staging_location=' + os.path.join('gs://', GCS_BUCKET_NAME, 'staging'),
    '--service_account_email=<redacted sa>',
    '--no_use_public_ips',
    '--subnetwork=<redacted subnet>',
    '--machine_type=e2-standard-8',
    '--experiments=use_runner_v2',
    '--max_num_workers=1',
    '--num_workers=1',
    '--disk_size_gb=100',
    '--prebuild_sdk_container_engine=cloud_build',
    '--docker_registry_push_url=northamerica-northeast1-docker.pkg.dev/prj-cxbi-dev-nane1-dsc-ttep/ml',
]

However this isn't a valid solution, because these flags are pre-building the SDK using cloud build, and does not contain any of my custom code required for downstream components (transform, trainer - hence the No module named 'models' error you referenced).

Attempt 3: use custom container, without Dataflow
What even more puzzling, is that if I use my custom container from attempt 1, remove .with_beam_pipeline_args(example_gen_beam_args) from each component which I am trying to leverage dataflow, the pipeline completes.

Which tells there isn't a bug with how the TFX lib is leveraging apache beam, but leveraging apache beam with Dataflow.

@IzakMaraisTAL
Copy link
Contributor

IzakMaraisTAL commented Oct 1, 2024

OK, that narrows it down a bit.

We also extend the TFX base image and successfully use the custom image with Dataflow. I don't have the time to try to reproduce your ticket (I am a user of the project, not a dev), but I can share our working config.

Here is what we pass into .with_beam_pipeline_args:

DATAFLOW_BEAM_PIPELINE_ARGS = [
    f"--project={GOOGLE_CLOUD_PROJECT}",
    "--runner=DataflowRunner",
    f"--temp_location={TEMP_LOCATION}",
    f"--region={GOOGLE_CLOUD_REGION}",
    "--disk_size_gb=50",
    "--machine_type=n2-standard-2",
    "--experiments=use_runner_v2",
    f"--subnetwork={SUBNETWORK}",
    f"--sdk_container_image={PIPELINE_IMAGE}",
    f"--labels=group={GROUP}",
    f"--labels=team={TEAM}",
    f"--labels=project={PROJECT}",
]

Hope that helps.

@adammkerr
Copy link
Author

adammkerr commented Oct 1, 2024

OK, that narrows it down a bit.

We also extend the TFX base image and successfully use the custom image with Dataflow. I don't have the time to try to reproduce your ticket (I am a user of the project, not a dev), but I can share our working config.

Here is what we pass into .with_beam_pipeline_args:

DATAFLOW_BEAM_PIPELINE_ARGS = [
    f"--project={GOOGLE_CLOUD_PROJECT}",
    "--runner=DataflowRunner",
    f"--temp_location={TEMP_LOCATION}",
    f"--region={GOOGLE_CLOUD_REGION}",
    "--disk_size_gb=50",
    "--machine_type=n2-standard-2",
    "--experiments=use_runner_v2",
    f"--subnetwork={SUBNETWORK}",
    f"--sdk_container_image={PIPELINE_IMAGE}",
    f"--labels=group={GROUP}",
    f"--labels=team={TEAM}",
    f"--labels=project={PROJECT}",
]

Hope that helps.

Unfortunately no luck @IzakMaraisTAL, my DF worker still gives the following logs / error on endless loop:

image

Thank though for the suggestions! What TFX Version are you using? If I roll back to 1.12.0, my exact same code works for dataflow. Any version greater than 1.12.0 seems to hang with the same code / setup.

@IzakMaraisTAL
Copy link
Contributor

Thank though for the suggestions! What TFX Version are you using? If I roll back to 1.12.0, my exact same code works for dataflow. Any version greater than 1.12.0 seems to hang with the same code / setup.

Aah, another clue.

This might be related to one of these tickets #6368, #6386? We currently use TFX 1.14 with the following Dockerfile

FROM tensorflow/tfx:1.14.0
WORKDIR /pipeline

COPY requirements.txt requirements.txt
# Fix pip in the tfx:1.14.0 base image to use the correct python environment (3.8, not 3.10)
RUN sed -i 's/python3/python/g' /usr/bin/pip

RUN pip install  -r requirements.txt

ENV PYTHONPATH="/pipeline:${PYTHONPATH}"
# Fix dataflow job sometimes not starting 
ENV RUN_PYTHON_SDK_IN_DEFAULT_ENVIRONMENT=1
COPY version \
    vertex_runner.py \
    config.py ./
COPY ml_pipeline ./ml_pipeline

@adammkerr
Copy link
Author

adammkerr commented Oct 2, 2024

Thank though for the suggestions! What TFX Version are you using? If I roll back to 1.12.0, my exact same code works for dataflow. Any version greater than 1.12.0 seems to hang with the same code / setup.

Aah, another clue.

This might be related to one of these tickets #6368, #6386? We currently use TFX 1.14 with the following Dockerfile

FROM tensorflow/tfx:1.14.0
WORKDIR /pipeline

COPY requirements.txt requirements.txt
# Fix pip in the tfx:1.14.0 base image to use the correct python environment (3.8, not 3.10)
RUN sed -i 's/python3/python/g' /usr/bin/pip

RUN pip install  -r requirements.txt

ENV PYTHONPATH="/pipeline:${PYTHONPATH}"
# Fix dataflow job sometimes not starting 
ENV RUN_PYTHON_SDK_IN_DEFAULT_ENVIRONMENT=1
COPY version \
    vertex_runner.py \
    config.py ./
COPY ml_pipeline ./ml_pipeline

If I roll back to 1.14.0 this worked for me!

However using the same Dockerfile and extending from 1.15.1 instead still gives me this endless loop in my Dataflow worker log:
image

I think I am going to roll back to 1.14.0 for now since it seems stable, however going to leave this issue open because there still seems to be a bug in 1.15.1

@IzakMaraisTAL thank you for all your help! Really really appreciated :)

@pritamdodeja
Copy link
Contributor

How would one go about debugging something like this? Would the following be possible/right approach?

  1. Get it to work with direct runner, verify correctness
  2. Execute with portable runner, see if vanilla beam container works (use gcs as filesystem)
  3. If 2 works, then it's something with dataflow, if it doesn't, inject python debugger into container. In this scenario, how do we enter the runtime? I understand containers have a go binary as entrypoint.

Thanks for sharing your knowledge!

@adammkerr
Copy link
Author

How would one go about debugging something like this? Would the following be possible/right approach?

  1. Get it to work with direct runner, verify correctness
  2. Execute with portable runner, see if vanilla beam container works (use gcs as filesystem)
  3. If 2 works, then it's something with dataflow, if it doesn't, inject python debugger into container. In this scenario, how do we enter the runtime? I understand containers have a go binary as entrypoint.

Thanks for sharing your knowledge!

That is a pretty solid approach in my opinion - my approach was:

  1. Develop the pipeline logic in an notebook using InteractiveContext - ensure that I can call each component with specified UDFs.
  2. Test your code - write pytests for each UDF that you create (any module_path / module_file provided to any component)
  3. Launch the job with no beam args - remove .with_beam_pipeline_args from each component which leverages beam. Default behavior will execute the beam jobs in your specified container using DirectRunner.
  4. If a beam-backed component fails - follow the error.
  5. If all beam-backed component are successful - add .with_beam_pipeline_args back to each component which leverages beam. Ensure that you specify beam args to leverage Dataflow (ex. --runner="DataflowRunner")
  6. Launch the job , watch the logs, if a dataflow job fails - follow the error.

In this case, since the logs indicate we are stuck in an endless loop upon worker start up, the idea would be to hop into the code and see how the respective component is launching the Dataflow job and what is going on - something I am currently struggling with (hence the bug report here).

@pritamdodeja
Copy link
Contributor

I have a fully functional tfx pipeline that was running in vertex a couple of months ago. The same pipeline no longer works, does not even get past examplegen. I am also running into dataflow related issues. Is there a canonical pipeline that can be run to verify vertex/dataflow is working as expected? I am able to get that same pipeline to locally via DirectRunner. How should one go about porting local tfx pipelines to vertex/dataflow runner?

My pipeline issue is fixed. I solved it by doing what the error you're seeing is suggesting. The beam sdk version (what we see in python pip output) should match the "boot" version of apache beam (i.e. the entry point of that docker container, that go binary). I solved it by porting my pipeline to a local version that uses PortableRunner, debugging there, and then porting back to GCP.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants