Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[AIR, callbacks] _MLflowLoggerUtil incompatible with DB MLflow backend #29749

Closed
tbukic opened this issue Oct 27, 2022 · 5 comments · Fixed by #29794
Closed

[AIR, callbacks] _MLflowLoggerUtil incompatible with DB MLflow backend #29749

tbukic opened this issue Oct 27, 2022 · 5 comments · Fixed by #29794
Labels
bug Something that is supposed to be working; but isn't observability Issues related to the Ray Dashboard, Logging, Metrics, Tracing, and/or Profiling P1 Issue that should be fixed within a few weeks

Comments

@tbukic
Copy link
Contributor

tbukic commented Oct 27, 2022

What happened + What you expected to happen

Line 199 in mlflow integration internals seems to be making problems when using mlflow in the scenario 5.
Call to mlflow API gives an error with text:

mlflow.exceptions.RestException: BAD_REQUEST: (psycopg2.errors.UniqueViolation) duplicate key value violates unique constraint "tag_pk"
DETAIL: Key (key, run_uuid)=(mlflow.runName, <RUN_NUMBER>) already exists.

Skipping asigning mlflow.runName tag seems to allow creation of the experiment. Without dwelling into mlflow code, my best guess (based on MLflow documentation) is MLflow automatically creates mlflow.runName column in the database and it fails on unique key constraint.

Commenting out the line 199 solves the issue locally. I guess it can be solved in production by changing Docker image, but ideally it is one liner fix to do and release.

Versions / Dependencies

Runtime environment for the example is local ray instance on Ubuntu-22.04 on WSL2.
MLflow tracking server is deployed on private cloud, running 1.30.0, reporting to PosgreSQL database and uses S3 object storage for artifacts.

[tool.poetry.dependencies]
python = "~3.10"
mlflow = "^1.29.0"
ray = "^2.0.0"
gym = "~0.25"
dm-tree = "^0.1.7"
opencv-python = "^4.6.0.66"
lz4 = "^4.0.2"
torch = "^1.12.1"
tensorboard = "^2.10.1"
boto3 = "^1.25.1"

Which on my PC resolves to:

  • Python 3.10.6 (main, Aug 10 2022, 11:40:04) [GCC 11.3.0]
  • Ray 2.0.1
  • mlflow 1.30.0

Reproduction script

Any code can be used when you have proper MLflow setup. Unfortunately mine is deployed on company's VPN and I can't share it. We're using setup #5, but I'm pretty sure any setup with database back-end will fail. Setups which don't use db but local filesystem work fine because there is no primary key constraint on

Bellow is a simplified version of the code I used for debugging purposes, but other examples fail as well. E.g. this.

import ray
from ray import air, tune
from ray.air.callbacks.mlflow import MLflowLoggerCallback

TRACKING_URI = ...

def main():
    ray.init()
    tuner = tune.Tuner(
        "PPO",
        run_config=air.RunConfig(
            stop={"episode_reward_mean": 200},
            callbacks=[
                MLflowLoggerCallback(
                    tracking_uri=TRACKING_URI,
                    registry_uri=TRACKING_URI,
                    experiment_name="DeBug",
                    tags={
                        'Author': 'My Name',
                        'Type': 'Testing MLflow'
                    },
                    save_artifact=False
                )
            ]
        ),
        tune_config=tune.TuneConfig(
            metric="episode_reward_mean",
            mode="max",
        ),
        param_space={
            "env": "CartPole-v1",
            "framework": "torch",
            "num_gpus": 0,
            "num_workers": 1,
        },
    )
    tuner.fit()


if __name__ == '__main__':
    main()

Issue Severity

Medium: It is a significant difficulty but I can work around it.

@tbukic tbukic added bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Oct 27, 2022
@scottsun94
Copy link
Contributor

cc: @xwjiang2010 @krfricke

@bveeramani
Copy link
Member

Was able to reproduce with a sqlite3 database:

❯ python python/ray/tune/examples/mlflow_example.py --tracking-uri sqlite:///example.db

...

Traceback (most recent call last):
  File "/Users/balaji/Documents/GitHub/ray/python/ray/tune/execution/trial_runner.py", line 833, in _wait_and_handle_event
    self._on_pg_ready(next_trial)
  File "/Users/balaji/Documents/GitHub/ray/python/ray/tune/execution/trial_runner.py", line 923, in _on_pg_ready
    if not _start_trial(next_trial) and next_trial.status != Trial.ERROR:
  File "/Users/balaji/Documents/GitHub/ray/python/ray/tune/execution/trial_runner.py", line 915, in _start_trial
    self._callbacks.on_trial_start(
  File "/Users/balaji/Documents/GitHub/ray/python/ray/tune/callback.py", line 317, in on_trial_start
    callback.on_trial_start(**info)
  File "/Users/balaji/Documents/GitHub/ray/python/ray/tune/logger/logger.py", line 135, in on_trial_start
    self.log_trial_start(trial)
  File "/Users/balaji/Documents/GitHub/ray/python/ray/air/callbacks/mlflow.py", line 111, in log_trial_start
    run = self.mlflow_util.start_run(tags=tags, run_name=str(trial))
  File "/Users/balaji/Documents/GitHub/ray/python/ray/air/_internal/mlflow.py", line 200, in start_run
    run = client.create_run(experiment_id=self.experiment_id, tags=tags)
  File "/Users/balaji/Documents/GitHub/ray/.venv/lib/python3.10/site-packages/mlflow/tracking/client.py", line 270, in create_run
    return self._tracking_client.create_run(experiment_id, start_time, tags, run_name)
  File "/Users/balaji/Documents/GitHub/ray/.venv/lib/python3.10/site-packages/mlflow/tracking/_tracking_service/client.py", line 108, in create_run
    return self.store.create_run(
  File "/Users/balaji/Documents/GitHub/ray/.venv/lib/python3.10/site-packages/mlflow/store/tracking/sqlalchemy_store.py", line 539, in create_run
    with self.ManagedSessionMaker() as session:
  File "/opt/homebrew/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/contextlib.py", line 142, in __exit__
    next(self.gen)
  File "/Users/balaji/Documents/GitHub/ray/.venv/lib/python3.10/site-packages/mlflow/store/db/utils.py", line 105, in make_managed_session
    raise MlflowException(message=e, error_code=BAD_REQUEST)
mlflow.exceptions.MlflowException: (sqlite3.IntegrityError) UNIQUE constraint failed: tags.key, tags.run_uuid
[SQL: INSERT INTO tags ("key", value, run_uuid) VALUES (?, ?, ?)]
[parameters: (('trial_name', 'easy_objective_6f569_00000', 'edbaafdd94d8437a87d59648523ea0d5'), ('mlflow.runName', 'easy_objective_6f569_00000', 'edbaafdd94d8437a87d59648523ea0d5'), ('mlflow.runName', 'honorable-hog-154', 'edbaafdd94d8437a87d59648523ea0d5'))]
(Background on this error at: https://sqlalche.me/e/14/gkpj)

...

ray.tune.error.TuneError: The Ray Tune run failed. Please inspect the previous error messages for a cause. After fixing the issue, you can restart the run from scratch or continue this run. To continue this run, you can use `tuner = Tuner.restore("/Users/balaji/ray_results/mlflow")`.

Debugging now.

@bveeramani
Copy link
Member

bveeramani commented Oct 28, 2022

Figured it out -- it's a bug with MLflow 1.30. See mlflow/mlflow#7138 and mlflow/mlflow#7133.

@tbukic As a temporary workaround, installing the MLflow nightly should fix your issue.

@amogkam should we add a fix to our code like below?

tags = tags or {}
if version.parse(mlflow.__version__) >= version.parse("1.30.0"):
    run = client.create_run(run_name=run_name, experiment_id=self.experiment_id, tags=tags)
else:
    tags[MLFLOW_RUN_NAME] = run_name
    run = client.create_run(experiment_id=self.experiment_id, tags=tags)

@amogkam
Copy link
Contributor

amogkam commented Oct 28, 2022

thanks for looking into this @bveeramani. yes that change sgtm

@tbukic
Copy link
Contributor Author

tbukic commented Oct 28, 2022

Great job, @bveeramani , thank you!

@tbukic tbukic closed this as completed Oct 28, 2022
@xwjiang2010 xwjiang2010 added observability Issues related to the Ray Dashboard, Logging, Metrics, Tracing, and/or Profiling P1 Issue that should be fixed within a few weeks and removed triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Oct 28, 2022
WeichenXu123 pushed a commit to WeichenXu123/ray that referenced this issue Dec 19, 2022
See ray-project#29749.

Signed-off-by: Weichen Xu <weichen.xu@databricks.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something that is supposed to be working; but isn't observability Issues related to the Ray Dashboard, Logging, Metrics, Tracing, and/or Profiling P1 Issue that should be fixed within a few weeks
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants