[AIR, callbacks] _MLflowLoggerUtil incompatible with DB MLflow backend #29749

tbukic · 2022-10-27T10:06:08Z

What happened + What you expected to happen

Line 199 in mlflow integration internals seems to be making problems when using mlflow in the scenario 5.
Call to mlflow API gives an error with text:

mlflow.exceptions.RestException: BAD_REQUEST: (psycopg2.errors.UniqueViolation) duplicate key value violates unique constraint "tag_pk"
DETAIL: Key (key, run_uuid)=(mlflow.runName, <RUN_NUMBER>) already exists.

Skipping asigning mlflow.runName tag seems to allow creation of the experiment. Without dwelling into mlflow code, my best guess (based on MLflow documentation) is MLflow automatically creates mlflow.runName column in the database and it fails on unique key constraint.

Commenting out the line 199 solves the issue locally. I guess it can be solved in production by changing Docker image, but ideally it is one liner fix to do and release.

Versions / Dependencies

Runtime environment for the example is local ray instance on Ubuntu-22.04 on WSL2.
MLflow tracking server is deployed on private cloud, running 1.30.0, reporting to PosgreSQL database and uses S3 object storage for artifacts.

[tool.poetry.dependencies]
python = "~3.10"
mlflow = "^1.29.0"
ray = "^2.0.0"
gym = "~0.25"
dm-tree = "^0.1.7"
opencv-python = "^4.6.0.66"
lz4 = "^4.0.2"
torch = "^1.12.1"
tensorboard = "^2.10.1"
boto3 = "^1.25.1"

Which on my PC resolves to:

Python 3.10.6 (main, Aug 10 2022, 11:40:04) [GCC 11.3.0]
Ray 2.0.1
mlflow 1.30.0

Reproduction script

Any code can be used when you have proper MLflow setup. Unfortunately mine is deployed on company's VPN and I can't share it. We're using setup #5, but I'm pretty sure any setup with database back-end will fail. Setups which don't use db but local filesystem work fine because there is no primary key constraint on

Bellow is a simplified version of the code I used for debugging purposes, but other examples fail as well. E.g. this.

import ray
from ray import air, tune
from ray.air.callbacks.mlflow import MLflowLoggerCallback

TRACKING_URI = ...

def main():
    ray.init()
    tuner = tune.Tuner(
        "PPO",
        run_config=air.RunConfig(
            stop={"episode_reward_mean": 200},
            callbacks=[
                MLflowLoggerCallback(
                    tracking_uri=TRACKING_URI,
                    registry_uri=TRACKING_URI,
                    experiment_name="DeBug",
                    tags={
                        'Author': 'My Name',
                        'Type': 'Testing MLflow'
                    },
                    save_artifact=False
                )
            ]
        ),
        tune_config=tune.TuneConfig(
            metric="episode_reward_mean",
            mode="max",
        ),
        param_space={
            "env": "CartPole-v1",
            "framework": "torch",
            "num_gpus": 0,
            "num_workers": 1,
        },
    )
    tuner.fit()


if __name__ == '__main__':
    main()

Issue Severity

Medium: It is a significant difficulty but I can work around it.

The text was updated successfully, but these errors were encountered:

scottsun94 · 2022-10-27T18:19:28Z

cc: @xwjiang2010 @krfricke

bveeramani · 2022-10-28T03:01:41Z

Was able to reproduce with a sqlite3 database:

❯ python python/ray/tune/examples/mlflow_example.py --tracking-uri sqlite:///example.db

...

Traceback (most recent call last):
  File "/Users/balaji/Documents/GitHub/ray/python/ray/tune/execution/trial_runner.py", line 833, in _wait_and_handle_event
    self._on_pg_ready(next_trial)
  File "/Users/balaji/Documents/GitHub/ray/python/ray/tune/execution/trial_runner.py", line 923, in _on_pg_ready
    if not _start_trial(next_trial) and next_trial.status != Trial.ERROR:
  File "/Users/balaji/Documents/GitHub/ray/python/ray/tune/execution/trial_runner.py", line 915, in _start_trial
    self._callbacks.on_trial_start(
  File "/Users/balaji/Documents/GitHub/ray/python/ray/tune/callback.py", line 317, in on_trial_start
    callback.on_trial_start(**info)
  File "/Users/balaji/Documents/GitHub/ray/python/ray/tune/logger/logger.py", line 135, in on_trial_start
    self.log_trial_start(trial)
  File "/Users/balaji/Documents/GitHub/ray/python/ray/air/callbacks/mlflow.py", line 111, in log_trial_start
    run = self.mlflow_util.start_run(tags=tags, run_name=str(trial))
  File "/Users/balaji/Documents/GitHub/ray/python/ray/air/_internal/mlflow.py", line 200, in start_run
    run = client.create_run(experiment_id=self.experiment_id, tags=tags)
  File "/Users/balaji/Documents/GitHub/ray/.venv/lib/python3.10/site-packages/mlflow/tracking/client.py", line 270, in create_run
    return self._tracking_client.create_run(experiment_id, start_time, tags, run_name)
  File "/Users/balaji/Documents/GitHub/ray/.venv/lib/python3.10/site-packages/mlflow/tracking/_tracking_service/client.py", line 108, in create_run
    return self.store.create_run(
  File "/Users/balaji/Documents/GitHub/ray/.venv/lib/python3.10/site-packages/mlflow/store/tracking/sqlalchemy_store.py", line 539, in create_run
    with self.ManagedSessionMaker() as session:
  File "/opt/homebrew/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/contextlib.py", line 142, in __exit__
    next(self.gen)
  File "/Users/balaji/Documents/GitHub/ray/.venv/lib/python3.10/site-packages/mlflow/store/db/utils.py", line 105, in make_managed_session
    raise MlflowException(message=e, error_code=BAD_REQUEST)
mlflow.exceptions.MlflowException: (sqlite3.IntegrityError) UNIQUE constraint failed: tags.key, tags.run_uuid
[SQL: INSERT INTO tags ("key", value, run_uuid) VALUES (?, ?, ?)]
[parameters: (('trial_name', 'easy_objective_6f569_00000', 'edbaafdd94d8437a87d59648523ea0d5'), ('mlflow.runName', 'easy_objective_6f569_00000', 'edbaafdd94d8437a87d59648523ea0d5'), ('mlflow.runName', 'honorable-hog-154', 'edbaafdd94d8437a87d59648523ea0d5'))]
(Background on this error at: https://sqlalche.me/e/14/gkpj)

...

ray.tune.error.TuneError: The Ray Tune run failed. Please inspect the previous error messages for a cause. After fixing the issue, you can restart the run from scratch or continue this run. To continue this run, you can use `tuner = Tuner.restore("/Users/balaji/ray_results/mlflow")`.

Debugging now.

bveeramani · 2022-10-28T03:38:23Z

Figured it out -- it's a bug with MLflow 1.30. See mlflow/mlflow#7138 and mlflow/mlflow#7133.

@tbukic As a temporary workaround, installing the MLflow nightly should fix your issue.

@amogkam should we add a fix to our code like below?

tags = tags or {}
if version.parse(mlflow.__version__) >= version.parse("1.30.0"):
    run = client.create_run(run_name=run_name, experiment_id=self.experiment_id, tags=tags)
else:
    tags[MLFLOW_RUN_NAME] = run_name
    run = client.create_run(experiment_id=self.experiment_id, tags=tags)

amogkam · 2022-10-28T03:44:55Z

thanks for looking into this @bveeramani. yes that change sgtm

tbukic · 2022-10-28T08:34:31Z

Great job, @bveeramani , thank you!

See #29749.

See ray-project#29749. Signed-off-by: Weichen Xu <weichen.xu@databricks.com>

tbukic added bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Oct 27, 2022

bveeramani mentioned this issue Oct 28, 2022

Do not append mlflow.runName tag if it's already in tags mlflow/mlflow#7138

Merged

31 tasks

bveeramani mentioned this issue Oct 28, 2022

[AIR] Avoid MLflow database integrity error #29794

Merged

7 tasks

tbukic closed this as completed Oct 28, 2022

xwjiang2010 added observability Issues related to the Ray Dashboard, Logging, Metrics, Tracing, and/or Profiling P1 Issue that should be fixed within a few weeks and removed triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Oct 28, 2022

amogkam pushed a commit that referenced this issue Nov 1, 2022

[AIR] Avoid MLflow database integrity error (#29794)

fbb6ced

See #29749.

WeichenXu123 pushed a commit to WeichenXu123/ray that referenced this issue Dec 19, 2022

[AIR] Avoid MLflow database integrity error (ray-project#29794)

0ec88d1

See ray-project#29749. Signed-off-by: Weichen Xu <weichen.xu@databricks.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[AIR, callbacks] _MLflowLoggerUtil incompatible with DB MLflow backend #29749

[AIR, callbacks] _MLflowLoggerUtil incompatible with DB MLflow backend #29749

tbukic commented Oct 27, 2022

scottsun94 commented Oct 27, 2022

bveeramani commented Oct 28, 2022

bveeramani commented Oct 28, 2022 •

edited

Loading

amogkam commented Oct 28, 2022

tbukic commented Oct 28, 2022

[AIR, callbacks] _MLflowLoggerUtil incompatible with DB MLflow backend #29749

[AIR, callbacks] _MLflowLoggerUtil incompatible with DB MLflow backend #29749

Comments

tbukic commented Oct 27, 2022

What happened + What you expected to happen

Versions / Dependencies

Reproduction script

Issue Severity

scottsun94 commented Oct 27, 2022

bveeramani commented Oct 28, 2022

bveeramani commented Oct 28, 2022 • edited Loading

amogkam commented Oct 28, 2022

tbukic commented Oct 28, 2022

bveeramani commented Oct 28, 2022 •

edited

Loading