[Bug] Commands sent during finally of a cache eviction may cross workflow contexts #494

cretz · 2024-03-26T16:26:46Z

Describe the bug

It appears during GC, finally clauses in Python will end up on a different workflow's even loop. This is very bad. Here is a replication:

import asyncio
import logging
import time
from concurrent.futures import ThreadPoolExecutor
from datetime import timedelta
from uuid import uuid4

from temporalio import activity, workflow
from temporalio.client import Client, WorkflowHandle
from temporalio.worker import Worker


@activity.defn
def waiting_activity() -> str:
    time.sleep(1)
    return "done"


@activity.defn
def unrelated_activity() -> str:
    return "I should only be in FinallyWorkflow"


@workflow.defn
class DummyWorkflow:
    @workflow.run
    async def run(self) -> None:
        await workflow.start_activity(
            waiting_activity, start_to_close_timeout=timedelta(seconds=10)
        )
        await workflow.start_activity(
            waiting_activity, start_to_close_timeout=timedelta(seconds=10)
        )


@workflow.defn
class FinallyWorkflow:
    @workflow.run
    async def run(self) -> None:
        try:
            await workflow.start_activity(
                waiting_activity, start_to_close_timeout=timedelta(seconds=10)
            )
            await workflow.start_activity(
                waiting_activity, start_to_close_timeout=timedelta(seconds=10)
            )
        finally:
            await workflow.start_activity(
                unrelated_activity, start_to_close_timeout=timedelta(seconds=10)
            )


async def main():
    logging.basicConfig(level=logging.INFO)

    task_queue = f"tq-{uuid4()}"
    logging.info(f"Starting on task queue {task_queue}")

    # Connect
    client = await Client.connect("localhost:7233")

    # Run with worker
    with ThreadPoolExecutor(100) as activity_executor:
        async with Worker(
            client=client,
            task_queue=task_queue,
            workflows=[FinallyWorkflow, DummyWorkflow],
            activities=[waiting_activity, unrelated_activity],
            activity_executor=activity_executor,
            max_concurrent_workflow_tasks=5,
            max_cached_workflows=10,
        ):

            # Create 1000 dummy and finally workflows
            dummy_handles: list[WorkflowHandle] = []
            logging.info("Starting dummy and finally workflows")
            for i in range(1000):
                dummy_handles.append(
                    await client.start_workflow(
                        DummyWorkflow.run, id=f"dummy-{uuid4()}", task_queue=task_queue
                    )
                )
                await client.start_workflow(
                    FinallyWorkflow.run, id=f"finally-{uuid4()}", task_queue=task_queue
                )

            # Wait on every dummy handle, which basically waits forever if/when
            # one hits non-determinism
            logging.info("Checking dummy and finally workflows")
            for _, handle in enumerate(dummy_handles):
                logging.info(f"Checking dummy result for {handle.id}")
                await handle.result()
            logging.info("No dummy workflows had finally activities")


if __name__ == "__main__":
    asyncio.run(main())

Running that against a localhost server will start to spit out non-determinism errors. And after a few seconds, if you look at the still-running DummyWorkflows, you'll see they run unrelated_activity which is not even in their code. It is suspected this is caused by GeneratorExit happening with finally upon cache eviction, and that is being interleaved in the same thread as another workflow that has asyncio._set_running_loop on the thread.

We may need to implement a special async gen capture/finalizer, see https://peps.python.org/pep-0525/#finalization.

The text was updated successfully, but these errors were encountered:

cretz · 2024-03-29T22:39:01Z

An update - so basically there's no way for us to intercept GC to make sure it only runs on a certain thread. There are ways if we wanted to "stop the world" (i.e. disable GC and such), but nothing specific to a workflow. Can't use any form of weakref ref/finalize before/after on the higher level event loop to confirm GC of transitive coroutines. Can't disable the GeneratorExit behavior on GC of Coroutine in any way in Python that we can find (coroutine wrapper long since deprecated/removed, event loop not told about every coroutine, etc). Async generator capture/finalizer does nothing for general coroutines.

So what is clear is that we need to force tasks to complete before we try to GC. This is similar to what we do in other SDKs on eviction where we tell coroutines to complete and just ignore any commands or side effects they may have. The first attempt at this is promising on this commit, but it is a scary change, because if for some reason a user's code can't have its tasks torn down, it will never get evicted which means the memory never gets reclaimed and a slot is used on that worker forever and it can't run anything for that run ID again either. This is technically preferable to running code on the wrong workflow of course, but we may need a better approach here (e.g. moving the workflow to a different place).

We are still researching here.

cretz added the bug Something isn't working label Mar 26, 2024

cretz mentioned this issue Apr 2, 2024

Safe Eviction #499

Merged

cretz closed this as completed in #499 Apr 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] Commands sent during finally of a cache eviction may cross workflow contexts #494

[Bug] Commands sent during finally of a cache eviction may cross workflow contexts #494

cretz commented Mar 26, 2024 •

edited

Loading

cretz commented Mar 29, 2024 •

edited

Loading

[Bug] Commands sent during finally of a cache eviction may cross workflow contexts #494

[Bug] Commands sent during finally of a cache eviction may cross workflow contexts #494

Comments

cretz commented Mar 26, 2024 • edited Loading

Describe the bug

cretz commented Mar 29, 2024 • edited Loading

cretz commented Mar 26, 2024 •

edited

Loading

cretz commented Mar 29, 2024 •

edited

Loading