Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] Commands sent during finally of a cache eviction may cross workflow contexts #494

Closed
cretz opened this issue Mar 26, 2024 · 1 comment · Fixed by #499
Closed

[Bug] Commands sent during finally of a cache eviction may cross workflow contexts #494

cretz opened this issue Mar 26, 2024 · 1 comment · Fixed by #499
Labels
bug Something isn't working

Comments

@cretz
Copy link
Member

cretz commented Mar 26, 2024

Describe the bug

It appears during GC, finally clauses in Python will end up on a different workflow's even loop. This is very bad. Here is a replication:

import asyncio
import logging
import time
from concurrent.futures import ThreadPoolExecutor
from datetime import timedelta
from uuid import uuid4

from temporalio import activity, workflow
from temporalio.client import Client, WorkflowHandle
from temporalio.worker import Worker


@activity.defn
def waiting_activity() -> str:
    time.sleep(1)
    return "done"


@activity.defn
def unrelated_activity() -> str:
    return "I should only be in FinallyWorkflow"


@workflow.defn
class DummyWorkflow:
    @workflow.run
    async def run(self) -> None:
        await workflow.start_activity(
            waiting_activity, start_to_close_timeout=timedelta(seconds=10)
        )
        await workflow.start_activity(
            waiting_activity, start_to_close_timeout=timedelta(seconds=10)
        )


@workflow.defn
class FinallyWorkflow:
    @workflow.run
    async def run(self) -> None:
        try:
            await workflow.start_activity(
                waiting_activity, start_to_close_timeout=timedelta(seconds=10)
            )
            await workflow.start_activity(
                waiting_activity, start_to_close_timeout=timedelta(seconds=10)
            )
        finally:
            await workflow.start_activity(
                unrelated_activity, start_to_close_timeout=timedelta(seconds=10)
            )


async def main():
    logging.basicConfig(level=logging.INFO)

    task_queue = f"tq-{uuid4()}"
    logging.info(f"Starting on task queue {task_queue}")

    # Connect
    client = await Client.connect("localhost:7233")

    # Run with worker
    with ThreadPoolExecutor(100) as activity_executor:
        async with Worker(
            client=client,
            task_queue=task_queue,
            workflows=[FinallyWorkflow, DummyWorkflow],
            activities=[waiting_activity, unrelated_activity],
            activity_executor=activity_executor,
            max_concurrent_workflow_tasks=5,
            max_cached_workflows=10,
        ):

            # Create 1000 dummy and finally workflows
            dummy_handles: list[WorkflowHandle] = []
            logging.info("Starting dummy and finally workflows")
            for i in range(1000):
                dummy_handles.append(
                    await client.start_workflow(
                        DummyWorkflow.run, id=f"dummy-{uuid4()}", task_queue=task_queue
                    )
                )
                await client.start_workflow(
                    FinallyWorkflow.run, id=f"finally-{uuid4()}", task_queue=task_queue
                )

            # Wait on every dummy handle, which basically waits forever if/when
            # one hits non-determinism
            logging.info("Checking dummy and finally workflows")
            for _, handle in enumerate(dummy_handles):
                logging.info(f"Checking dummy result for {handle.id}")
                await handle.result()
            logging.info("No dummy workflows had finally activities")


if __name__ == "__main__":
    asyncio.run(main())

Running that against a localhost server will start to spit out non-determinism errors. And after a few seconds, if you look at the still-running DummyWorkflows, you'll see they run unrelated_activity which is not even in their code. It is suspected this is caused by GeneratorExit happening with finally upon cache eviction, and that is being interleaved in the same thread as another workflow that has asyncio._set_running_loop on the thread.

We may need to implement a special async gen capture/finalizer, see https://peps.python.org/pep-0525/#finalization.

@cretz cretz added the bug Something isn't working label Mar 26, 2024
@cretz
Copy link
Member Author

cretz commented Mar 29, 2024

An update - so basically there's no way for us to intercept GC to make sure it only runs on a certain thread. There are ways if we wanted to "stop the world" (i.e. disable GC and such), but nothing specific to a workflow. Can't use any form of weakref ref/finalize before/after on the higher level event loop to confirm GC of transitive coroutines. Can't disable the GeneratorExit behavior on GC of Coroutine in any way in Python that we can find (coroutine wrapper long since deprecated/removed, event loop not told about every coroutine, etc). Async generator capture/finalizer does nothing for general coroutines.

So what is clear is that we need to force tasks to complete before we try to GC. This is similar to what we do in other SDKs on eviction where we tell coroutines to complete and just ignore any commands or side effects they may have. The first attempt at this is promising on this commit, but it is a scary change, because if for some reason a user's code can't have its tasks torn down, it will never get evicted which means the memory never gets reclaimed and a slot is used on that worker forever and it can't run anything for that run ID again either. This is technically preferable to running code on the wrong workflow of course, but we may need a better approach here (e.g. moving the workflow to a different place).

We are still researching here.

@cretz cretz mentioned this issue Apr 2, 2024
@cretz cretz closed this as completed in #499 Apr 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant