You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
It appears during GC, finally clauses in Python will end up on a different workflow's even loop. This is very bad. Here is a replication:
importasyncioimportloggingimporttimefromconcurrent.futuresimportThreadPoolExecutorfromdatetimeimporttimedeltafromuuidimportuuid4fromtemporalioimportactivity, workflowfromtemporalio.clientimportClient, WorkflowHandlefromtemporalio.workerimportWorker@activity.defndefwaiting_activity() ->str:
time.sleep(1)
return"done"@activity.defndefunrelated_activity() ->str:
return"I should only be in FinallyWorkflow"@workflow.defnclassDummyWorkflow:
@workflow.runasyncdefrun(self) ->None:
awaitworkflow.start_activity(
waiting_activity, start_to_close_timeout=timedelta(seconds=10)
)
awaitworkflow.start_activity(
waiting_activity, start_to_close_timeout=timedelta(seconds=10)
)
@workflow.defnclassFinallyWorkflow:
@workflow.runasyncdefrun(self) ->None:
try:
awaitworkflow.start_activity(
waiting_activity, start_to_close_timeout=timedelta(seconds=10)
)
awaitworkflow.start_activity(
waiting_activity, start_to_close_timeout=timedelta(seconds=10)
)
finally:
awaitworkflow.start_activity(
unrelated_activity, start_to_close_timeout=timedelta(seconds=10)
)
asyncdefmain():
logging.basicConfig(level=logging.INFO)
task_queue=f"tq-{uuid4()}"logging.info(f"Starting on task queue {task_queue}")
# Connectclient=awaitClient.connect("localhost:7233")
# Run with workerwithThreadPoolExecutor(100) asactivity_executor:
asyncwithWorker(
client=client,
task_queue=task_queue,
workflows=[FinallyWorkflow, DummyWorkflow],
activities=[waiting_activity, unrelated_activity],
activity_executor=activity_executor,
max_concurrent_workflow_tasks=5,
max_cached_workflows=10,
):
# Create 1000 dummy and finally workflowsdummy_handles: list[WorkflowHandle] = []
logging.info("Starting dummy and finally workflows")
foriinrange(1000):
dummy_handles.append(
awaitclient.start_workflow(
DummyWorkflow.run, id=f"dummy-{uuid4()}", task_queue=task_queue
)
)
awaitclient.start_workflow(
FinallyWorkflow.run, id=f"finally-{uuid4()}", task_queue=task_queue
)
# Wait on every dummy handle, which basically waits forever if/when# one hits non-determinismlogging.info("Checking dummy and finally workflows")
for_, handleinenumerate(dummy_handles):
logging.info(f"Checking dummy result for {handle.id}")
awaithandle.result()
logging.info("No dummy workflows had finally activities")
if__name__=="__main__":
asyncio.run(main())
Running that against a localhost server will start to spit out non-determinism errors. And after a few seconds, if you look at the still-running DummyWorkflows, you'll see they run unrelated_activity which is not even in their code. It is suspected this is caused by GeneratorExit happening with finally upon cache eviction, and that is being interleaved in the same thread as another workflow that has asyncio._set_running_loop on the thread.
An update - so basically there's no way for us to intercept GC to make sure it only runs on a certain thread. There are ways if we wanted to "stop the world" (i.e. disable GC and such), but nothing specific to a workflow. Can't use any form of weakref ref/finalize before/after on the higher level event loop to confirm GC of transitive coroutines. Can't disable the GeneratorExit behavior on GC of Coroutine in any way in Python that we can find (coroutine wrapper long since deprecated/removed, event loop not told about every coroutine, etc). Async generator capture/finalizer does nothing for general coroutines.
So what is clear is that we need to force tasks to complete before we try to GC. This is similar to what we do in other SDKs on eviction where we tell coroutines to complete and just ignore any commands or side effects they may have. The first attempt at this is promising on this commit, but it is a scary change, because if for some reason a user's code can't have its tasks torn down, it will never get evicted which means the memory never gets reclaimed and a slot is used on that worker forever and it can't run anything for that run ID again either. This is technically preferable to running code on the wrong workflow of course, but we may need a better approach here (e.g. moving the workflow to a different place).
Describe the bug
It appears during GC, finally clauses in Python will end up on a different workflow's even loop. This is very bad. Here is a replication:
Running that against a localhost server will start to spit out non-determinism errors. And after a few seconds, if you look at the still-running
DummyWorkflow
s, you'll see they rununrelated_activity
which is not even in their code. It is suspected this is caused byGeneratorExit
happening with finally upon cache eviction, and that is being interleaved in the same thread as another workflow that hasasyncio._set_running_loop
on the thread.We may need to implement a special async gen capture/finalizer, see https://peps.python.org/pep-0525/#finalization.
The text was updated successfully, but these errors were encountered: