Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[1.1.0] Fix deferred trx processing #1150

Merged
merged 6 commits into from
Feb 10, 2025
Merged

Conversation

heifner
Copy link
Member

@heifner heifner commented Feb 6, 2025

Currently with #1027 a long running trx is interrupted when the oc compile completes. This causes an issue with deferred trx processing in old blocks where deferred trxs exist. Disable interrupt of transaction when oc compile completes for implicit transactions. Implicit transactions include deferred onerror and onblock. Not interrupting onblock seems like an ok thing here as onblock is part of the main system contract.

Also stop() timer on oc interrupt so start() is always called on a stopped timer.
Also call checktime() after setup of cleanup in eos-vm-oc so that set_expiration_callback(nullptr, nullptr); is called on checktime exception.

chicken-dance mismatch still on this branch which is being tracked by #1152

Resolves #1145

@heifner heifner added the OCI Work exclusive to OCI team label Feb 6, 2025
@heifner heifner changed the base branch from main to release/1.1 February 6, 2025 21:10
@heifner heifner changed the title Fix deferred trx processing [1.1.0] Fix deferred trx processing Feb 6, 2025
@heifner heifner marked this pull request as draft February 6, 2025 21:42
@heifner heifner marked this pull request as ready for review February 7, 2025 14:44
@spoonincode
Copy link
Member

Can you describe a call path that yielded the failure a little more? Because,

Also call checktime() after setup of cleanup in eos-vm-oc so that set_expiration_callback(nullptr, nullptr); is called on checktime exception.

Couldn't this on its own have been a problem in prior releases?

@heifner
Copy link
Member Author

heifner commented Feb 7, 2025

Can you describe a call path that yielded the failure a little more? Because,

Also call checktime() after setup of cleanup in eos-vm-oc so that set_expiration_callback(nullptr, nullptr); is called on checktime exception.

Couldn't this on its own have been a problem in prior releases?

This has been this way since 2.0.0. I noticed it during visual inspection of all calls to set_expiration_callback. Certainly seems like this could have been an issue at any point. Maybe there was something that prevent(ed) checktime from ever being triggered there.

@spoonincode
Copy link
Member

Sorry I should have written those as 2 separate questions more clearly. Can you describe the (or a) flow that was causing the problem?

@heifner
Copy link
Member Author

heifner commented Feb 7, 2025

Sorry I should have written those as 2 separate questions more clearly. Can you describe the (or a) flow that was causing the problem?

onerror case

While applying a block a deferred transaction is processed. During the processing of the deferred transaction the action was interrupted via oc interrupt. This throws the interrupt_oc_exception transaction out of apply (line 191):

} catch (const interrupt_exception& e) {
if (allow_oc_interrupt && eos_vm_oc_compile_interrupt && main_thread_timer.timer_state() == platform_timer::state_t::interrupted) {
++eos_vm_oc_compile_interrupt_count;
dlog("EOS VM OC compile complete interrupt of ${r} <= ${a}::${act} code ${h}, interrupt #${c}",
("r", context.get_receiver())("a", context.get_action().account)
("act", context.get_action().name)("h", code_hash)("c", eos_vm_oc_compile_interrupt_count));
EOS_THROW(interrupt_oc_exception, "EOS VM OC compile complete interrupt of ${r} <= ${a}::${act} code ${h}, interrupt #${c}",
("r", context.get_receiver())("a", context.get_action().account)
("act", context.get_action().name)("h", code_hash)("c", eos_vm_oc_compile_interrupt_count));
}
throw;

This interrupt_oc_exception is expected to be caught in transaction_context:

} catch ( const fc::exception& e ) {
if (i == 0 && e.code() == interrupt_oc_exception::code_value) {
reset();
continue;
}

However, onerror calls trx_context.execute_action directly from the controller:

trx_context.execute_action( trx_context.schedule_action( trx.get_transaction().actions.back(), gtrx.sender, false, 0, 0 ), 0 );

Which means the interrupt_oc_exception would be caught here:

} catch( const fc::exception& e ) {

This interrupt_oc_exception would cause the onerror to fail. But the failure is captured into a trace of the onerror (not thrown on out). This results in a hard_fail instead of a soft_fail. Also the trx_context is undone instead of squashed.

deferred trx case

I'm not really sure why interrupting deferred trxs exec() would be a problem. I was thinking it was because it created its own undo_session. But it shouldn't when validating a block. I now don't see why interrupting a deferred trx would cause any issue. I does, because one of the runs had the not interrupt for implicit but do interrupt for scheduled and that chicken-dance failed with an invalid soft_fail.

@@ -59,6 +59,7 @@ namespace eosio::chain {
undo();
*trace = transaction_trace{}; // reset trace
initialize();
transaction_timer.stop();
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was not needed. The undo() call above calls transaction_timer.stop(). I can remove it, but having it here makes it more obvious that it is being called.

@ericpassmore ericpassmore added the bug The product is not working as was intended. label Feb 9, 2025
@ericpassmore
Copy link
Contributor

Note:start
category: Other
component: Internal
summary: Fix deferred trx process to support replay-ability.
Note:end

@heifner heifner merged commit a291a42 into release/1.1 Feb 10, 2025
36 checks passed
@heifner heifner deleted the GH-1145-replay-errors branch February 10, 2025 12:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug The product is not working as was intended. OCI Work exclusive to OCI team
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Spring 1.1 Misc Errors During Replay
4 participants