Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

execution errors with attached engine.Batch (for EndTransaction) #1989

Closed
tbg opened this issue Aug 6, 2015 · 8 comments
Closed

execution errors with attached engine.Batch (for EndTransaction) #1989

tbg opened this issue Aug 6, 2015 · 8 comments

Comments

@tbg
Copy link
Member

tbg commented Aug 6, 2015

all replica commands (EndTransaction, Put, ...) operate on an engine.Batch which is discarded by the caller if an error is returned.
That worked well so far, but some commands actually fail and need something written to the engine: In #1916, EndTransaction resolves local intents in the same batch, but on an error, there's little it can do locally: it needs to return them as skipped intents so that they can be resolved indirectly.

Even worse is when the error is related to the Txn itself: in case of an epoch or timestamp regression, you'd reasonably expect to have the transaction aborted - but you can't, because it's returning an error. This means that any open intents anywhere in the system cannot resolve until they abort the transaction or it times out.

A carefully set up system to return something like a

type ErrorWithBatch{
  error
  ms *engine.MVCCStats
  batch engine.Batch
}

could be the best way to deal with the above. In case of errors, we create a new batch anyways to update the response cache: it's possible that the change will be relatively easy.

Once that (or a related mechanism - after all, the above is only my initial suggestion) is in place, we can clean up in EndTransaction on error.

@bdarnell
Copy link
Contributor

bdarnell commented Aug 6, 2015

There's no reason that all the methods in replica_command.go have the same signature; they already differ because only some of them return intents. Rather than trying to combine an error and other data in the same object, we could make EndTransaction return an engine.Batch in addition to its other return values. Replica.executeCmd could return this batch to Replica.applyRaftCommandInBatch, since the latter method is already prepared to commit a batch whether there was an error or not.

@tbg
Copy link
Member Author

tbg commented Aug 6, 2015

but Replica.applyRaftCommandInBatch discards the batch on error, which means r.executeCmd needs to get two more return parameters (stats and batch). My suggestion doesn't look so bad, I think. I'm impartial though. Whatever gets the job done and is most legible.

    // Execute the command.
    reply, intents, rErr := r.executeCmd(batch, ms, args)
    if proto.IsWrite(args) {
    if rErr == nil {
        // ...
    } else {
        batch.Close()
        if bErr, ok := rErr.(*errWithBatch); ok {
            batch = // ...
            // call MergeMVCCStats, set ms = ...
        } else {
            batch = r.rm.Engine().NewBatch()
        }
    }
    // ...
}

@spencerkimball
Copy link
Member

I'm unclear on the specific use cases where this is necessary. Could you elaborate a little?

@tbg
Copy link
Member Author

tbg commented Aug 6, 2015

EndTransaction which fails with, for example, timestamp regression still needs to abort the transaction.

@spencerkimball
Copy link
Member

That's not a use case which justifies this kind of a change. That's a sanity check--not something that is expected to ever occur (certainly not in the normal course of events).

@tbg
Copy link
Member Author

tbg commented Aug 6, 2015

Ok, but then we should panic there.

Another case is EndTransaction which tries to commit but has been aborted by someone else (who obviously didn't know anything about the intents). Will still be good to resolve those intents which are local right away.

@spencerkimball
Copy link
Member

I'm not sure it's panic-worthy. It's not going to affect the correctness of the system or lead to corruption.

For the second case, doesn't seem worthwhile to introduce the complexity of returning a batch just so we end up cleaning up local intents synchronously.

@tbg
Copy link
Member Author

tbg commented Aug 11, 2015

Going to close for now and address some of the above issues in upcoming work towards completing #1821.

@tbg tbg closed this as completed Aug 11, 2015
kkaneda added a commit that referenced this issue Apr 6, 2016
(I was curious if we can fix #3037 (and #1989). The change introduces
some complexity and I'm not totally sure we should fix these issues.)

This commit changes EndTransaction to return nil error on txn abort so
that the we can commit a batch and update the transaction record.

Change txnSender.Send so that it convert a nil error to TransactionAbortedError
so that txn.Exec will initiate a restart. This change is hacky, but I couldn't
come up with a better approach.

We still return TransactionAbortedError in other places and keep the
code path for handling TransactionAbortedError.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants