Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

txview: run status and age checks on incoming transactions #4506

Open
wants to merge 9 commits into
base: master
Choose a base branch
from

Conversation

apfitzge
Copy link

Problem

  • view transaction parsing option in banking stage does not run status/age checks before inserting transactions into the buffer
  • this means that aged out, high priority, transactions may push out lower-priority but still valid transactions

Summary of Changes

  • Insert all valid transactions into the map on receive (map only, not priority queue)
  • Check ages in batches (batch size is equal to the extra capacity we reserved in the map)
  • Insert only valid transaction's id into priority queue, drop lowest priority if at capacity

Fixes #

@apfitzge apfitzge self-assigned this Jan 16, 2025
let mut run_status_age_checks =
|container: &mut TransactionViewStateContainer,
transaction_ids: &mut ArrayVec<usize, 64>| {
// Temporary scope so that transaction references are immediately
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

complexity here can go away once we have Bytes backed transactions coming from upstream, since we do not need to do the weird "insert to map only" pattern.

vacant_entry.insert(state);

// Push the transaction into the queue.
self.inner
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No longer push into the queue here. We just let up to 64 (see EXTRA_CAPACITY const) additional transactions to live in the map. Once we reach end of incoming tx stream or hit 64 packets we run age checks and only THEN do we insert into the priority queue.
This means that txs that are already processed or too old will not push out transactions that can be processed.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think fn remaining_capacity needs to be reworked because with this change we will often have 0 remaining capacity and always be popping from the priority queue even if the priority queue length was actually less than its capacity.

If the priority capacity was 100 and we only had 36 elements in the transaction slab, then a new batch of 64 transactions would get added to the slab and remaining_capacity would return 0 when inserting into the priority queue. This means we would always pop an element and end up with only 36 elements in the queue when we should have 100.

If I'm reading this change correctly, we used to get the remaining capacity before inserting but no longer do that, causing this new issue to emerge.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah 100% right; I need to add some tests for this and fix the issue with remaining capacity.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we'd end up with 36 packets though? We'd end up with 99 (obviously still not ideal).

  • Initial capacity = 100. Extra capacity = 64 => Slab capacity = 164
  • 36 packets in queue/slab. => 64 remanining_capacity at start
  • receive 64 packets, all entered into the slab. Slab length now 100.
  • we run checks
  • when we go to insert to queue in a loop. First tx insert, remaining_capacity = 0. So we drop a packet. Now slab length is 99.
  • rest of the loop the slab length will be 99, and remaining packets will be inserted.

but we've also possibly dropped the wrong packet here. everything already in queue and first tx in in received batch could be high priority, with remaining 63 being low priority.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we could solve w/ the following:

  • store desired capacity separately (or derive by - EXTRA_CAPACITY)
  • insert all into slab
  • insert all into queue
  • pop min from queue until desired size

that way we are always popping the lowest known item(s) instead of strictly keeping at capacity. because even if fixing the off-by-one issue, we'd still potentially be dropping non-optimally for the received batch.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we'd end up with 36 packets though? We'd end up with 99 (obviously still not ideal).

Yeah this looks correct for the example I gave, my mistake. I should have given this example:

If the priority capacity was 100 and we had 68 elements in the transaction slab, after inserting a new batch of 64 pending transactions into the slab, the remaining_capacity would return 0 when inserting into the priority queue. This is an issue because we actually have capacity to add 32 new transactions to the priority queue. But for the first 32 pending transactions that we try to insert into the priority queue, we will pop the min value of (next-pending-tx, set of priority queue txs). But that's not great because it could be that the first 32 transactions in the batch of 64 pending transactions are all high priority and we would end up dropping 31 of them.

That said, I think your proposed solution makes sense. We could have a new method on TransactionStateContainer for pushing a batch of transactions into the priority queue and then drain from the container until remaining_capacity is within the max capacity.

@apfitzge apfitzge marked this pull request as ready for review January 16, 2025 23:11
@apfitzge apfitzge requested a review from jstarry January 16, 2025 23:12
@apfitzge
Copy link
Author

Follow-up to #3820

Copy link

@jstarry jstarry left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add some tests as well?

}
}

// Any remaining packets undergo status/age checks
run_status_age_checks(container, &mut transaction_ids);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's check if transactions_ids is empty first

}));
working_bank.check_transactions::<RuntimeTransaction<_>>(
&transactions,
&lock_results,
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's currently fine to call check_transactions with lock_results that are longer than the transactions slice but I think it would be better to pass a slice that is exactly the same length as transactions. What do you think?

vacant_entry.insert(state);

// Push the transaction into the queue.
self.inner
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think fn remaining_capacity needs to be reworked because with this change we will often have 0 remaining capacity and always be popping from the priority queue even if the priority queue length was actually less than its capacity.

If the priority capacity was 100 and we only had 36 elements in the transaction slab, then a new batch of 64 transactions would get added to the slab and remaining_capacity would return 0 when inserting into the priority queue. This means we would always pop an element and end up with only 36 elements in the queue when we should have 100.

If I'm reading this change correctly, we used to get the remaining capacity before inserting but no longer do that, causing this new issue to emerge.

@apfitzge apfitzge force-pushed the view_receiving_age_checks branch 2 times, most recently from 1f4a99d to 0d74aa4 Compare February 6, 2025 23:18
@apfitzge apfitzge requested a review from jstarry February 10, 2025 22:32
for id in priority_ids {
self.priority_queue.push(id);
}
let num_dropped = self.priority_queue.len().saturating_sub(self.capacity);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we also want to enforce that the slab (id_to_transaction_state) length is also not more than self.capacity after pushing ids into the queue right? If we use priority_queue length, we might actually already have some transactions that are in the slab but aren't currently in the priority queue because they're scheduled. So this could mean that after pushing 64 txs into the slab and all checks pass, we could have at least 64 transactions scheduled and so there is space for all of those transactions in the priority queue. So ReceiveAndBuffer would run another pass and could fill an extra 64 txs into the slab again.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah you're right about the scheduled txs not in queue - we should be using the slab's len here because it should always be >= to priority_queue len. i.e. we never have something in queue that is not in slab.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

788f144

Added test to test this as well, with all txs in queue popped but not removed from map.
Realistically that will not happen due to our CU throttling, we don't have everything scheduled.

Comment on lines +88 to +90
/// To avoid allocating, the caller should not push more than
/// [`EXTRA_CAPACITY`] ids in a call.
/// Returns the number of dropped transactions.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it enough to enforce that you cannot push any ids that aren't already in the tx container slab? As long as the tx container never goes over self.capacity + EXTRA_CAPACITY we can never get into a situation where we over allocate here I think?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We shouldn't be pushing anything to priority queue that's not in the map already.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool, then we don't really need this comment right? We can push however many ids that we want (even more than EXTRA_CAPACITY ids) in a call because we know that the resulting length will never be more than the tx container map size.

I was just concerned that if EXTRA_CAPACITY was truly a hard limit, we're not enforcing it anywhere and if batch size was ever higher than EXTRA_CAPACITY, we could start pushing more than EXTRA_CAPACITY into the priority queue when a full batch of transactions fails and needs to be requeued.

@apfitzge apfitzge force-pushed the view_receiving_age_checks branch from 788f144 to c5fc7f6 Compare February 11, 2025 22:22
Comment on lines 450 to 464
let transaction = &container
.get_transaction_ttl(priority_id.id)
.expect("transaction must exist")
.transaction;
*result = Consumer::check_fee_payer_unlocked(
&working_bank,
transaction,
&mut error_counters,
);
if result.is_err() {
num_dropped_on_status_age_checks += 1;
container.remove_by_id(priority_id.id);
}
}
// Push non-errored transaction into queue.
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jstarry, functional change here to apply similar change as #4865.
In commit: c5fc7f6

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great, looks like the error type is mismatched FYI (CI failure). And we should update the name of num_dropped_on_status_age_checks or add a separate counter for txs dropped due to invalid fee payer

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah it should be renamed, keeping it in the same one for now since that's what the other PR does. Will separate the metric in a separate PR, changing for both ingest paths.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great, looks like the error type is mismatched FYI (CI failure). And we should update the name of num_dropped_on_status_age_checks or add a separate counter for txs dropped due to invalid fee payer

ugh yeah - pushed to CI for testing since the rocksdb update was causing a bunch of issues in my dev environment yesterday w/ not finding libclang. Can actually build now. Should have just been patient and waited, sorry for the waste of time reviewing that!

@jstarry
Copy link

jstarry commented Feb 12, 2025

Implementation looks correct now, will approve after last few things are cleaned up

@apfitzge apfitzge force-pushed the view_receiving_age_checks branch from c5fc7f6 to 95bb4fd Compare February 12, 2025 15:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants