Shared memory between calls #673

thedevbirb · 2023-08-31T16:54:05Z

I tried to give a shot to #445 just for fun and as a challenge with this PR.
This introduces a new struct called SharedMemory which is indeed shared between calls.

About the implementation

its API matches almost completely the 'old' interpreter Memory implementation
memory space is allocated using this estimate https://2π.com/22/eth-max-mem/,
it has a pointer current_slice which is internally used to refer to the portion of data reserved for the current context. This requires two unsafe methods get_current_slice and get_current_slice_mut which deferences the raw pointer
current_slice pointer is updated when entering a new context (which is when a new Interpreter instance is created) using the new_context_memory method and when exiting it, which happens when the return_value method of Interpreter is called

Performance

I'd like a feedback on this because:

probably I'm doing something not super optimized: some results are very good, other are awful
maybe benches code is not ideal for a shared memory: for example, if a tx has a high gas limit (I had to lower it in the benches) but no memory-related operations, performance is slower since I'm allocating for no reason. If I don't go wrong, we always use the same bytecode for benches, which may result is not entirely accurate performance indications on real usage of this shared memory

Anyway, this is the result on running cargo bench --all on main and then on my branch:

analysis/transact/raw   time:   [8.5066 µs 8.6215 µs 8.7831 µs]
                        change: [+14.721% +16.955% +19.127%] (p = 0.00 < 0.05)
                        Performance has regressed.
analysis/transact/checked
                        time:   [8.2738 µs 8.3773 µs 8.5572 µs]
                        change: [+11.677% +14.695% +18.139%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 1 outliers among 10 measurements (10.00%)
  1 (10.00%) high mild
analysis/transact/analysed
                        time:   [6.2713 µs 6.3171 µs 6.3475 µs]
                        change: [+8.6451% +11.185% +13.377%] (p = 0.00 < 0.05)
                        Performance has regressed.

snailtracer/transact/analysed
                        time:   [5.4007 µs 5.4421 µs 5.5240 µs]
                        change: [-99.992% -99.992% -99.992%] (p = 0.00 < 0.05)
                        Performance has improved.
snailtracer/eval        time:   [2.9665 ms 2.9938 ms 3.0241 ms]
                        change: [-94.980% -94.892% -94.818%] (p = 0.00 < 0.05)
                        Performance has improved.

transfer/transact/analysed
                        time:   [5.2016 µs 5.2170 µs 5.2350 µs]
                        change: [+548.06% +556.75% +566.60%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 4 outliers among 100 measurements (4.00%)
  4 (4.00%) high severe

I also tried to run cachegrind as explained here #582 and it is indeed slower, of about a 10% or less. Gas limit required is between 2^22 and 2^23. Maybe also here I'm allocating memory which is not used very much. It depends on the bytecode of the snailtracer.

Thanks in advance for any feedback!

…fixes

…g; doc comments

…per memory allocation

gakonst · 2023-09-01T18:26:57Z

cc @DaniPopes who may also have thoughts

rakita · 2023-09-04T10:11:36Z

In general very supportive of this, tbh I would have expected the same or better performance. Will see to play with it a little bit to check it out.

Would like to see if we can remove Rc<RefCell<>>, maybe the overhead that we see is related to the dynamic borrow checks inside refcell. Snailtracer should be a good benchmark to check as init of evm is small in comparison of the work that interpreter is doing.

thedevbirb · 2023-09-05T11:52:13Z

In general very supportive of this, tbh I would have expected the same or better performance. Will see to play with it a little bit to check it out.

I'm very happy you like it! After this PR I can try to do a shared_stack as well in the same fashion, by allocating 32MB.

Would like to see if we can remove Rc<RefCell<>>, maybe the overhead that we see is related to the dynamic borrow checks inside refcell. Snailtracer should be a good benchmark to check as init of evm is small in comparison of the work that interpreter is doing.

I've found a way to remove the Rc<Refcell<>> and keep lifetimes to the minimum. I had to change a bit the signature of the function run_interpreter in order to achieve it (it no longer returns the interpreter but only the needed properties). The performance gains are very very minor though. Let me know what you think!

DaniPopes · 2023-09-05T12:35:35Z

cc @DaniPopes who may also have thoughts

Same as #660 (comment)

DaniPopes · 2023-09-21T13:48:14Z

#582 was merged, can you rebase this PR? @lorenzofero

thedevbirb · 2023-09-21T16:25:36Z

#582 was merged, can you rebase this PR? @lorenzofero

I can't really perform a rebase because I have conflicts in multiple commits and it's really cumbersome to get out. I'm working on a merge with conflicts resolution 👍 . I hope it's not a problem! It will take some time though, I would like to keep as close as possible how you managed some memory functions

…d_memory

thedevbirb · 2023-09-26T16:56:21Z

Hey @rakita @DaniPopes @gakonst, now tests are passing if you want to check it out again. I managed to keep all Dani's updates of memory related functions inside shared_memory.rs. Let me know what you think!

Lastly, is there a new way to run benches now? I've seen in the readme that something has changed but it seems not up to date.

Thanks!

DaniPopes · 2023-09-26T17:09:26Z

You can run cargo criterion or the Cachegrind test as explained in #582's description

DaniPopes

cachegrind
- main (8206193): 431,577,511
- Shared memory between calls #673 (c7945eb): 445,967,393

crates/interpreter/src/instructions/macros.rs

crates/interpreter/src/interpreter/shared_memory.rs

crates/revm/src/evm_impl.rs

crates/interpreter/src/interpreter/shared_memory.rs

crates/revm/benches/bench.rs

crates/interpreter/src/interpreter/shared_memory.rs

crates/interpreter/src/interpreter.rs

crates/interpreter/src/interpreter/shared_memory.rs

DaniPopes · 2023-09-26T18:27:02Z

This is great btw, I think we can get perf regression down to neutral or positive

thedevbirb · 2023-09-27T12:27:35Z

This is great btw, I think we can get perf regression down to neutral or positive

Thanks and thank you for the very detailed feedback you provided; I'll try to resolve all the comments as soon as possible

crates/interpreter/src/interpreter/shared_memory.rs

crates/interpreter/src/interpreter.rs

crates/interpreter/src/interpreter/shared_memory.rs

DaniPopes

I think this is what's causing tests to fail

crates/interpreter/src/interpreter/shared_memory.rs

thedevbirb · 2023-09-30T08:50:09Z

Today if I have time I'll take a look at tests failing with ethtests profile
edit: I need to handle the edge case gas_limit == 0. A cheap fix is do to nothing on set_ptr if self.data.len() == 0:

fn set_ptr(&mut self, checkpoint: usize) {
    if self.data.len() > 0 {
        assume!(checkpoint < self.data.len());
        self.current_ptr = unsafe { self.data.as_mut_ptr().add(checkpoint) };
    }
}

rakita · 2023-10-03T11:30:52Z

Hi @lorenzofero, how is the performance on this?
What I am thinking rn is this becoming a more complex solution than the current one, additionally the peak memory usage is lower but average memory usage is a few times more than the previous where your memory per call was dynamically incremented while this approach takes its maximum right away.

In essence, I expected this to be more impactful but if this is not the case unfortunately there is no good reason to include it.

thedevbirb · 2023-10-03T11:31:22Z

Hey @rakita I tried to run again Cachegrin after Dani's first measurements

cachegrind

main (8206193): 431,577,511

Shared memory between calls #673 (c7945eb): 445,967,393

Which is around -3.5% in performance. After current changes, this is what I get:

main 4e78fbe: 469,208,070
Shared memory between calls #673, (6b6119e): 482,056,264

which is around -2.7%.

@DaniPopes maybe I can try with this now to see if we can squeeze a little bit more

I think we can go a step further here and use uninit memory for extra perf. This may not be that big so maybe wait until the end to try this.

Yeah I get that the more complex the transaction, the better the performance. For simple stuff on average I expect it to be in some way worse, as benches suggest.
Maybe we can shift to the model you originally suggested ethereum/evmone#481 (comment) with manual expansion, or a similar setup where we keep allocating using the shared setup.
However if you don't plan on include at all given that it can add some complexity that's fine too

thedevbirb · 2023-10-04T16:41:50Z

Hey, I hope you don't find all of this comments pedantic. I tried to use a different model for the shared memory similar to ethereum/evmone#481 (comment) that you can see here on my fork: thedevbirb#2. This model has some benefits imo:

very simple -- a little more than a wrapper of original memory for the shared setup
no big pre-allocation based on gas limit -- it allocates 4KiB as the original memory, and only when you need more, it allocates other slots pf 4KiB, which will be kept for the next calls (see resize method)
overall better performance in most situation -- and if you don't use the memory, no performance penalties

Here is the result of running cargo bench --all against main:

analysis/transact/raw   time:   [4.0090 µs 4.1340 µs 4.2474 µs]
                        change: [-57.683% -52.266% -45.567%] (p = 0.00 < 0.05)
                        Performance has improved.
analysis/transact/checked
                        time:   [3.9221 µs 3.9936 µs 4.0423 µs]
                        change: [-45.316% -44.017% -42.611%] (p = 0.00 < 0.05)
                        Performance has improved.
analysis/transact/analysed
                        time:   [2.3999 µs 2.4642 µs 2.5303 µs]
                        change: [-54.329% -53.175% -52.059%] (p = 0.00 < 0.05)
                        Performance has improved.

snailtracer/transact/analysed
                        time:   [60.762 ms 61.650 ms 62.692 ms]
                        change: [-3.0766% +0.7508% +4.7091%] (p = 0.73 > 0.05)
                        No change in performance detected.
Found 2 outliers among 10 measurements (20.00%)
  2 (20.00%) high mild
snailtracer/eval        time:   [58.104 ms 59.000 ms 59.465 ms]
                        change: [-10.689% -7.1282% -3.1639%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 1 outliers among 10 measurements (10.00%)
  1 (10.00%) high mild

transfer/transact/analysed
                        time:   [960.41 ns 967.11 ns 975.02 ns]
                        change: [+13.760% +16.191% +18.220%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 2 outliers among 100 measurements (2.00%)
  2 (2.00%) high mild

If you like it, I can bring the changes to this branch or in a new pr

rakita · 2023-10-05T14:39:44Z

Hey, I hope you don't find all of this comments pedantic.

It is fine, this is a tough decision as you invested a lot of time into this, and the benefits are unfortunately small, I have a few examples of this where the idea didn't pan out as expected (gas block i am looking at you, we had performance boost of 5-7% and the code was merged but it was very hard to use, so after few months it was reverted) so in the end it was more like a research effort to get data.

I tried to use a different model for the shared memory similar to ethereum/evmone#481 (comment) that you can see here on my fork: lorenzofero#2. This model has some benefits imo:

very simple -- a little more than a wrapper of original memory for the shared setup

no big pre-allocation based on gas limit -- it allocates 4KiB as the original memory, and only when you need more, it allocates other slots pf 4KiB, which will be kept for the next calls (see resize method)

overall better performance in most situation -- and if you don't use the memory, no performance penalties

This seems better, not all transactions are going to be 1024 calls deep or use 30M gas, so this is more reasonable. But with this,

you would always copy all data to the newly allocated vec when resizing happens. That is why I that comment i showed memory in chunks to mitigate this somehow.
but it allows more compact usage of allocated memory.
memory peak stays allocated.

And spilling context (in this case memory) from one Interpreter to another would be fine if we would gain something significant on the other hand having just one place for memory opens things for new ideas. This is probably going to look better if we switch from recursive calls to loop calls (if the loop approach turns out okay).

I am on the fence here but let us include it, will review it in detail in the next few days (@DaniPopes already did an amazing job there).

thedevbirb · 2023-10-06T08:42:11Z

Ok I brought the changes here for reviewing. However, I was wondering if it was worth it to open a new PR which supersedes this one with this new model, since both commit history and github conversation here are becoming a little messy imo. Let me know what you think about it!

rakita · 2023-10-08T15:34:07Z

Ok I brought the changes here for reviewing. However, I was wondering if it was worth it to open a new PR which supersedes this one with this new model, since both commit history and github conversation here are becoming a little messy imo. Let me know what you think about it!

It is messy but it I fine for it to be in one place, so people can follow what is happening.

rakita

We should reintroduce the memory limit.

Other parts look good!

rakita · 2023-10-08T16:05:31Z

crates/interpreter/src/instructions/macros.rs

    ($interp:expr, $offset:expr, $len:expr) => {
        if let Some(new_size) =
            crate::interpreter::next_multiple_of_32($offset.saturating_add($len))
        {
-            #[cfg(feature = "memory_limit")]
-            if new_size > ($interp.memory_limit as usize) {


Should we reintroduce the memory limit?

rakita · 2023-10-09T16:35:39Z

crates/interpreter/src/interpreter/shared_memory.rs

+    /// Memory checkpoints for each depth
+    checkpoints: Vec<usize>,
+    /// How much memory has been used in the current context
+    current_len: usize,


We can probably put memory limit here, it feels like a better place

rakita · 2023-10-09T16:45:31Z

crates/interpreter/src/interpreter/shared_memory.rs

+    /// Get the last memory checkpoint
+    #[inline(always)]
+    fn last_checkpoint(&self) -> usize {
+        *self.checkpoints.last().unwrap_or(&0)


Suggested change

*self.checkpoints.last().unwrap_or(&0)

self.checkpoints.last().cloned().unwrap_or_default()

cloned would remove the reference from the option.

thedevbirb · 2023-10-10T12:08:04Z

We should reintroduce the memory limit.

It should be good now!

rakita

lgtm! Amazing work @lorenzofero !

thedevbirb · 2023-10-11T11:50:40Z

I'm very happy we've found the right approach for this and got some good performance improvements. Thanks a lot @rakita and @DaniPopes for all the support in the last month; it has been a pleasure working with you.

Lorenzo Feroleto added 8 commits August 31, 2023 09:21

feat: shared memory primitive

3d09aa5

feat(shared_memory): available inside interpreter and ready to use

9ad9da8

chore(memory): replaced memory with shared memory

1e69907

chore(shared_memory): tests, completed replaced old memory, some bug …

b415024

…fixes

chore(shared_memory): moved under interpreter crate; small refactorin…

caad052

…g; doc comments

chore(shared_memory): memory limit feature management

2bcddef

chore(shared_memory): make benchmarks use right amount of gas for pro…

bdd1ccc

…per memory allocation

chore(shared_memory): formatting

e91224e

thedevbirb changed the title ~~Shared memory~~ Shared memory between calls Aug 31, 2023

chore(shared_memory): improved shared_memory! macro

ed4079f

chore(shared_memory): replace Rc<Refcell with mutable reference

3c5946c

chore(shared_memory): dropped f64::sqrt for no_std code

27978c7

Lorenzo Feroleto added 2 commits September 23, 2023 17:23

fix(shared_memory): conflicts

b524b18

Merge branch 'main' of https://github.com/lorenzofero/revm into share…

eb74892

…d_memory

rakita force-pushed the main branch from 41b2ab5 to f3180aa Compare September 23, 2023 19:52

fix(shared_memory): restore free_context_memory where needed

c7945eb

DaniPopes requested changes Sep 26, 2023

View reviewed changes

crates/interpreter/src/interpreter.rs Outdated Show resolved Hide resolved

crates/interpreter/src/interpreter/shared_memory.rs Show resolved Hide resolved

thedevbirb force-pushed the shared_memory branch from 6947afc to 0e349b5 Compare September 27, 2023 12:25

chore(SharedMemory): refactoring

07ece23

thedevbirb force-pushed the shared_memory branch from 572e15b to 07ece23 Compare September 29, 2023 16:43

DaniPopes requested changes Sep 29, 2023

View reviewed changes

crates/interpreter/src/interpreter.rs Outdated Show resolved Hide resolved

crates/interpreter/src/interpreter.rs Outdated Show resolved Hide resolved

DaniPopes requested changes Sep 29, 2023

View reviewed changes

crates/interpreter/src/interpreter/shared_memory.rs Outdated Show resolved Hide resolved

crates/interpreter/src/interpreter/shared_memory.rs Outdated Show resolved Hide resolved

DaniPopes requested changes Sep 29, 2023

View reviewed changes

crates/interpreter/src/interpreter/shared_memory.rs Outdated Show resolved Hide resolved

crates/interpreter/src/interpreter/shared_memory.rs Outdated Show resolved Hide resolved

chore(SharedMemory): internal refactoring part 2

23a1191

thedevbirb force-pushed the shared_memory branch from e94d183 to 23a1191 Compare September 30, 2023 13:29

chore(SharedMemory): sync

6b6119e

Lorenzo Feroleto added 2 commits October 4, 2023 15:28

chore(SharedMemory): test with more allocations

39efbe9

chore(SharedMemory): refactoring for simplifying logic

986a03b

Lorenzo Feroleto added 3 commits October 6, 2023 10:00

chore(SharedMemory): sync

181eb05

chore(SharedMemory): remove memory file from sync

1da0f38

chore(SharedMemory): docs

b6ed337

rakita requested changes Oct 9, 2023

View reviewed changes

fix(SharedMemory): restore cfg memory limit

b82ea0d

rakita approved these changes Oct 11, 2023

View reviewed changes

rakita merged commit b5aa4c9 into bluealloy:main Oct 11, 2023

rakita mentioned this pull request Oct 11, 2023

Share memory between calls #445

Closed

thedevbirb deleted the shared_memory branch October 11, 2023 16:51

github-actions bot mentioned this pull request Jan 12, 2024

chore: release #973

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Shared memory between calls #673

Shared memory between calls #673

thedevbirb commented Aug 31, 2023 •

edited

Loading

gakonst commented Sep 1, 2023

rakita commented Sep 4, 2023

thedevbirb commented Sep 5, 2023 •

edited

Loading

DaniPopes commented Sep 5, 2023

DaniPopes commented Sep 21, 2023

thedevbirb commented Sep 21, 2023 •

edited

Loading

thedevbirb commented Sep 26, 2023 •

edited

Loading

DaniPopes commented Sep 26, 2023

DaniPopes left a comment

DaniPopes commented Sep 26, 2023 •

edited

Loading

thedevbirb commented Sep 27, 2023

DaniPopes left a comment

thedevbirb commented Sep 30, 2023 •

edited

Loading

rakita commented Oct 3, 2023

thedevbirb commented Oct 3, 2023 •

edited

Loading

thedevbirb commented Oct 4, 2023 •

edited

Loading

rakita commented Oct 5, 2023 •

edited

Loading

thedevbirb commented Oct 6, 2023 •

edited

Loading

rakita commented Oct 8, 2023

rakita left a comment

rakita Oct 8, 2023

rakita Oct 9, 2023

rakita Oct 9, 2023

thedevbirb commented Oct 10, 2023

rakita left a comment •

edited

Loading

thedevbirb commented Oct 11, 2023

	*self.checkpoints.last().unwrap_or(&0)
	self.checkpoints.last().cloned().unwrap_or_default()

Shared memory between calls #673

Shared memory between calls #673

Conversation

thedevbirb commented Aug 31, 2023 • edited Loading

About the implementation

Performance

gakonst commented Sep 1, 2023

rakita commented Sep 4, 2023

thedevbirb commented Sep 5, 2023 • edited Loading

DaniPopes commented Sep 5, 2023

DaniPopes commented Sep 21, 2023

thedevbirb commented Sep 21, 2023 • edited Loading

thedevbirb commented Sep 26, 2023 • edited Loading

DaniPopes commented Sep 26, 2023

DaniPopes left a comment

Choose a reason for hiding this comment

DaniPopes commented Sep 26, 2023 • edited Loading

thedevbirb commented Sep 27, 2023

DaniPopes left a comment

Choose a reason for hiding this comment

thedevbirb commented Sep 30, 2023 • edited Loading

rakita commented Oct 3, 2023

thedevbirb commented Oct 3, 2023 • edited Loading

thedevbirb commented Oct 4, 2023 • edited Loading

rakita commented Oct 5, 2023 • edited Loading

thedevbirb commented Oct 6, 2023 • edited Loading

rakita commented Oct 8, 2023

rakita left a comment

Choose a reason for hiding this comment

rakita Oct 8, 2023

Choose a reason for hiding this comment

rakita Oct 9, 2023

Choose a reason for hiding this comment

rakita Oct 9, 2023

Choose a reason for hiding this comment

thedevbirb commented Oct 10, 2023

rakita left a comment • edited Loading

Choose a reason for hiding this comment

thedevbirb commented Oct 11, 2023

thedevbirb commented Aug 31, 2023 •

edited

Loading

thedevbirb commented Sep 5, 2023 •

edited

Loading

thedevbirb commented Sep 21, 2023 •

edited

Loading

thedevbirb commented Sep 26, 2023 •

edited

Loading

DaniPopes commented Sep 26, 2023 •

edited

Loading

thedevbirb commented Sep 30, 2023 •

edited

Loading

thedevbirb commented Oct 3, 2023 •

edited

Loading

thedevbirb commented Oct 4, 2023 •

edited

Loading

rakita commented Oct 5, 2023 •

edited

Loading

thedevbirb commented Oct 6, 2023 •

edited

Loading

rakita left a comment •

edited

Loading