Parallelize local memory arbitration #9649

xiaoxmeng · 2024-04-28T20:39:03Z

Parallelize the local memory arbitration execution. The workload flow of memory arbitration process
changes as follows:
First wait in the arbitration queue to serialize the memory arbitration request processing from the same
query pool. Adds ArbitrationQueue data structure for this which contains the wait promises of the
arbitration requests from the same query pool. The arbitration queue is protected by arbitrator lock.

Second calls runLocalArbitration to run local arbitration which acquires the reader lock of arbitration
lock (which is added by this pr for local/global arbitration execution control). Then it ensures the request
memory pool is within the capacity limit (this might trigger spill to reclaim the used memory from the
request pool itself), and tries to allocate free capacity from the arbitrator or reclaim the free memory from
the other queries which is protected by arbitrator lock.

Third if runLocalArbitration can't reclaim sufficient memory, then proceeds with runGlobalArbitration which
acquires the writer lock of arbitration lock. Then it reclaims the free capacity from the arbitrator or the other
queries. And at last reclaims the used memory from the other queries by spilling.

ArbitrationOperation is added to contains the state of arbitration request processing to simplify the code
implementation.

This PR also adds an option to disable global arbitration and only do local arbitration which can always
run in parallel. With this option, for 2hrs stress test (1000 query concurrency at coordinator) with Meta
internal production workloads replay, the averaged query execution time has been reduced by ~30% and cpu
time is kept the same. The memory arbitration wall time for query without triggering spilled is only 9 mins in
total.

netlify · 2024-04-28T20:39:23Z

✅ Deploy Preview for meta-velox canceled.

Name	Link
🔨 Latest commit	`65ce31d`
🔍 Latest deploy log	https://app.netlify.com/sites/meta-velox/deploys/6639a98df0ce1300089312e5

facebook-github-bot · 2024-04-28T20:39:58Z

@xiaoxmeng has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

velox/common/memory/SharedArbitrator.h

tanjialiang · 2024-04-30T19:53:46Z

velox/common/memory/SharedArbitrator.h

  };

  // Invoked to check if the memory growth will exceed the memory pool's max
  // capacity limit or the arbitrator's node capacity limit.
-  bool checkCapacityGrowth(const MemoryPool& pool, uint64_t targetBytes) const;
+  bool checkCapacityGrowth(ArbitrationOperation* op) const;


Instead of passing around the same object everywhere, shall we just pass this operation object to runLocalArbitration() and keep track of an instance of operation of currently running global one? It seems unnecessary to pass this stateful item around for global aribtration.

We can set the current running arbitration operation as a member of shared arbitrator object but there a quite a few utilities shared between local and global arbitration operations.

looks like the only shared one is checkFreeCapacity()?

velox/common/memory/SharedArbitrator.h

facebook-github-bot · 2024-04-30T23:12:59Z

@xiaoxmeng has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

velox/common/memory/SharedArbitrator.cpp

velox/common/memory/SharedArbitrator.h

velox/common/memory/SharedArbitrator.cpp

tanjialiang

Thanks. Left some comments. Also since this changed the arbitration workflow of SharedArbitrator, shall we also put detailed documentation on top of the SharedArbitrator class?

facebook-github-bot · 2024-05-02T23:58:07Z

@xiaoxmeng has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

facebook-github-bot · 2024-05-04T03:00:02Z

@xiaoxmeng has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

facebook-github-bot · 2024-05-04T04:35:20Z

@xiaoxmeng has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

facebook-github-bot · 2024-05-04T04:52:51Z

@xiaoxmeng has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

Summary: Parallelize the local memory arbitration execution. The workload flow of memory arbitration process changes as follows: First wait in the arbitration queue to serialize the memory arbitration request processing from the same query pool. Adds ArbitrationQueue data structure for this which contains the wait promises of the arbitration requests from the same query pool. The arbitration queue is protected by arbitrator lock. Second calls runLocalArbitration to run local arbitration which acquires the reader lock of arbitration lock (which is added by this pr for local/global arbitration execution control). Then it ensures the request memory pool is within the capacity limit (this might trigger spill to reclaim the used memory from the request pool itself), and tries to allocate free capacity from the arbitrator or reclaim the free memory from the other queries which is protected by arbitrator lock. Third if runLocalArbitration can't reclaim sufficient memory, then proceeds with runGlobalArbitration which acquires the writer lock of arbitration lock. Then it reclaims the free capacity from the arbitrator or the other queries. And at last reclaims the used memory from the other queries by spilling. ArbitrationOperation is added to contains the state of arbitration request processing to simplify the code implementation Reviewed By: tanjialiang, oerling Differential Revision: D56685200 Pulled By: xiaoxmeng

facebook-github-bot · 2024-05-04T07:52:16Z

This pull request was exported from Phabricator. Differential Revision: D56685200

facebook-github-bot · 2024-05-06T21:54:13Z

@xiaoxmeng has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

facebook-github-bot · 2024-05-07T00:56:52Z

@xiaoxmeng has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

Summary: Parallelize the local memory arbitration execution. The workload flow of memory arbitration process changes as follows: First wait in the arbitration queue to serialize the memory arbitration request processing from the same query pool. Adds ArbitrationQueue data structure for this which contains the wait promises of the arbitration requests from the same query pool. The arbitration queue is protected by arbitrator lock. Second calls runLocalArbitration to run local arbitration which acquires the reader lock of arbitration lock (which is added by this pr for local/global arbitration execution control). Then it ensures the request memory pool is within the capacity limit (this might trigger spill to reclaim the used memory from the request pool itself), and tries to allocate free capacity from the arbitrator or reclaim the free memory from the other queries which is protected by arbitrator lock. Third if runLocalArbitration can't reclaim sufficient memory, then proceeds with runGlobalArbitration which acquires the writer lock of arbitration lock. Then it reclaims the free capacity from the arbitrator or the other queries. And at last reclaims the used memory from the other queries by spilling. ArbitrationOperation is added to contains the state of arbitration request processing to simplify the code implementation. This PR also adds an option to disable global arbitration and only do local arbitration which can always run in parallel. With this option, for 2hrs stress test (1000 query concurrency at coordinator) with Meta internal production workloads replay, the averaged query execution time has been reduced by ~30% and cpu time is kept the same. The memory arbitration wall time for query without triggering spilled is only 9 mins in total. Reviewed By: tanjialiang, oerling Differential Revision: D56685200 Pulled By: xiaoxmeng

facebook-github-bot · 2024-05-07T03:04:05Z

This pull request was exported from Phabricator. Differential Revision: D56685200

Summary: Parallelize the local memory arbitration execution. The workload flow of memory arbitration process changes as follows: First wait in the arbitration queue to serialize the memory arbitration request processing from the same query pool. Adds ArbitrationQueue data structure for this which contains the wait promises of the arbitration requests from the same query pool. The arbitration queue is protected by arbitrator lock. Second calls runLocalArbitration to run local arbitration which acquires the reader lock of arbitration lock (which is added by this pr for local/global arbitration execution control). Then it ensures the request memory pool is within the capacity limit (this might trigger spill to reclaim the used memory from the request pool itself), and tries to allocate free capacity from the arbitrator or reclaim the free memory from the other queries which is protected by arbitrator lock. Third if runLocalArbitration can't reclaim sufficient memory, then proceeds with runGlobalArbitration which acquires the writer lock of arbitration lock. Then it reclaims the free capacity from the arbitrator or the other queries. And at last reclaims the used memory from the other queries by spilling. ArbitrationOperation is added to contains the state of arbitration request processing to simplify the code implementation. This PR also adds an option to disable global arbitration and only do local arbitration which can always run in parallel. With this option, for 2hrs stress test (1000 query concurrency at coordinator) with Meta internal production workloads replay, the averaged query execution time has been reduced by ~30% and cpu time is kept the same. The memory arbitration wall time for query without triggering spilled is only 9 mins in total. Reviewed By: tanjialiang, oerling Differential Revision: D56685200 Pulled By: xiaoxmeng

facebook-github-bot · 2024-05-07T04:09:41Z

This pull request was exported from Phabricator. Differential Revision: D56685200

facebook-github-bot · 2024-05-07T06:38:27Z

@xiaoxmeng merged this pull request in 5911129.

conbench-facebook · 2024-05-07T07:19:45Z

Conbench analyzed the 1 benchmark run on commit 59111293.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details.

Summary: Parallelize the local memory arbitration execution. The workload flow of memory arbitration process changes as follows: First wait in the arbitration queue to serialize the memory arbitration request processing from the same query pool. Adds ArbitrationQueue data structure for this which contains the wait promises of the arbitration requests from the same query pool. The arbitration queue is protected by arbitrator lock. Second calls runLocalArbitration to run local arbitration which acquires the reader lock of arbitration lock (which is added by this pr for local/global arbitration execution control). Then it ensures the request memory pool is within the capacity limit (this might trigger spill to reclaim the used memory from the request pool itself), and tries to allocate free capacity from the arbitrator or reclaim the free memory from the other queries which is protected by arbitrator lock. Third if runLocalArbitration can't reclaim sufficient memory, then proceeds with runGlobalArbitration which acquires the writer lock of arbitration lock. Then it reclaims the free capacity from the arbitrator or the other queries. And at last reclaims the used memory from the other queries by spilling. ArbitrationOperation is added to contains the state of arbitration request processing to simplify the code implementation. This PR also adds an option to disable global arbitration and only do local arbitration which can always run in parallel. With this option, for 2hrs stress test (1000 query concurrency at coordinator) with Meta internal production workloads replay, the averaged query execution time has been reduced by ~30% and cpu time is kept the same. The memory arbitration wall time for query without triggering spilled is only 9 mins in total. Pull Request resolved: facebookincubator#9649 Reviewed By: tanjialiang, oerling Differential Revision: D56685200 Pulled By: xiaoxmeng fbshipit-source-id: f8db71e4cae05f24f913464c37c5bc1e6528083b

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Apr 28, 2024

xiaoxmeng force-pushed the par branch from 143910b to 51ff8ae Compare April 29, 2024 00:29