[PECO-953] Optimize CloudFetchResultHandler memory consumption #204

kravets-levko · 2023-11-21T12:57:02Z

Initially, we implemented CloudFetch result handler on top of Arrow result handler. The problem with this solution was that Arrow result handler operates with batches returned directly in TRowSets, which are small (maximum tens of kilobytes) and contain few hundred records each. So Arrow handler was just immediately unpacking and processing the whole batch.

On the other hand, CloudFetch results are stored in files of 10MB and more, and attempt to unpack the whole file lead to quite intensive memory usage (in some cases up to 1GB Nodejs RSS / 700MB heap).

The solution is to not unpack the whole received file. Instead, we read and process batches one by one. Each of them is small (approximately like batches returned in TRowSet), and usually by the time when next batch is requested - previous one is no longer needed, so Node can collect and reuse this memory easily.

Optimize CloudFetch result handler
Add/update tests

Profiler reports before and after the changes (10'000'000 records two fields each, 1 concurrent download for better visibility)

Before

After

Notes for reviewers

Previously ArrowResultHandler was collecting arrow batches and converting them to objects, CloudFetchResultHandler was inherited from ArrowResultHandler and was overriding batch collecting method.

Now, ArrowResultHandler and CloudFetchResultHandler are separated. Both just collect raw (binary) arrow batches - each using own way - and pass them to ArrowResultConverter. ArrowResultConverter contains data conversion code that previously was in ArrowResultHandler, but uses new mechanism to unpack binary batches (old one was reading all records at once, new one reads them one my one).

Tests were mostly updated to reflect those changes, no much new code added there.

…ltsHelper into provider of TRowSet Signed-off-by: Levko Kravets <levko.ne@gmail.com>

…vider interface Signed-off-by: Levko Kravets <levko.ne@gmail.com>

Signed-off-by: Levko Kravets <levko.ne@gmail.com>

…size Signed-off-by: Levko Kravets <levko.ne@gmail.com>

Signed-off-by: Levko Kravets <levko.ne@gmail.com>

nithinkdb

LGTM

kravets-levko added 13 commits October 7, 2023 19:41

Refactoring: Introduce concept of results provider; convert FetchResu…

d292824

…ltsHelper into provider of TRowSet Signed-off-by: Levko Kravets <levko.ne@gmail.com>

Convert Json/Arrow/CloudFetch result handlers to implement result pro…

3da3e4a

…vider interface Signed-off-by: Levko Kravets <levko.ne@gmail.com>

Refine the code and update tests

6ada0db

Signed-off-by: Levko Kravets <levko.ne@gmail.com>

Make sure that DBSQLOperation.fetchChunk returns chunks of requested …

f8ca56d

…size Signed-off-by: Levko Kravets <levko.ne@gmail.com>

Add option to disable result buffering & slicing

ec96ec6

Signed-off-by: Levko Kravets <levko.ne@gmail.com>

Update existing tests

44168c4

Signed-off-by: Levko Kravets <levko.ne@gmail.com>

Add tests for ResultSlicer

68c2225

Signed-off-by: Levko Kravets <levko.ne@gmail.com>

Merge branch 'main' into fix-max-rows-behavior

acd511f

Merge branch 'main' into fix-max-rows-behavior

9646b38

Refine code

b6936f2

Signed-off-by: Levko Kravets <levko.ne@gmail.com>

Add more tests

29e86e3

Signed-off-by: Levko Kravets <levko.ne@gmail.com>

Optimize CloudFetchResultHandler memory consumption

05691d0

Signed-off-by: Levko Kravets <levko.ne@gmail.com>

Add and update tests

55b7ed5

Signed-off-by: Levko Kravets <levko.ne@gmail.com>

kravets-levko marked this pull request as ready for review November 22, 2023 10:56

kravets-levko requested review from arikfr, superdupershant, yunbodeng-db, susodapop, nithinkdb and andrefurlan-db as code owners November 22, 2023 10:56

yunbodeng-db requested a review from rcypher-databricks November 22, 2023 15:26

Base automatically changed from fix-max-rows-behavior to main November 28, 2023 11:53

Merge branch 'main' into optimize-cloudfetch-handler

0ee7186

kravets-levko temporarily deployed to azure-prod November 28, 2023 12:18 — with GitHub Actions Inactive

databricks deleted a comment from codecov-commenter Nov 28, 2023

rcypher-databricks approved these changes Nov 28, 2023

View reviewed changes

kravets-levko mentioned this pull request Nov 29, 2023

Performance fixes #207

Merged

Merge branch 'main' into optimize-cloudfetch-handler

6ad5046

kravets-levko temporarily deployed to azure-prod November 30, 2023 20:57 — with GitHub Actions Inactive

databricks deleted a comment from codecov-commenter Nov 30, 2023

kravets-levko changed the title ~~Optimize CloudFetchResultHandler memory consumption~~ [PECO-953] Optimize CloudFetchResultHandler memory consumption Nov 30, 2023

nithinkdb approved these changes Dec 4, 2023

View reviewed changes

kravets-levko merged commit 5c5b87f into main Dec 4, 2023
5 checks passed

kravets-levko deleted the optimize-cloudfetch-handler branch December 4, 2023 20:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[PECO-953] Optimize CloudFetchResultHandler memory consumption #204

[PECO-953] Optimize CloudFetchResultHandler memory consumption #204

kravets-levko commented Nov 21, 2023 •

edited by jira bot

Loading

nithinkdb left a comment

[PECO-953] Optimize CloudFetchResultHandler memory consumption #204

[PECO-953] Optimize CloudFetchResultHandler memory consumption #204

Conversation

kravets-levko commented Nov 21, 2023 • edited by jira bot Loading

Notes for reviewers

nithinkdb left a comment

Choose a reason for hiding this comment

kravets-levko commented Nov 21, 2023 •

edited by jira bot

Loading