Implement cursor population in batches, only when batchSize is passed #9366

biomorgoth · 2020-08-27T15:48:33Z

Summary
This adds a new feature that allows to optimize QueryCursor usage when populate is passed to the query. It requires a batchSize to be defined on the query, otherwise it will work as usual.

Motivation
Actually population on cursors iterates one doc at a time, running N population queries per doc which is not the best scenario for large datasets. By using a custom batchSize, the Mongo driver is already getting a batch of documents into memory, which can be safely be passed into Mongoose QueryCursor without triggering a new getMore command to the server. Then those documents can be populated together, reducing the amount of extra queries to the DB.

Examples

const cursor = Model.find().populate('someRef anotherRef').batchSize(5).cursor()

// The first .next() called on the iterator will populate the 5 elements that already were on memory
// but still passing one at a time to the iterator.
// Later on the 6th, 11th, 15th .next() call, Mongo driver will issue a getMore command to the server
// and the received docs will be populated per batch again, until cursor is exhausted.
for await(const doc of cursor) {
  console.log(doc);
}

… set on the query

biomorgoth · 2020-08-27T15:54:58Z

I'm willing to hear comments on this feature, i noted the current cursor population queries all populate fields per document which is very slow when multiple populate fields are requested and the main query total size is large.

I turned on logging for the Mongo driver while testing the new feature and effectively the amount of queries lowers drastically when specifying batchSize with this approach. Populate queries are requested at the same rate as getMore commands.

vkarpov15

Overall I think this PR is great! Made a couple comments.

Could we put this in the 5.11 release? This change might be a bit too big for a patch release.

vkarpov15 · 2020-08-27T19:07:25Z

lib/cursor/QueryCursor.js

        return callback(null, null);
+      } else {
+        // Request as many docs as batchSize, to populate them also in batch
+        return ctx.cursor.next(_onNext.bind({ ctx, callback, batchDocs: [] }));


I'd prefer we not use bind(), especially with a custom object as this.

Yeah, that was a quick approach to make ESLint happy (you can see the previous commit). Take in mind callback needs to be preserved through the internal cursor iteration. There probably could be a cleaner way to succeed in this. Any suggestions?

I also did not wanted to expose onNext, onPopulate and nextDoc in the QueryCursor API, as they are purely internal methods. That forces me to pass somehow the ctx object in any way. And thus i ended with the bind() approach.

vkarpov15 · 2020-08-27T19:07:50Z

lib/cursor/QueryCursor.js

+      return _this.callback(err);
+    }
+
+    _this.ctx._batchDocs = docs;


Is there a reason to have both .batchDocs and ._batchDocs? Model.populate(docs) modifies docs in place so they point to the same object.

Note this.batchDocs belongs to the custom object initially bound to _onNext, while this.ctx._batchDocs belongs to the QueryCursor instance. So this assignation is basically passing the Array from one object to another.

Sorry i did not researched that deep to know that Model.populate(docs) modified the Array in place, so maybe i could even reset ctx._batchDocs just before line 312 and always keep one Array. Let me see if that works as expected.

biomorgoth · 2020-08-27T20:48:28Z

No problem with adding this to the 5.11 release. When is it expected to be released?

…in-place

biomorgoth · 2020-08-28T15:10:04Z

Done with the single batchDocs array @vkarpov15 . Still not sure how to remove the .bind() and at the same time pass the necessary data to _onNext/_populateBatch (specially the callback function).

My first version ( 3ec17eb ) did not use .bind(), but inner functions that relied on the closure to have access to ctx/callback. ESLint rules for the project did not liked that style ☹️

vkarpov15 · 2020-09-19T20:17:08Z

We're planning on releasing 5.11 in October or November 2020. Does that work for you or do you need this feature out sooner?

biomorgoth · 2020-09-21T22:16:47Z

Great! I can manage to use my fork as temporary dependency for my projects right now until the 5.11 release. The current expectative is to not use the fork dependency as a long term solution.

Fix #10449 Re: #9366

Implement cursor population in batches, only if a custom batchSize is…

3ec17eb

… set on the query

ESLint changes

cf68cdc

biomorgoth changed the title ~~Implement cursor population in batches, only when batchSize passed~~ Implement cursor population in batches, only when batchSize is passed Aug 27, 2020

biomorgoth mentioned this pull request Aug 27, 2020

.populate() + .batchSize() + .cursor() works one by one #9365

Closed

vkarpov15 reviewed Aug 27, 2020

View reviewed changes

No need to use a separate batchDocs array, as populate modifies docs …

ef87058

…in-place

biomorgoth requested a review from vkarpov15 August 28, 2020 18:34

vkarpov15 changed the base branch from master to 5.11 September 19, 2020 20:16

vkarpov15 approved these changes Sep 19, 2020

View reviewed changes

vkarpov15 added this to the 5.11 milestone Sep 25, 2020

vkarpov15 merged commit 6684621 into Automattic:5.11 Sep 25, 2020

vkarpov15 mentioned this pull request Jul 28, 2021

Maximum call stack size exceeded #10449

Closed

vkarpov15 added a commit that referenced this pull request Jul 28, 2021

fix(cursor): cap parallel batchSize for populate at 5000

1589af2

Fix #10449 Re: #9366

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement cursor population in batches, only when batchSize is passed #9366

Implement cursor population in batches, only when batchSize is passed #9366

biomorgoth commented Aug 27, 2020 •

edited

Loading

biomorgoth commented Aug 27, 2020

vkarpov15 left a comment

vkarpov15 Aug 27, 2020

biomorgoth Aug 27, 2020

vkarpov15 Aug 27, 2020

biomorgoth Aug 27, 2020 •

edited

Loading

biomorgoth commented Aug 27, 2020

biomorgoth commented Aug 28, 2020

vkarpov15 commented Sep 19, 2020

biomorgoth commented Sep 21, 2020

Implement cursor population in batches, only when batchSize is passed #9366

Implement cursor population in batches, only when batchSize is passed #9366

Conversation

biomorgoth commented Aug 27, 2020 • edited Loading

biomorgoth commented Aug 27, 2020

vkarpov15 left a comment

Choose a reason for hiding this comment

vkarpov15 Aug 27, 2020

Choose a reason for hiding this comment

biomorgoth Aug 27, 2020

Choose a reason for hiding this comment

vkarpov15 Aug 27, 2020

Choose a reason for hiding this comment

biomorgoth Aug 27, 2020 • edited Loading

Choose a reason for hiding this comment

biomorgoth commented Aug 27, 2020

biomorgoth commented Aug 28, 2020

vkarpov15 commented Sep 19, 2020

biomorgoth commented Sep 21, 2020

biomorgoth commented Aug 27, 2020 •

edited

Loading

biomorgoth Aug 27, 2020 •

edited

Loading