Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement cursor population in batches, only when batchSize is passed #9366

Merged

Conversation

biomorgoth
Copy link
Contributor

@biomorgoth biomorgoth commented Aug 27, 2020

Summary
This adds a new feature that allows to optimize QueryCursor usage when populate is passed to the query. It requires a batchSize to be defined on the query, otherwise it will work as usual.

Motivation
Actually population on cursors iterates one doc at a time, running N population queries per doc which is not the best scenario for large datasets. By using a custom batchSize, the Mongo driver is already getting a batch of documents into memory, which can be safely be passed into Mongoose QueryCursor without triggering a new getMore command to the server. Then those documents can be populated together, reducing the amount of extra queries to the DB.

Examples

const cursor = Model.find().populate('someRef anotherRef').batchSize(5).cursor()

// The first .next() called on the iterator will populate the 5 elements that already were on memory
// but still passing one at a time to the iterator.
// Later on the 6th, 11th, 15th .next() call, Mongo driver will issue a getMore command to the server
// and the received docs will be populated per batch again, until cursor is exhausted.
for await(const doc of cursor) {
  console.log(doc);
}

@biomorgoth
Copy link
Contributor Author

I'm willing to hear comments on this feature, i noted the current cursor population queries all populate fields per document which is very slow when multiple populate fields are requested and the main query total size is large.

I turned on logging for the Mongo driver while testing the new feature and effectively the amount of queries lowers drastically when specifying batchSize with this approach. Populate queries are requested at the same rate as getMore commands.

@biomorgoth biomorgoth changed the title Implement cursor population in batches, only when batchSize passed Implement cursor population in batches, only when batchSize is passed Aug 27, 2020
Copy link
Collaborator

@vkarpov15 vkarpov15 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall I think this PR is great! Made a couple comments.

Could we put this in the 5.11 release? This change might be a bit too big for a patch release.

return callback(null, null);
} else {
// Request as many docs as batchSize, to populate them also in batch
return ctx.cursor.next(_onNext.bind({ ctx, callback, batchDocs: [] }));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer we not use bind(), especially with a custom object as this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that was a quick approach to make ESLint happy (you can see the previous commit). Take in mind callback needs to be preserved through the internal cursor iteration. There probably could be a cleaner way to succeed in this. Any suggestions?

I also did not wanted to expose onNext, onPopulate and nextDoc in the QueryCursor API, as they are purely internal methods. That forces me to pass somehow the ctx object in any way. And thus i ended with the bind() approach.

return _this.callback(err);
}

_this.ctx._batchDocs = docs;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason to have both .batchDocs and ._batchDocs? Model.populate(docs) modifies docs in place so they point to the same object.

Copy link
Contributor Author

@biomorgoth biomorgoth Aug 27, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note this.batchDocs belongs to the custom object initially bound to _onNext, while this.ctx._batchDocs belongs to the QueryCursor instance. So this assignation is basically passing the Array from one object to another.

Sorry i did not researched that deep to know that Model.populate(docs) modified the Array in place, so maybe i could even reset ctx._batchDocs just before line 312 and always keep one Array. Let me see if that works as expected.

@biomorgoth
Copy link
Contributor Author

No problem with adding this to the 5.11 release. When is it expected to be released?

@biomorgoth
Copy link
Contributor Author

Done with the single batchDocs array @vkarpov15 . Still not sure how to remove the .bind() and at the same time pass the necessary data to _onNext/_populateBatch (specially the callback function).

My first version ( 3ec17eb ) did not use .bind(), but inner functions that relied on the closure to have access to ctx/callback. ESLint rules for the project did not liked that style ☹️

@biomorgoth biomorgoth requested a review from vkarpov15 August 28, 2020 18:34
@vkarpov15 vkarpov15 changed the base branch from master to 5.11 September 19, 2020 20:16
@vkarpov15
Copy link
Collaborator

We're planning on releasing 5.11 in October or November 2020. Does that work for you or do you need this feature out sooner?

@biomorgoth
Copy link
Contributor Author

Great! I can manage to use my fork as temporary dependency for my projects right now until the 5.11 release. The current expectative is to not use the fork dependency as a long term solution.

@vkarpov15 vkarpov15 added this to the 5.11 milestone Sep 25, 2020
@vkarpov15 vkarpov15 merged commit 6684621 into Automattic:5.11 Sep 25, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants