Skip to content
This repository has been archived by the owner on Aug 8, 2023. It is now read-only.

Ensure GeometryTile::pending state is only false in the last placement attempt #9842

Merged
merged 4 commits into from
Aug 25, 2017

Conversation

brunoabinader
Copy link
Member

This PR:

This causes API.RecycleMapUpdateImages to pass without cd8eb13.

@brunoabinader brunoabinader self-assigned this Aug 23, 2017
@brunoabinader brunoabinader added Core The cross-platform C++ core, aka mbgl bug labels Aug 23, 2017
@ChrisLoer
Copy link
Contributor

I think there's a race condition here (this timeline is imagined, not taken from an actual trace):

Main Worker
setData ...
... redoLayout, send getImages
Receives getImages, start downloading ...
setLayers ...
... redoLayout, send getImages
Finishes downloading, hasn't received getImages yet, so thinks it has all dependencies, sends onImagesAvailable ...
Receives getImages, starts downloading again Receives onImagesAvailable, erroneously starts placement

@brunoabinader
Copy link
Member Author

I haven't found a way to reproduce the workflow mentioned by @ChrisLoer ( #9842 (comment) ) yet, but it looks possible. @kkaefer's suggestion of bookkeeping these via correlation ID sounds like a good solution - I'll add a patch for that.

@brunoabinader
Copy link
Member Author

Reverting cd8eb13 fixes #9829.

@ChrisLoer
Copy link
Contributor

The correlation ID change looks good. Doesn't it also make c69fbee unnecessary? If that change is still necessary, we should update the comment in ImageManager::getImages to match the behavior of the code below.

@ivovandongen ivovandongen mentioned this pull request Aug 23, 2017
6 tasks
@brunoabinader
Copy link
Member Author

Doesn't it also make c69fbee unnecessary?

Not really - c69fbee prevents the first onImagesAvailable from being called (caused by setData in API.RecycleMapUpdateImages). Having two or more onImagesAvailable calls prevents attemptPlacement from re-preparing the symbol layout features. These two onImagesAvailable calls cannot be caught only by correlation because the time span between the getImage calls is long enough to coalesce.

@ChrisLoer
Copy link
Contributor

These two onImagesAvailable calls cannot be caught only by correlation because the time span between the getImage calls is long enough to coalesce.

I didn't follow this -- how does coalescing affect the correlation IDs? Can you put together a main/worker timeline to show the condition this is trying to prevent? I would expect GeometryTile to be able to ignore any obsolete onImagesAvailable messages.

I just looked more closely at the imageCorrelationID logic and I think there's still a race condition there. 😞 We can't increment the correlation ID on receiving getImages from the worker -- we'd still be vulnerable to the same race condition I hypothesized in #9842 (comment). We have to do something like the other correlationID, where the counter is incremented before we send a message to the worker that will potentially make it require new images.

@jfirebaugh jfirebaugh mentioned this pull request Aug 23, 2017
@brunoabinader
Copy link
Member Author

We have to do something like the other correlationID, where the counter is incremented before we send a message to the worker that will potentially make it require new images.

Hmm, the worker requests for new images whenever it finishes (re)doing layout, which is only triggered by setData and setLayers. These setters are already covered by the original correlation ID scheme, so in theory these should be all covered by now.

However, I've noticed we call for attemptPlacement synchronously in all these cases. One thing we could try, then, is to make all calls for attemptPlacement async, using the current correlationID set on the worker. When attemptPlacement is finally run, we would then re-check if the correlation ID given matches the now-current in the worker. If not, we could then discard that placement attempt.

@brunoabinader
Copy link
Member Author

I didn't follow this -- how does coalescing affect the correlation IDs? Can you put together a main/worker timeline to show the condition this is trying to prevent? I would expect GeometryTile to be able to ignore any obsolete onImagesAvailable messages.

Sorry for the confusion - by coalescing I meant the two onImagesAvailable calls would only generate a single placement attempt (unrelated to the worker thread coalescing mechanism).

In API.RecycleMapUpdateImages we force a scenario where most certainly the first placement attempt caused by the first onImagesAvailable is finished before the second onImagesAvailable is called. Because symbolLayoutsNeedPreparation is now false, the updated images won't be used to re-prepare the symbol layout:

Step Main Worker
1 Sends setData ...
2 Sends setLayers Receives setData ▶️ redoLayout, sends getImages
3 Receives empty getImages, sends onImagesAvailable Receives setLayers ▶️ redoLayout, send getImages
4 Receives final getImages, sends onImagesAvailable Receives onImagesAvailable, starts placement with empty image map (symbolLayoutsNeedPreparation is set to false)
5 ... Receives onImagesAvailable, but because the previous one set symbolLayoutsNeedPreparation to false, does not attempt placement anymore 💥

So even correlation for placement attempts (as I described in the previous comment) is not enough to prevent this (though it saves lots of redundant placement attempts according to my tests!).

In this case, what we really need is to flag symbolLayoutsNeedPreparation again in step 5 for forcing the symbol layout to update its buckets. However, this is not enough because when preparing the symbol layout in step 4, we clear the feature geometries - so when SymbolLayout::prepare is called again, there is no geometry data left to be prepared.

In this case, we could either enforce re-doing layout (costly), or simply do not clear the feature geometries (memory costly?). In this case, I prefer the latter.

In sum: re-setting symbolLayoutsNeedPreparation back to true whenever receiving onImagesAvailable + not clearing feature geometries when preparing symbol layouts causes API.RecycleMapUpdateImages to pass again. Adding correlation for placement attempts also helps by early aborting outdated placement attempts 🎉

Do you agree with the proposed approach?

@brunoabinader brunoabinader changed the title Prevent ImageManager from notifying image requestors when there are dependencies left Re-prepare symbol layout when new images are available Aug 24, 2017
@kkaefer
Copy link
Member

kkaefer commented Aug 24, 2017

Sorry, the correlation ID was meant to go to the main thread in the parent.invoke(&GeometryTile::getImages, pendingImageDependencies); call so that we can be sure that we got the response to the latest request.

@brunoabinader
Copy link
Member Author

brunoabinader commented Aug 24, 2017

Sorry, the correlation ID was meant to go to the main thread in the parent.invoke(&GeometryTile::getImages, pendingImageDependencies); call so that we can be sure that we got the response to the latest request.

Sounds reasonable (makes RecycleMapUpdateImages pass without 4f8b754). However, we'd need then to separate correlation between layout from placement requests.

By simply adding correlation from the worker thread into main when calling getImages (like the diff below):

diff --git a/src/mbgl/tile/geometry_tile.cpp b/src/mbgl/tile/geometry_tile.cpp
index 5aa27fb..a377e95 100644
--- a/src/mbgl/tile/geometry_tile.cpp
+++ b/src/mbgl/tile/geometry_tile.cpp
@@ -175,8 +175,11 @@ void GeometryTile::onImagesAvailable(ImageMap images) {
     worker.invoke(&GeometryTileWorker::onImagesAvailable, std::move(images), correlationID);
 }
 
-void GeometryTile::getImages(ImageDependencies imageDependencies) {
-    imageManager.getImages(*this, std::move(imageDependencies));
+void GeometryTile::getImages(ImageDependencies imageDependencies, uint64_t correlationID_) {
+    // Ignore `getImages` requests from previous requests.
+    if (correlationID_ == correlationID) {
+        imageManager.getImages(*this, std::move(imageDependencies));
+    }
 }
 
 void GeometryTile::upload(gl::Context& context) {
diff --git a/src/mbgl/tile/geometry_tile.hpp b/src/mbgl/tile/geometry_tile.hpp
index 2f4f68d..1da16f5 100644
--- a/src/mbgl/tile/geometry_tile.hpp
+++ b/src/mbgl/tile/geometry_tile.hpp
@@ -44,7 +44,7 @@ public:
     void onImagesAvailable(ImageMap) override;
     
     void getGlyphs(GlyphDependencies);
-    void getImages(ImageDependencies);
+    void getImages(ImageDependencies, uint64_t correlationID);
 
     void upload(gl::Context&) override;
     Bucket* getBucket(const style::Layer::Impl&) const override;
diff --git a/src/mbgl/tile/geometry_tile_worker.cpp b/src/mbgl/tile/geometry_tile_worker.cpp
index 6b19920..3480c45 100644
--- a/src/mbgl/tile/geometry_tile_worker.cpp
+++ b/src/mbgl/tile/geometry_tile_worker.cpp
@@ -240,7 +240,7 @@ void GeometryTileWorker::requestNewGlyphs(const GlyphDependencies& glyphDependen
 void GeometryTileWorker::requestNewImages(const ImageDependencies& imageDependencies) {
     pendingImageDependencies = imageDependencies;
     if (!pendingImageDependencies.empty()) {
-        parent.invoke(&GeometryTile::getImages, pendingImageDependencies);
+        parent.invoke(&GeometryTile::getImages, pendingImageDependencies, correlationID);
     }
 }

We get the following scenario when running Annotations.QueryFractionalZoomLevels (the number in brackets is the current correlation ID):

[ RUN      ] Annotations.QueryFractionalZoomLevels
[0x1cc2720] setData (1)
[0x1cc2720] setLayers (2)
[0x1cc2720] setLayers (3)
[0x1ccb490] setData (1)
[0x1ccb490] setLayers (2)
[0x1ccb490] setLayers (3)
[0x1cccfd0] setData (1)
[0x1cccfd0] setLayers (2)
[0x1cccfd0] setLayers (3)
[0x1cceaf0] setData (1)
[0x1cceaf0] setLayers (2)
[0x1cceaf0] setLayers (3)
[0x1ccb490] setPlacementConfig (4)
[0x1cceaf0] setPlacementConfig (4)
[0x1cc2720] setPlacementConfig (4)
[0x1cccfd0] setPlacementConfig (4)
[0x1ccb490] getImages (4) (2)
 *** current correlation id doesn't match the one passed by worker thread (4 vs 2)
[0x1cccfd0] getImages (4) (2)
 *** current correlation id doesn't match the one passed by worker thread (4 vs 2)
[0x1cc2720] getImages (4) (3)
 *** current correlation id doesn't match the one passed by worker thread (4 vs 3)
[0x1cceaf0] getImages (4) (3)
 *** current correlation id doesn't match the one passed by worker thread (4 vs 3)
[0x1ccb490] getImages (4) (3)
 *** current correlation id doesn't match the one passed by worker thread (4 vs 3)
[0x1cccfd0] getImages (4) (3)
 *** current correlation id doesn't match the one passed by worker thread (4 vs 3)

Because setPlacementConfig also increments the correlation ID, getImages would never actually reach ImageManager, and thus the render never finishes because the tiles will be kept in pending state.

@brunoabinader brunoabinader force-pushed the revert-9739 branch 3 times, most recently from 2d56721 to b5a97c4 Compare August 24, 2017 15:44
@brunoabinader
Copy link
Member Author

brunoabinader commented Aug 24, 2017

@ChrisLoer and I just had a talk about the current state of this PR, and we agreed on the following changes:

  • Remove the asynchronous placement attempt change from this PR: this is a micro perf optimization for still mode that adds too much state changes in GeometryTileWorker (we can eventually re-visit that on a separate PR)
  • Instead of splitting layout and placement correlations, focus on having a "special" image correlation ID (e.g. that would increment in the worker thread every time getImage is invoked).

I'm going to test the proposed changes above and update this PR accordingly.

@brunoabinader
Copy link
Member Author

Instead of splitting layout and placement correlations, focus on having a "special" image correlation ID (e.g. that would increment in the worker thread every time getImage is invoked).

There is one drawback for this approach, see example below:

GeometryTileWorker GeometryTile ImageManager
GeometryTile::getImages ... ...
... ImageManager::getImages ...
... ... ImageManager::notify
... GeometryTileWorker::onImagesAvailable ...
onImagesAvailable ... ...

The image correlation would need to propagate through this entire chain. This is fine, however ImageManager::notify is also fired when we set the ImageManager to be in loaded state. This doesn't happen inside the worker thread, so there is no easy way for the worker thread to keep track of image correlation requests in this case.

@brunoabinader
Copy link
Member Author

Given the issue with the special image correlation described above, I'll stick with splitting the correlation from layout and placement requests, as it solves the issues pointed out by the unit tests, doesn't impact in continuous mode operation and guarantees that Tile::pending state is only set to false on the very last placement attempt made by the worker thread.

@brunoabinader brunoabinader changed the title Re-prepare symbol layout when new images are available Ensure GeometryTile::pending is only set to false in the last placement attempt Aug 24, 2017
@brunoabinader brunoabinader changed the title Ensure GeometryTile::pending is only set to false in the last placement attempt Ensure GeometryTile::pending state is only false in the last placement attempt Aug 24, 2017
@brunoabinader
Copy link
Member Author

More context on the correlation changes:

  • We need to track GeometryTile::onImagesAvailable because we want to make sure pending is false (signaling the tile is ready to be rendered in still mode) only when the last placement (that could also be caused by onImagesAvailable) happens.
  • We need to track GeometryTile::getImages so we can skip image requests from older layout requests.
  • We track the last completed layout so we can be sure that there will be no more ongoing placement attempts caused by layout changes.
  • Finally, we split layout and placement correlations so we can prevent cases where onImagesAvailable won't end up attempting placement (e.g. when it's called twice in API.RecycleMapUpdateImages), thus never updating the correlation check in GeometryTile::onPlacement that makes pending go to false.

Copy link
Contributor

@ChrisLoer ChrisLoer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From #9842 (comment):

The image correlation would need to propagate through this entire chain.

Yes, I think that's true. But it also doesn't seem to me that the layout/placement ID approach solves this problem (I know the tests pass, but I think that may just mean we should build some unit tests designed to go specifically at some of these edge cases).

This is fine, however ImageManager::notify is also fired when we set the ImageManager to be in loaded state. This doesn't happen inside the worker thread, so there is no easy way for the worker thread to keep track of image correlation requests in this case.

I don't understand why this is a problem -- in the setLoaded function, we're still calling notify for individual requestors, and each request would come along with a correlation ID that we could pass back to onImagesAvailable. Each call to getImages (with a unique ID) would still trigger exactly one call to onImagesAvailable (with the same ID): either it would happen immediately, or it would happen later in setLoaded

Finally, we split layout and placement correlations so we can prevent cases where onImagesAvailable won't end up attempting placement (e.g. when it's called twice in API.RecycleMapUpdateImages), thus never updating the correlation check in GeometryTile::onPlacement that makes pending go to false.

This is the case I'm worried about. It seems like it's still possible with the current logic? (see my comments on GeometryTile::getImages).

Man, this stuff is tricky to think about!

if (mode == MapMode::Continuous) {
placementThrottler.invoke();
} else {
worker.invoke(&GeometryTileWorker::setPlacementConfig, *requestedConfig, placementCorrelationID);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Replacing this with a call to invokePlacement() here would have the same effect and make it a little more clear it was following the same path.

@@ -133,7 +141,9 @@ void GeometryTile::onLayout(LayoutResult result) {
void GeometryTile::onPlacement(PlacementResult result) {
loaded = true;
renderable = true;
if (result.correlationID == correlationID) {
if (result.layoutCorrelationID == lastLayoutCorrelationID
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's how I read this. If:

  • This placement result is using the same layout as the last layout result
  • The last layout result was for the most recent layout request
  • This placement result is for the most recent placement request

Then mark this tile no longer pending. Is that right?

The logic before this change was, "if this placement is the first placement to happen in response to the last call to any of setData/setLayers/setPlacementConfig, then mark this tile as no longer pending". In order for that logic to work, the critical assumption was that while the worker might receive onImagesAvailable or onGlyphsAvailable, it would only do placement if it had everything it needed to proceed to completion.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This placement result is using the same layout as the last layout result
The last layout result was for the most recent layout request
This placement result is for the most recent placement request
Then mark this tile no longer pending. Is that right?

💯

The logic before this change was, "if this placement is the first placement to happen in response to the last call to any of setData/setLayers/setPlacementConfig, then mark this tile as no longer pending". In order for that logic to work, the critical assumption was that while the worker might receive onImagesAvailable or onGlyphsAvailable, it would only do placement if it had everything it needed to proceed to completion.

Precisely - and as exemplified in API.RecycleMapUpdateImages that is an assumption we cannot rely.

@@ -84,14 +84,18 @@ void GeometryTile::setPlacementConfig(const PlacementConfig& desiredConfig) {
// state despite pending parse operations.
pending = true;

++correlationID;
++placementCorrelationID;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand right here, incrementing the placementCorrelationID on every call here is important even if we're going to throttle the placement request, so that the tile stays in the "pending" state until the final placement?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so that the tile stays in the "pending" state until the final placement?

👍

@@ -166,11 +176,14 @@ void GeometryTile::getGlyphs(GlyphDependencies glyphDependencies) {
}

void GeometryTile::onImagesAvailable(ImageMap images) {
worker.invoke(&GeometryTileWorker::onImagesAvailable, std::move(images));
worker.invoke(&GeometryTileWorker::onImagesAvailable, std::move(images), ++placementCorrelationID);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This part of the logic makes me nervous. If onImagesAvailable doesn't cause a placement result to come back (i.e. for whatever reason symbolLayoutsNeedPreparation is false), you'll end up with the tile stalled in the "pending" state (because symbolDependenciesChanged won't trigger a placement).

For this to work, it has to be the case that symbolLayoutsNeedPreparation will always be true whenever onImagesAvailable gets called. To guarantee that, you'd need the cooperation of any code that might call onImagesAvailable (so ImageManager would have to be able to do something like cancel an in-progress request for a tile if the tile received a new setData call).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right - I believe I had code on my prior attempts that would either mark symbolLayoutsNeedPreparation to true, but now that we have a broader understanding of its implications I agree this is an error-prone approach.

imageManager.getImages(*this, std::move(imageDependencies));
void GeometryTile::getImages(ImageDependencies imageDependencies, uint64_t layoutCorrelationID_) {
// Ignore image requests from previous layout requests.
if (layoutCorrelationID == layoutCorrelationID_) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like either (1) we don't need this check, or (2) this check isn't enough. The reason I say it isn't enough is that something like a setData or setLayers could be called after getImages, but before the ImageManager sent a result back via onImagesAvailable.

On the other hand, if it's always safe to call onImagesAvailable, we don't need the check at all (except maybe as what I think would be a tiny/edge-case performance optimization).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the other hand, if it's always safe to call onImagesAvailable, we don't need the check at all (except maybe as what I think would be a tiny/edge-case performance optimization).

Agreed - and with your insight in #9842 (review) I have to apologize as I didn't properly noticed that in the image-correlation-id-coming-from-worker-approach we would then have already populated ImageManager::requestors with that image correlation ID info in getImages.

@brunoabinader
Copy link
Member Author

Thank you so much for the thoughtful review @ChrisLoer ❤️ I am now confident that your image correlation ID approach is our best chance to tackle this problem without adding too much extra states.

I just want to point out for the fact that with that approach, we'll have two correlation schemes - one controlled by the main thread for checking when to mark the tile as non-pending, and the other in the worker thread to check when to ignore or proceed with image request replies coming from the main thread. I'll add this information in a short comment in GeometryTileWorker.

@ChrisLoer
Copy link
Contributor

Great! I don't see a problem with the current approach. We might be able to come up with a better name for what we're doing than imageCorrelationID, but I don't have a great suggestion...

I would still like to get @jfirebaugh 's eyes on the changes before merging.

Copy link
Contributor

@jfirebaugh jfirebaugh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like a solid approach to me. I haven't reasoned through it in detail, but I trust your joint analysis.

Is the glyph request process susceptible to a similar issue? Do we need a glyphCorrelationID?

@ChrisLoer
Copy link
Contributor

Is the glyph request process susceptible to a similar issue? Do we need a glyphCorrelationID?

I think we're OK there because there aren't any runtime operations that could cause the result for a given fontstack/glyph ID to change once it was loaded the first time.

@brunoabinader
Copy link
Member Author

Thank you @ChrisLoer and @jfirebaugh - and apologies for the amount of rework to find the best approach for this 🙇‍♂️

@brunoabinader brunoabinader merged commit fe8cbc7 into master Aug 25, 2017
@brunoabinader brunoabinader deleted the revert-9739 branch August 25, 2017 20:42
@ChrisLoer
Copy link
Contributor

🎉 💥

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Core The cross-platform C++ core, aka mbgl
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants