Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clear external tileset skeletons from tile tree to save memory usage #1107

Merged
merged 40 commits into from
Feb 27, 2025

Conversation

azrogers
Copy link
Contributor

Closes #739. Currently, though the content of tiles are unloaded when no longer used, the "skeletons" - the Tile objects - created by loading external tilesets are never unloaded. This can cause memory usage to steadily increase. This change implements a _doNotUnloadCount number on each tile that tracks situations where the tile's pointers are still in use. When a tile is in a situation where its pointer is being used - the tile is being loaded, for example - it increments this count on the tile and each of its parent tiles, and when the pointer is no longer needed, this counter is decremented up the tree as well. This means we can clear the children of external tilesets when their _doNotUnloadCount number is 0. This implementation also includes a TileDoNotUnloadCountTracker class, enabled with the CESIUM_DEBUG_TILE_UNLOADING switch, that tracks the source of every modification to a tile's _doNotUnloadCount.

Copy link
Member

@kring kring left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @azrogers, this is so good! I tried flying around with Google Photorealistic 3D Tiles in Cesium for Unreal, and memory usage stays extremely steady. Previously I believe it would have gone up quite quickly. I didn't see any crashes or other dodgy behavior either. This will be a major improvement for our users!

Mostly small comments here, but I did notice a couple of cases where I think there's potential for (rare) bad behavior.

I think it's also worth taking a bit of time to think through whether there could be any other gotchas like this, and generally do everything we can to thoroughly test everything. Perhaps bring back the soak test from #1415?

@azrogers
Copy link
Contributor Author

@kring Looks like reversing the direction of iteration, as well as unloading empty tiles, worked great! I'll take a look at that soak test and think of some other ways we could verify correct behavior here.

Copy link
Member

@kring kring left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few more small things here. Also please update CHANGES.md.

@azrogers
Copy link
Contributor Author

Bringing the soak test up-to-date (now in CesiumGS/cesium-unreal#1615 for reasons detailed in that PR) did identify some crashes, including issues with sampleHeightMostDetailed. I'll try to fix those now.

@azrogers
Copy link
Contributor Author

I realize now that sampleHeightMostDetailed never showed up on your list of Tile pointer references in the previous PR because the functionality hadn't been implemented at that point 😅

@azrogers
Copy link
Contributor Author

It's now tracking most of the tile pointer usages in TilesetHeightQuery, though it still crashes when running the test in Unreal. However, it also crashes when running the soak test from CesiumGS/cesium-unreal#1615, so I might in fact be correctly tracking all the usages in TilesetHeightQuery and it's crashing from an unrelated pointer use that I haven't found yet. Still looking into it, but might need to finish up solving this on Monday.

@azrogers
Copy link
Contributor Author

Ok, I believe I've finally solved it! Turns out, there's two reasons that we do in fact need to remove children from _loadedTiles recursively:

  1. It's possible for a tile to be created Unloaded (byTile::createChildTiles), get visited and added to _loadedTiles, but not get loaded by the time the parent external tileset is unloaded and has its children cleared. Because the tile never had to go through unloadTileContent to become Unloaded, it doesn't count towards the count of tiles still yet to be unloaded, but it nevertheless is in _loadedTiles.
  2. The root tiles of loaded external tilesets, which are created as empty tiles with children, will be considered unconditionally refined unloaded external tilesets if we set them to Unloaded and remove their content. This is what was responsible for the Children already created issues. The fix for this is simply not letting empty tiles count towards the "still not unloaded" count of their parents, and cleaning them up as we're clearing their external tileset parent's children. You can read my full reasoning here.

I ran some quick comparisons with the tile loading soak test (CesiumGS/cesium-unreal#1615) to show that this does indeed clean up external tilesets:
Total Unloaded Tiles Over Time (main)

Total Unloaded Tiles Over Time (unload-external-tilesets-2)

@kring kring added this to the March 2025 Release milestone Feb 21, 2025
@kring
Copy link
Member

kring commented Feb 24, 2025

For my own reference, I created an inventory of how the two counts in a Tile are used:

_doNotUnloadCount

  • Increment
    • When the value in a child tile is incremented, the parent tile is, too.
    • When a tile is added to the tilesFadingOut list.
    • When a tile starts async loading.
    • When a tile is added to candidateTiles / additiveCandidateTiles lists for height queries.
    • When a tile is added to _heightQueryLoadQueue
  • Decrement
    • When the value in a child tile is decremented, the parent tile is, too.
    • When a tile is removed from the tilesFadingOut list.
    • When a tile finishes async loading.
    • When a tile is removed from candidatedTiles / additiveCandidateTiles lists for height queries.
    • When a tile is remove from _heightQueryLoadQueue
  • Usage
    • In TilesetContentManager::unloadTileContent, if the tile has external content, it is not unloaded if this count is greater than 0.

Summary:
This counter tracks the number of outstanding pointers to this Tile or its children. When there are any outstanding pointers at all, we can't destroy the tile.

_tilesStillNotUnloadedCount

  • Increment:
    • When the value in a child tile is incremented, the parent tile is, too.
    • In createChildTiles, tilesStillNotUnloadedCount of new children is added to current tile and propagated up the tree.
    • In Tile constructor if the tile starts in a state other than Unloaded, and it's not UnknownContent, and it's not EmptyContent.
    • When TilesetContentManager::setTileContent transitions the tile to the ContentLoaded state and it's not UnknownContent and it's not EmptyContent.
  • Decrement:
    • When the value in a child tile is decremented, the parent tile is, too.
    • In TilesetContentManager::unloadTileContent when it transitions the tile back to the Unloaded state.
  • Usage:
    • In TilesetContentManager::unloadTileContent, if the tile has external content, it is not unloaded if this count is greater than 1.

Summary:
This counter tracks the number of tile's in this subtree (counting itself) that have content that needs to be unloaded. All content in the subtree must be unloaded before the external tileset can be unloaded and the children can be cleared.

@kring
Copy link
Member

kring commented Feb 24, 2025

Based on the above, I think it's possible to collapse this back down to a single count, _doNotUnloadCount. Basically, everywhere we would increment/decrement _tilesStillNotUnloadedCount, we instead increment/decrement _doNotUnloadCount but starting with the parent tile instead of the current one. The logic is that when a tile has loaded content, the parent tile's subtree cannot yet be unloaded.

Also, let's rename _doNotUnloadCount to _doNotUnloadSubtreeCount. A tile's renderable content can be unloaded no matter the value of this count, it's only the subtree that cannot be unloaded.

Copy link
Member

@kring kring left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is looking good! A few suggestions that might help clean things up a bit.

@@ -122,9 +163,19 @@ void Tile::createChildTiles(std::vector<Tile>&& children) {
throw std::runtime_error("Children already created.");
}

int32_t prevLoadedContentsCount = this->_tilesStillNotUnloadedCount;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it's possible for any of these children to have loaded content at this point. I can't think of how they would, at least. Am I mistaken?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It turns out it is possible - in TilesetJsonLoader::createLoader when we call createChildTiles with the root tile from parseTilesetJson. When parseTilesetJson is called with an implicit tileset, that root tile is created with external content which gives it the ContentLoaded state. Checking in createChildTiles if any children have the ContentLoaded state and ticking up the _doNotUnloadSubtreeCount for each one solves the issue.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, interesting! Implicit tiling has been a bit of a blind spot for me in reviewing this so far. We should definitely make sure implicit tilesets are still working well in this PR, if you haven't already.

It doesn't need to be part of this PR, but we should eventually support unloading implicit tilesets, too. The rules there are a bit different, though. With an external tileset, once we're sure the entire thing is unused, we can unload all of it. It's all or nothing. However, with implicit tiling, we can recreate explicit tiles for any part of the implicit tree, so it's valid to unload any unused subtree.

@azrogers
Copy link
Contributor Author

Removing _tilesStillNotUnloadedCount and just incrementing the _doNotUnloadSubtreeCount on the tile's parent worked great! Unfortunately I have yet to figure out exactly how to resolve the empty tile content issue in TilesetJsonLoader as you described - will experiment further.

@azrogers
Copy link
Contributor Author

@kring Successfully got the fix for empty tile handling implemented. I believe that's everything!

Copy link
Member

@kring kring left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is looking better and better, but I still have a few comments. Thanks for your patience and ongoing work on this @azrogers!

Comment on lines +169 to +171
if (tile.getState() == TileLoadState::ContentLoaded) {
++this->_doNotUnloadSubtreeCount;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is still to handle the root tile of implicit tilesets, which are constructed with TileExternalContent, right?

If so, I don't have high confidence it will work correctly in all cases. For one thing, an implicit root need not be at the root tile of the tileset.json (though it often is). Second, this code is only incrementing the count on this tile, not propagating it up the tree.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Messed around with the behavior here. We're now adding the _doNotUnloadSubtreeCount of the children to the current tile's count. I believe this is OK as there shouldn't be any other pointers to the child tiles yet if they're being passed to this method, so the count should just represent whether any of the children are loaded. I've also changed it to propagate the counter up through the tree. So I believe the way it's written now, an implicit root buried in the tree of a tileset.json would load like:

  1. parseTileJsonRecursively returns the implicit root tile with the external content and the ContentLoaded state.
  2. The parent tile in the tree passes that implicit root to createChildTiles, which updates the parent tile's _doNotUnloadSubtreeCount as the implicit root tile is ContentLoaded.
  3. The parent tile's parent tile passes the parent tile to createChildTiles, which adds the parent tile's _doNotUnloadSubtreeCount to its own counter.
  4. Repeat up the tree until we reach the root.

I think this covers all our bases?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Almost! I tried out the S2 globe, and saw a crash (_doNotUnloadSubtreeCount assertion failure) when entering Play in Editor mode and then exiting it. The problem was here:

On that line, thanks to your changes to createChildTiles, the _doNotUnloadSubtreeCount of tile is 6, as it should be. But then it's returning that tile instance from a function whose return type is std::optional<Tile>. If it were just Tile, we'd get named return value optimization, and there would be no issue. But since it's a std::optional, the Tile gets moved into the optional. And remember how, at my suggestion, the move constructor doesn't copy _doNotUnloadSubtreeCount? So the function ends up returning a Tile with a _doNotUnloadSubtreeCount of zero! 😱

My suggestion was reasonable (I think) when the _doNotUnloadSubtreeCount was only used to track pointers to the Tile instance. But now it has that dual purpose where it also tracks loaded content in this Tile and its subtree. That's what the 6 represents in this case, and it's essential that those counts move into the new Tile, or else we'll have a bug.

With the count having a dual purpose, there's no way to tell during the move operation which counts should move and which shouldn't. So one solution is to go back to two different counts. But as a practical matter, we take great pains to avoid moving out of Tile instances with outstanding pointers to them, because this is the sort of thing that would quickly lead to bugs. So while it feels a little bit wrong, I think we can safely say that any Tile that is the source of a move only has counts as a result of content, not pointers. And thus we should move them all.

So I made that change, and it fixed the S2 bug.

I'm going to make sure CI is happy and do a bit more testing, and then merge this!

@kring
Copy link
Member

kring commented Feb 27, 2025

Found another bug. This one might be tricky to reproduce, but fortunately I was running in the debugger at the time and the cause is clear enough.

It was triggered by a strange connection error (not sure of the cause) and generated this log before the crash:

[[2025.02.27-04.28.54:309][123]]LogHttp: Warning: 000008538E8E4B00: request failed, libcurl error: 55 (Failed sending data to the peer)
[[2025.02.27-04.28.54:309][123]]LogHttp: Warning: 000008538E8E4B00: libcurl info message cache 0 (Found bundle for host: 0x853468b18c0 [serially])
[[2025.02.27-04.28.54:309][123]]LogHttp: Warning: 000008538E8E4B00: libcurl info message cache 1 (Connection #261 isn't open enough, can't reuse)
[[2025.02.27-04.28.54:309][123]]LogHttp: Warning: 000008538E8E4B00: libcurl info message cache 2 (Connection #262 isn't open enough, can't reuse)
[[2025.02.27-04.28.54:309][123]]LogHttp: Warning: 000008538E8E4B00: libcurl info message cache 3 (Hostname tile.googleapis.com was found in DNS cache)
[[2025.02.27-04.28.54:309][123]]LogHttp: Warning: 000008538E8E4B00: libcurl info message cache 4 (  Trying 172.217.167.106:443...)
[[2025.02.27-04.28.54:309][123]]LogHttp: Warning: 000008538E8E4B00: libcurl info message cache 5 (Connected to tile.googleapis.com (172.217.167.106) port 443)
[[2025.02.27-04.28.54:310][123]]LogHttp: Warning: 000008538E8E4B00: libcurl info message cache 6 (ALPN: curl offers http/1.1)
[[2025.02.27-04.28.54:310][123]]LogHttp: Warning: 000008538E8E4B00: libcurl info message cache 7 (SSL reusing session ID)
[[2025.02.27-04.28.54:310][123]]LogHttp: Warning: 000008538E8E4B00: libcurl info message cache 8 (TLSv1.3 (OUT), TLS handshake, Client hello (1):)
[[2025.02.27-04.28.54:310][123]]LogHttp: Warning: 000008538E8E4B00: libcurl info message cache 9 (TLSv1.3 (IN), TLS handshake, Server hello (2):)
[[2025.02.27-04.28.54:310][123]]LogHttp: Warning: 000008538E8E4B00: libcurl info message cache 10 (TLSv1.3 (IN), TLS handshake, Encrypted Extensions (8):)
[[2025.02.27-04.28.54:310][123]]LogHttp: Warning: 000008538E8E4B00: libcurl info message cache 11 (TLSv1.3 (IN), TLS handshake, Finished (20):)
[[2025.02.27-04.28.54:310][123]]LogHttp: Warning: 000008538E8E4B00: libcurl info message cache 12 (TLSv1.3 (OUT), TLS change cipher, Change cipher spec (1):)
[[2025.02.27-04.28.54:310][123]]LogHttp: Warning: 000008538E8E4B00: libcurl info message cache 13 (TLSv1.3 (OUT), TLS handshake, Finished (20):)
[[2025.02.27-04.28.54:310][123]]LogHttp: Warning: 000008538E8E4B00: libcurl info message cache 14 (SSL connection using TLSv1.3 / TLS_AES_256_GCM_SHA384)
[[2025.02.27-04.28.54:310][123]]LogHttp: Warning: 000008538E8E4B00: libcurl info message cache 15 (ALPN: server accepted http/1.1)
[[2025.02.27-04.28.54:310][123]]LogHttp: Warning: 000008538E8E4B00: libcurl info message cache 16 (Server certificate:)
[[2025.02.27-04.28.54:310][123]]LogHttp: Warning: 000008538E8E4B00: libcurl info message cache 17 ( subject: CN=upload.video.google.com)
[[2025.02.27-04.28.54:310][123]]LogHttp: Warning: 000008538E8E4B00: libcurl info message cache 18 ( start date: Feb  3 08:37:09 2025 GMT)
[[2025.02.27-04.28.54:311][123]]LogHttp: Warning: 000008538E8E4B00: libcurl info message cache 19 ( expire date: Apr 28 08:37:08 2025 GMT)
[[2025.02.27-04.28.54:311][123]]LogHttp: Warning: 000008538E8E4B00: libcurl info message cache 20 ( subjectAltName: host "tile.googleapis.com" matched cert's "*.googleapis.com")
[[2025.02.27-04.28.54:311][123]]LogHttp: Warning: 000008538E8E4B00: libcurl info message cache 21 ( issuer: C=US; O=Google Trust Services; CN=WR2)
[[2025.02.27-04.28.54:311][123]]LogHttp: Warning: 000008538E8E4B00: libcurl info message cache 22 ( SSL certificate verify ok.)
[[2025.02.27-04.28.54:311][123]]LogHttp: Warning: 000008538E8E4B00: libcurl info message cache 23 (using HTTP/1.1)
[[2025.02.27-04.28.54:311][123]]LogHttp: Warning: 000008538E8E4B00: libcurl info message cache 24 (Send failure: Connection was aborted)
[[2025.02.27-04.28.54:311][123]]LogHttp: Warning: 000008538E8E4B00: libcurl info message cache 25 (OpenSSL SSL_write: Connection was aborted, errno 10053)
[[2025.02.27-04.28.54:311][123]]LogHttp: Warning: 000008538E8E4B00: libcurl info message cache 26 (Failed sending HTTP request)
[[2025.02.27-04.28.54:311][123]]LogHttp: Warning: 000008538E8E4B00: libcurl info message cache 27 (Connection #263 to host tile.googleapis.com left intact)
Exception thrown at 0x00007FFA85BABB0A in UnrealEditor-Win64-DebugGame.exe: Microsoft C++ exception: std::runtime_error at memory location 0x0000002E4FD76370.
Exception thrown at 0x00007FFA85BABB0A in UnrealEditor-Win64-DebugGame.exe: Microsoft C++ exception: std::runtime_error at memory location 0x0000002E4FD73900.
[[2025.02.27-04.28.54:318][123]]LogCesium: Error: [2025-02-27 15:28:54.318] [error] [TilesetContentManager.cpp:1127] An unexpected error occurs when loading tile: Request failed.

The assertion is in _clearChildrenRecursively. The child tile's _doNotUnloadSubtreeCount is zero (yay!) but it's in the ContentLoading state (boo!).

I believe the problem is in TilesetContentManager::loadTileContent. In the catchInMainThread block (which we definitely hit, as evidenced by the log), it decrements the _doNotUnloadSubtreeCount, but it does not set the tile's load state to Failed. So the fact that it's ContentLoading is actually incorrect; it was just never reset after the failure.

This is not a new problem, but previously the tile would have just gotten stuck in that ContentLoading state with no other major ill effects. Now it causes an assertion failure.

The most common reason this may happen is on a connection failure.
@kring
Copy link
Member

kring commented Feb 27, 2025

I was actually able to reproduce the assertion failure quite easily by disconnecting my network temporarily, and submitted a fix.

@kring
Copy link
Member

kring commented Feb 27, 2025

Thanks for the major effort here, @azrogers! This is going to be a wonderful improvement for our users. Merging!

@kring kring merged commit d3f3e02 into main Feb 27, 2025
22 checks passed
@kring kring deleted the unload-external-tilesets-2 branch February 27, 2025 06:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Memory leak when using external tilesets.
2 participants