Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incremental Static Regeneration (ISR) pages not re-rendering after revalidate expiry #24806

Closed
tommarshall opened this issue May 5, 2021 · 2 comments · Fixed by #27335
Closed
Labels
bug Issue was opened via the bug report template.

Comments

@tommarshall
Copy link
Contributor

What version of Next.js are you using?

10.0.5

What version of Node.js are you using?

12.20.0

What browser are you using?

Chrome & cURL

What operating system are you using?

macOS

How are you deploying your application?

next start on Azure App Service

Describe the Bug

Under certain circumstances the Incremental Static Regeneration (ISR) logic is incorrectly prolonging the life of rendered page content beyond the revalidation expiry, potentially indefinitely.

Those circumstances:

  • The site has sufficient pages (and page size) to exhaust the 50MB LRUCache size.
  • The site has sufficient traffic that a given page may well be removed from the LRUCache within the revalidate period.

For context, we're experiencing this on a site with:

  • Around 2,000 pages, total.
  • A mean average LRUCache item size (page size) of 420KB.
  • So around enough space for 120 pages (6%) to fit in the LRUCache at one time.
  • Around 75k* daily page views.
  • A revalidate duration of 300 (5 minutes).

*This is the figure from analytics. We have quite a few pages listing large quantities of internal links at a time, which with Next/Link's prefetch behaviour amplifies this figure somewhat as far at the ISR cache is concerned.

Next's incremental cache logic makes use of an in-memory LRUCache (least recently used cache) with a max size of 50MB, with a fallback to the filesystem for anything not found in the LRUCache.

I believe there's a bug in the incremental cache logic where if a page is not present in the LRUCache, then when falling back to the filesystem cache the (potentially) stale file content is added back to the LRUCache for a full fresh revalidation period, which is a process that can repeat itself if the page keeps being pushed out of the LRUCache before the next request comes in after the revalidation period expires.

This creates the potential for stale content to be served in perpetuity for certain pages.

Expected Behavior

Under the stale-while-revalidate (SWR) caching model for the pages which make use of ISR, any request for a page which was last generated more than the defined revalidate seconds ago should trigger the page to be regenerated in the background.

To Reproduce

Due to its nature, this is a tricky one to reproduce. You need a site with enough ISR pages (and a big enough page size) to exhaust the 50MB LRUCache space, and ideally some visibility into the cache state so that you can follow whether the LRUCache is being hit, whether it's falling back to the filesystem, and when the cache is determining that a page is considered stale (should be regenerated by next-server).

To that end I've created a branch which reduces the cache size to 1MB to make this easier to reproduce, and adds some logging to make this easier to follow, but it is possible to reproduce this without that branch, just harder. Setting the revalidate to a shorter period also makes life easier here. I found 60 (1 min) a good balance, but I believe this bug is present for any revalidate duration.

  1. Ensure there are no cached pages - rm -rf .next/
  2. Rebuild the site - yarn build
  3. Serve the site - yarn start

At this point the LRUCache is empty, and the filesystem cache is populated for any pages that are prebuilt during the yarn build.

  1. Fill up the LRUCache, by requesting enough pages to exhaust the 50MB (Or 1MB if you're using the investigate-isr-pages-not-re-rendering-after-revalidate-expiry branch) capacity.
  2. Make a note of the current time. Select a test page (e.g. /posts/foo-post), and request it to bring it to the front of the LRUCache.
  3. Request enough other pages to push that page out of the cache, making sure not to request the test page you selected in step 5.

Next/Link's prefetch behaviour can make life tricky here if you're using a browser, as it might sneakily request the test page if it's included as a link on one of the other pages. curl is likely a safer way to go here.

(See the cacheKeys log key for visibility of which pages are in the LRUCache if using the investigate-isr-pages-not-re-rendering-after-revalidate-expiry branch logging).

  1. Wait for the revalidate window to expire for the test page request (step 5), if it hasn't already.
  2. Make a note of the last modified timestamp from the test page's .html and .json files within the filesystem cache directory, for example:
$ ls -lt .next/server/pages/posts/foo-post.*
-rw-r--r--  1 user  staff   79293  4 May 13:25 .next/server/pages/posts/foo-post.json
-rw-r--r--  1 user  staff  308864  4 May 13:25 .next/server/pages/posts/foo-post.html
  1. Request the test page.
  2. Review the last modified timestamps for the test page's .html and .json files again, e.g.
$ ls -lt .next/server/pages/posts/foo-post.*
-rw-r--r--  1 user  staff   79293  4 May 13:25 .next/server/pages/posts/foo-post.json
-rw-r--r--  1 user  staff  308864  4 May 13:25 .next/server/pages/posts/foo-post.html

Note that they have not changed. The page has not been regenerated, despite the revalidation window having expired.

  1. (optional) Repeat steps 6-10 again to demonstrate that this bug can potentially result in an infinite revalidation window.
  2. Request the test page to bring it back to the front of the LRUCache again.
  3. This time, do not request other pages in order to push the test page out of the LRUCache.
  4. Wait for the revalidate window to expire for the most recent test page request (if it hasn't already).
  5. Make a note of the last modified timestamp from the test page's .html and .json files within the Next filesystem again, e.g.
$ ls -lt .next/server/pages/posts/foo-post.*
-rw-r--r--  1 user  staff   79293  4 May 13:25 .next/server/pages/posts/foo-post.json
-rw-r--r--  1 user  staff  308864  4 May 13:25 .next/server/pages/posts/foo-post.html
  1. Request the test page again.
  2. Review the last modified timestamps for the test page's .html and .json files again.
$ ls -lt .next/server/pages/posts/foo-post.*
-rw-r--r--  1 user  staff   79293  4 May 13:31 .next/server/pages/posts/foo-post.json
-rw-r--r--  1 user  staff  308864  4 May 13:31 .next/server/pages/posts/foo-post.html

Note that this time the last modified timestamps have updated. The page was regenerated because it was still in the LRUCache when the request came in following the expiry of the revalidation window.

@tommarshall tommarshall added the bug Issue was opened via the bug report template. label May 5, 2021
tommarshall added a commit to kyan/next.js that referenced this issue May 5, 2021
* As outlined in vercel#24806 I there's currently a bug with the incremental
  static regeneration (ISR) logic incorrectly prolonging the life of
  rendered page content when it's the `LRUCache` is refreshed with
  potentially stale content from the filesystem cache.
* This because if a page is not present in the `LRUCache`, but is
  present in the filesystem cache then it's re-added to the `LRUCache`
  with a fresh cache expiry calculated from the current time using the
  `revalidate` window.
* If that page continues to be pushed out of the LRUCache before the
  next request comes in after the revalidation window expires then this
  creates an endless loop of the same stale content from the filesystem
  being served, without it being regenerated.
* This fix addresses that issue by adding a `fromTime` argument to the
  revalidation period calculation, so that when we're falling back to
  the filesystem we can use the filesystem's `lastModfied` timestamp
  for that file to anchor the revalidation calculation, rather than
  restarting from the current time in every case.
* That means that if page content from the filesystem is stale it will
  be marked as so when returned from the incremental cache `get` method
  (`isStale: true`), which tells next server to regenerate the page in
  the background.
* If however the the page content from the filesystem is still within
  the revalidate window (i.e. is not yet stale) then it will be added
  back to the LRUCache with the correct remaining duration for the
  `revalidateAfter`.

Fixes vercel#24806
tommarshall added a commit to kyan/next.js that referenced this issue May 5, 2021
* As outlined in vercel#24806 I there's currently a bug with the incremental
  static regeneration (ISR) logic incorrectly prolonging the life of
  rendered page content when it's the `LRUCache` is refreshed with
  potentially stale content from the filesystem cache.
* This because if a page is not present in the `LRUCache`, but is
  present in the filesystem cache then it's re-added to the `LRUCache`
  with a fresh cache expiry calculated from the current time using the
  `revalidate` window.
* If that page continues to be pushed out of the LRUCache before the
  next request comes in after the revalidation window expires then this
  creates an endless loop of the same stale content from the filesystem
  being served, without it being regenerated.
* This fix addresses that issue by adding a `fromTime` argument to the
  revalidation period calculation, so that when we're falling back to
  the filesystem we can use the filesystem's `lastModfied` timestamp
  for that file to anchor the revalidation calculation, rather than
  restarting from the current time in every case.
* That means that if page content from the filesystem is stale it will
  be marked as so when returned from the incremental cache `get` method
  (`isStale: true`), which tells next server to regenerate the page in
  the background.
* If however the the page content from the filesystem is still within
  the revalidate window (i.e. is not yet stale) then it will be added
  back to the LRUCache with the correct remaining duration for the
  `revalidateAfter`.

Fixes vercel#24806
tommarshall added a commit to kyan/next.js that referenced this issue May 5, 2021
* As outlined in vercel#24806 I there's currently a bug with the incremental
  static regeneration (ISR) logic incorrectly prolonging the life of
  rendered page content when it's the `LRUCache` is refreshed with
  potentially stale content from the filesystem cache.
* This because if a page is not present in the `LRUCache`, but is
  present in the filesystem cache then it's re-added to the `LRUCache`
  with a fresh cache expiry calculated from the current time using the
  `revalidate` window.
* If that page continues to be pushed out of the LRUCache before the
  next request comes in after the revalidation window expires then this
  creates an endless loop of the same stale content from the filesystem
  being served, without it being regenerated.
* This fix addresses that issue by adding a `fromTime` argument to the
  revalidation period calculation, so that when we're falling back to
  the filesystem we can use the filesystem's `lastModfied` timestamp
  for that file to anchor the revalidation calculation, rather than
  restarting from the current time in every case.
* That means that if page content from the filesystem is stale it will
  be marked as so when returned from the incremental cache `get` method
  (`isStale: true`), which tells next server to regenerate the page in
  the background.
* If however the the page content from the filesystem is still within
  the revalidate window (i.e. is not yet stale) then it will be added
  back to the LRUCache with the correct remaining duration for the
  `revalidateAfter`.

Fixes vercel#24806
tommarshall added a commit to kyan/next.js that referenced this issue May 5, 2021
* As outlined in vercel#24806 I there's currently a bug with the incremental
  static regeneration (ISR) logic incorrectly prolonging the life of
  rendered page content when it's the `LRUCache` is refreshed with
  potentially stale content from the filesystem cache.
* This because if a page is not present in the `LRUCache`, but is
  present in the filesystem cache then it's re-added to the `LRUCache`
  with a fresh cache expiry calculated from the current time using the
  `revalidate` window.
* If that page continues to be pushed out of the LRUCache before the
  next request comes in after the revalidation window expires then this
  creates an endless loop of the same stale content from the filesystem
  being served, without it being regenerated.
* This fix addresses that issue by adding a `fromTime` argument to the
  revalidation period calculation, so that when we're falling back to
  the filesystem we can use the filesystem's `lastModfied` timestamp
  for that file to anchor the revalidation calculation, rather than
  restarting from the current time in every case.
* That means that if page content from the filesystem is stale it will
  be marked as so when returned from the incremental cache `get` method
  (`isStale: true`), which tells next server to regenerate the page in
  the background.
* If however the the page content from the filesystem is still within
  the revalidate window (i.e. is not yet stale) then it will be added
  back to the LRUCache with the correct remaining duration for the
  `revalidateAfter`.

Fixes vercel#24806
tommarshall added a commit to kyan/next.js that referenced this issue May 5, 2021
* As outlined in vercel#24806 I there's currently a bug with the incremental
  static regeneration (ISR) logic incorrectly prolonging the life of
  rendered page content when it's the `LRUCache` is refreshed with
  potentially stale content from the filesystem cache.
* This because if a page is not present in the `LRUCache`, but is
  present in the filesystem cache then it's re-added to the `LRUCache`
  with a fresh cache expiry calculated from the current time using the
  `revalidate` window.
* If that page continues to be pushed out of the LRUCache before the
  next request comes in after the revalidation window expires then this
  creates an endless loop of the same stale content from the filesystem
  being served, without it being regenerated.
* This fix addresses that issue by adding a `fromTime` argument to the
  revalidation period calculation, so that when we're falling back to
  the filesystem we can use the filesystem's `lastModfied` timestamp
  for that file to anchor the revalidation calculation, rather than
  restarting from the current time in every case.
* That means that if page content from the filesystem is stale it will
  be marked as so when returned from the incremental cache `get` method
  (`isStale: true`), which tells next server to regenerate the page in
  the background.
* If however the the page content from the filesystem is still within
  the revalidate window (i.e. is not yet stale) then it will be added
  back to the LRUCache with the correct remaining duration for the
  `revalidateAfter`.

Fixes vercel#24806
jamsinclair pushed a commit to jamsinclair/next.js that referenced this issue Jul 20, 2021
* As outlined in vercel#24806 I there's currently a bug with the incremental
  static regeneration (ISR) logic incorrectly prolonging the life of
  rendered page content when it's the `LRUCache` is refreshed with
  potentially stale content from the filesystem cache.
* This because if a page is not present in the `LRUCache`, but is
  present in the filesystem cache then it's re-added to the `LRUCache`
  with a fresh cache expiry calculated from the current time using the
  `revalidate` window.
* If that page continues to be pushed out of the LRUCache before the
  next request comes in after the revalidation window expires then this
  creates an endless loop of the same stale content from the filesystem
  being served, without it being regenerated.
* This fix addresses that issue by adding a `fromTime` argument to the
  revalidation period calculation, so that when we're falling back to
  the filesystem we can use the filesystem's `lastModfied` timestamp
  for that file to anchor the revalidation calculation, rather than
  restarting from the current time in every case.
* That means that if page content from the filesystem is stale it will
  be marked as so when returned from the incremental cache `get` method
  (`isStale: true`), which tells next server to regenerate the page in
  the background.
* If however the the page content from the filesystem is still within
  the revalidate window (i.e. is not yet stale) then it will be added
  back to the LRUCache with the correct remaining duration for the
  `revalidateAfter`.

Fixes vercel#24806
@kodiakhq kodiakhq bot closed this as completed in #27335 Jul 20, 2021
kodiakhq bot pushed a commit that referenced this issue Jul 20, 2021
…27335)

Fixes: #24806 
Fixes: #26689
Fixes: #27325
Closes: #24807

@tommarshall has done us a huge favor with his PR #24807 which outlines exactly the issue and a pragmatic solution.

I'm not trying to "steal their work", however, the PR seems to have been stuck for some months. I think there's huge value in this for myself and others as essentially **ISR is broken** for people running Next.js at scale 😱 

This PR has cherry-picked @tommarshall's fine fix and added some integrations tests around page revalidation and the edge case when the cache size is exhausted.

✏️ Edits are enabled, so feel free great Vercel staff and other maintainers to improve my bad tests or surrounding code 🙇 

## Bug

- [x] Related issues linked using `fixes #number`
- [x] Integration tests added
@ijjk
Copy link
Member

ijjk commented Jul 20, 2021

Hi, this has been updated in v11.0.2-canary.18 of Next.js, please update and give it a try!

flybayer pushed a commit to blitz-js/next.js that referenced this issue Aug 19, 2021
…) (vercel#27335)

Fixes: vercel#24806 
Fixes: vercel#26689
Fixes: vercel#27325
Closes: vercel#24807

@tommarshall has done us a huge favor with his PR vercel#24807 which outlines exactly the issue and a pragmatic solution.

I'm not trying to "steal their work", however, the PR seems to have been stuck for some months. I think there's huge value in this for myself and others as essentially **ISR is broken** for people running Next.js at scale 😱 

This PR has cherry-picked @tommarshall's fine fix and added some integrations tests around page revalidation and the edge case when the cache size is exhausted.

✏️ Edits are enabled, so feel free great Vercel staff and other maintainers to improve my bad tests or surrounding code 🙇 

## Bug

- [x] Related issues linked using `fixes #number`
- [x] Integration tests added
@balazsorban44
Copy link
Member

This issue has been automatically locked due to no recent activity. If you are running into a similar issue, please create a new issue with the steps to reproduce. Thank you.

@vercel vercel locked as resolved and limited conversation to collaborators Jan 28, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Issue was opened via the bug report template.
Projects
None yet
3 participants