Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ZnSeasideGemServer crashes due to socket becoming nil after a Seaside error handler caught an exception #1297

Closed
jbrichau opened this issue Dec 27, 2021 · 16 comments
Assignees

Comments

@jbrichau
Copy link
Member

jbrichau commented Dec 27, 2021

Running the tests in WAWebDriverFunctionalTestCase including the testCORSFilterFunctionalTest on GemStone 3.6.x is crashing the ZnServer with a stack overflow when handling the exception that the socket is nil.

Running all the tests with the testCORSFilterFunctionalTest removed, does not crash the ZnServer.

Since this test restarts all gems when it starts and stops, the cause can probably be found there. However, I fail to understand why the gems restarts in that test cause the socket to be nil at some later point, why it does not crash on other versions of GemStone, and why that also leads to a loop in exception handling. A lot of questions...

@dalehenrich dalehenrich self-assigned this Dec 30, 2021
@dalehenrich
Copy link
Member

first observation ... I reproduced the bug and in my case the stack overflow was "immediately" preceded by a Transcript show of an MNU:

-----------
--transcript--2021-12-30 16:47:36.535079 109939 E 
1 [] in ZnMultiThreadedServer >> executeOneRequestResponseOn: @23 line 19  [GsNMethod 173924353]
2 MessageNotUnderstood (AbstractException) >> _executeHandler: @7 line 11  [GsNMethod 3778049]
3 MessageNotUnderstood (AbstractException) >> _signalWith: @1 line 2  [GsNMethod 3878145]
4 MessageNotUnderstood (AbstractException) >> signal @2 line 47  [GsNMethod 3564289]
5 UndefinedObject (Object) >> doesNotUnderstand: @9 line 10  [GsNMethod 6212865]
6 UndefinedObject (Object) >> _doesNotUnderstand:args:envId:reason: @8 line 14  [GsNMethod 6168833]
7 SocketStream >> flush @5 line 5  [GsNMethod 160911105]
8 ZnManagingMultiThreadedServer (ZnSingleThreadedServer) >> writeResponse:on: @3 line 5  [GsNMethod 162365185]
9 [] in ZnMultiThreadedServer >> writeResponseSafely:on: @7 line 4  [GsNMethod 173923585]
10 ExecBlock0 (ExecBlock) >> on:do: @3 line 44  [GsNMethod 5659905]
11 ZnManagingMultiThreadedServer (ZnMultiThreadedServer) >> writeResponseSafely:on: @3 line 5  [GsNMethod 160791041]

The odd thing is that a continuation was not recorded so I would have expected the gem to fail with such an "unhandled exception" ... it is also likely that error handlers on the stack would dispose of sockets to prevent random stuff going out over the wire ... so at this point I'm going to say that the root cause of the problem is the "unhandled MNU" and that is what I think we need to understand ...

The good news is that it is reproducing for on my local machine, so we have a chance to track this puppy down ...

@dalehenrich
Copy link
Member

dalehenrich commented Dec 31, 2021

... the SocketStream >> flush is the source of the MNU and a niled out socket would be the likely source ... sot the stack overflow is just part of a series of errors ...

Prior to the MNU, there was this:

----------- Continuation saved to object log LOG ENTRY: anArray-----------
----------- Continuation saved to object log ERROR Encountered: 2021-12-30T16:47:34.775838-08:00
WAGsInvalidCallbackContext: You can only #call: and #answer: from within a callback or a Task.
1 GRGemStonePlatform >> logError:title:shouldCommit: @2 line 4  [GsNMethod 137232129]
2 GRGemStonePlatform >> logError:title: @2 line 3  [GsNMethod 137232641]
3 WAGemStoneProductionErrorHandler (WAErrorHandler) >> saveExceptionContinuation: @8 line 6  [GsNMethod 249487105]
4 WAGemStoneProductionErrorHandler >> handleDefault: @2 line 3  [GsNMethod 252143617]
5 WAGemStoneProductionErrorHandler (WAErrorHandler) >> handleError: @2 line 2  [GsNMethod 214702337]
6 WAGemStoneProductionErrorHandler (WAErrorHandler) >> handleGemStoneException: @4 line 4  [GsNMethod 249487361]
7 WAGemStoneProductionErrorHandler (WAHtmlHaltAndErrorHandler) >> handleException: @2 line 2  [GsNMethod 252424193]
8 [] in WAExceptionHandler >> handleExceptionsDuring: @10 line 5  [GsNMethod 232993793]
9 ExecBlock0 (ExecBlock) >> on:do: @3 line 44  [GsNMethod 5659905]

Now I don't think the WAGemStoneProductionErrorHandler is expected to shut down the system, but I'm not sure whether the server is expected to survive a WAGsInvalidCallbackContext ...

Okay I debugged this continuation (using ol view in tODE) and looking at frame 208, the ifCurtailed: block:

206 [] in ExecBlock >> ifCurtailed: @7 line 6  [GsNMethod 138175489]
207 ExecBlock0 (ExecBlock) >> ensure: @2 line 12  [GsNMethod 5650689]
208 ExecBlock0 (ExecBlock) >> ifCurtailed: @3 line 8  [GsNMethod 120973057]
209 [] in ZnMultiThreadedServer >> serveConnectionsOn: @24 line 23  [GsNMethod 173925121]
210 ExecBlock >> valueWithArguments: @1 line 7  [GsNMethod 5652481]
211 GsProcess >> _start @7 line 16  [GsNMethod 6453505]
212 <Reenter marker>

The code that get's executed in [] in ZnManagingMultiThreadedServer(ZnMultiThreadedServer)>>serveConnectionsOn: @24 line 23 ends up nuking the socket:

    ifCurtailed: [ 
      self log debug: 'Destroying socket'.
      socket destroy ] ]

Which spells doom from the server, because this didn't cause the server to immediately die ... Sooooo, now it seems that the WAGsInvalidCallbackContext: You can only #call: and #answer: from within a callback or a Task. is the root cause of the crash ...

Hmmm, it looks like WAAllFunctionalTests>>renderContentOn: @8 line 19 is on the stack and poking around it seems the WAFlowRenderingCallFunctionalTest is in one of the WAValueHolders...

Headed to dinner now, but there's a bit more meat to find on this bone ...

@dalehenrich
Copy link
Member

I think that catastrophic failure in this case is probably expected presumably there's something in 3.6.1 (and beyond?) that is choking and resulting in the WAGsInvalidCallbackContext ... debugging that particular error is next up ... If you can determine which test is leading to this ... Presumably the functional tests can be run interactively, so I'll head down that route and once we've isolated the particular test, we should be able to get to the bottom of the problem ...

@jbrichau
Copy link
Member Author

jbrichau commented Jan 2, 2022

Well... testFlowRenderingCallFunctionalTest actually tests that this error is being produced and it should not tear down the server. It is strange that this error is being caught by the Zinc server instead of Seaside's WAErrorHandler (or whatever subclass that was set as application exception handler).

But indeed! If I run the test interactively, after about a minute or so, the server dies.
The inclusion of the WACORSFunctionalTest probably makes the tests run longer and therefore this problem now shows up. Now I need to figure out why (and if) this only happens on GS3.6...

Since the problem occurred with the inclusion of the WACORSFunctionalTest, and that this test is starting and stopping gems to perform the test, I was focused that it had something to do with that... thanks for spotting that and redirecting the search effort. I'll continue down this path and see what's going on.

@jbrichau
Copy link
Member Author

jbrichau commented Jan 2, 2022

The error does get handled by Seaside (WAGemStoneProductionErrorHandler). Interestingly, when I run the testFlowRenderingCallFunctionalTest in GS3.4.5, a second error pops up which probably explains why the socket is not destroyed in versions earlier than GS3.6:
Could not remove SocketStream is thrown in closeSocketStream:

1. DebuggerLogEntry class>>createContinuationFor: @3 line 5
2. DebuggerLogEntry class>>createContinuationLabeled: @3 line 4
3. [] in UndefinedObject(ZnGemServerLogSupport)>>createContinuation: @12 line 5
4. [] in GRGemStonePlatform>>doTransaction: @17 line 21
5. UndefinedObject(ExecBlock)>>ensure: @2 line 12
6. [] in GRGemStonePlatform>>doTransaction: @12 line 22
7. GRGemStonePlatform(ExecBlock)>>ensure: @2 line 12
8. TransientRecursionLock>>critical: @12 line 12
9. GRGemStonePlatform>>doTransaction: @4 line 10
10. ZnGemServerLogSupport>>createContinuation: @8 line 5
11. ZnGemServerLogSupport>>error: @3 line 2
12. [] in ZnManagingMultiThreadedServer>>closeSocketStream: @15 line 7
13. OrderedCollection(SequenceableCollection)>>remove:ifAbsent: @7 line 14
14. [] in ZnManagingMultiThreadedServer>>closeSocketStream: @10 line 6
15. ZnManagingMultiThreadedServer(ExecBlock)>>ensure: @2 line 12
16. TransientMutex(TransientSemaphore)>>critical: @5 line 7
17. ZnManagingMultiThreadedServer>>closeSocketStream: @5 line 4
18. [] in ZnManagingMultiThreadedServer(ZnMultiThreadedServer)>>serveConnectionsOn: @47 line 22
19. ZnManagingMultiThreadedServer(ExecBlock)>>ensure: @6 line 13
20. [] in ZnManagingMultiThreadedServer(ZnMultiThreadedServer)>>serveConnectionsOn: @29 line 20
21. [] in ExecBlock>>ifCurtailed: @8 line 6
22. ExecBlock>>ensure: @2 line 12
23. ZnManagingMultiThreadedServer(ExecBlock)>>ifCurtailed: @4 line 8
24. [] in ZnManagingMultiThreadedServer(ZnMultiThreadedServer)>>serveConnectionsOn: @25 line 23
25. ZnManagingMultiThreadedServer(ExecBlock)>>valueWithArguments: @1 line 1
26. GsProcess>>_start @8 line 16
27. GsNMethod class>>_gsReturnToC @1 line 1

So, this last error went unnoticed and prevents the socket from being destroyed after handling the WAGsInvalidCallbackContext error. As a result, the socket stays alive and the problem of the socket being nil does not show up. However, the WAGsInvalidCallbackContext is handled by the Seaside exception handler and correctly displayed in the interface. The code behind the ifCurtailed: should therefore never have been executed unless an error was produced in the error handling... but that is not what I'm seeing happing.

Edit: the Could not remove SocketStream error also only happens some time after the GsInvalidCallbackContext occurred, as this print of the objectlog shows:

error  Continuation saved to object log                       11273  01/02/2022 04:06:56:400
error  -- continuation -- (WAGsInvalidCallbackContext: Yo...  11273  01/02/2022 04:06:56:403
error  -- continuation -- (Could not remove SocketStream[...  11273  01/02/2022 04:07:13:124
error  -- continuation -- (Could not remove SocketStream[...  11273  01/02/2022 04:08:04:014
error  -- continuation -- (Could not remove SocketStream[...  11273  01/02/2022 04:08:04:015

@jbrichau
Copy link
Member Author

jbrichau commented Jan 2, 2022

I have no automated test for it yet but WAFlowErrorFunctionalTest also eventually results in a crashed ZnServer, even on Gemstone 3.4.5. That's not good....

@dalehenrich
Copy link
Member

Taking a slightly different tack ... I see that the code in ZnGemServerManagingMultiThreadedServer>>serveConnectionsOn: does not have an ifCurtailed: so I'm wondering if using ZnGemServerManagingMultiThreadedServer would be a better choice? That would entail using ZnNewGemServer or ZnSeasideNewGemServer instead of ZnGemServer/ZnSeasideGemServer ... but sets of GemServers are tested, so the New variants might just behave when errors are occurring ...

In general, a crashing server (backed by a server restart) is a "good thing" since the transactional nature of the system means that the user is expected to retry their request when a truly fatal error occurs ...

So for tests that are supposed to create errors, the new ZnNewGemServer may behave better from a testing perspective ...

@jbrichau
Copy link
Member Author

jbrichau commented Jan 2, 2022

My problem is that I fail to see why the ifCurtailed block is executed. It should not because the error has been captured by the Seaside error handler and an appropriate response was produced.
The gem should really not crash, especially not some time later when processing another request.

@jbrichau
Copy link
Member Author

jbrichau commented Jan 2, 2022

And with the following scenario, I start thinking something is still wrong with the workaround code made in the context of #1198

In the Seaside WAFlowErrorFunctionalTest:

  • Click on 'Raise error in called component'
  • Click through on 'Ok'
  • Hit the browser back button twice (going back to the test start page)
  • Hit the 'Restart' button
    -> socket is cleared

If you do not click on the 'Restart' button again (i.e. not triggering a restore of that continuation on the server), everything is fine. For example, if you go to the Seaside start page again instead of going 'back' after getting the error response: all is well.

@dalehenrich
Copy link
Member

Hmmm, perhaps the socket instance variable is part of the continuation? if so the restored continuation would include the niled socket var?

It is possible that the persistent behavior of Sockets has changed in 3.6.1 making them commitable ...

Hmm, hmmm, I think that if server state IS being persisted then the ZnTransactionSafeManagingMultiThreadedServer should be used ... it takes great care to wrap all objects on the stack with TransientValue wrappers so the stack can be persisted (as part of the continuation) ... although this was done to be able to simply save a continuation safely ...

Restoring a continuation on the stack that has references to a Socket will never work ... so that depends upon where the continuation boundaries are located ...

When I look at continuations in ol view I see somw with server instances that have been stored ... so if the restore is trying to restore a gem server instance, then the continuation boundaries need to be managed?

@jbrichau
Copy link
Member Author

jbrichau commented Jan 22, 2022

Another scenario that does not involve the WAGsInvalidCallbackContext error nor the partial continuation restore that leads to the same crash:

  • go to WAExceptionFunctionalTest in the functional test cases
  • select the WAGemStoneProductionErrorHandler as the handler (click 'set handler')
  • click 'raise error' link
  • go back (using browser back button)
  • click 'restart' button
    Repeat this a couple of times... at a certain point, the server will crash

@jbrichau
Copy link
Member Author

jbrichau commented Jan 22, 2022

@dalehenrich I am unable to reconstruct the issue when using ZnSeasideNewGemServer instead of ZnSeasideGemServer

Unfortunately, I don't feel at ease just changing to that version for the tests because I fail to understand why the ifCurtailed: is executed in the version with ZnSeasideGemServer

@jbrichau
Copy link
Member Author

Confirmed that its working but I keep this issue open for a little more investigation.
green build: https://github.com/SeasideSt/Seaside/actions/runs/1733599025

@jbrichau jbrichau changed the title ZnServer crashes due to socket being nil ZnSeasideGemServer crashes due to socket being nil Jan 23, 2022
@jbrichau jbrichau changed the title ZnSeasideGemServer crashes due to socket being nil ZnSeasideGemServer crashes due to socket becoming nil after a Seaside error handler caught an exception Jan 23, 2022
@dalehenrich
Copy link
Member

@jbrichau ...

Unfortunately, I don't feel at ease just changing to that version for the tests because I fail to understand why the ifCurtailed: is executed in the version with ZnSeasideGemServer

I think that the ifCurtailed: can probably be fixed ... don't know if there are other potential issues lurking in ZnSeasideGemServer :) I created ZnSeasideNewGemServer for a reason, but without looking at the diffs between the two implementations I don't recall what those reasons were ... this commit comment is interesting though: add ZnNewGsServerTests for testing ZnNewGemServer ... these tests can be run against an interactive gem server running in second tODE environmnent ... 19 run, 9 passes, 0 expected defects, 10 failures, 0 errors, 0 unexpected passes .... so blatant server-side errors have been cleaned up, as are the commit comments on this issue create GsApplicationTools v1.0.0 compatible gem server #69 ...

ZnGemServer was a year old at that time (2015), so I was probably reluctant to wholesale replace ZnGemServer with ZnNewGemServer, since folks may have been using the class for awhile and been happy with it ...

@dalehenrich
Copy link
Member

... and I guess people have been using ZnGemServer for 6 years now and it took that long for this particular issue to pop up???

@jbrichau
Copy link
Member Author

@dalehenrich I'm quite surprised by that yes. I am not using the ZnGemServer myself (using FastCGI).
Anyway, switched to the new version which is holding up in the tests and which I also did some more elaborate manual pounding on so I think we're good (or at least: better! ;-) )

Keeping this open to investigate the issue a bit more later on since I really want to try to fully grasp what's going on, but I will tend to more urgent matters first.

Thanks for the feedback!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants