Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possible bug with ThreadLocal.asContextElement #985

Closed
alamothe opened this issue Feb 15, 2019 · 34 comments
Closed

Possible bug with ThreadLocal.asContextElement #985

alamothe opened this issue Feb 15, 2019 · 34 comments
Labels

Comments

@alamothe
Copy link

alamothe commented Feb 15, 2019

Hello,
I'm observing a strange behavior with the following code:

	val t: ThreadLocal<String> = ThreadLocal.withInitial { "a" }

	@Test
	fun testContext1() = runBlocking {
		println(t.get())
		withContext(coroutineContext.plus(t.asContextElement("b"))) {
			println(t.get())
		}
		println(t.get())
	}

	@Test
	fun testContext2() = runBlocking {
		println(t.get())
		withContext(coroutineContext.plus(t.asContextElement("b"))) {
			println(t.get())
			delay(Random.nextLong(100))
			println(t.get())
		}
		println(t.get())
	}

	@Test
	fun testContext3() = runBlocking {
		println(t.get())
		async(coroutineContext.plus(t.asContextElement("b"))) {
			println(t.get())
			delay(Random.nextLong(100))
			println(t.get())
		}.await()
		println(t.get())
	}

The result is:

  • testContext1 prints aba - expected
  • testContext2 prints abbb - It looks like the value is not restored for some reason. Is this a bug? What's the explanation?
  • testContext3 prints abba - expected

Please help!

I'm using org.jetbrains.kotlinx:kotlinx-coroutines-core:1.1.1

@elizarov
Copy link
Contributor

It is not exactly a bug, but a side-effect of the way asContextElement is implemented. See, your main coroutine (in runBlocking) does not have t's context element, so this is interpreted as "main coroutine do not care about this thread local". So then, depending on specific ways you get back to the main tread, the value of t can be either restored or not restored for the main coroutine.

If you write runBlocking(t.asContextElement("a")) { ... } in all your tests, then they all are going to run as expected.

Also note, then when you use withContext function you don't have to write withContext(coroutineContext.plus(other)) { ... }, because that merging of current context with the other context is exactly what withContext does (see its docs), so you can simply write withContext(other) { ... }.

Does it help?

@alamothe
Copy link
Author

alamothe commented Feb 16, 2019

Hey @elizarov thank you for your reply & the tip for withContext.

I changed the test like this:

	@Test
	fun testContext2() {
		println(t.get())
		runBlocking {
			println(t.get())
			withContext(t.asContextElement("b")) {
				println(t.get())
				delay(Random.nextLong(100))
				println(t.get())
			}
			println(t.get())
		}
		println(t.get())
	}

It prints a a b b b a.

Are you basically saying that, between the outermost asContextElement and the end of the coroutine, the value of ThreadLocal is undefined? I find that very troubling. In my opinion, such undefined behavior could lead to disastrous bugs.

How about after the coroutine? Is it guaranteed to be restored?

@alamothe
Copy link
Author

alamothe commented Feb 16, 2019

Consider also this case where the value depends on previous value (not uncommon):

	val tInt: ThreadLocal<Int> = ThreadLocal.withInitial { 0 }

	@Test
	fun testContext4() {
		println(tInt.get())
		runBlocking {
			println(tInt.get())
			withContext(tInt.asContextElement(tInt.get() + 1)) {
				println(tInt.get())
				// Comment for other behavior
				delay(Random.nextLong(100))
			}
			println(tInt.get())
			withContext(tInt.asContextElement(tInt.get() + 1)) {
				println(tInt.get())
				// Comment for other behavior
				delay(Random.nextLong(100))
			}
			println(tInt.get())
		}
		println(tInt.get())
	}

So now in the second withContext, the value could be both 1 or 2. In practice, withContext can be deep in the stack and have no idea if it is outermost asContextElement or not.

@alamothe
Copy link
Author

If I understand correctly, the only workaround is to remember to set as a context element every time a coroutine is started. But what if that's happening at 100 places (not uncommon, consider the codebase which is migrating to coroutines)

@elizarov
Copy link
Contributor

The design principle of t.asContextElement(value) is this -- a coroutine with this element in its context will have value set in the corresponding thread local. Indeed, there are no guarantees for coroutines that do have this element in their context, as they are not supposed to ever care about it. If they do care about this thread-local, they should have it in their context.

You don't need to remember to always set it if you use the structured concurrency approach and always launch your coroutines as children of other coroutines. Once you set the corresponding element in the context of your top-level scope, it is going to be inherited by all your coroutines.

@alamothe
Copy link
Author

You don't need to remember to always set it if you use the structured concurrency approach and always launch your coroutines as children of other coroutines. Once you set the corresponding element in the context of your top-level scope, it is going to be inherited by all your coroutines.

What I was referring to is usage in a servlet container. In this case, each request will start outside of a coroutine (as a separate thread), and then a coroutine will be started at some point via runBlocking. Code outside of a coroutine will care about a thread-local value. To be able to access it both inside and outside of a coroutine - isn't that precisely the idea behind ThreadLocal.asContextElement?

So in this scenario, does the implementation guarantee that the value will be restored? In my example, outside of runBlocking? (my tests do show the value being restored after runBlocking)

I hope the answer is yes. But even so, you should realize the inconsistency here:

  1. Value of a thread-local is well-defined inside a coroutine which has it in its context.
  2. Value of a thread-local is well-defined outside of coroutine.
  3. Value of a thread-local is undefined inside a coroutine which doesn't have it in its context.

Isn't it well reasoned to expect 2 and 3 to act the same? Outside of a coroutine there is no context at all. It's troublesome that the behavior is different. I am sure that someone will get bitten by this.

@elizarov
Copy link
Contributor

@alamothe The value of thread-local is guaranteed to be restored when you go outside of the scope of any of the coroutines. If you write the following code:

// not in a coroutine here
runBlocking(t.asContextElement()) {

}
// the value of t is guaranteed to be restored here

@alamothe
Copy link
Author

Got it. The way I solved it is to explicitly add to context every time I start a coroutine like that, and even implement as a helper function.

I still consider this behavior unintuitive, and suggest adding a warning to the docs, stating that "the value of a thread-local is undefined inside a coroutine which doesn't have it in its context", with some examples.

@qwwdfsad qwwdfsad added the docs KDoc and API reference label Feb 27, 2019
qwwdfsad added a commit that referenced this issue Feb 27, 2019
qwwdfsad added a commit that referenced this issue Mar 6, 2019
@frost13it
Copy link

It would also be nice to add the warning to the documentation of ThreadContextElement too.
At the moment it does not mention that restoreThreadContext() may not be called when it would be expected by users implementing ThreadContextElement themselves.

And it would be much better to have thread state restored outside withContext { }, because a library may not have control over initial context of the coroutine, so this feature becomes unusable there.

@frost13it
Copy link

The kotlinx-coroutines-slf4j module seems to be a bit broken because of this problem.
Running the following snippet:

import kotlinx.coroutines.Dispatchers
import kotlinx.coroutines.slf4j.MDCContext
import kotlinx.coroutines.withContext
import org.slf4j.LoggerFactory

private val logger = LoggerFactory.getLogger("MainKt")

suspend fun main() {
    logger.info("Before operation")
    withContext(MDCContext(mapOf("operationId" to "123"))) {
        withContext(Dispatchers.IO) {
            logger.info("Inside operation")
        }
    }
    logger.info("After operation")
}

Procuces the following output:

[] INFO  MainKt - Before operation
[operationId=123] INFO  MainKt - Inside operation
[operationId=123] INFO  MainKt - After operation

Here the expression in square brackets is Logback's %mdc.

As you can see, MDC is not cleaned up after exiting from withContext {}. And there is no top-level scope where MDCContext can be injected.

Even when this is possible, it is not convenient at all to put "empty" value of thread-local state in all root coroutines to ensure that it will not left undefined after running some code. Could you please fix it?

Should I create a separate issue and is there a chance that this design of ThreadContextElement will be changed in future and such usages will work?

@elizarov elizarov reopened this Sep 25, 2019
@elizarov
Copy link
Contributor

@frost13it Thanks for a use-case. I've reopened this way. Let's see if it would be somehow possible to fix it (ideally, in an efficient way).

@elizarov elizarov removed the question label Sep 25, 2019
@elizarov
Copy link
Contributor

@frost13it I've looked at this problem again. Can you, please, elaborate on your actual use-case? In particular, you are giving an example where thread local value leaks into suspend fun main. Is it your real use-case?

Note, that you can wrap your top-level code into withContext(MDCContext(emptyMap())) { ... }. If you do this, you should see that your operationId is cleared. Would it be a solution for you?

@frost13it
Copy link

Unfortunately it is not a good solution for me.
I'm writing a logging library. What am I trying to achieve is to propagate contextual data (called LoggingContext) in a call graph consisting of both suspend and non-suspend functions without passing it explicitly.
Coroutine context allows to achieve that in suspend code, and ThreadLocal allows to achieve that in non-suspend single-threaded code. Provided that suspend functions can only be called from other suspend functions, I see the following solutions for this:

  1. Take LoggingContext from the context of the coroutine currently running on the current thread. It would be possible if there was a ThreadLocal containing CoroutineContext of that coroutine. But as far as I know, no such ThreadLocal exists at the moment.

  2. Make thread local value of LoggingContext match that in the coroutine context.
    It seemed that ThreadContextElement fits perfectly here. But it is hard to guarantee that context of top-level coroutines will contain an initial value of LoggingContext. Forcing users of the library to add it manually is inconvenient and error-prone. Especially in the case with suspend fun main().
    That is why I would love to see this issue fixed.

Don't you think that kotlinx-coroutines-slf4j (which works the same way) is pretty difficult to use correctly because of this issue? I'm afraid that for now the same is true for any similar library.

@elizarov
Copy link
Contributor

elizarov commented Sep 26, 2019

We assume that kotlinx-coroutines-slf4j is used in the following way. Consider the application that is using SLF4J for logging, uses coroutines, and uses MDC. This application adds kotlinx-coroutines-slf4j to its dependencies and adds MDCContext(emptyMap()) to its top-level scope to give a "signal" to coroutine machinery that it is going to use MDCContext. After that, its MDC will be always correctly set/restored and reset and switching between different coroutines.

Unfortunately, this does not currently affect GlobalScope and requires some boiler-plate code from the application writer. E.g., if my application has code like val myScope = CoroutineScope(someContext) then I, as an application author, have to remember to add MDCContext(emptyMap()) to this scope's context.

What we can do to improve this situation is to give integration modules like kotlinx-coroutines-slf4j and ability to globally inject context elements to all scopes (including GlobalScope). This should considerably improve usability. All you'll need to do in this case is to add kotlinx-coroutines-slf4j to dependencies and that's it. What do you think?

P.S. The downside of this approach to consider is that anyone who adds kotlinx-coroutines-slf4j to their dependencies will get a performance hit even if they don't use MDC (as an afterthought -- there's no other reason to add kotlinx-coroutines-slf4j, but for MDC support).

@frost13it
Copy link

I think it would be sufficient in my case. It would be great to have such an ability in a library.

elizarov added a commit that referenced this issue Sep 27, 2019
Note, that this fix has potentially severe performance impact since
it always walk up the coroutine completion stack in search for parent
UndispatchedCoroutine.

:todo: figure out how to optimize it

Fixes #985
@elizarov elizarov added bug and removed docs KDoc and API reference labels Sep 28, 2019
elizarov added a commit that referenced this issue Sep 30, 2019
Note, that this fix has potentially severe performance impact since
it always walk up the coroutine completion stack in search for parent
UndispatchedCoroutine.

:todo: figure out how to optimize it

Fixes #985
@frost13it
Copy link

Are there any updates on this issue?

@alamothe
Copy link
Author

alamothe commented Jun 8, 2020

Thanks @elizarov for your work and everyone else in the discussion.

We have solved this by making sure every time we start a coroutine, we put all thread locals that we care about in its context.

But I totally understand how this doesn't work for libraries.

@akalini
Copy link

akalini commented Jul 28, 2020

Hello,

I am experiencing the same issue even with the code base that is not huge and is mostly under our control. Our services use MDC logging, opencensus tracing, thread locals, and GRPC contexts. It has been a huge challenge to wire everything up correctly to work with coroutine context propagation.

For the code under our control, we need to remember to seed all the different context elements every time somebody starts a coroutine. This leads to tight coupling, every piece of code that wants to start a coroutine needs to be aware of all the thread context elements in the entire codebase. And of course, it is very error-prone.

The problem is exacerbated by GRPC launching coroutines without giving everyone the opportunity to override the behavior. I imagine other libraries wanting to do the same.

The described behavior is surprising, to say the least. I don't totally understand why global injection is needed to fix this. In the examples above, there are clear scope/context entry and exit points. Could you explain why the restoreThreadContext is not invoked at the block's end? It's especially surprising behavior because if you remove delay invocation from the examples above, withContext works as expected!

@qwwdfsad
Copy link
Collaborator

One more request: https://youtrack.jetbrains.com/issue/KT-42582

qwwdfsad pushed a commit that referenced this issue Dec 1, 2020
This fix solves the problem of restoring thread-context when
returning to another context in undispatched way. It impacts
suspend/resume performance of coroutines that use ThreadContextElement
since we have to walk up the coroutine completion stack in search for
parent UndispatchedCoroutine. However, there is a fast-path to ensure
that there is no performance impact in cases when ThreadContextElement
is not used by a coroutine.

Fixes #985
@frost13it
Copy link

Is there a decision on the #1577?

@qwwdfsad
Copy link
Collaborator

qwwdfsad commented Feb 1, 2021

Yes, we are going to release the fix with the next release

@frost13it
Copy link

Magnificently!

qwwdfsad added a commit that referenced this issue Feb 1, 2021
This fix solves the problem of restoring thread-context when
returning to another context in undispatched way.

It impacts suspend/resume performance of coroutines that use ThreadContextElement and undispatched coroutines.

The kotlinx.coroutines code poisons the context with special 'UndispatchedMarker' element and linear lookup is performed only when the marker is present. The code also contains description of an alternative approach in order to save a linear lookup in complex coroutines hierarchies.

Fast-path of coroutine resumption is slowed down by a single context lookup.

Fixes #985

Co-authored-by: Roman Elizarov <elizarov@gmail.com
qwwdfsad added a commit that referenced this issue Feb 1, 2021
This fix solves the problem of restoring thread-context when
returning to another context in undispatched way.

It impacts suspend/resume performance of coroutines that use ThreadContextElement and undispatched coroutines.

The kotlinx.coroutines code poisons the context with special 'UndispatchedMarker' element and linear lookup is performed only when the marker is present. The code also contains description of an alternative approach in order to save a linear lookup in complex coroutines hierarchies.

Fast-path of coroutine resumption is slowed down by a single context lookup.

Fixes #985

Co-authored-by: Roman Elizarov <elizarov@gmail.com>
@luispollo
Copy link

luispollo commented Feb 28, 2021

Thanks for fixing this! I've been pulling my hair trying to understand why my MDC values were screwed up for months now without coming across this thread. Is there an ETA on the next release?

@luispollo
Copy link

We assume that kotlinx-coroutines-slf4j is used in the following way. Consider the application that is using SLF4J for logging, uses coroutines, and uses MDC. This application adds kotlinx-coroutines-slf4j to its dependencies and adds MDCContext(emptyMap()) to its top-level scope to give a "signal" to coroutine machinery that it is going to use MDCContext. After that, its MDC will be always correctly set/restored and reset and switching between different coroutines.

Until there's a new release, it would be great to have a clear example of what this "expected usage" looks like. Specifically, the documentation for kotlinx-coroutines-slf4j is extremely short and could benefit from a warning about the caveat caused by this bug and an example of a workaround. Might be best tracked as a separate issue, not sure.

@cyberdelia
Copy link

I'm seeing a strange side-effect which I believe is the result of this change. Using MDCContext will remove keys from the MDC if you call a suspendable function, which is a bit surprising, when in this simple example case you just call delay:

Without MDCContext:

runBlocking {
    MDC.put("key", "value")
    println(MDC.copyOfContextMap)
    delay(50)
    println(MDC.copyOfContextMap)
}

# {key=value}
# {key=value}

With MDCContext:

runBlocking(MDCContext()) {
    MDC.put("key", "value")
    println(MDC.copyOfContextMap)
    delay(50)
    println(MDC.copyOfContextMap)
}

# {key=value}
# null

@qwwdfsad
Copy link
Collaborator

Yes, this is the existing limitation of MDCContext exposed more strictly.

Quoting our KDoc:

Note that you cannot update MDC context from inside of the coroutine simply using MDC.put. These updates are going to be lost on the next suspension and reinstalled to the MDC context that was captured or explicitly specified in contextMap when this object was created on the next resumption.

This is a by-design limitation, that become much more clearer when more than one coroutine is launched.

E.g.:

runBlocking((MDCContext()) {
    launch(someDispatcher) {
        // mutate MDC
    }.join()
    // use MDC here as well
}

do you expect mutated MDC to be available in runBlocking coroutine. What if someDispatcher is Dispatchers.Unconfined and what if it's concurrent? And what if it's concurrent, but it happened to be dispatched on the same thread as runBlocking coroutine?

@michail-nikolaev
Copy link

It seems like there are some strange issues related to it - https://youtrack.jetbrains.com/issue/KT-46552

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

9 participants