Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parent-Child discover error in examples/simple when built as release and shared library. #1014

Open
Pupilsch opened this issue Oct 15, 2021 · 16 comments
Assignees
Labels
bug Something isn't working do-not-stale help wanted Good for taking. Extra help will be provided by maintainers

Comments

@Pupilsch
Copy link

Describe your environment
My system is Ubuntu 18.04.5 LTS. And the compiler is gcc 7.5.0

Steps to reproduce
I am trying to use opentelemetry as a shared library for my project. But I found that when I try to build it as release and shared library, there is some error in Parent-Child discover. The problem can be reproduced by just running the examples/simple/example_simple. Build in Debug(default) in cmake or static library wont show this error.

I use the following code to build.
cmake .. -DCMAKE_POSITION_INDEPENDENT_CODE=ON -DCMAKE_BUILD_TYPE=Release -DBUILD_SHARED_LIBS=ON
cmake --build . --target all

What is the expected behavior?
In example/simple the parent_span_id of f1 should be span_id of f2. And parent_span_id of f2 should be span_id of library. But both showing 0000000000000000.

What is the actual behavior?

{
  name          : f1
  trace_id      : cbbb2e3813a548c408de4b484dc723a1
  span_id       : ce90a9eaf3214e19
  tracestate    : 
  parent_span_id: 0000000000000000
  start         : 1634260358826620364
  duration      : 7665
  description   : 
  span kind     : Internal
  status        : Unset
  attributes    : 
  events        : 
  links         : 
  resources     : 
	service.name: unknown_service
	telemetry.sdk.version: 1.0.0
	telemetry.sdk.language: cpp
	telemetry.sdk.name: opentelemetry
  instr-lib     : foo_library
}
{
  name          : f1
  trace_id      : 3dc7f1b4da6e2eedff960b136ab6e305
  span_id       : a145e9e7fca8f666
  tracestate    : 
  parent_span_id: 0000000000000000
  start         : 1634260358826785278
  duration      : 1634
  description   : 
  span kind     : Internal
  status        : Unset
  attributes    : 
  events        : 
  links         : 
  resources     : 
	service.name: unknown_service
	telemetry.sdk.version: 1.0.0
	telemetry.sdk.language: cpp
	telemetry.sdk.name: opentelemetry
  instr-lib     : foo_library
}
{
  name          : f2
  trace_id      : 94d5abc1a0839e21a97ab9bb3564cbea
  span_id       : 8c0d473664eeeb48
  tracestate    : 
  parent_span_id: 0000000000000000
  start         : 1634260358826616048
  duration      : 296248
  description   : 
  span kind     : Internal
  status        : Unset
  attributes    : 
  events        : 
  links         : 
  resources     : 
	service.name: unknown_service
	telemetry.sdk.version: 1.0.0
	telemetry.sdk.language: cpp
	telemetry.sdk.name: opentelemetry
  instr-lib     : foo_library
}
{
  name          : library
  trace_id      : 0f0bc2eebeb389b448502ac1f015ed1e
  span_id       : 72fdbadc8e352b5d
  tracestate    : 
  parent_span_id: 0000000000000000
  start         : 1634260358826597245
  duration      : 460802
  description   : 
  span kind     : Internal
  status        : Unset
  attributes    : 
  events        : 
  links         : 
  resources     : 
	service.name: unknown_service
	telemetry.sdk.version: 1.0.0
	telemetry.sdk.language: cpp
	telemetry.sdk.name: opentelemetry
  instr-lib     : foo_library
}

Also the there is a test failed passing.

/home/xxx/src/opentelemetry-cpp/sdk/test/trace/tracer_test.cc:130: Failure
Value of: span2.at(0)->GetParentSpanId().IsValid()
  Actual: false
Expected: true
/home/xxx/src/opentelemetry-cpp/sdk/test/trace/tracer_test.cc:142: Failure
Expected equality of these values:
  span1.at(0)->GetTraceId()
    Which is: 16-byte object <11-E7 89-39 C6-F1 C7-94 00-2A 0C-55 77-86 CC-F1>
  span2.at(0)->GetTraceId()
    Which is: 16-byte object <7A-F8 0A-45 00-73 E5-4C 09-3E D9-58 8D-D0 60-F3>
/home/chunhao.shen/src/opentelemetry-cpp/sdk/test/trace/tracer_test.cc:143: Failure
Expected equality of these values:
  span2.at(0)->GetParentSpanId()
    Which is: 8-byte object <00-00 00-00 00-00 00-00>
  span1.at(0)->GetSpanId()
    Which is: 8-byte object <4D-6D D3-1B 08-D4 0C-9F>
@Pupilsch Pupilsch added the bug Something isn't working label Oct 15, 2021
@lalitb
Copy link
Member

lalitb commented Oct 15, 2021

This is weird. It works fine with ubuntu 18.04 + gcc-4.8 and ubuntu 20.04 + gcc-9.3 in our github CI environment. I did a quick test on ubuntu 18.04 and got the same behavior as yours. Need to debug it further. Thanks for reporting the issue.

@Pupilsch
Copy link
Author

Pupilsch commented Oct 15, 2021

Note that the traceid is also different. Some guesses when compiler doing optimization, the thread local "Stack" in runtime_context behave funny. Just a thought.

@cqhan777
Copy link

cqhan777 commented Oct 16, 2021

@lalitb This is our patch:

--- a/api/include/opentelemetry/context/runtime_context.h
+++ b/api/include/opentelemetry/context/runtime_context.h
@@ -300,7 +300,13 @@ class ThreadLocalContextStorage : public RuntimeContextStorage
     Context *base_;
   };
 
-  Stack &GetStack()
+  static Stack &GetStack() __attribute__((noinline, noclone))
   {
     static thread_local Stack stack_ = Stack();
     return stack_;

It seems to be a bug in gcc that somehow lost track of the static storage property for stack_ when it does the ISRA_CLONE. Disable cloning seems work-around this, although it pays some performance cost. The additional static is not a must, but since it does not need this, it's better to be a static to save a little parameter passing. I understand direct use of attribute is not recommended here, just for your reference. Hope this can help a little.

@lalitb
Copy link
Member

lalitb commented Oct 16, 2021

Thanks, @cqhan777, @Pupilsch, this is helpful.

@lalitb lalitb mentioned this issue Oct 16, 2021
3 tasks
@lalitb lalitb added the help wanted Good for taking. Extra help will be provided by maintainers label Nov 2, 2021
@lalitb lalitb self-assigned this Nov 3, 2021
@github-actions
Copy link

github-actions bot commented Jan 3, 2022

This issue was marked as stale due to lack of activity. It will be closed in 7 days if no furthur activity occurs.

@github-actions
Copy link

Closed as inactive. Feel free to reopen if this is still an issue.

@lalitb
Copy link
Member

lalitb commented Jan 26, 2022

@cqhan777 @Pupilsch - Is this issue still relevant, and reproducible?

@cqhan777
Copy link

Problem still exists but no longer relevant to us since we have the workaround.

@github-actions
Copy link

This issue was marked as stale due to lack of activity. It will be closed in 7 days if no furthur activity occurs.

@github-actions github-actions bot added the Stale label Mar 28, 2022
@github-actions
Copy link

github-actions bot commented Apr 4, 2022

Closed as inactive. Feel free to reopen if this is still an issue.

@github-actions github-actions bot closed this as completed Apr 4, 2022
@lalitb lalitb reopened this Apr 4, 2022
@lalitb lalitb removed the Stale label Apr 4, 2022
@sunyata0
Copy link

sunyata0 commented Apr 7, 2022

@Pupilsch Did you figure out how to solve the issue ?
I have exactly the same problem on my environment :

  • SUSE Linux Enterprise Server 12 SP5
  • gcc 8.5.0
  • C++ 17

But it's working fine on my Ubuntu 20.04 with gcc 9.4.0 (and also 8.4.0) and C++ 17

@lalitb
Copy link
Member

lalitb commented Apr 7, 2022

@Briac-nt If it is the same problem, the patch from @cqhan777 should work for you:

#1014 (comment)

Please confirm.

@sunyata0
Copy link

sunyata0 commented Apr 8, 2022

Hi @lalitb , unfortunately this patch doesn't work for me, the trick in the 1223 issue neither.
I changed the Stack &GetStack() by static Stack &GetStack() __attribute__((noinline, noclone)) and added a console log to be sure the modification was live. The problem is still here, very strange...

@Falmarri
Copy link

I think I have a similar issue, but more complex setup. I'm compiling otel as a static library, but then building a shared library out of that static lib. And it doesn't seem to always lose the stack, only sometimes. Not sure how to provide more info, it's super complex and I only see this happening in my production system

@lalitb
Copy link
Member

lalitb commented Jan 31, 2024

@Falmarri Thanks, can you also add the platform details here - (linux-distro, gcc version, build flags etc)

@Falmarri
Copy link

Falmarri commented Jan 31, 2024

@lalitb This is built on Ubuntu 18.04.6 LTS using

Using built-in specs.
COLLECT_GCC=gcc
COLLECT_LTO_WRAPPER=/usr/local/libexec/gcc/x86_64-pc-linux-gnu/10.5.0/lto-wrapper
Target: x86_64-pc-linux-gnu
Configured with: ./configure --disable-multilib --enable-default-pie
Thread model: posix
Supported LTO compression algorithms: zlib
gcc version 10.5.0 (GCC)

I'm not entirely sure this is the same issue now that I've looked a bit furhter. What I'm seeing is that the trace is actually being propagated correctly. That is, I do this

    auto current_ctx = opentelemetry::context::RuntimeContext::GetCurrent();
    propagator->Inject(carrier, current_ctx);

and the remote service correctly gets the right trace/span. So I actually do have the right context. However, the span in question is never actually exported, so that my span never shows up in my collector and the traces are pretty broken.

The reason I think it's related to this though is because it's always this 1 particular span, and this span is in a separate compilation unit from the rest of the shared lib. So how this works is that I have the following:

  • statically built otel libraries
  • libA.a <-- This includes the otel headers but does NOT link against the libs, and does not use the sdk
  • libB.so <-- Links statically against libA.a AND also statically to otel libs. These traces work fine, but the trace defined in libA.a does get propagated, but not reported

I realize this isn't a lot of information but like I said this only happens in our production system and I can't reproduce it on demand or figure out if there's a pattern

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working do-not-stale help wanted Good for taking. Extra help will be provided by maintainers
Projects
None yet
Development

No branches or pull requests

5 participants