Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SDK] Fix core dump issue in OTEL_INTERNAL_LOG_DISPATCH macro #2394

Closed
wants to merge 1 commit into from

Conversation

ninghejun
Copy link
Contributor

Fixes #2393

Changes

Fix core dump issue in OTEL_INTERNAL_LOG_DISPATCH macro

@ninghejun ninghejun requested a review from a team November 6, 2023 06:22
@ninghejun ninghejun changed the title Log handler dev [SDK] Fix core dump issue in OTEL_INTERNAL_LOG_DISPATCH macro Nov 6, 2023
@ninghejun ninghejun closed this Nov 6, 2023
@ninghejun ninghejun reopened this Nov 6, 2023
Copy link

codecov bot commented Nov 6, 2023

Codecov Report

Merging #2394 (42187e6) into main (ca08c5a) will not change coverage.
The diff coverage is n/a.

Impacted file tree graph

@@           Coverage Diff           @@
##             main    #2394   +/-   ##
=======================================
  Coverage   87.43%   87.43%           
=======================================
  Files         199      199           
  Lines        6030     6030           
=======================================
  Hits         5272     5272           
  Misses        758      758           
Files Coverage Δ
...lude/opentelemetry/sdk/common/global_log_handler.h 72.23% <ø> (ø)

@lalitb
Copy link
Member

lalitb commented Nov 6, 2023

Thanks for the PR. Ideally log_handler->Handle is supposed to be synchronous function and it's default implementation doesn't store or use the pointer after the function returns.
If custom handler requires that the lifetime of the string data outlives the use of the pointer, it could be managed by the Handle function itself, if it immediately copies the string data into its own storage before returning. This can be documented better. In general changes look good, but just want to prevent creating a temporary variable in the hot path if it is avoidable.

std::stringstream tmp_stream; \
tmp_stream << message; \
std::string tmp_string = tmp_stream.str(); \
log_handler->Handle(level, __FILE__, __LINE__, tmp_string.c_str(), attributes); \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably there is something I missed. What's is different behavior with this change?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also didn't get the point, could you please explain this change?
To my understanding, C++ specifications will ensure the temporary object should be freed only after this code block(Handle(...)).

https://en.cppreference.com/w/cpp/language/lifetime#Temporary_object_lifetime

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're correct.

We've previously encountered a core dump caused by c_str(), similar to the situation described in https://stackoverflow.com/questions/27627413/why-does-calling-c-str-on-a-function-that-returns-a-string-not-work.
The code was as follows:

    auto cstr = tmp_stream.str().c_str();
    f(cstr);

This caused cstr to become an invalid pointer passed into function f.

However, in this case, Handle(level, FILE, LINE, tmp_stream.str().c_str(), attributes) is within a single expression, so this issue doesn't exist.

When I saw this core dump, my habitual aversion to the c_str() function led to a misjudgment.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, in the example you shared and aslo the stackoverflow link, the temporary object returned from function reaches the end of its lifetime at the end of the including its expression, which means the temporary object is no longer valid after the ";", but the lifetime of local variable is different and will reach the end of current block, like @owent mentioend.

Copy link
Member

@marcalff marcalff left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the issue report and PR.

I do not agree with the root cause analysis, and as a result the code change proposed does not resolve or improve anything, in my understanding.

Details below.

"In this macro, tmp_stream.str().c_str() returns a pointer to a temporary string. This temporary string is the return value of tmp_stream.str(), which will be destructed at the end of the current statement."

Agreed.

"Therefore, it is unsafe to continue using this pointer after the statement ends."

Agreed.

"If the log_handler->Handle function needs to access this string after the current statement, this macro may cause issues."

Do not agree.

If the handle function needs to access the string later, it first must make a copy of it, during the call to Handle().

The root cause of the crash seen is most likely that a raw pointer is held instead.

As @lalitb pointed out, this could be documented better to set expectations, but is not a bug caused by a temporary variable.

The "current" statement is:

log_handler->Handle(level, __FILE__, __LINE__, tmp_stream.str().c_str(), attributes);

The temporary used by tmp_stream.str() will be destroyed after the current statement ends, that is, it still exists during the call to Handler().

The code change modifies this slightly, but the problem still remains that accessing the message after the Handle() method returns is not expected.

@ninghejun
Copy link
Contributor Author

The previous contributors mentioned the issue of asynchrony.
When I inherit from ::opentelemetry::sdk::common::internal_log::LogHandler, I realize that the input parameter of the function
void Handle(::opentelemetry::sdk::common::internal_log::LogLevel level, const char* file, int line, const char* msg, const ::opentelemetry::sdk::common::AttributeMap& attributes) noexcept override;
is const char* msg. This means I can't manage the lifecycle of the parameter, so it's certain that I can't hold its pointer for asynchronous processing.
Therefore, I used a completely synchronous method. However, my service still has a very small chance of core dumping at this point. Now, I need to reproduce this core dump with the smallest example, and then I can analyze its cause.

I'll study it further and report back to everyone once I make progress.

@owent
Copy link
Member

owent commented Nov 8, 2023

The previous contributors mentioned the issue of asynchrony. When I inherit from ::opentelemetry::sdk::common::internal_log::LogHandler, I realize that the input parameter of the function void Handle(::opentelemetry::sdk::common::internal_log::LogLevel level, const char* file, int line, const char* msg, const ::opentelemetry::sdk::common::AttributeMap& attributes) noexcept override; is const char* msg. This means I can't manage the lifecycle of the parameter, so it's certain that I can't hold its pointer for asynchronous processing. Therefore, I used a completely synchronous method. However, my service still has a very small chance of core dumping at this point. Now, I need to reproduce this core dump with the smallest example, and then I can analyze its cause.

I'll study it further and report back to everyone once I make progress.

As metioned by @marcalff , you should copy the strings and move them if you want to handle this log asynchrony. Move the temporary tmp_stream.str() to a named local variable will not solve lifetime problem.

@ninghejun
Copy link
Contributor Author

ninghejun commented Nov 8, 2023

The previous contributors mentioned the issue of asynchrony. When I inherit from ::opentelemetry::sdk::common::internal_log::LogHandler, I realize that the input parameter of the function void Handle(::opentelemetry::sdk::common::internal_log::LogLevel level, const char* file, int line, const char* msg, const ::opentelemetry::sdk::common::AttributeMap& attributes) noexcept override; is const char* msg. This means I can't manage the lifecycle of the parameter, so it's certain that I can't hold its pointer for asynchronous processing. Therefore, I used a completely synchronous method. However, my service still has a very small chance of core dumping at this point. Now, I need to reproduce this core dump with the smallest example, and then I can analyze its cause.
I'll study it further and report back to everyone once I make progress.

As metioned by @marcalff , you should copy the strings and move them if you want to handle this log asynchrony. Move the temporary tmp_stream.str() to a named local variable will not solve lifetime problem.

Thank you all for your help.

I tried not calling GlobalLogHandler::SetLogHandler, so it will use the default DefaultLogHandler.
However, the core dump issue still persists, and the service still crashes at the same place: std::ostream::sentry::~sentry().

Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x00007f6a063ff5e8 in std::ostream::sentry::~sentry() () from /lib64/libstdc++.so.6
[Current thread is 1 (Thread 0x7f69fb5ff700 (LWP 654))]
(gdb) bt
#0  0x00007f6a063ff5e8 in std::ostream::sentry::~sentry() () from /lib64/libstdc++.so.6
#1  0x00007f6a063ffd32 in std::basic_ostream >& std::__ostream_insert >(std::basic_ostream >&, char const*, long) ()
   from /lib64/libstdc++.so.6
#2  0x0000000001d69d37 in opentelemetry::v1::exporter::otlp::OtlpHttpLogRecordExporter::Export(opentelemetry::v1::nostd::span >, 18446744073709551615ul> const&) ()
#3  0x0000000001db0870 in opentelemetry::v1::sdk::logs::BatchLogRecordProcessor::Export() ()
#4  0x0000000001dafd4f in opentelemetry::v1::sdk::logs::BatchLogRecordProcessor::DoBackgroundWork() ()
#5  0x00007f6a063a6b73 in execute_native_thread_routine () from /lib64/libstdc++.so.6
#6  0x00007f6a04d1e2de in start_thread () from /lib64/libpthread.so.0
#7  0x00007f6a04837e83 in clone () from /lib64/libc.so.6
(gdb) 

The corresponding code is :

  opentelemetry::sdk::common::ExportResult result = http_client_->Export(service_request);
  if (result != opentelemetry::sdk::common::ExportResult::kSuccess)
  {
    OTEL_INTERNAL_LOG_ERROR("[OTLP HTTP Client] ERROR: Export "
                            << log_count << " log(s) error: " << static_cast<int>(result));
  }

After macro expansion, it becomes

std::stringstream tmp_stream; 
tmp_stream << "[OTLP HTTP Client] ERROR: Export " << log_count << " log(s) error: " << static_cast<int>(result));. 

I found through gdb that the argument for std::__ostream_insert at that time was "[OTLP HTTP Client] ERROR: Export ".
In fact, it crashes at tmp_stream << "[OTLP HTTP Client] ERROR: Export ", during the first call to the operator <<.
and after it calls std::__ostream_insert, it crashes in the std::ostream::sentry::~sentry() destructor.

So it had a core dump before entering log_handler->Handle(), during the execution of tmp_stream << message.

This is puzzling to me. I believe that it is absolutely impossible for it to crash at this location. Therefore, there might be some hidden issues elsewhere. It seems like there might be a wild pointer issue in the service, causing memory corruption. But why would it always crash in this function? It seems too coincidental.

I tried using the -DOTEL_INTERNAL_LOG_LEVEL=-1 compilation option to disable the OTEL_INTERNAL_LOG_ERROR log. After doing this, the core dump issue seems to be resolved. However, I don't understand where the problem was.

@marcalff
Copy link
Member

marcalff commented Nov 8, 2023

@ninghejun Thanks for the details.

If std::stringstream itself does not work and crashes, this looks like the first item to investigate and resolve, and is independent of opentelemetry-cpp.

Typically in these cases, I would start to audit compiling and link options, to make sure the build itself is sane.

Next, I would add some sample code in the application that writes to a std::stringstream and reads the data back, and see if that alone works.

Let's continue this investigation and discussion in issue #2393, because this does not concern the PR at hand.

As for the PR, we all agree now that the change is not needed, and does not fix the (yet to be determined) root cause of the crash seen in your application, so the PR needs to be closed.

@marcalff marcalff closed this Nov 8, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

OTEL_INTERNAL_LOG_DISPATCH Macro's Temporary String's c_str Method Causes Core Dump
5 participants