You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The key problem we encounter while trying to create bug-free code is reproducing bugs.
Our users report bug is sufficient number and with sufficient detail. However, reproducing bug reported in-the-wild is now becoming the key problem. Triggering them has proven extremely hard giving our dependencies on thread scheduling, OS version, libraries versions, incoming network packets, obscure firewall blocking software, and database content.
One idea is to ask users to record all state of their Python intepreter in a ring buffer of many GBytes. When users are able to trigger a bug, they press the stop recoding and submit bug report to our system. By using the source code we should be able to piece together what happened.
OSDI 2018 paper on debugging. Microsoft uses "lightweight hardware tracing" to fix this: REPT: Reverse Debugging of Failures in Deployed Software
In this paper, we present REPT, a practical system that enables reverse debugging of software failures in deployed systems. REPT reconstructs the execution history with high fidelity by combining online lightweight hardware tracing of a program’s control flow with offline binary analysis that recovers its data flow. It is seemingly impossible to recover data values thousands of instructions before the failure due to information loss and concurrent execution. REPT tackles these challenges by constructing a partial execution order based on timestamps logged by hardware and iteratively performing forward and backward execution with error correction.
REPT leverages Intel Processor Trace (PT) to log control-flow and timing information of a program’s execution. Intel PT became available when the Broadwell architecture was released in 2014. Intel PT supports various program tracing modes, and REPT currently uses the per-thread circular buffer mode to trace user-space execution of all threads within a process. https://github.com/01org/satt
towards fault-free software
The key problem we encounter while trying to create bug-free code is reproducing bugs.
Our users report bug is sufficient number and with sufficient detail. However, reproducing bug reported in-the-wild is now becoming the key problem. Triggering them has proven extremely hard giving our dependencies on thread scheduling, OS version, libraries versions, incoming network packets, obscure firewall blocking software, and database content.
One idea is to ask users to record all state of their Python intepreter in a ring buffer of many GBytes. When users are able to trigger a bug, they press the stop recoding and submit bug report to our system. By using the source code we should be able to piece together what happened.
There has been done prior work in this area. We struggled with debugging in the past, see ticket on Linux OS catching debig info..
The text was updated successfully, but these errors were encountered: