Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enables duration queries by promoting Span.duration, timestamp #807

Merged
merged 1 commit into from
Nov 6, 2015

Conversation

codefromthecrypt
Copy link
Member

A commonly requested feature is querying by duration, for example showing traces longer than 1 minute.

While the UI supports ordering by duration, it is a game of chance whether the traces returned happen to be long or not. Even if you put a limit of 1000, you still may not find the longest trace (as your storage system may have a million traces in it).

The path proposed is top-level Span.timestamp, and Span.duration, which formerly existed in the scala and mustache models, but not in the thrift.

We could implement duration queries without changing thrifts. For example, this could be an implementation detail of each storage system. However, this limits testing to side-effects and keeps essential details in scala code, making portability harder.

Also, we have other reasons to top-level these fields:

Here's a convenience paste of the definitions of Span.timestamp and Span.duration

  /**
   * Microseconds from epoch of the creation of this span.
   *
   * This value should be set directly by instrumentation, using the most
   * precise value possible. For example, gettimeofday or syncing nanoTime
   * against a tick of currentTimeMillis.
   *
   * For compatibilty with instrumentation that precede this field, collectors
   * or span stores can derive this via Annotation.timestamp.
   * For example, SERVER_RECV.timestamp or CLIENT_SEND.timestamp.
   *
   * This field is optional for compatibility with old data: first-party span
   * stores are expected to support this at time of introduction.
   */
  10: optional i64 timestamp,
  /**
   * Measurement of duration in microseconds, used to support queries.
   *
   * This value should be set directly, where possible. Doing so encourages
   * precise measurement decoupled from problems of clocks, such as skew or NTP
   * updates causing time to move backwards.
   *
   * For compatibilty with instrumentation that precede this field, collectors
   * or span stores can derive this by subtracting Annotation.timestamp.
   * For example, SERVER_SEND.timestamp - SERVER_RECV.timestamp.
   *
   * If this field is persisted as unset, zipkin will continue to work, except
   * duration query support will be implementation-specific. Similarly, setting
   * this field non-atomically is implementation-specific.
   *
   * This field is i64 vs i32 to support spans longer than 35 minutes.
   */
  11: optional i64 duration

* example, SERVER_SEND.timestamp - SERVER_RECV.timestamp.
*
* Note that this should be treated unsigned. i32 implies trace durations are
* not longer than 35.79 minutes.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that's a reasonable limitation. Just the other day I had a discussion with the data team where they sometimes have spans measured in hours. I would go for i64.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fair enough.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@yurishkuro
Copy link
Contributor

In order to support a query by duration, we need to have an index of (service name, duration) -> traceId limited by startTs. We've always had implicit Span.duration and Span.startTs fields. To ensure we can support duration queries, we need to make these explicit.

The last sentence does not follow from the requirements. Yes, we do need an index, but we don't need to capture start/duration in the collected spans, we just need a way to calculate it in the collector. We could use cs/cr for that, provided they reach the collector in a single submission from instrumentation.

@yurishkuro
Copy link
Contributor

These are my concerns with adding the fields to thrift:

  • it suggests to the instrumentation libs to send redundant data, since they are already sending the start/end annotations from which startTs/duration can be derived. The Zipkin story about what and how the instrumentation libs have to send is already super convoluted.
  • In RPC case the start/duration can be derived either from server annotations or from client annotations, depending on where the instrumentation exists. There isn't really such a thing as a "span" in the ingestion pipeline, more like multiple slices of span. A slice + annotations is not ambiguous (although "span name" is), but slice + start/duration definitely are. So someone need to resolve that somewhere, and I think it's better resolved using the raw data of annotations than using potentially incorrectly calculated start/duration.
  • In cases when instrumentation cannot send the complete span, some other mechanism, like a Spark pipeline, can do a post-processing and calculate start/duration having seen the complete span in the storage (multiple slices stored independently). It would certainly be storage implementation dependent how such post-caculated result would be stored. In Cassandra we'd add a record to the index by duration. If other stores have a record representing a complete span, they would have to make an update to that record. Forcing Cassandra to store an update of post-calculation would be pretty bad for performance.

@yurishkuro
Copy link
Contributor

btw, what is your concern with having these two fields be implementation-specific? The span API that query service deals with can have them as def's, and storage implementations fill it. In many cases the actual calculation should be the same, e.g. if it's from annotations, so that logic can be refactored to a base class or composition. I thought right now it it's already in the shared class.

@yurishkuro
Copy link
Contributor

ok, I'll shut up, go for it. I believe in my system we may be able to calculate those fields in the collector.

@codefromthecrypt
Copy link
Member Author

Glad it might work. Just added some more detail around how I see this fitting in with local spans. PTAL, as it may make more sense within this context #808

@codefromthecrypt
Copy link
Member Author

Re-summarized in the thrift definition and WIP description. I removed my related redundant comments.

@codefromthecrypt codefromthecrypt changed the title Enables duration queries by promoting Span.duration, startTs Enables duration queries by promoting Span.duration, timestamp Nov 2, 2015
codefromthecrypt pushed a commit that referenced this pull request Nov 4, 2015
This moves existing code around the notion of Span.timestamp,
Span.duration discussed in #807.

The impact from a user POV is minimal. `endTs`, used by the query and UI
formerly looked for the last timestamp in a trace. What this meant is
that if someone clicked search, waited, then clicked search again with
the same `endTs`, an in-flight trace may "disappear" if it has new
activity. Since `endTs` is now based on a stable point (the start), a
trace wouldn't disappear anymore. This impact is so subtle that it is
barely worth discussing.

The primary motivation for this change is to simplify the commodity task
of timestamping and duration stamping spans. This is discussed #807, and
directly supports a new minimal design of local spans (#808).
@codefromthecrypt
Copy link
Member Author

Note I plan to start on this today, as soon as the supporting changes are released as 1.21.1

@codefromthecrypt
Copy link
Member Author

ETA tomorrow!

@codefromthecrypt
Copy link
Member Author

expect an update tomorrow pacific AM. Just polishing tests

codefromthecrypt pushed a commit that referenced this pull request Nov 6, 2015
Before this change, Span.timestamp and duration were always derived
lazily from annotations. This decouples that logic by converting the
scala methods to vals and populating them with
`ApplyTimestampAndDuration`.

This serves us in at least two ways. First, the implicit association
between annotations and timestamp or duration was tested in various
components. Making these explicit centralizes the responsibility, and
lowers the test burden on other components. Also, we want to formalize
these fields in persisted models in support of duration queries and
local spans (#807). Organizing logic ahead of this work makes the change
to persistence simpler.
codefromthecrypt pushed a commit that referenced this pull request Nov 6, 2015
Before this change, Span.timestamp and duration were always derived
lazily from annotations. This decouples that logic by converting the
scala methods to vals and populating them with
`ApplyTimestampAndDuration`.

This serves us in at least two ways. First, the implicit association
between annotations and timestamp or duration was tested in various
components. Making these explicit centralizes the responsibility, and
lowers the test burden on other components. Also, we want to formalize
these fields in persisted models in support of duration queries and
local spans (#807). Organizing logic ahead of this work makes the change
to persistence simpler.
With this in place, instrumentation can send timestamp and duration
explicitly, which facilitates local spans or other spans who don't log
RPC annotations.
@codefromthecrypt
Copy link
Member Author

ok. this is finally complete. Once merged, I'll raise a pull request for duration query.

codefromthecrypt pushed a commit that referenced this pull request Nov 6, 2015
Enables duration queries by promoting Span.duration, timestamp
@codefromthecrypt codefromthecrypt merged commit b175e24 into master Nov 6, 2015
@codefromthecrypt codefromthecrypt deleted the duration branch November 6, 2015 20:32
@codefromthecrypt codefromthecrypt added enhancement model Modeling of traces labels Oct 26, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement model Modeling of traces
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants