Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KSQL-1787: First draft of Time and Windows topic #2134

Merged
merged 13 commits into from
Nov 28, 2018
Merged

KSQL-1787: First draft of Time and Windows topic #2134

merged 13 commits into from
Nov 28, 2018

Conversation

JimGalasyn
Copy link
Member

@JimGalasyn JimGalasyn commented Nov 7, 2018

No description provided.

handle late-arriving records.

.. image:: ../img/ksql-window.png
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure how precise the graphic should be, but note that window start timestamp is inclusive while window end timestamp is exclusive. This property is important for non-overlapping windows, for which each record should be contained in exactly one window.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I make a note of this later in the topic.

which occur at particular times. Three important times in a record's
life cycle are *Event-time*, *Ingestion-time*, and *Processing-time*.

Representing time consistently enables aggregation operations o streams,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit o streams ?

Representing time consistently enables aggregation operations o streams,
like SUM, that have time boundaries.

KSQL and Kafka Streams support the following notions of time:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove and Kafka Streams (who cares?)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not me :)

real-time pipelines based on Apache Kafka and Kafka Streams) or it might be
hours, like for batch pipelines based on Apache Hadoop or Apache Spark.

Kafka Streams assigns a timestamp to every data record by using
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Kafka Streams -> KSQL ?

progress of a stream over time. Timestamps are used by time-dependent
operations, like joins.

We call it the *application event-time* to differentiate
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This term is new to me... Where does it come from? Just curious.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was "event-time of the application" in the original -- not sure if it's even meaningful.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

\cc @miguno Can you help out here? IMHO, it's important that we standardize our language.


Developers can thus enforce different notions/semantics of time depending on their business needs.

Finally, whenever a Kafka Streams application writes records to Kafka,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Kafka Streams -> KSQL ?

aspects of time such as time zones and calendars are correctly
synchronized – or at least understood and traced – throughout your stream data pipelines.
It often helps, for example, to agree on specifying time information in UTC or in Unix time
(such as seconds since the epoch). You should also not mix topics with different time semantics.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This has actually larger impact. In fact, Kafka Streams is based on the unix epoch time in UTC timezone. This affect time-windows. For example, if you define a 24h tumbling time window, it will be in UTC timezone... This might not be appropriate if you want to get daily windows in your timezone.

operations such as aggregations or joins into so-called windows. Windows are tracked
per record key.

Windowing operations are available in the Kafka Streams DSL. When working with windows,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought this is KSQL docs? Why do we talk so much about Kafka Streams?


Windowing operations are available in the Kafka Streams DSL. When working with windows,
you can specify a retention period for the window. This retention period controls how
long Kafka Streams will wait for out-of-order or late-arriving data records for a given window.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not true any longer in 5.1 (I know that this PR is against 5.0, and also not sure who KSQL exposes both anyway -- just want to point it out to make you aware of this) -- we added a new configuration that we call "grace period" that determines how long we wait before we close a window. Retention time is still a valid parameter that defines how long we store the (potentially) closed window -- we do this, to allow to access the window via Interactive Queries even if it's already closed.

.. image:: ../img/ksql-window.png


.. image:: ../img/ksql-window-aggregation.png
Copy link
Member

@mjsax mjsax Nov 8, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With regard to my comment above: Session Window start and end time are both inclusive (in contract to time-windows) and there is always a record in the session window with the start and end timestamps (because the timestamp of the first and last record in the window define window start and end time).

@big-andy-coates big-andy-coates requested a review from a team November 8, 2018 17:40
Copy link
Contributor

@big-andy-coates big-andy-coates left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @JimGalasyn - ran out of time, got to dash.

Only looked at the first parts - I'm with Matthias - this is currently reading like KStreams docs, not KSQL!

Sorry, got to dash...

Time and Windows
################

During the life cycle of a record, it experiences a number of events
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a little confusing. The record is the event, it doesn't experience events. Nor does it have a lifecycle in this context.

Maybe just have:

When talking about time it is important to be clear about which time is being discussed. The time an event occurred may be different to the time it was ingested into Kafka, or the time at which the processing of the event is occurring.

???

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to get very crisp about this -- my thinking is that in the KSQL world, "events" and "messages" are abstracted away, and we have instead "records". But maybe this isn't the way we want to think about it...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, that there is nothing like a life cycle.

I also agree, that we should abstract over events/messages and use the term records -- we use records in KS docs, too, for the same reason. "Message" is the standard term for core Kafka, ie, pub/sub.

Copy link
Contributor

@big-andy-coates big-andy-coates Nov 9, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine with 'records'. The reason I mentioned 'events' is because of the use of 'events' here:

it experiences a number of events

The record does not experience anything. It is an immutable bit of data / information.

From a KSQL point of view, a record has a timestamp, (i.e. the timestamp of the record in Kafka). This is the default timestamp that KSQL will use. This may be an 'event-time' or just the 'ingestion-time' depending on what the record's producer did when producing the record to Kafka. (The producer can chose to set the timestamp to any time they like, but most often chose either some sensible 'event-time' or the current time, which would be equivalent to 'ingestion-time'. If they don't set a timestamp its set to 'ingestion-time' by Kafka.)

When importing the topic in to KSQL it will, by default, using this timestamp for the record. Alternatively, a user can use the WITH(TIMESTAMP='some-field') to tell KSQL to use an alternative timestamp from the record's value, (with an optional TIMESTAMP_FORMAT). The field the user specifies may be an 'event-time' or a 'ingestion-time'.

I think I'm right in saying KSQL currently doesn't support any 'processing-time' operations, (though they're probably possible with UDFs to access the current time).

So.. from KSQL perspective, I don't think the type of the timestamp, i.e. event vs ingestion vs processing time, really matters that much. KSQL just uses the timestamp that you tell it too.

That said, a little background explaining that different concepts of time exist and that its important to know which one you're discussing is probably worthwhile as a preamble, just don't talk about records having a lifecycle or experiencing events ;)

One thing that is very important to call out is that it is of paramount importance that the KSQL rows' timestamp, (be that from the record's timestamp or a field within the value), is using a consistent concept of time. For example, things will go awry if some records were written to kafka with the field containing some event-time, which others go the default ingest-time set by the broker. It is important that all records within a topic are using a consistent notion of time. What that notion is, doesn't really matter to KSQL.

Copy link
Member

@mjsax mjsax Nov 9, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This may be an 'event-time' or just the 'ingestion-time' depending on what the record's producer did when producing the record to Kafka. (The producer can chose to set the timestamp to any time they like, but most often chose either some sensible 'event-time' or the current time, which would be equivalent to 'ingestion-time'. If they don't set a timestamp its set to 'ingestion-time' by Kafka.)

IMHO, this is not 100% correct (with respect to out current terminology). (1) brokers allow to configure topic with either CreateTime (default) vs LogAppendTime -- for the first config, the timestamp will be whatever the producer sets. For the second config, the broker always set the timestamp. The first case implies "ingestion time" semantics. Thus, if LogAppendTime is configured, the producer has no control whatsoever over the timestamp.

If CreateTime is set, the following holds for the producer:

  • if a ProducerRecord is created, by default it contains no timestamp
  • users can set the timestamp on ProducerRecord explicitly though
  • if the timestamp is not set, on producer.send() the current wall-clock time will be set automatically

Note, that we consider all three cases as "event-time" and don't distinguish them in our current terminology. I agree, that semantically we could follow Andy's suggestion, but I believe this would make it more complex for users to understand the subtle differences. Thus, I would prefer to stay with our current (maybe simplified) terminology.

For example, things will go awry if some records were written to kafka with the field containing some event-time, which others go the default ingest-time set by the broker.

I guess this is not possible.

However, another important thing to point out: if a WITH(timestamp=...) clause is used, this timestamp must be passed to Kafka Streams as "unix epoch time in milli seconds" -- not sure how much control the user has about this, and what part KSQL handles. If the use could mess this up (ie, KSQL cannot ensure this timestamp format/semantic) we need to tell users about it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is all great stuff, please keep it coming! We'll get a KSQL-focused topic out of this yet.

Representing time consistently enables aggregation operations o streams,
like SUM, that have time boundaries.

KSQL and Kafka Streams support the following notions of time:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not me :)

real-time pipelines based on Apache Kafka and Kafka Streams) or it might be
hours, like for batch pipelines based on Apache Hadoop or Apache Spark.

Kafka Streams assigns a timestamp to every data record by using
Copy link
Contributor

@big-andy-coates big-andy-coates Nov 8, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is kstreams implementation detail. What does KSQL support? It doesn't care what type of time a timestamp may refer to.

You can also switch which field KSQL gets its timestamp from. Default is the record's timestamp, but it can also grab a timestamp out of a field in the value.

Does KSQL support anything else when it comes to timestamps? cc @dguy

@JimGalasyn
Copy link
Member Author

@mjsax @big-andy-coates Thanks for the early feedback! Just so you know, this topic isn't even half-baked yet -- I only posted the PR to keep track of it. :) The "Time" section has been copy-edited from the original KStreams Concepts topic, which is why is reads like KStreams content (I had to start somewhere). The Windows section is literally just copy-pasted from KStreams, I haven't had a chance to edit it at all yet.

So expect this topic to change completely as I work on it -- but do keep the feedback coming!

Copy link
Contributor

@big-andy-coates big-andy-coates left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @JimGalasyn - it's great to see such docs being added.

At the moment the docs seem more focused on KStreams functionality, or how KSQL does things under the hood. I think it would benefit from reworking to be more 'KSQL user' orientated, i.e. from the point of view of someone using KSQL. It should focus on KSQL concepts and terminology and try to avoid muddying the water with implementation details.

Representing time consistently enables aggregation operations on streams,
like SUM, that have time boundaries.

KSQL supports the following notions of time:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As above, KSQL doesn't care what notion of time a timestamp has. I think it's beneficial to explain the main categories of event time, it just needs this bit reworking.

I would also separate this discussion of notions of time from anything Kafka specific. What type of timestamp is stored in a Kafka record's timestamp field, or some field in the record's value is completely arbitrary as producing apps can put what ever they like in there. It can be the event-time, some concept of ingestion-time, (i.e. ingested into the platform at time x, which would be different to ingested into Kafka at time y), or something random).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It can be the event-time, some concept of ingestion-time, (i.e. ingested into the platform at time x, which would be different to ingested into Kafka at time y), or something random).

That is correct. However, we should consider all of those as "event-time". From a KSQL point of view, it's unknown how the timestamp is set -- thus, for KSQL it is just event-time IMHO. If user sets event-time in a crazy manner, it's the user's fault.

Ingestion-time, would be an approximation of event-time, but it technically still event-time.

Processing-time would only hold, if KSQL uses the current wall-clock time when reading the record -- I think, this is not supported in KSQL, as users cannot set a custom TimestampExtractor from my understanding?

I think, it overall boils down to: if data is reprocessed, so we use the same timestamps. For event-time this is true (also for ingestion-time) -- for processing-time, this is not true.


KSQL supports the following notions of time:

Event-time
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The time when a record is created by the data source

So for being pedantic, but this is not strictly correct. 'Event-time' is the time at which the event occurred. End of.

It's possible the data-source, or the producing app, or what ever, is producing a record to Kafka for an event that happened a some point in the past. The 'event time' in this context would be the time the event occurred, i.e. the historical point in time.

I would change the first paragraph to something like:

"The time an event occurred. For example, if the event is a geo-location change reported by a GPS sensor in a car, the associated event-time is the time when the GPS sensor captured the location change."

As this is clean and explicit. Maybe followed by something like:

"A record may have more than one type of event-time. For example, a 'completed-transaction' record may have a field representing the time a transaction was started and another for when it was completed. Both can be considered event-times".

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

'Event-time' is the time at which the event occurred.

That is semantically correct. Note, that we assume that users would set this timestamp in ProducerRecrod explicitly.

It's possible the data-source, or the producing app, or what ever, is producing a record to Kafka for an event that happened a some point in the past. The 'event time' in this context would be the time the event occurred, i.e. the historical point in time.

Agreed. Again, the producer application would need to set the timestamp accordingly when creating ProducerRecords.

Bottom line: It think it makes sense to discuss semantic on producer application side and KSQL side separately.

(1) For producer side, we need to explain how event-time semantics can be achieved. Btw: often "the time at which the event occurred" is the same as "the wall-clock time when ProducerRecord is created" (ie, for any live/online application) -- only if you write historical data, both would be different.

(2) For KSQL, the assumption is always event-time. If the producer does it differently, the end-to-end pipeline might be incorrect (but it might also be ok, depending on the application). However, this part is not a KSQL concern any longer IMHO.

in a car, the associated event-time is the time when the GPS sensor captured
the location change.

Ingestion-time
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Avoid being Kafka specific. An ingestion-time could equally be the time some data was received by some platform as a whole, e.g. the timestamp at which the data was received by some gateway service. This too is a totally valid use of 'ingestion-time'.

As mentioned already, the 'timestamp' on the record may have been set by the Kafka broker or may have been set by the producing app. In the former case it is one type of ingestion-time, in the latter it may be some sort of ingestion-time, or maybe event-time, or maybe complete rubbish.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with your basic assessment though. However, it does not make too much difference, because any form of "ingestion time" is an approximation of "event-time" and the main classes we need to consider is event- vs processing-time.

I guess it depend from which angel you look at the problem: from a broker point of view, if the topic is configured with "CreateTime" the producer set the timestamp -- for "LogAppendTime" it's ingestion time. Of the producer does not set a proper "event-time" but some approximation, it' still event-time from a broker point of view.

Hence, I tend to disagree: we use the term "ingestion-time" for the case that the topic is configured with "LogAppendTime". It's fair to question this terminology. However, I think it makes sense, because this insures that there is not out-of-order data. If you have a form of "external ingestion time" (as you described above), there can still be out-of-order data... And I hope to make KS/KSQL smarter at some point, to exploit this property of ingestion time.


For example: imagine an analytics application that reads and processes the
geo-location data reported from car sensors, and presents it to a
fleet-management dashboard. In this case, processing-time in the analytics
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider changing:

In this case, processing-time in the analytics application might be milliseconds or seconds after event-time

To,

In this case, processing-time in the analytics application might be many minutes or hours after the event-time as cars can move out of mobile reception for periods of time and have to locally buffer records.

I think a larger discrepancy better highlights the difference and the importance of using the right one.

real-time pipelines based on Apache Kafka and Kafka Streams) or it might be
hours, like for batch pipelines based on Apache Hadoop or Apache Spark.

KSQL assigns a timestamp to every data record by using *timestamp extractors*,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is talking about KSQL internal implementation. It would benefit from reworking using more KSQL terminology, e.g.

"By default, KSQL uses a record's timestamp field. However, a user chose to use a field from within the record's value using the TIMESTAMP directive with the WITH clause. "

???

then it will also assign timestamps to these new records.
The way the timestamps are assigned depends on the context:

* When new output records are generated via directly processing some input record,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Output records in KSQL are only ever generated via directly processing some input record, (a.f.a.i.k).

(such as seconds since the epoch). You should also not mix topics with different time semantics.


Windowing
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this too needs reworking to be more orientated to someone using KSQL, rather than how KSQL / KStreams does things under the hood.

e.g. talking more in terms of syntax such as WINDOW HOPPING (SIZE 20 SECONDS, ADVANCE BY 5 SECONDS)


Session Window start and end time are both inclusive (in contrast to time-windows),
and there is always a record in the session window with the start and end timestamps
(because the timestamp of the first and last record in the window define window start and end time).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be great to add a new section around windowed joins, using the WITHIN clause, e.g.

SELECT O.ORDER_ID, O.TOTAL_AMOUNT, O.CUSTOMER_NAME,
          S.SHIPMENT_ID, S.WAREHOUSE
          FROM NEW_ORDERS O
          INNER JOIN SHIPMENTS S
            WITHIN 1 HOURS
            ON O.ORDER_ID = S.ORDER_ID;

@big-andy-coates big-andy-coates requested review from a team and removed request for a team November 9, 2018 12:04
@big-andy-coates
Copy link
Contributor

Hey @JimGalasyn - just seen your comment about this being WIP! No worries. Just add me back in as a reviewer again when you want me to have a look at this. :D

@mjsax
Copy link
Member

mjsax commented Nov 9, 2018

@big-andy-coates @JimGalasyn @miguno

The discussion about the time is quite important IMHO, and we should try to find an agreement on our used terminology and how we explain it to the user. I think there are different angle how you can look at this (conceptually, producer, broker, KSQL). I would suggest the following (that is slightly different from the current terminology -- this might even contradict some of my comment from above...)

(1) Conceptually, there is event-time, ingestion-time, and processing-time. Event-time is when an event occurred, processing-time is when an event is processed, and ingestion-time is something in-between. Note, I would not consider ingestion-time as an approximation of event time. Some examples:

  • A application captures a user-click and records the timestamp of the click, and set this timestamp on the ProducerRecord and writes is to a CreateTime configured topic. Event-time semantics.

  • A sensor captures an event with a timestamp (the sensor does have its own clock), and sends it to an application that set the sensor captured timestamp on the ProducerRecord and writes is to a CreateTime configured topic. Event-time semantics.

  • A sensor captures an event with no timestamp (the sensor does not have any clock), and sends it to an application that set the current wall-clock time on the ProducerRecord and writes is to a CreateTime configured topic. I would consider this event-time semantics. Even if the sensor does not capture the timestamp, and the event-time is approximated, it's still "close enough" to be considered event-time IMHO.

  • A sensor captures an event with no timestamp (the sensor does not have any clock), and sends it to an application that and writes is to a LogAppendTime configured topic. I would still consider this event-time semantics, even if the approximation is worse to the case above.

  • An application reads historic data and uses the current wall-clock time to set a timestamp. This is a clear case of ingestion time. The used timestamp is not related (nor an approximation) of the event-time.

IMHO, the first 4 cases should be considered event-time, even if some event-timestamps are approximated. The reality is (pun intended), that every recorded timestamp is approximated anyway. All clocks have screw and we should not over-index on this. It's imprecise clock-time measurements but it does not put it into ingestion time semantics category IMHO -- even if the approximation is done broker side.

For topic config LogAppendTime I think it is independent of the semantics. The broker can approximate event-time as laid out above, but `LogAppendTime' can also be used to set timestamp for historical data from case 5. Side remark: approximating event-time via LogAppendTime has the disadvantage, that it's less exact compare to producer side approximation (or sensor based capturing). However, it has the nice property that out-of-order data is avoided. This hold also for ingestion time. Note that this does not imply that ingestion time guarantees no out-of-order data.

Thus, I think it might be good to distinguish between event-time and ingestion-time differently than we discussed this before. Event-time is the time when an event occurs, and if this timestamp is captured "close enough" it's an event timestamp (even it it's approximated producer or broker side -- the later, the worst the approximation, but it does not put it into ingestion time category). In contrast, ingestion time does not try to approximate event-time -- there is no knowledge of when an event occurred in the real world and thus, ingestion time has actually no semantic meaning at all with respect to the data. The only property that event and ingestion time semantics share is, that it gives deterministic processing guarantees, because a timestamp is store with the data.

Processing-time semantics implies, that there is not timestamp store within the data and thus deterministic re-processing is not possible.

Thoughts?

@JimGalasyn JimGalasyn changed the title KSQL-1787: First draft of Time and Windows topic [WIP] KSQL-1787: First draft of Time and Windows topic Nov 21, 2018
@JimGalasyn JimGalasyn requested review from joel-hamill and a team November 21, 2018 20:23
Copy link
Contributor

@joel-hamill joel-hamill left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested grammar fixes

record, but the ingestion timestamp is generated when the Kafka broker appends
the record to the target topic.

Ingestion-time may approximate event-time reasonably well if the time
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure how to interpret the subjective wording here (e.g. reasonably well, sufficiently small). if we don't have concrete guidance on what these terms mean, i think we should leave out. e.g.

Ingestion-time can approximate event-time if the time difference between 
the creation of the record and its ingestion into Kafka is small.

is sufficiently small. This means that ingestion-time may be an alternative
for use cases where event-time semantics aren't possible.

You may face this situation when data producers don't embed timestamps in
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is this paragraph referring to? the need to use ingestion time rather than event time? i'd revise and/or remove entirely


Processing-time
The time when the record is consumed by a stream processing application.
The processing-time may happen immediately after ingestion-time, or it may
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggest:

The processing-time can occur immediately after ingestion-time, or it may be delayed 
by milliseconds, hours, or days.

Hopping Window
--------------

All time windows are of the same size, but they might overlap, depending
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All time windows are the same size...?

Copy link
Contributor

@miguno miguno left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great draft, @JimGalasyn. Have lots of feedback for you, please don't despair! ;-) This section is one of the most trickiest to get right so that users can understand what we need to cover here in the easiest and quickest way.

Additional comments on the new pictures:

ksql-window-aggregation.png:

  • "delta t > session timeout": session timeout might get confused with the session.timeout.ms for KStreams, which is quite a different thing. We typically say "inactivity gap", so sth like "delta t > inactivity gap"?

ksql-window.png:

  • Should we include, in the pic, that you can access start and endtime via WINDOWSTART() and WINDOWEND(), respectively? I know this is a conceptual picture, but...?
  • In ksql-window-aggregation.png we don't say "duration" but "size". Can we be consistent?

as a Unix epoch time in milliseconds, which is the number of milliseconds
that have elapsed since 1 January 1970 at midnight UTC/GMT.

When working with time you should also make sure that additional
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI (not to be included here): There's an open ticket #1968 to allow users to specify a time zone other than today's hardcoded UTC for windowed aggregations.

.. image:: ../img/ksql-stream-records.png
:alt: Diagram showing records in a KSQL stream


Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Unnecessary newline(s), not sure whether these will be rendered or not.

helps to agree on specifying time information in UTC or in Unix time,
like seconds since the Unix epoch, everywhere in your system.

Don't mix streams or tables that have different time semantics.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be highlighted separately. I think it would get missed easily at the bottom of this longer IMPORTANT box.


.. code:: sql

SELECT regionid, count(*) FROM pageviews
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: uppercase COUNT(*)


.. code:: sql

SELECT card_number, count(*) FROM authorization_attempts
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: uppercase COUNT

GROUP BY regionid;

The start and end times for a session window are both inclusive, in contrast to
time windows. There is always a record in the session window with both the start
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds confusing: "There is always a record in the session window with both the start and end timestamps, [...]"

What do you want to say here?

I think it's about:

  • A session window contains at least one record. It's not possible for a session window to have zero records.
  • If a session window contains exactly 1 record, then that record's timestamp (via ROWTIME) is identical to the window's own start and end times (which the user can access via WINDOWSTART() and WINDOWEND(), respectively).
  • If a session window contains 2 or more records, then the "earliest/oldest" record's timestamp (ROWTIME) is identical to the window's start time (WINDOWSTART()), and the "latest/newest" record's timestamp (ROWTIME) is identical to the window's end time (WINDOWEND()).

Is the above correct?

Session Window
--------------

A new window starts if the last event that arrived is further back in time
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually like the (high-level) explanation of session windows in the KStreams docs (https://docs.confluent.io/current/streams/developer-guide/dsl-api.html#session-windows). Don't we want to re-use some of that here, including the diagrams we created to help users understand session windows more easily?

This is important for non-overlapping windows, in which each record must be
contained in exactly one window.

Hopping Window
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually like the (high-level) explanation of hopping windows in the KStreams docs (https://docs.confluent.io/current/streams/developer-guide/dsl-api.html#hopping-time-windows). Don't we want to re-use some of that here, including the diagrams we created to help users understand hopping windows more easily?

.. image:: ../img/ksql-window-aggregation.png
:alt: Diagram showing three types of time windows in KSQL streams: tumbling, hopping, and session

Tumbling Window
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have a diagram for tumbling windows in the KStreams docs (https://docs.confluent.io/current/streams/developer-guide/dsl-api.html#tumbling-time-windows). Don't we want to re-use that here to help users understand tumbling windows more easily?

Windowed Joins
--------------

KSQL supports using windows in JOIN queries.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: "KSQL supports using windows in JOIN queries via a WITHIN clause."

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also:

  • Support for WITHIN: Any restrictions for WITHIN, or can we use it for any join we support?
  • A stream-stream must always be windowed, right? We should also call this out in the new docs on Joins.
  • Do we want to cover windowed joins under joins, or here? Or in both places?

--------------

A new window starts if the last event that arrived is further back in time
than a specified session timeout time.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMHO we should not use the words "session timeout time", as there is a very similar config setting of the same name. Thus far we say "activity gap" / "inactivity gap".

be in the UTC timezone, which may not be appropriate if you want to have daily
windows in your timezone.

Window Types
Copy link
Contributor

@miguno miguno Nov 23, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For each of the three window types (tumbling, hopping, session), we should not just cover how to create the respective queries/tables, but also how to subsequently use the created tables, because this latter step is where a lot of users struggle.

  • How does the data in a windowed table look like (notably the key)?
  • How can I write a SELECT query (well, any query) on a windowed table?
  • How can I access window-related information such as start/end time of the window? (and e.g. ROWTIME vs. WINDOWSTART() and WINDOWEND() times)
  • How can I query a specific window, say, the window from today 9:00AM (absolute time)? Or the window from 2 hours ago (relative time)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I opened KSQL-1931 to track this.

@JimGalasyn JimGalasyn merged commit 91bae9a into confluentinc:5.0.0-post Nov 28, 2018
@JimGalasyn JimGalasyn deleted the ksql-1787 branch November 28, 2018 00:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants