KSQL-1787: First draft of Time and Windows topic #2134

JimGalasyn · 2018-11-07T23:46:00Z

No description provided.

mjsax · 2018-11-08T03:08:08Z

docs/concepts/time-and-windows.rst

+handle late-arriving records.
+
+.. image:: ../img/ksql-window.png


Not sure how precise the graphic should be, but note that window start timestamp is inclusive while window end timestamp is exclusive. This property is important for non-overlapping windows, for which each record should be contained in exactly one window.

I make a note of this later in the topic.

mjsax · 2018-11-08T03:08:58Z

docs/concepts/time-and-windows.rst

+which occur at particular times. Three important times in a record's 
+life cycle are *Event-time*, *Ingestion-time*, and *Processing-time*. 
+
+Representing time consistently enables aggregation operations o streams,


nit o streams ?

mjsax · 2018-11-08T03:11:17Z

docs/concepts/time-and-windows.rst

+Representing time consistently enables aggregation operations o streams,
+like SUM, that have time boundaries.
+
+KSQL and Kafka Streams support the following notions of time:


remove and Kafka Streams (who cares?)

mjsax · 2018-11-08T03:11:45Z

docs/concepts/time-and-windows.rst

+    real-time pipelines based on Apache Kafka and Kafka Streams) or it might be
+    hours, like for batch pipelines based on Apache Hadoop or Apache Spark.
+
+Kafka Streams assigns a timestamp to every data record by using


Kafka Streams -> KSQL ?

mjsax · 2018-11-08T03:13:00Z

docs/concepts/time-and-windows.rst

+progress of a stream over time. Timestamps are used by time-dependent
+operations, like joins.
+
+We call it the *application event-time* to differentiate


This term is new to me... Where does it come from? Just curious.

It was "event-time of the application" in the original -- not sure if it's even meaningful.

\cc @miguno Can you help out here? IMHO, it's important that we standardize our language.

mjsax · 2018-11-08T03:13:57Z

docs/concepts/time-and-windows.rst

+
+Developers can thus enforce different notions/semantics of time depending on their business needs.
+
+Finally, whenever a Kafka Streams application writes records to Kafka,


Kafka Streams -> KSQL ?

mjsax · 2018-11-08T03:17:07Z

docs/concepts/time-and-windows.rst

+aspects of time such as time zones and calendars are correctly
+synchronized – or at least understood and traced – throughout your stream data pipelines.
+It often helps, for example, to agree on specifying time information in UTC or in Unix time
+(such as seconds since the epoch). You should also not mix topics with different time semantics.


This has actually larger impact. In fact, Kafka Streams is based on the unix epoch time in UTC timezone. This affect time-windows. For example, if you define a 24h tumbling time window, it will be in UTC timezone... This might not be appropriate if you want to get daily windows in your timezone.

mjsax · 2018-11-08T03:18:04Z

docs/concepts/time-and-windows.rst

+operations such as aggregations or joins into so-called windows. Windows are tracked
+per record key.
+
+Windowing operations are available in the Kafka Streams DSL. When working with windows,


I thought this is KSQL docs? Why do we talk so much about Kafka Streams?

mjsax · 2018-11-08T03:23:16Z

docs/concepts/time-and-windows.rst

+
+Windowing operations are available in the Kafka Streams DSL. When working with windows,
+you can specify a retention period for the window. This retention period controls how
+long Kafka Streams will wait for out-of-order or late-arriving data records for a given window.


This is not true any longer in 5.1 (I know that this PR is against 5.0, and also not sure who KSQL exposes both anyway -- just want to point it out to make you aware of this) -- we added a new configuration that we call "grace period" that determines how long we wait before we close a window. Retention time is still a valid parameter that defines how long we store the (potentially) closed window -- we do this, to allow to access the window via Interactive Queries even if it's already closed.

mjsax · 2018-11-08T03:25:40Z

docs/concepts/time-and-windows.rst

+.. image:: ../img/ksql-window.png
+
+
+.. image:: ../img/ksql-window-aggregation.png


With regard to my comment above: Session Window start and end time are both inclusive (in contract to time-windows) and there is always a record in the session window with the start and end timestamps (because the timestamp of the first and last record in the window define window start and end time).

big-andy-coates

Hey @JimGalasyn - ran out of time, got to dash.

Only looked at the first parts - I'm with Matthias - this is currently reading like KStreams docs, not KSQL!

Sorry, got to dash...

big-andy-coates · 2018-11-08T17:52:08Z

docs/concepts/time-and-windows.rst

+Time and Windows
+################
+
+During the life cycle of a record, it experiences a number of events


I think this is a little confusing. The record is the event, it doesn't experience events. Nor does it have a lifecycle in this context.

Maybe just have:

When talking about time it is important to be clear about which time is being discussed. The time an event occurred may be different to the time it was ingested into Kafka, or the time at which the processing of the event is occurring.

???

I'd like to get very crisp about this -- my thinking is that in the KSQL world, "events" and "messages" are abstracted away, and we have instead "records". But maybe this isn't the way we want to think about it...

I agree, that there is nothing like a life cycle.

I also agree, that we should abstract over events/messages and use the term records -- we use records in KS docs, too, for the same reason. "Message" is the standard term for core Kafka, ie, pub/sub.

I'm fine with 'records'. The reason I mentioned 'events' is because of the use of 'events' here:

it experiences a number of events

The record does not experience anything. It is an immutable bit of data / information.

From a KSQL point of view, a record has a timestamp, (i.e. the timestamp of the record in Kafka). This is the default timestamp that KSQL will use. This may be an 'event-time' or just the 'ingestion-time' depending on what the record's producer did when producing the record to Kafka. (The producer can chose to set the timestamp to any time they like, but most often chose either some sensible 'event-time' or the current time, which would be equivalent to 'ingestion-time'. If they don't set a timestamp its set to 'ingestion-time' by Kafka.)

When importing the topic in to KSQL it will, by default, using this timestamp for the record. Alternatively, a user can use the WITH(TIMESTAMP='some-field') to tell KSQL to use an alternative timestamp from the record's value, (with an optional TIMESTAMP_FORMAT). The field the user specifies may be an 'event-time' or a 'ingestion-time'.

I think I'm right in saying KSQL currently doesn't support any 'processing-time' operations, (though they're probably possible with UDFs to access the current time).

So.. from KSQL perspective, I don't think the type of the timestamp, i.e. event vs ingestion vs processing time, really matters that much. KSQL just uses the timestamp that you tell it too.

That said, a little background explaining that different concepts of time exist and that its important to know which one you're discussing is probably worthwhile as a preamble, just don't talk about records having a lifecycle or experiencing events ;)

One thing that is very important to call out is that it is of paramount importance that the KSQL rows' timestamp, (be that from the record's timestamp or a field within the value), is using a consistent concept of time. For example, things will go awry if some records were written to kafka with the field containing some event-time, which others go the default ingest-time set by the broker. It is important that all records within a topic are using a consistent notion of time. What that notion is, doesn't really matter to KSQL.

This may be an 'event-time' or just the 'ingestion-time' depending on what the record's producer did when producing the record to Kafka. (The producer can chose to set the timestamp to any time they like, but most often chose either some sensible 'event-time' or the current time, which would be equivalent to 'ingestion-time'. If they don't set a timestamp its set to 'ingestion-time' by Kafka.)

IMHO, this is not 100% correct (with respect to out current terminology). (1) brokers allow to configure topic with either CreateTime (default) vs LogAppendTime -- for the first config, the timestamp will be whatever the producer sets. For the second config, the broker always set the timestamp. The first case implies "ingestion time" semantics. Thus, if LogAppendTime is configured, the producer has no control whatsoever over the timestamp.

If CreateTime is set, the following holds for the producer:

if a ProducerRecord is created, by default it contains no timestamp

users can set the timestamp on ProducerRecord explicitly though

if the timestamp is not set, on producer.send() the current wall-clock time will be set automatically

Note, that we consider all three cases as "event-time" and don't distinguish them in our current terminology. I agree, that semantically we could follow Andy's suggestion, but I believe this would make it more complex for users to understand the subtle differences. Thus, I would prefer to stay with our current (maybe simplified) terminology.

For example, things will go awry if some records were written to kafka with the field containing some event-time, which others go the default ingest-time set by the broker.

I guess this is not possible.

However, another important thing to point out: if a WITH(timestamp=...) clause is used, this timestamp must be passed to Kafka Streams as "unix epoch time in milli seconds" -- not sure how much control the user has about this, and what part KSQL handles. If the use could mess this up (ie, KSQL cannot ensure this timestamp format/semantic) we need to tell users about it.

This is all great stuff, please keep it coming! We'll get a KSQL-focused topic out of this yet.

big-andy-coates · 2018-11-08T17:52:20Z

docs/concepts/time-and-windows.rst

+Representing time consistently enables aggregation operations o streams,
+like SUM, that have time boundaries.
+
+KSQL and Kafka Streams support the following notions of time:


big-andy-coates · 2018-11-08T17:55:26Z

docs/concepts/time-and-windows.rst

+    real-time pipelines based on Apache Kafka and Kafka Streams) or it might be
+    hours, like for batch pipelines based on Apache Hadoop or Apache Spark.
+
+Kafka Streams assigns a timestamp to every data record by using


This is kstreams implementation detail. What does KSQL support? It doesn't care what type of time a timestamp may refer to.

You can also switch which field KSQL gets its timestamp from. Default is the record's timestamp, but it can also grab a timestamp out of a field in the value.

Does KSQL support anything else when it comes to timestamps? cc @dguy

JimGalasyn · 2018-11-08T20:12:07Z

@mjsax @big-andy-coates Thanks for the early feedback! Just so you know, this topic isn't even half-baked yet -- I only posted the PR to keep track of it. :) The "Time" section has been copy-edited from the original KStreams Concepts topic, which is why is reads like KStreams content (I had to start somewhere). The Windows section is literally just copy-pasted from KStreams, I haven't had a chance to edit it at all yet.

So expect this topic to change completely as I work on it -- but do keep the feedback coming!

big-andy-coates

Hey @JimGalasyn - it's great to see such docs being added.

At the moment the docs seem more focused on KStreams functionality, or how KSQL does things under the hood. I think it would benefit from reworking to be more 'KSQL user' orientated, i.e. from the point of view of someone using KSQL. It should focus on KSQL concepts and terminology and try to avoid muddying the water with implementation details.

big-andy-coates · 2018-11-09T11:23:35Z

docs/concepts/time-and-windows.rst

+Representing time consistently enables aggregation operations on streams,
+like SUM, that have time boundaries.
+
+KSQL supports the following notions of time:


As above, KSQL doesn't care what notion of time a timestamp has. I think it's beneficial to explain the main categories of event time, it just needs this bit reworking.

I would also separate this discussion of notions of time from anything Kafka specific. What type of timestamp is stored in a Kafka record's timestamp field, or some field in the record's value is completely arbitrary as producing apps can put what ever they like in there. It can be the event-time, some concept of ingestion-time, (i.e. ingested into the platform at time x, which would be different to ingested into Kafka at time y), or something random).

It can be the event-time, some concept of ingestion-time, (i.e. ingested into the platform at time x, which would be different to ingested into Kafka at time y), or something random).

That is correct. However, we should consider all of those as "event-time". From a KSQL point of view, it's unknown how the timestamp is set -- thus, for KSQL it is just event-time IMHO. If user sets event-time in a crazy manner, it's the user's fault.

Ingestion-time, would be an approximation of event-time, but it technically still event-time.

Processing-time would only hold, if KSQL uses the current wall-clock time when reading the record -- I think, this is not supported in KSQL, as users cannot set a custom TimestampExtractor from my understanding?

I think, it overall boils down to: if data is reprocessed, so we use the same timestamps. For event-time this is true (also for ingestion-time) -- for processing-time, this is not true.

big-andy-coates · 2018-11-09T11:35:38Z

docs/concepts/time-and-windows.rst

+
+KSQL supports the following notions of time:
+
+Event-time


The time when a record is created by the data source

So for being pedantic, but this is not strictly correct. 'Event-time' is the time at which the event occurred. End of.

It's possible the data-source, or the producing app, or what ever, is producing a record to Kafka for an event that happened a some point in the past. The 'event time' in this context would be the time the event occurred, i.e. the historical point in time.

I would change the first paragraph to something like:

"The time an event occurred. For example, if the event is a geo-location change reported by a GPS sensor in a car, the associated event-time is the time when the GPS sensor captured the location change."

As this is clean and explicit. Maybe followed by something like:

"A record may have more than one type of event-time. For example, a 'completed-transaction' record may have a field representing the time a transaction was started and another for when it was completed. Both can be considered event-times".

'Event-time' is the time at which the event occurred.

That is semantically correct. Note, that we assume that users would set this timestamp in ProducerRecrod explicitly.

It's possible the data-source, or the producing app, or what ever, is producing a record to Kafka for an event that happened a some point in the past. The 'event time' in this context would be the time the event occurred, i.e. the historical point in time.

Agreed. Again, the producer application would need to set the timestamp accordingly when creating ProducerRecords.

Bottom line: It think it makes sense to discuss semantic on producer application side and KSQL side separately.

(1) For producer side, we need to explain how event-time semantics can be achieved. Btw: often "the time at which the event occurred" is the same as "the wall-clock time when ProducerRecord is created" (ie, for any live/online application) -- only if you write historical data, both would be different.

(2) For KSQL, the assumption is always event-time. If the producer does it differently, the end-to-end pipeline might be incorrect (but it might also be ok, depending on the application). However, this part is not a KSQL concern any longer IMHO.

big-andy-coates · 2018-11-09T11:43:45Z

docs/concepts/time-and-windows.rst

+    in a car, the associated event-time is the time when the GPS sensor captured
+    the location change.
+
+Ingestion-time


Avoid being Kafka specific. An ingestion-time could equally be the time some data was received by some platform as a whole, e.g. the timestamp at which the data was received by some gateway service. This too is a totally valid use of 'ingestion-time'.

As mentioned already, the 'timestamp' on the record may have been set by the Kafka broker or may have been set by the producing app. In the former case it is one type of ingestion-time, in the latter it may be some sort of ingestion-time, or maybe event-time, or maybe complete rubbish.

I agree with your basic assessment though. However, it does not make too much difference, because any form of "ingestion time" is an approximation of "event-time" and the main classes we need to consider is event- vs processing-time.

I guess it depend from which angel you look at the problem: from a broker point of view, if the topic is configured with "CreateTime" the producer set the timestamp -- for "LogAppendTime" it's ingestion time. Of the producer does not set a proper "event-time" but some approximation, it' still event-time from a broker point of view.

Hence, I tend to disagree: we use the term "ingestion-time" for the case that the topic is configured with "LogAppendTime". It's fair to question this terminology. However, I think it makes sense, because this insures that there is not out-of-order data. If you have a form of "external ingestion time" (as you described above), there can still be out-of-order data... And I hope to make KS/KSQL smarter at some point, to exploit this property of ingestion time.

big-andy-coates · 2018-11-09T11:48:00Z

docs/concepts/time-and-windows.rst

+
+    For example: imagine an analytics application that reads and processes the
+    geo-location data reported from car sensors, and presents it to a
+    fleet-management dashboard. In this case, processing-time in the analytics


Consider changing:

In this case, processing-time in the analytics application might be milliseconds or seconds after event-time

To,

In this case, processing-time in the analytics application might be many minutes or hours after the event-time as cars can move out of mobile reception for periods of time and have to locally buffer records.

I think a larger discrepancy better highlights the difference and the importance of using the right one.

big-andy-coates · 2018-11-09T11:49:29Z

docs/concepts/time-and-windows.rst

+    real-time pipelines based on Apache Kafka and Kafka Streams) or it might be
+    hours, like for batch pipelines based on Apache Hadoop or Apache Spark.
+
+KSQL assigns a timestamp to every data record by using *timestamp extractors*,


This is talking about KSQL internal implementation. It would benefit from reworking using more KSQL terminology, e.g.

"By default, KSQL uses a record's timestamp field. However, a user chose to use a field from within the record's value using the TIMESTAMP directive with the WITH clause. "

???

big-andy-coates · 2018-11-09T11:54:23Z

docs/concepts/time-and-windows.rst

+then it will also assign timestamps to these new records.
+The way the timestamps are assigned depends on the context:
+
+* When new output records are generated via directly processing some input record,


Output records in KSQL are only ever generated via directly processing some input record, (a.f.a.i.k).

big-andy-coates · 2018-11-09T11:58:16Z

docs/concepts/time-and-windows.rst

+(such as seconds since the epoch). You should also not mix topics with different time semantics.
+
+
+Windowing


I think this too needs reworking to be more orientated to someone using KSQL, rather than how KSQL / KStreams does things under the hood.

e.g. talking more in terms of syntax such as WINDOW HOPPING (SIZE 20 SECONDS, ADVANCE BY 5 SECONDS)

big-andy-coates · 2018-11-09T12:01:15Z

docs/concepts/time-and-windows.rst

+
+Session Window start and end time are both inclusive (in contrast to time-windows),
+and there is always a record in the session window with the start and end timestamps
+(because the timestamp of the first and last record in the window define window start and end time).


Would be great to add a new section around windowed joins, using the WITHIN clause, e.g.

SELECT O.ORDER_ID, O.TOTAL_AMOUNT, O.CUSTOMER_NAME, S.SHIPMENT_ID, S.WAREHOUSE FROM NEW_ORDERS O INNER JOIN SHIPMENTS S WITHIN 1 HOURS ON O.ORDER_ID = S.ORDER_ID;

big-andy-coates · 2018-11-09T12:05:42Z

Hey @JimGalasyn - just seen your comment about this being WIP! No worries. Just add me back in as a reviewer again when you want me to have a look at this. :D

mjsax · 2018-11-09T19:42:17Z

@big-andy-coates @JimGalasyn @miguno

The discussion about the time is quite important IMHO, and we should try to find an agreement on our used terminology and how we explain it to the user. I think there are different angle how you can look at this (conceptually, producer, broker, KSQL). I would suggest the following (that is slightly different from the current terminology -- this might even contradict some of my comment from above...)

(1) Conceptually, there is event-time, ingestion-time, and processing-time. Event-time is when an event occurred, processing-time is when an event is processed, and ingestion-time is something in-between. Note, I would not consider ingestion-time as an approximation of event time. Some examples:

A application captures a user-click and records the timestamp of the click, and set this timestamp on the ProducerRecord and writes is to a CreateTime configured topic. Event-time semantics.
A sensor captures an event with a timestamp (the sensor does have its own clock), and sends it to an application that set the sensor captured timestamp on the ProducerRecord and writes is to a CreateTime configured topic. Event-time semantics.
A sensor captures an event with no timestamp (the sensor does not have any clock), and sends it to an application that set the current wall-clock time on the ProducerRecord and writes is to a CreateTime configured topic. I would consider this event-time semantics. Even if the sensor does not capture the timestamp, and the event-time is approximated, it's still "close enough" to be considered event-time IMHO.
A sensor captures an event with no timestamp (the sensor does not have any clock), and sends it to an application that and writes is to a LogAppendTime configured topic. I would still consider this event-time semantics, even if the approximation is worse to the case above.
An application reads historic data and uses the current wall-clock time to set a timestamp. This is a clear case of ingestion time. The used timestamp is not related (nor an approximation) of the event-time.

IMHO, the first 4 cases should be considered event-time, even if some event-timestamps are approximated. The reality is (pun intended), that every recorded timestamp is approximated anyway. All clocks have screw and we should not over-index on this. It's imprecise clock-time measurements but it does not put it into ingestion time semantics category IMHO -- even if the approximation is done broker side.

For topic config LogAppendTime I think it is independent of the semantics. The broker can approximate event-time as laid out above, but `LogAppendTime' can also be used to set timestamp for historical data from case 5. Side remark: approximating event-time via LogAppendTime has the disadvantage, that it's less exact compare to producer side approximation (or sensor based capturing). However, it has the nice property that out-of-order data is avoided. This hold also for ingestion time. Note that this does not imply that ingestion time guarantees no out-of-order data.

Thus, I think it might be good to distinguish between event-time and ingestion-time differently than we discussed this before. Event-time is the time when an event occurs, and if this timestamp is captured "close enough" it's an event timestamp (even it it's approximated producer or broker side -- the later, the worst the approximation, but it does not put it into ingestion time category). In contrast, ingestion time does not try to approximate event-time -- there is no knowledge of when an event occurred in the real world and thus, ingestion time has actually no semantic meaning at all with respect to the data. The only property that event and ingestion time semantics share is, that it gives deterministic processing guarantees, because a timestamp is store with the data.

Processing-time semantics implies, that there is not timestamp store within the data and thus deterministic re-processing is not possible.

Thoughts?

joel-hamill

Suggested grammar fixes

joel-hamill · 2018-11-21T23:55:04Z

docs/concepts/time-and-windows-in-ksql-queries.rst

+    record, but the ingestion timestamp is generated when the Kafka broker appends
+    the record to the target topic.
+
+    Ingestion-time may approximate event-time reasonably well if the time


not sure how to interpret the subjective wording here (e.g. reasonably well, sufficiently small). if we don't have concrete guidance on what these terms mean, i think we should leave out. e.g.

Ingestion-time can approximate event-time if the time difference between the creation of the record and its ingestion into Kafka is small.

joel-hamill · 2018-11-21T23:57:36Z

docs/concepts/time-and-windows-in-ksql-queries.rst

+    is sufficiently small. This means that ingestion-time may be an alternative
+    for use cases where event-time semantics aren't possible.
+
+    You may face this situation when data producers don't embed timestamps in


what is this paragraph referring to? the need to use ingestion time rather than event time? i'd revise and/or remove entirely

joel-hamill · 2018-11-21T23:58:45Z

docs/concepts/time-and-windows-in-ksql-queries.rst

+
+Processing-time
+    The time when the record is consumed by a stream processing application.
+    The processing-time may happen immediately after ingestion-time, or it may


suggest:

The processing-time can occur immediately after ingestion-time, or it may be delayed by milliseconds, hours, or days.

joel-hamill · 2018-11-21T23:59:41Z

docs/concepts/time-and-windows-in-ksql-queries.rst

+Hopping Window
+--------------
+
+All time windows are of the same size, but they might overlap, depending


All time windows are the same size...?

miguno

Great draft, @JimGalasyn. Have lots of feedback for you, please don't despair! ;-) This section is one of the most trickiest to get right so that users can understand what we need to cover here in the easiest and quickest way.

Additional comments on the new pictures:

ksql-window-aggregation.png:

"delta t > session timeout": session timeout might get confused with the session.timeout.ms for KStreams, which is quite a different thing. We typically say "inactivity gap", so sth like "delta t > inactivity gap"?

ksql-window.png:

Should we include, in the pic, that you can access start and endtime via WINDOWSTART() and WINDOWEND(), respectively? I know this is a conceptual picture, but...?
In ksql-window-aggregation.png we don't say "duration" but "size". Can we be consistent?

miguno · 2018-11-23T09:17:07Z

docs/concepts/time-and-windows-in-ksql-queries.rst

+    as a Unix epoch time in milliseconds, which is the number of milliseconds
+    that have elapsed since 1 January 1970 at midnight UTC/GMT.
+
+    When working with time you should also make sure that additional


FYI (not to be included here): There's an open ticket #1968 to allow users to specify a time zone other than today's hardcoded UTC for windowed aggregations.

miguno · 2018-11-23T09:18:39Z

docs/concepts/time-and-windows-in-ksql-queries.rst

+.. image:: ../img/ksql-stream-records.png
+   :alt: Diagram showing records in a KSQL stream
+
+


Nit: Unnecessary newline(s), not sure whether these will be rendered or not.

miguno · 2018-11-23T09:21:16Z

docs/concepts/time-and-windows-in-ksql-queries.rst

+    helps to agree on specifying time information in UTC or in Unix time,
+    like seconds since the Unix epoch, everywhere in your system.
+
+    Don't mix streams or tables that have different time semantics.


This should be highlighted separately. I think it would get missed easily at the bottom of this longer IMPORTANT box.

miguno · 2018-11-23T09:21:44Z

docs/concepts/time-and-windows-in-ksql-queries.rst

+
+.. code:: sql
+
+    SELECT regionid, count(*) FROM pageviews


Nit: uppercase COUNT(*)

miguno · 2018-11-23T09:22:04Z

docs/concepts/time-and-windows-in-ksql-queries.rst

+
+.. code:: sql
+
+    SELECT card_number, count(*) FROM authorization_attempts


Nit: uppercase COUNT

miguno · 2018-11-23T11:13:28Z

docs/concepts/time-and-windows-in-ksql-queries.rst

+      GROUP BY regionid;
+
+The start and end times for a session window are both inclusive, in contrast to
+time windows. There is always a record in the session window with both the start


Sounds confusing: "There is always a record in the session window with both the start and end timestamps, [...]"

What do you want to say here?

I think it's about:

A session window contains at least one record. It's not possible for a session window to have zero records.

If a session window contains exactly 1 record, then that record's timestamp (via ROWTIME) is identical to the window's own start and end times (which the user can access via WINDOWSTART() and WINDOWEND(), respectively).

If a session window contains 2 or more records, then the "earliest/oldest" record's timestamp (ROWTIME) is identical to the window's start time (WINDOWSTART()), and the "latest/newest" record's timestamp (ROWTIME) is identical to the window's end time (WINDOWEND()).

Is the above correct?

miguno · 2018-11-23T11:16:15Z

docs/concepts/time-and-windows-in-ksql-queries.rst

+Session Window
+--------------
+
+A new window starts if the last event that arrived is further back in time


I actually like the (high-level) explanation of session windows in the KStreams docs (https://docs.confluent.io/current/streams/developer-guide/dsl-api.html#session-windows). Don't we want to re-use some of that here, including the diagrams we created to help users understand session windows more easily?

miguno · 2018-11-23T11:17:32Z

docs/concepts/time-and-windows-in-ksql-queries.rst

+This is important for non-overlapping windows, in which each record must be
+contained in exactly one window.
+
+Hopping Window


I actually like the (high-level) explanation of hopping windows in the KStreams docs (https://docs.confluent.io/current/streams/developer-guide/dsl-api.html#hopping-time-windows). Don't we want to re-use some of that here, including the diagrams we created to help users understand hopping windows more easily?

miguno · 2018-11-23T11:18:35Z

docs/concepts/time-and-windows-in-ksql-queries.rst

+.. image:: ../img/ksql-window-aggregation.png
+   :alt: Diagram showing three types of time windows in KSQL streams: tumbling, hopping, and session
+
+Tumbling Window


We have a diagram for tumbling windows in the KStreams docs (https://docs.confluent.io/current/streams/developer-guide/dsl-api.html#tumbling-time-windows). Don't we want to re-use that here to help users understand tumbling windows more easily?

miguno · 2018-11-23T11:19:20Z

docs/concepts/time-and-windows-in-ksql-queries.rst

+Windowed Joins
+--------------
+
+KSQL supports using windows in JOIN queries. 


Nit: "KSQL supports using windows in JOIN queries via a WITHIN clause."

Also:

Support for WITHIN: Any restrictions for WITHIN, or can we use it for any join we support?

A stream-stream must always be windowed, right? We should also call this out in the new docs on Joins.

Do we want to cover windowed joins under joins, or here? Or in both places?

miguno · 2018-11-23T11:44:33Z

docs/concepts/time-and-windows-in-ksql-queries.rst

+--------------
+
+A new window starts if the last event that arrived is further back in time
+than a specified session timeout time.


IMHO we should not use the words "session timeout time", as there is a very similar config setting of the same name. Thus far we say "activity gap" / "inactivity gap".

miguno · 2018-11-23T14:08:42Z

docs/concepts/time-and-windows-in-ksql-queries.rst

+    be in the UTC timezone, which may not be appropriate if you want to have daily
+    windows in your timezone.
+
+Window Types


For each of the three window types (tumbling, hopping, session), we should not just cover how to create the respective queries/tables, but also how to subsequently use the created tables, because this latter step is where a lot of users struggle.

How does the data in a windowed table look like (notably the key)?

How can I write a SELECT query (well, any query) on a windowed table?

How can I access window-related information such as start/end time of the window? (and e.g. ROWTIME vs. WINDOWSTART() and WINDOWEND() times)

How can I query a specific window, say, the window from today 9:00AM (absolute time)? Or the window from 2 hours ago (relative time)?

I opened KSQL-1931 to track this.

mjsax reviewed Nov 8, 2018

View reviewed changes

big-andy-coates requested a review from a team November 8, 2018 17:40

big-andy-coates reviewed Nov 8, 2018

View reviewed changes

big-andy-coates reviewed Nov 9, 2018

View reviewed changes

big-andy-coates requested review from a team and removed request for a team November 9, 2018 12:04

JimGalasyn changed the title ~~KSQL-1787: First draft of Time and Windows topic [WIP]~~ KSQL-1787: First draft of Time and Windows topic Nov 21, 2018

JimGalasyn requested review from joel-hamill and a team November 21, 2018 20:23

joel-hamill approved these changes Nov 22, 2018

View reviewed changes

miguno reviewed Nov 23, 2018

View reviewed changes

JimGalasyn added 13 commits November 26, 2018 16:21

Add new concepts dir

79587e3

First edit pass on existing content

e119aa2

Fix merge conflict in index.rst

c676f0c

Incorporated some feedback

af8239b

Rework the Time section per feedback

5a2eda0

Write the Windows section; add diagrams

3f7cc7c

Add a section on windows in joins

2179b81

Add links to Stream Processing Cookbook recpies

4b2ba33

Incorporate feedback

d84a509

Incorporate feedback; add images

7f1316c

Update images per feedback

58b6259

Improve descriptions of window types

8a22b8a

Add cross-links between Join and Windows topics

4efd895

JimGalasyn merged commit 91bae9a into confluentinc:5.0.0-post Nov 28, 2018

JimGalasyn deleted the ksql-1787 branch November 28, 2018 00:16

		handle late-arriving records.

		.. image:: ../img/ksql-window.png


		Developers can thus enforce different notions/semantics of time depending on their business needs.

		Finally, whenever a Kafka Streams application writes records to Kafka,

		.. image:: ../img/ksql-window.png


		.. image:: ../img/ksql-window-aggregation.png

		(such as seconds since the epoch). You should also not mix topics with different time semantics.


		Windowing

		.. image:: ../img/ksql-stream-records.png
		:alt: Diagram showing records in a KSQL stream


		.. code:: sql

		SELECT card_number, count(*) FROM authorization_attempts

KSQL-1787: First draft of Time and Windows topic #2134

KSQL-1787: First draft of Time and Windows topic #2134

Conversation

JimGalasyn commented Nov 7, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mjsax Nov 8, 2018 • edited Loading

Choose a reason for hiding this comment

big-andy-coates left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

big-andy-coates Nov 9, 2018 • edited Loading

Choose a reason for hiding this comment

mjsax Nov 9, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

big-andy-coates Nov 8, 2018 • edited Loading

Choose a reason for hiding this comment

JimGalasyn commented Nov 8, 2018

big-andy-coates left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

big-andy-coates commented Nov 9, 2018

mjsax commented Nov 9, 2018

joel-hamill left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

miguno left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

miguno Nov 23, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

JimGalasyn commented Nov 7, 2018 •

edited

Loading

mjsax Nov 8, 2018 •

edited

Loading

big-andy-coates Nov 9, 2018 •

edited

Loading

mjsax Nov 9, 2018 •

edited

Loading

big-andy-coates Nov 8, 2018 •

edited

Loading

miguno left a comment •

edited

Loading

miguno Nov 23, 2018 •

edited

Loading