Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

breaking,time: rewrite parse_rfc3339/1 to improve performance, reject partial timestamps, that miss date info like 22:47:08Z #22585

Merged
merged 12 commits into from
Oct 21, 2024

Conversation

enghitalo
Copy link
Contributor

@enghitalo enghitalo commented Oct 19, 2024

Before

 SPENT   738.004 ms in decoder2.decode[time.Time]('2022-03-11T13:54:25')!

Now

 SPENT    41.814 ms in decoder2.decode[time.Time]('2022-03-11T13:54:25')!

NOTE: I will base myself in this code to create a Parser struct then it can be useful in others standard like above ones
https://ijmacd.github.io/rfc3339-iso8601/

@esquerbatua
Copy link
Contributor

Wow, huge increase in performance, great job!

['2015-01-06T15:47:32+01:00', '2015-01-06 16:47:32.000000'],
['2015-01-06T15:47:32-01:00', '2015-01-06 14:47:32.000000'],
['2015-01-06T15:47:32+01:10', '2015-01-06 16:57:32.000000'],
['2015-01-06T15:47:32-01:10', '2015-01-06 14:37:32.000000'],
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Include test

['2015-01-06T22:59:59+00:10', '2015-01-06 23:00:09.000000'],

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is that still needed?

Copy link
Member

@spytheman spytheman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excellent work.

@spytheman
Copy link
Member

The failure in TOML seems to happen for values like this:

import time

t := time.parse_rfc3339('1979-05-27T07:32:00-08:00')!
dump(t)

On master, that prints:
[a.v:4] t: 1979-05-27 15:32:00

On this PR, that prints:
[a.v:4] t: 1979-05-26 23:32:00

@spytheman
Copy link
Member

spytheman commented Oct 19, 2024

Other examples, that work on master, but do not here:

import time

dump(time.parse_rfc3339('1979-05-27T07:32:00-08:00')!)
dump(time.parse_rfc3339('22:47:08Z')!)
dump(time.parse_rfc3339('01:47:08.981+03:00')!)
dump(time.parse_rfc3339('2024-10-19T22:47:08-00:00')!)
dump(time.parse_rfc3339('2024-10-19T22:47:08.9+00:00')!)
dump(time.parse_rfc3339('2024-10-20T01:47:08+03:00')!)
dump(time.parse_rfc3339('2024-10-20T01:47:08.981+03:00')!)

On master:

[a.v:3] time.parse_rfc3339('1979-05-27T07:32:00-08:00')!: 1979-05-27 15:32:00
[a.v:4] time.parse_rfc3339('22:47:08Z')!: 0000-11-30 22:47:08
[a.v:5] time.parse_rfc3339('01:47:08.981+03:00')!: 0000-11-29 22:47:08
[a.v:6] time.parse_rfc3339('2024-10-19T22:47:08-00:00')!: 2024-10-19 22:47:08
[a.v:7] time.parse_rfc3339('2024-10-19T22:47:08.9+00:00')!: 2024-10-19 22:47:08
[a.v:8] time.parse_rfc3339('2024-10-20T01:47:08+03:00')!: 2024-10-19 22:47:08
[a.v:9] time.parse_rfc3339('2024-10-20T01:47:08.981+03:00')!: 2024-10-19 22:47:08

On the PR:

[a.v:3] time.parse_rfc3339('1979-05-27T07:32:00-08:00')!: 1979-05-26 23:32:00
================ V panic ================
   module: main
 function: main()
  message: Invalid time format code: 1, error: datetime string is too short
     file: a.v:4

@spytheman
Copy link
Member

I will keep parsing with is_local: false for performance reason.

I think that code, should work as a first priority, and then it can be optimised.

Breaking users of that parser should not happen, unless the parser has a bug, that has to be fixed.

@spytheman
Copy link
Member

You can get more time examples from: https://ijmacd.github.io/rfc3339-iso8601/ .

@enghitalo
Copy link
Contributor Author

enghitalo commented Oct 19, 2024

@spytheman I think master is broken for 22:47:08Z

  • code
import time

fn main() {
	dump(time.parse_rfc3339('22:47:08Z')!)
}
  • result
time.parse_rfc3339('22:47:08Z')!: 0000-11-30 22:47:08

What do you think should be the result? 0000-00-00 22:47:08 looks great

@spytheman
Copy link
Member

Only the time part should be set in this case, since the date is not specified.

Note however, that on master, it does not return an error for this case, while it does here, which later leads to problems for the toml tests.

@JalonSolov
Copy link
Contributor

If only a time is specified, and a date is needed, then use the current date. This seems like a logical, reasonable alternative to trying to decide if 0000-00-00 is good or not.

@spytheman spytheman changed the title perf: parse_rfc3339 time: improve performance of parse_rfc3339/1 Oct 20, 2024
@spytheman
Copy link
Member

spytheman commented Oct 20, 2024

If only a time is specified, and a date is needed, then use the current date. This seems like a logical, reasonable alternative to trying to decide if 0000-00-00 is good or not.

No, it is not, since that is ambiguous and time dependent.
If a system sends you a timestamp, or records it in a log, you can not know what the time was, when that timestamp was generated by the system.

Consider also the case, where you have just a date: 2024-10-20 . If you receive such a time, you will not set the time part of it to the arbitrary current hour:minute:second.

In contrast, setting the unknown fields to 0, is predictable, simple and unambiguous.

@enghitalo
Copy link
Contributor Author

enghitalo commented Oct 20, 2024

toml will need some fix

  • If date and time are given but offset is None, Datetime corresponds to a Local Date-Time.

from https://docs.rs/toml/latest/toml/value/struct.Datetime.html
image

from https://toml.io/en/
image

@spytheman
Copy link
Member

spytheman commented Oct 20, 2024

The toml module has very extensive tests, including external ones (see .github/workflows/download_full_toml_test_suites.sh and .github/workflows/toml_ci.yml that runs it).

I suspect (but can be wrong of course), that the bug is more likely to be in the new parser here, since that is the most recent thing that is changed.

@spytheman
Copy link
Member

Here is the actual spec: https://toml.io/en/v1.0.0#local-date-time
image

@JalonSolov
Copy link
Contributor

JalonSolov commented Oct 20, 2024

If only a time is specified, and a date is needed, then use the current date. This seems like a logical, reasonable alternative to trying to decide if 0000-00-00 is good or not.

No, it is not, since that is ambiguous and time dependent. If a system sends you a timestamp, or records it in a log, you can not know what the time was, when that timestamp was generated by the system.

Consider also the case, where you have just a date: 2024-10-20 . If you receive such a time, you will not set the time part of it to the arbitrary current hour:minute:second.

In contrast, setting the unknown fields to 0, is predictable, simple and unambiguous.

So you're saying the time would be set to 00:00:00... which is exactly midnight.

00:00:00 is a valid time. 0000-00-00 is an invalid date (at least in any computer system I've ever seen. I suppose you could argue it was the exact instant of the Big Bang, but...).

It doesn't make sense to set one to a valid value, and the other to an invalid value.

@spytheman
Copy link
Member

Is 0000-01-01 better for you?
It has valid year, valid month, and valid day.
For that particular format, any fixed date part will work. We only have information about the time part.

@enghitalo
Copy link
Contributor Author

00:00:00 is a valid time. 0000-00-00 is an invalid date (at least in any computer system I've ever seen. I suppose you could argue it was the exact instant of the Big Bang, but...).

To me, 00:00:00 and 0000-00-00 make sense, once Time not specifically is date, time or data-time. There is the possibility to implement a field with a enum signing if it is .date, .time or .date_time; or change fields hour, minute, second [...] to option, if do you prefer

@spytheman
Copy link
Member

spytheman commented Oct 20, 2024

No, the result should be a valid date, @JalonSolov has a point. time.Time{} is an invalid one (having all fields be 0s).
Having a separate field to make it modal will lead to a lot more edge cases and bugs.

@spytheman
Copy link
Member

On master, the fixed date is 0000-11-30, which is also valid, it is just not intuitive.

@spytheman
Copy link
Member

It is interesting, that Go for example, refuses to parse 22:47:08Z:

package main

import (
	"fmt"
	"time"
)

func main() {
	t, err := time.Parse(time.RFC3339, "22:47:08Z")
	if err != nil {
		fmt.Println(err)
	}
	fmt.Println(t)
}

produces:

parsing time "22:47:08Z" as "2006-01-02T15:04:05Z07:00": cannot parse "22:47:08Z" as "2006"
0001-01-01 00:00:00 +0000 UTC

Rust also errors out for it:

use chrono::format::ParseError;
use chrono::{DateTime};

fn main() -> Result<(), ParseError> {
    let rfc3339 = DateTime::parse_from_rfc3339("2016-06-20T12:41:45.14Z")?;
    println!("{}", rfc3339);
    let rfc3339 = DateTime::parse_from_rfc3339("22:47:08Z")?;
    println!("{}", rfc3339);
    Ok(())
}

produces:

    Finished `dev` profile [unoptimized + debuginfo] target(s) in 0.02s
     Running `target/debug/smalltime`
2016-06-20 12:41:45.140 +00:00
Error: ParseError(Invalid)

@spytheman
Copy link
Member

I could not find a good way to test for python (its standard library does not appear to have a rfc3339 parser), or for JS.

@spytheman
Copy link
Member

This https://it-tools.tech/date-converter could handle it, but not as RFC3339, but as something called "Mongo ObjectID", whatever it is 🤔 .
For that format, it produces:
image

For RFC3339:
image

@spytheman spytheman changed the title time: improve performance of parse_rfc3339/1 time: rewrite parse_rfc3339/1to improve its performance Oct 20, 2024
@spytheman spytheman changed the title time: rewrite parse_rfc3339/1to improve its performance time: rewrite parse_rfc3339/1 to improve its performance Oct 20, 2024
@enghitalo
Copy link
Contributor Author

enghitalo commented Oct 20, 2024

from https://www.rfc-editor.org/rfc/rfc3339#section-4.3

4. Local Time

4.1. Coordinated Universal Time (UTC)

   Because the daylight saving rules for local time zones are so
   convoluted and can change based on local law at unpredictable times,
   true interoperability is best achieved by using Coordinated Universal
   Time (UTC).  This specification does not cater to local time zone
   rules.

image

@spytheman
Copy link
Member

I do not see, how that affects what the result of 22:47:08Z should be?

@JalonSolov
Copy link
Contributor

JalonSolov commented Oct 21, 2024

In reading rfc 3339 over (and over and over and...), there is no clear statement that a date needs to be included. HOWEVER, if you look at the ABNF for the format, you see

   full-date       = date-fullyear "-" date-month "-" date-mday
   full-time       = partial-time time-offset

   date-time       = full-date "T" full-time

which at least implies that if you don't have ALL the digits, plus the T plus the time offset, then it is not a valid rfc 3339 format date. This would explain why Go and Rust fail with an error when asked to parse a time only.

This also means that TOML does not support rfc 3339, whether it says so or not. rfc 3339 if very explicit that the time offset must be supplied, as well as the date, there the last 2 example copied from the TOML docs in the screenshot above show just times, and without offsets.

@spytheman
Copy link
Member

@JalonSolov from https://toml.io/en/v1.0.0#offset-date-time

[Offset Date-Time](https://toml.io/en/v1.0.0#offset-date-time)

To unambiguously represent a specific instant in time, 
you *may* use an [RFC 3339](https://tools.ietf.org/html/rfc3339) 
formatted date-time with offset.

...
[Local Date-Time](https://toml.io/en/v1.0.0#local-date-time)

If you omit the offset from an [RFC 3339](https://tools.ietf.org/html/rfc3339)
formatted date-time, it will represent the given date-time *without any relation
to an offset or timezone*. It cannot be converted to an instant in time without
additional information. Conversion to an instant, if required, is *implementation-specific*.

...
[Local Date](https://toml.io/en/v1.0.0#local-date)

If you include only the date portion of an [RFC 3339](https://tools.ietf.org/html/rfc3339)
formatted date-time, it will represent that entire day without any relation to an offset
or timezone.

...
[Local Time](https://toml.io/en/v1.0.0#local-time)

If you include *only the time portion* of an [RFC 3339](https://tools.ietf.org/html/rfc3339)
formatted date-time, it will represent that time of day *without any relation to a 
specific day* or any offset or timezone.

i.e. in my interpretation, the TOML spec does not require a full RFC3339 date-time
but it allows it. It also allows for time from that spec too.

The shown examples in the linked spec, and the test files in the official
suits also confirm it:
image

I.e. TOML is very liberal in what it allows, while RFC3339 on the other hand, requires the time zone.

@spytheman
Copy link
Member

I also found this specialized Python parser module: https://pypi.org/project/pyRFC3339/ .
It also rejects 22:47:08Z :
image

@spytheman
Copy link
Member

I've just changed it, so that time.parse_rfc3339 now returns an error, for incomplete time (missing date), like all of the other existing rfc3339 parsers do.

The TOML module now supplies the missing date part, when needed, so that it can continue to be conforming to its own relaxed spec.

@spytheman spytheman changed the title time: rewrite parse_rfc3339/1 to improve its performance breaking,time: rewrite parse_rfc3339/1 to improve its performance, reject partial timestamps, that miss date info like 22:47:08Z Oct 21, 2024
@spytheman spytheman changed the title breaking,time: rewrite parse_rfc3339/1 to improve its performance, reject partial timestamps, that miss date info like 22:47:08Z breaking,time: rewrite parse_rfc3339/1 to improve performance, reject partial timestamps, that miss date info like 22:47:08Z Oct 21, 2024
@spytheman
Copy link
Member

This C implementation, also errors on 22:47:08Z:
https://github.com/Bnz-0/rfc3339timestamp/blob/master/rfc3339.h#L52

@spytheman spytheman merged commit c55a75f into vlang:master Oct 21, 2024
67 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants