Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Datagen: Consume iana field in timezone.json #4218

Merged
merged 7 commits into from
Oct 27, 2023
Merged

Conversation

sffc
Copy link
Member

@sffc sffc commented Oct 25, 2023

First part of #4044

The data changed because we're selecting a different value from the aliases list than we were before. However, upon upgrading to CLDR 44, this is a well defined algorithm.

@sffc sffc requested review from a team, robertbastian and Manishearth as code owners October 25, 2023 01:15
Comment on lines +71 to +78
} else if let Some(iana) = &bcp47_tzid_data.iana {
canonical_tzids.insert(*bcp47_tzid, iana.clone());
} else if let Some(iana) = &bcp47_tzid_data
.alias
.as_ref()
.and_then(|s| s.split(' ').next())
{
canonical_tzids.insert(*bcp47_tzid, String::from(*iana));
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@srl295: can you confirm that this algorithm is correct? First check for the iana attribute, and if it's not there, then choose the first entry in the aliases list.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not that familiar with the algorithm… i can take a look

Copy link
Member

@srl295 srl295 Oct 27, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you want IANA's preferred ID, your description above sounds right.

But then you should get Asia/Kolkata because that's what the data has.

<type name="inccu" description="Kolkata, India" alias="Asia/Calcutta Asia/Kolkata" iana="Asia/Kolkata"/> (just as in the spec).

I'd go with CLDR's "canonical" id which is simply the first entry in the alias list. IANA tends to rename stuff, which is why we stablized it to Asia/Calcutta.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you confirm that this algorithm is correct? First check for the iana attribute, and if it's not there, then choose the first entry in the aliases list.

Yes, this is correct.

@sffc sffc requested review from srl295 and removed request for a team and robertbastian October 25, 2023 01:17
@sffc sffc requested a review from nordzilla as a code owner October 26, 2023 00:28
Copy link
Member

@robertbastian robertbastian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This significantly changes data with CLDR 43:

  • Asia/Kolkata becomes Asia/Calcutta, which is a regression
  • W-SU becomes Europe/Moscow, which is definitely an improvement
  • There are many more improvements, like anything prefixed with a country is being replaced by the correct continent/ocean prefix

I hadn't looked at the current data, but it seems very wrong and I'd like to make a 1.3 patch once this is landed. However, Kolkata is a regression, and there might be others. Can we identify them and special case them in datagen to generate the correct data on 43?

Cursorary scan:

  • Pacific/Kanton regresses to Pacific/Enderbury
  • Asia/Yangon regresses to Asia/Rangoon
  • Europe/Kyiv regresses to Europe/Kiev

@sffc
Copy link
Member Author

sffc commented Oct 26, 2023

Yes I'm aware that there are regressions but there are also improvements against CLDR 43 data. We already weren't consistently producing the correct results, and with this change we are just changing which results are right and which ones are wrong. This is because CLDR 43 did not define the canonical ID, so as far as ICU4X is concerned, we can return any alias and it is correct according to CLDR 43.

@robertbastian
Copy link
Member

There's an argument to be made that once we have returned an ID as canonical, we shouldn't change it. I don't want to make that argument, because I really want to clean this up asap, but it's out there.

@sffc
Copy link
Member Author

sffc commented Oct 27, 2023

We have spent months debating what codes to return as canonical in CLDR and ECMA, and CLDR 44 finally defines them. People using CLDR 43 data cannot have any expectations in this area. It's really that simple. If they want forever stable codes, use the BCP-47 ones.

Aside: this API is doc-hidden in ICU4X 1.3 if that makes stability problems moot.

https://docs.rs/icu/latest/icu/timezone/struct.IanaBcp47RoundTripMapper.html

@sffc
Copy link
Member Author

sffc commented Oct 27, 2023

@yumaoka Can you review the algorithm in this PR?

/// let iana_id = mapper_borrowed.bcp47_to_iana(bcp47_id.unwrap());
/// assert_eq!(iana_id, Some("Asia/Kolkata"))
/// assert_eq!(iana_id, Some("Asia/Calcutta"))
Copy link
Member

@srl295 srl295 Oct 27, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this seems backwards from the algorithm you described - but i might not know the syntax here. iana_idwould be Kolkata (for now!)

Copy link
Member

@srl295 srl295 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the code and comment seems to match the algorithm for getting IANA's preferred id, whatever that happens to be at a moment. So that would be Asia/Kolkata

@sffc sffc merged commit 2f1a6c1 into unicode-org:main Oct 27, 2023
@sffc sffc deleted the tz-iana-field branch October 27, 2023 20:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants