-
Notifications
You must be signed in to change notification settings - Fork 183
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Datagen: Consume iana field in timezone.json #4218
Conversation
} else if let Some(iana) = &bcp47_tzid_data.iana { | ||
canonical_tzids.insert(*bcp47_tzid, iana.clone()); | ||
} else if let Some(iana) = &bcp47_tzid_data | ||
.alias | ||
.as_ref() | ||
.and_then(|s| s.split(' ').next()) | ||
{ | ||
canonical_tzids.insert(*bcp47_tzid, String::from(*iana)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@srl295: can you confirm that this algorithm is correct? First check for the iana attribute, and if it's not there, then choose the first entry in the aliases list.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not that familiar with the algorithm… i can take a look
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you want IANA's preferred ID, your description above sounds right.
But then you should get Asia/Kolkata
because that's what the data has.
<type name="inccu" description="Kolkata, India" alias="Asia/Calcutta Asia/Kolkata" iana="Asia/Kolkata"/>
(just as in the spec).
I'd go with CLDR's "canonical" id which is simply the first entry in the alias list. IANA tends to rename stuff, which is why we stablized it to Asia/Calcutta
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you confirm that this algorithm is correct? First check for the iana attribute, and if it's not there, then choose the first entry in the aliases list.
Yes, this is correct.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This significantly changes data with CLDR 43:
Asia/Kolkata
becomesAsia/Calcutta
, which is a regressionW-SU
becomesEurope/Moscow
, which is definitely an improvement- There are many more improvements, like anything prefixed with a country is being replaced by the correct continent/ocean prefix
I hadn't looked at the current data, but it seems very wrong and I'd like to make a 1.3 patch once this is landed. However, Kolkata is a regression, and there might be others. Can we identify them and special case them in datagen to generate the correct data on 43?
Cursorary scan:
Pacific/Kanton
regresses toPacific/Enderbury
Asia/Yangon
regresses toAsia/Rangoon
Europe/Kyiv
regresses toEurope/Kiev
Yes I'm aware that there are regressions but there are also improvements against CLDR 43 data. We already weren't consistently producing the correct results, and with this change we are just changing which results are right and which ones are wrong. This is because CLDR 43 did not define the canonical ID, so as far as ICU4X is concerned, we can return any alias and it is correct according to CLDR 43. |
There's an argument to be made that once we have returned an ID as canonical, we shouldn't change it. I don't want to make that argument, because I really want to clean this up asap, but it's out there. |
We have spent months debating what codes to return as canonical in CLDR and ECMA, and CLDR 44 finally defines them. People using CLDR 43 data cannot have any expectations in this area. It's really that simple. If they want forever stable codes, use the BCP-47 ones. Aside: this API is doc-hidden in ICU4X 1.3 if that makes stability problems moot. https://docs.rs/icu/latest/icu/timezone/struct.IanaBcp47RoundTripMapper.html |
@yumaoka Can you review the algorithm in this PR? |
/// let iana_id = mapper_borrowed.bcp47_to_iana(bcp47_id.unwrap()); | ||
/// assert_eq!(iana_id, Some("Asia/Kolkata")) | ||
/// assert_eq!(iana_id, Some("Asia/Calcutta")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this seems backwards from the algorithm you described - but i might not know the syntax here. iana_idwould be Kolkata (for now!)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the code and comment seems to match the algorithm for getting IANA's preferred id, whatever that happens to be at a moment. So that would be Asia/Kolkata
First part of #4044
The data changed because we're selecting a different value from the aliases list than we were before. However, upon upgrading to CLDR 44, this is a well defined algorithm.