I18N and Metadata Translations for Data Package #42

relet · 2013-04-24T08:16:59Z

How should the standard support titles, descriptions and data fields in languages other than English?

Proposal (Nov 2016)

An internationalised field:

# i18n
"title": { 
    "": "Israel's 2015 budget", 
    "he-IL": "תקציב לשנת 2015", 
    "es": "Presupuestos Generales del Estado para el año 2015 de Israel" 
}
...

Summary:

Each localizable string in datapackage.json could take two forms:

A simple string (for backward compatibility)
An object, mapping from ISO Locale codes (with or without the region specification, e.g. 'en', or 'es-ES') to their representations.
In this object, you could have an empty key "" which denotes the 'default' representation

Not all properties would be localizable for now. For the sake of simplicity, we limit this to only the following properties;

title (at package and resource level)
description (at package and resource level)

Default Language

You can define the default language for a data package using a lang attribute:

"lang": "en"

The default language if none is specified is English (?).

The text was updated successfully, but these errors were encountered:

rufuspollock · 2013-04-24T08:24:22Z

I like the json-ld approach of @{lang-code}. I actually had this in the original version of simple data format (but it got removed in the quest for simplicity).

While i18n seems good I do wonder whether the occam's razor for standards should also be applied here: "how essential is this, and how many potential users will care about this feature?"

relet · 2013-04-24T10:33:18Z

I agree that it could be omitted, but that decision should then be mentioned in the standard or a FAQ:

How should I mark data that is not (described) in English language?
How should I handle data that is presented in several languages?
Can I provide these fields in several languages if I want to?

rufuspollock · 2013-09-07T07:15:03Z

I'm starting to think we could at least mention idea of using @ style stuff ...

trickvi · 2014-01-22T11:44:22Z

I actually quite like this but I would focus more on l10n than i18n especially since we're very likely to add foreign keys soon (issue #23). That would mean everybody could point to the same dataset which could include many locales (translations).

What I'm thinking is something like a new optional field for the datapackage specification: alternativeResources (since we've all of a sudden decided to go for lowerCamelCase instead of the previous underscore_keywords even if that means we have to break backwards compatibility/consistency -- me not like but that's a different issue).

The form I'm thinking is something like:

{
    "name": "dataset-identifier",
    "...": "...",
    "resources": [
        {
            "name": "resource-identifier",
            "schema" : { "..." : "..." },
            "..." : "..."
        }
    ],
    "..." : "...",
    "alternativeResources" : {
        "resource-identifier": {
            "is-IS" : {
                "path": "/data/LC_messages/is_IS.csv",
                "format": "csv",
                "mediatype": "text/csv",
                "encoding": "<default utf8>",
                "bytes": 10000000,
                "hash": "<md5 hash of file>",
                "modified": "<iso8601 date>"
                "sources": "<source for this file>",
                "licenses": "<inherits from resource or datapackage>"
            },
            "de-DE" : { "..." : "..." },
            "..." : "..."
        }
    },
    "..." : "..."

At the moment I'm thinking the translations would be files with the exact same schema (so things are duplicated) because that makes it easier to do both translations (copy this file and translate the values you want) and implementation (want to get the Romanian version just fetch this resource instead).

I'm reluctant to calling alternativeResources something like l18n, translations or locales (even though that's what I'm using to identify the alternative resources) because I would like to be able to have other identifiers like for example "en-GB-simple" or something like that. For that, I'm thinking about datasets which I have in mind that would, for example, have COFOG classifications. This way the data package for COFOG classifications, could provide the official names for the COFOG categories, but also the simple jargonless version (which are used on WhereDoesMyMoneyGo and the translations of the classifications (the simple ones) like budzeti.ba or hvertferskatturinn.is use.

However that just opens up a new problem: How to standardise "locales/alternativeResources" identifiers? So maybe it's enough to just stick with locales as identifiers and stick to BCP 47. If people decide to create a jargonless version of a dataset then that would be a different dataset (with its own l10n). So we could just call it translations and live happily ever after.

rufuspollock · 2014-01-22T20:44:53Z

@tryggvib How often do people actually translate an entire dataset? Is it quite common?

trickvi · 2014-01-22T22:33:30Z

I think this applies to perhaps smaller datasets used with foreign keys. This could be datasets with names of all countries in the world so you can point to them instead of having them only in English, classification datasets like I mention etc. (I think this is the biggest use case).

I also think this is beneficial for datasets created in one non-English speaking country, that you want to make comparable to other datasets, for example as part of some global data initiative, so you would translate it into English and make that available. That way you can make the dataset available in two languages.

As a side note, it might be interesting to start some project to make dataset translations simpler ;)

pvgenuchten · 2014-02-04T15:28:23Z

Hi @tryggvib @rgrp, I found this thread while searching for i18n in datapackage.json. Most common usecase probably is that people will want to describe their dataset in more then a single language. However we've also found some cases where a full dataset is translated in multiple languages.

Looking at json-lds' @language attribute, seems there are three options available (http://www.w3.org/TR/json-ld/#string-internationalization)

{
  "@context": {
    ...
    "ex": "http://example.com/vocab/",
    "@language": "ja",
    "name": { "@id": "ex:name", "@language": null },
    "occupation": { "@id": "ex:occupation" },
    "occupation_en": { "@id": "ex:occupation", "@language": "en" },
    "occupation_cs": { "@id": "ex:occupation", "@language": "cs" }
  },
  "name": "Yagyū Muneyoshi",
  "occupation": "忍者",
  "occupation_en": "Ninja",
  "occupation_cs": "Nindža",
  ...
}

or

{
  "@context":
  {
    ...
    "occupation": { "@id": "ex:occupation", "@container": "@language" }
  },
  "name": "Yagyū Muneyoshi",
  "occupation":
  {
    "ja": "忍者",
    "en": "Ninja",
    "cs": "Nindža"
  }
  ...
}

or

{
  "@context": {
    ...
    "@language": "ja"
  },
  "name": "花澄",
  "occupation": {
    "@value": "Scientist",
    "@language": "en"
  }
}

first seems to have best backwards compat

Stiivi · 2014-02-04T15:37:04Z

To summarize my experience with translations: the translation is on multiple levels: metadata translation and data translation.

The metadata translation is simpler:

define keys which are localizable, such as labels, descriptions and comments
have way how to specify the localized values

Having the localization in the main file might be handy for the package reader, however it has a disadvantage of providing additional translations. One has to edit the file or have a tool that will combine multiple metadata specifications into one file. Much better solution is to have metadata translations as separate objects/files, for example datapackage-locale-XXXX.json or have a folder with LOCALE.json files or something like that. Much easier to move translations around. With multiple datasets with the same structure the translation is just about creating a simple copy of a file.

Data translation is slightly different. The localized data can be provided in multiple formats:

whole dataset copy per language
denormalized translation with one column per language, not all columns might be localized (for example european CPV was provided in this form for all languages)
normalized translation with a column specifying a language (many data of localized apps)

Question is: which case we would like to handle? All of them? Only certain ones?

How the translation is handled technically during data analysis process depends on the case:

The most relevant tables to be localized are the dimension tables, therefore I'm going to use them as an example.

whole dataset: JOIN table based on desired language
denormalized translation: switch columns based on language
normalized: use additional WHERE condition on the language column

As for specification requirements:

whole dataset: we just need to point to another resource with the SAME structure as the original one and assign a language to it
denormalized translation: specify which columns are localized; assign column names to their respective locales
normalized: specify which column contains the language code

As for the denormalized translation: do we want to provide "logical" column name or the original name? For example, the columns might be name_de, name_en, name_sk - do we want to provide only the name_XX to the user based on the user's language choice or rename it just to name?

In Cubes framework we are using the denormalized translation and hiding the original column names (stripping the locale column extension) – therefore the reports work regardless of language used. The reports even work when localized column was added to non-localized dataset later. But Cubes is metadata-heavy framework.

rufuspollock · 2015-11-23T16:59:31Z

@pwalsh @danfowler this is one to look at again.

pwalsh · 2015-11-23T20:29:42Z

@rgrp related, my long standing pull request, which deals with i18n in the resources themselves: #190

rufuspollock · 2015-11-30T10:53:18Z

@pwalsh I know - I still feel we should do metadata first then data.

akariv · 2015-12-04T12:22:28Z

I agree that starting with meta-data is a good idea.

My humble suggestion is that each localizable string in datapackage.json could take two forms:

A simple string (for backward compatibility)
An object, mapping from ISO Locale codes (with or without the region specification, e.g. 'en', or 'es-ES') to their representations.
In this object, you could have an empty key "" which denotes the 'default' representation

(For the sake of simplicity, I also think that we could limit this to only apply for the title and description fields)

For example:

...
"title": { 
    "": "Israel's 2015 budget", 
    "he-IL": "תקציב לשנת 2015", 
    "es": "Presupuestos Generales del Estado para el año 2015 de Israel" 
}
...

pwalsh · 2015-12-04T14:32:45Z

Since we do lots of "string or object" type patterns in the Data Package specs generally, I'm partial to the suggestion made by @akariv. However, it could get complicated real quick if someone tries to apply this liberally to any string located anywhere on the datapackage.json descriptor (think: custom data structures of heavily nested objects).

One way to counter that is to limit translatable fields explicitly, but that kind of goes against the flexibility of the family of Data Package specifications in general.

I'd suggest something that follows on from the pattern I suggest for data localisation here

Where:

@ becomes a special symbol in keys, denoting a translated field
What follows @ is a language code
What precedes @ is a property name, expected to match another property of the dp.

I also think that the distinction between localisation and translation is important, and would again suggest the same concept as I suggest for data, here. Note that this is not some invention: the pattern I'm suggesting is heavily influenced by my work with translation and localisation using Django, and probably is quite consistent with other web frameworks.

Example:

{
  "name": "School of Rock",
  "description": "A school, for Rock.",
  "name@he": "בית הספר לרוק",
  "description@he": "בית ספר, לרוק"
}

akariv · 2015-12-04T15:44:07Z

@pwalsh a two comments:

The reason I suggested we use this pattern only for title and description fields, is that having multiple translations for other fields is probably pointless (and tbh user-supplied fields can use whichever scheme they want).
I really like your suggestion, but don't you think that your scheme might result in a lot of clutter? For example, imagine translating a few fields to 20+ languages? JSON doesn't have any inherent ordering of object keys, which could make things quite messy...

pwalsh · 2015-12-05T12:46:43Z

@akariv

On the first point, user-specified fields on Data Package are part of the design of the spec, and with the way the family of specs works, I do think it would be unusual to explicitly say only specific fields are translatable.

On the second point: yes, it would result in a lot of clutter. I guess we have to decide if we are optimising for human reading of the spec too. Al alternate approach would be to group everything by language which would at least be an ordered type of clutter :).

{
  "translations": {
    "he": { ..  all translated properties ...},
    ... etc ...
  }
}

akariv · 2015-12-05T16:34:28Z

(What I meant was not that only these two fields are translatable, but that only for them the spec specifies a method for translating - and other user-specified fields may use a different scheme - although in second though that may not be the best practice).

As for readability - I think that is definitely a factor (as someone said: "JSON is readable as simple text making it amenable to management and processing using simple text tools")

And your suggestion does improve things in terms of clutter, but it somehow doesn't feel right to me to separate the original value from the translation.

pwalsh · 2015-12-05T16:59:58Z

@akariv yes, it is not a simple problem to solve. Maybe we should be optimising for cases of a handful of translations - say: 2-5 languages. And, acknowledging the fact that it is likely that we might expect, say, 2-5 translatable properties on a giving package?

rufuspollock · 2016-12-01T18:50:23Z

So, I've thought quite a bit about this and I generally agree with @akariv approach:

"title": { 
    "": "Israel's 2015 budget", 
    "he-IL": "תקציב לשנת 2015", 
    "es": "Presupuestos Generales del Estado para el año 2015 de Israel" 
}
...

I've updated the main description of the issue with a relatively full spec based on this.

Welcome comments from @frictionlessdata/specs-working-group

Research

JSON-LD: http://json-ld.org/spec/latest/json-ld/#string-internationalization
- Has approach we have here plus some other more subtle options related to specifics of JSON-LD
Originally quite liked title@en. However, this does not seem common and JSON-LD no longer supports this. As @akariv points out it does not necessarily sort well and bloats the top level of the JSON. Also the @ in json property names is kind of annoying.
It is also what I see around on the web e.g. https://www.drzon.net/approaches-to-json-internationalisation-i18n/

pwalsh · 2016-12-11T13:09:00Z

@rufuspollock agreed.

In my opinion, we do need lang or languages as well as the actual handling of translations for properties. See the pattern described here

I prefer the array and the special treatment of the first element in the array, as per my pattern. Another approach, like in Django for example, is LANGUAGE_CODE for the default lang and an additional LANGUAGES array for the supported translations. But I'm not convinced of the need for two different properties.

pwalsh · 2017-02-05T06:54:17Z

@rufuspollock let's schedule this for v1.1 - there are lots of changes for v1 and they should settle before we introduce translations, esp. as the proposal here uses the dynamic type pattern we moved away from in v1.

rufuspollock · 2017-02-05T14:26:48Z

@pwalsh agreed.

ppKrauss · 2017-05-28T23:34:56Z

Hi, no news here (only later v1.1)?

If "real life example" is useful to this discussion ... My approach (while no v1.1) at datasets-br/state-codes's datapackage.json, was to add lang descriptor and lang-suffix differentiator. The lang at source level, as default for all fields.

Hum... the interpretation was "language of the descriptions (!and CSV textual contents)".

If some field or descriptor need to use other language, I use a suffix -{lang}. In the example we used title as default (en) and title-pt for Portuguese title.

* Added `resources` heading * Added $schema * Updated urls * Removed `package.profile` * Updated $schema * Started extensions * Finished extensions * Updated wording * Updated JSONSchema version * Added extensions note * Fixed recursivity * Updated sections * Fixed extension example * Updated JSON Schema version * Updated extensions features list * Fixed unfinished sentence * Replace idempotent -> immutable * Update content/docs/specifications/extensions.md Co-authored-by: Peter Desmet <peter.desmet@inbo.be> * Improved grammar --------- Co-authored-by: Peter Desmet <peter.desmet@inbo.be>

rufuspollock mentioned this issue Sep 2, 2014

Translated or localized titles and descriptions #135

Closed

pwalsh mentioned this issue Dec 28, 2015

Proposal: Translation and Language Support in JSON Table Schema #190

Closed

roll added the backlog label Aug 8, 2016

roll removed the backlog label Aug 29, 2016

rufuspollock mentioned this issue Dec 1, 2016

Tabular JSON format for Wikipedia collaboration #265

Closed

rufuspollock added this to the Version-1 milestone Dec 1, 2016

rufuspollock changed the title ~~I18N for Data Package~~ I18N and Metadata Translations for Data Package Dec 11, 2016

rufuspollock added the Status: Ready-For-PR label Dec 22, 2016

rufuspollock self-assigned this Dec 22, 2016

pwalsh removed the Status: Ready-For-PR label Feb 5, 2017

pwalsh modified the milestones: v1.1, v1.0 Feb 5, 2017

ppKrauss mentioned this issue May 29, 2017

Adopt hierarchical principle frictionlessdata/frictionlessdata.io#852

Closed

n0rdlicht mentioned this issue Mar 26, 2021

Support for localisation/internationalisation (i18n) cividi/spatial-data-package#1

Open

roll added this to Open Knowledge Apr 14, 2023

roll removed this from the v1.1 milestone Apr 14, 2023

roll unassigned rufuspollock Jan 3, 2024

roll added this to the v2.1 milestone Jun 26, 2024

frictionlessdata locked and limited conversation to collaborators Oct 21, 2024

roll converted this issue into discussion #991 Oct 21, 2024

github-project-automation bot moved this to Done in Open Knowledge Oct 21, 2024

roll removed this from the v2.1 milestone Oct 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

This issue was moved to a discussion.

I18N and Metadata Translations for Data Package #42

I18N and Metadata Translations for Data Package #42

relet commented Apr 24, 2013 •

edited by rufuspollock

Loading

rufuspollock commented Apr 24, 2013

relet commented Apr 24, 2013

rufuspollock commented Sep 7, 2013

trickvi commented Jan 22, 2014

rufuspollock commented Jan 22, 2014

trickvi commented Jan 22, 2014

pvgenuchten commented Feb 4, 2014

Stiivi commented Feb 4, 2014

rufuspollock commented Nov 23, 2015

pwalsh commented Nov 23, 2015

rufuspollock commented Nov 30, 2015

akariv commented Dec 4, 2015

pwalsh commented Dec 4, 2015

akariv commented Dec 4, 2015

pwalsh commented Dec 5, 2015

akariv commented Dec 5, 2015

pwalsh commented Dec 5, 2015

rufuspollock commented Dec 1, 2016 •

edited

Loading

pwalsh commented Dec 11, 2016 •

edited

Loading

pwalsh commented Feb 5, 2017

rufuspollock commented Feb 5, 2017

ppKrauss commented May 28, 2017

This issue was moved to a discussion.

This issue was moved to a discussion.

I18N and Metadata Translations for Data Package #42

I18N and Metadata Translations for Data Package #42

Comments

relet commented Apr 24, 2013 • edited by rufuspollock Loading

Proposal (Nov 2016)

Default Language

rufuspollock commented Apr 24, 2013

relet commented Apr 24, 2013

rufuspollock commented Sep 7, 2013

trickvi commented Jan 22, 2014

rufuspollock commented Jan 22, 2014

trickvi commented Jan 22, 2014

pvgenuchten commented Feb 4, 2014

Stiivi commented Feb 4, 2014

rufuspollock commented Nov 23, 2015

pwalsh commented Nov 23, 2015

rufuspollock commented Nov 30, 2015

akariv commented Dec 4, 2015

pwalsh commented Dec 4, 2015

akariv commented Dec 4, 2015

pwalsh commented Dec 5, 2015

akariv commented Dec 5, 2015

pwalsh commented Dec 5, 2015

rufuspollock commented Dec 1, 2016 • edited Loading

Research

pwalsh commented Dec 11, 2016 • edited Loading

pwalsh commented Feb 5, 2017

rufuspollock commented Feb 5, 2017

ppKrauss commented May 28, 2017

This issue was moved to a discussion.

relet commented Apr 24, 2013 •

edited by rufuspollock

Loading

rufuspollock commented Dec 1, 2016 •

edited

Loading

pwalsh commented Dec 11, 2016 •

edited

Loading