Skip to content

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

I18N and Metadata Translations for Data Package #42

Closed
relet opened this issue Apr 24, 2013 · 22 comments
Closed

I18N and Metadata Translations for Data Package #42

relet opened this issue Apr 24, 2013 · 22 comments

Comments

@relet
Copy link

relet commented Apr 24, 2013

How should the standard support titles, descriptions and data fields in languages other than English?

Proposal (Nov 2016)

An internationalised field:

# i18n
"title": { 
    "": "Israel's 2015 budget", 
    "he-IL": "תקציב לשנת 2015", 
    "es": "Presupuestos Generales del Estado para el año 2015 de Israel" 
}
...

Summary:

Each localizable string in datapackage.json could take two forms:

  • A simple string (for backward compatibility)
  • An object, mapping from ISO Locale codes (with or without the region specification, e.g. 'en', or 'es-ES') to their representations.
  • In this object, you could have an empty key "" which denotes the 'default' representation

Not all properties would be localizable for now. For the sake of simplicity, we limit this to only the following properties;

  • title (at package and resource level)
  • description (at package and resource level)

Default Language

You can define the default language for a data package using a lang attribute:

"lang": "en"

The default language if none is specified is English (?).

@rufuspollock
Copy link
Contributor

I like the json-ld approach of @{lang-code}. I actually had this in the original version of simple data format (but it got removed in the quest for simplicity).

While i18n seems good I do wonder whether the occam's razor for standards should also be applied here: "how essential is this, and how many potential users will care about this feature?"

@relet
Copy link
Author

relet commented Apr 24, 2013

I agree that it could be omitted, but that decision should then be mentioned in the standard or a FAQ:

  • How should I mark data that is not (described) in English language?
  • How should I handle data that is presented in several languages?
  • Can I provide these fields in several languages if I want to?

@rufuspollock
Copy link
Contributor

I'm starting to think we could at least mention idea of using @ style stuff ...

@trickvi
Copy link

trickvi commented Jan 22, 2014

I actually quite like this but I would focus more on l10n than i18n especially since we're very likely to add foreign keys soon (issue #23). That would mean everybody could point to the same dataset which could include many locales (translations).

What I'm thinking is something like a new optional field for the datapackage specification: alternativeResources (since we've all of a sudden decided to go for lowerCamelCase instead of the previous underscore_keywords even if that means we have to break backwards compatibility/consistency -- me not like but that's a different issue).

The form I'm thinking is something like:

{
    "name": "dataset-identifier",
    "...": "...",
    "resources": [
        {
            "name": "resource-identifier",
            "schema" : { "..." : "..." },
            "..." : "..."
        }
    ],
    "..." : "...",
    "alternativeResources" : {
        "resource-identifier": {
            "is-IS" : {
                "path": "/data/LC_messages/is_IS.csv",
                "format": "csv",
                "mediatype": "text/csv",
                "encoding": "<default utf8>",
                "bytes": 10000000,
                "hash": "<md5 hash of file>",
                "modified": "<iso8601 date>"
                "sources": "<source for this file>",
                "licenses": "<inherits from resource or datapackage>"
            },
            "de-DE" : { "..." : "..." },
            "..." : "..."
        }
    },
    "..." : "..." 

At the moment I'm thinking the translations would be files with the exact same schema (so things are duplicated) because that makes it easier to do both translations (copy this file and translate the values you want) and implementation (want to get the Romanian version just fetch this resource instead).

I'm reluctant to calling alternativeResources something like l18n, translations or locales (even though that's what I'm using to identify the alternative resources) because I would like to be able to have other identifiers like for example "en-GB-simple" or something like that. For that, I'm thinking about datasets which I have in mind that would, for example, have COFOG classifications. This way the data package for COFOG classifications, could provide the official names for the COFOG categories, but also the simple jargonless version (which are used on WhereDoesMyMoneyGo and the translations of the classifications (the simple ones) like budzeti.ba or hvertferskatturinn.is use.

However that just opens up a new problem: How to standardise "locales/alternativeResources" identifiers? So maybe it's enough to just stick with locales as identifiers and stick to BCP 47. If people decide to create a jargonless version of a dataset then that would be a different dataset (with its own l10n). So we could just call it translations and live happily ever after.

@rufuspollock
Copy link
Contributor

@tryggvib How often do people actually translate an entire dataset? Is it quite common?

@trickvi
Copy link

trickvi commented Jan 22, 2014

I think this applies to perhaps smaller datasets used with foreign keys. This could be datasets with names of all countries in the world so you can point to them instead of having them only in English, classification datasets like I mention etc. (I think this is the biggest use case).

I also think this is beneficial for datasets created in one non-English speaking country, that you want to make comparable to other datasets, for example as part of some global data initiative, so you would translate it into English and make that available. That way you can make the dataset available in two languages.

As a side note, it might be interesting to start some project to make dataset translations simpler ;)

@pvgenuchten
Copy link

Hi @tryggvib @rgrp, I found this thread while searching for i18n in datapackage.json. Most common usecase probably is that people will want to describe their dataset in more then a single language. However we've also found some cases where a full dataset is translated in multiple languages.

Looking at json-lds' @language attribute, seems there are three options available (http://www.w3.org/TR/json-ld/#string-internationalization)

{
  "@context": {
    ...
    "ex": "http://example.com/vocab/",
    "@language": "ja",
    "name": { "@id": "ex:name", "@language": null },
    "occupation": { "@id": "ex:occupation" },
    "occupation_en": { "@id": "ex:occupation", "@language": "en" },
    "occupation_cs": { "@id": "ex:occupation", "@language": "cs" }
  },
  "name": "Yagyū Muneyoshi",
  "occupation": "忍者",
  "occupation_en": "Ninja",
  "occupation_cs": "Nindža",
  ...
}

or

{
  "@context":
  {
    ...
    "occupation": { "@id": "ex:occupation", "@container": "@language" }
  },
  "name": "Yagyū Muneyoshi",
  "occupation":
  {
    "ja": "忍者",
    "en": "Ninja",
    "cs": "Nindža"
  }
  ...
}

or

{
  "@context": {
    ...
    "@language": "ja"
  },
  "name": "花澄",
  "occupation": {
    "@value": "Scientist",
    "@language": "en"
  }
}

first seems to have best backwards compat

@Stiivi
Copy link
Contributor

Stiivi commented Feb 4, 2014

To summarize my experience with translations: the translation is on multiple levels: metadata translation and data translation.

The metadata translation is simpler:

  1. define keys which are localizable, such as labels, descriptions and comments
  2. have way how to specify the localized values

Having the localization in the main file might be handy for the package reader, however it has a disadvantage of providing additional translations. One has to edit the file or have a tool that will combine multiple metadata specifications into one file. Much better solution is to have metadata translations as separate objects/files, for example datapackage-locale-XXXX.json or have a folder with LOCALE.json files or something like that. Much easier to move translations around. With multiple datasets with the same structure the translation is just about creating a simple copy of a file.

Data translation is slightly different. The localized data can be provided in multiple formats:

  • whole dataset copy per language
  • denormalized translation with one column per language, not all columns might be localized (for example european CPV was provided in this form for all languages)
  • normalized translation with a column specifying a language (many data of localized apps)

Question is: which case we would like to handle? All of them? Only certain ones?

How the translation is handled technically during data analysis process depends on the case:

The most relevant tables to be localized are the dimension tables, therefore I'm going to use them as an example.

  • whole dataset: JOIN table based on desired language
  • denormalized translation: switch columns based on language
  • normalized: use additional WHERE condition on the language column

As for specification requirements:

  • whole dataset: we just need to point to another resource with the SAME structure as the original one and assign a language to it
  • denormalized translation: specify which columns are localized; assign column names to their respective locales
  • normalized: specify which column contains the language code

As for the denormalized translation: do we want to provide "logical" column name or the original name? For example, the columns might be name_de, name_en, name_sk - do we want to provide only the name_XX to the user based on the user's language choice or rename it just to name?

In Cubes framework we are using the denormalized translation and hiding the original column names (stripping the locale column extension) – therefore the reports work regardless of language used. The reports even work when localized column was added to non-localized dataset later. But Cubes is metadata-heavy framework.

@rufuspollock
Copy link
Contributor

@pwalsh @danfowler this is one to look at again.

@pwalsh
Copy link
Member

pwalsh commented Nov 23, 2015

@rgrp related, my long standing pull request, which deals with i18n in the resources themselves: #190

@rufuspollock
Copy link
Contributor

@pwalsh I know - I still feel we should do metadata first then data.

@akariv
Copy link
Member

akariv commented Dec 4, 2015

I agree that starting with meta-data is a good idea.

My humble suggestion is that each localizable string in datapackage.json could take two forms:

  • A simple string (for backward compatibility)
  • An object, mapping from ISO Locale codes (with or without the region specification, e.g. 'en', or 'es-ES') to their representations.
    In this object, you could have an empty key "" which denotes the 'default' representation

(For the sake of simplicity, I also think that we could limit this to only apply for the title and description fields)

For example:

...
"title": { 
    "": "Israel's 2015 budget", 
    "he-IL": "תקציב לשנת 2015", 
    "es": "Presupuestos Generales del Estado para el año 2015 de Israel" 
}
...

@pwalsh
Copy link
Member

pwalsh commented Dec 4, 2015

Since we do lots of "string or object" type patterns in the Data Package specs generally, I'm partial to the suggestion made by @akariv. However, it could get complicated real quick if someone tries to apply this liberally to any string located anywhere on the datapackage.json descriptor (think: custom data structures of heavily nested objects).

One way to counter that is to limit translatable fields explicitly, but that kind of goes against the flexibility of the family of Data Package specifications in general.

I'd suggest something that follows on from the pattern I suggest for data localisation here

Where:

  • @ becomes a special symbol in keys, denoting a translated field
  • What follows @ is a language code
  • What precedes @ is a property name, expected to match another property of the dp.

I also think that the distinction between localisation and translation is important, and would again suggest the same concept as I suggest for data, here. Note that this is not some invention: the pattern I'm suggesting is heavily influenced by my work with translation and localisation using Django, and probably is quite consistent with other web frameworks.

Example:

{
  "name": "School of Rock",
  "description": "A school, for Rock.",
  "name@he": "בית הספר לרוק",
  "description@he": "בית ספר, לרוק"
}

@akariv
Copy link
Member

akariv commented Dec 4, 2015

@pwalsh a two comments:

  • The reason I suggested we use this pattern only for title and description fields, is that having multiple translations for other fields is probably pointless (and tbh user-supplied fields can use whichever scheme they want).
  • I really like your suggestion, but don't you think that your scheme might result in a lot of clutter? For example, imagine translating a few fields to 20+ languages? JSON doesn't have any inherent ordering of object keys, which could make things quite messy...

@pwalsh
Copy link
Member

pwalsh commented Dec 5, 2015

@akariv

On the first point, user-specified fields on Data Package are part of the design of the spec, and with the way the family of specs works, I do think it would be unusual to explicitly say only specific fields are translatable.

On the second point: yes, it would result in a lot of clutter. I guess we have to decide if we are optimising for human reading of the spec too. Al alternate approach would be to group everything by language which would at least be an ordered type of clutter :).

{
  "translations": {
    "he": { ..  all translated properties ...},
    ... etc ...
  }
}

@akariv
Copy link
Member

akariv commented Dec 5, 2015

(What I meant was not that only these two fields are translatable, but that only for them the spec specifies a method for translating - and other user-specified fields may use a different scheme - although in second though that may not be the best practice).

As for readability - I think that is definitely a factor (as someone said: "JSON is readable as simple text making it amenable to management and processing using simple text tools")

And your suggestion does improve things in terms of clutter, but it somehow doesn't feel right to me to separate the original value from the translation.

@pwalsh
Copy link
Member

pwalsh commented Dec 5, 2015

@akariv yes, it is not a simple problem to solve. Maybe we should be optimising for cases of a handful of translations - say: 2-5 languages. And, acknowledging the fact that it is likely that we might expect, say, 2-5 translatable properties on a giving package?

@rufuspollock
Copy link
Contributor

rufuspollock commented Dec 1, 2016

So, I've thought quite a bit about this and I generally agree with @akariv approach:

"title": { 
    "": "Israel's 2015 budget", 
    "he-IL": "תקציב לשנת 2015", 
    "es": "Presupuestos Generales del Estado para el año 2015 de Israel" 
}
...

I've updated the main description of the issue with a relatively full spec based on this.

Welcome comments from @frictionlessdata/specs-working-group

Research

@rufuspollock rufuspollock added this to the Version-1 milestone Dec 1, 2016
@rufuspollock rufuspollock changed the title I18N for Data Package I18N and Metadata Translations for Data Package Dec 11, 2016
@pwalsh
Copy link
Member

pwalsh commented Dec 11, 2016

@rufuspollock agreed.

In my opinion, we do need lang or languages as well as the actual handling of translations for properties. See the pattern described here

I prefer the array and the special treatment of the first element in the array, as per my pattern. Another approach, like in Django for example, is LANGUAGE_CODE for the default lang and an additional LANGUAGES array for the supported translations. But I'm not convinced of the need for two different properties.

@pwalsh
Copy link
Member

pwalsh commented Feb 5, 2017

@rufuspollock let's schedule this for v1.1 - there are lots of changes for v1 and they should settle before we introduce translations, esp. as the proposal here uses the dynamic type pattern we moved away from in v1.

@pwalsh pwalsh modified the milestones: v1.1, v1.0 Feb 5, 2017
@rufuspollock
Copy link
Contributor

@pwalsh agreed.

@ppKrauss
Copy link

Hi, no news here (only later v1.1)?


If "real life example" is useful to this discussion ... My approach (while no v1.1) at datasets-br/state-codes's datapackage.json, was to add lang descriptor and lang-suffix differentiator. The lang at source level, as default for all fields.

Hum... the interpretation was "language of the descriptions (!and CSV textual contents)".

If some field or descriptor need to use other language, I use a suffix -{lang}. In the example we used title as default (en) and title-pt for Portuguese title.

@roll roll removed this from the v1.1 milestone Apr 14, 2023
roll added a commit that referenced this issue Jun 26, 2024
* Added `resources` heading

* Added $schema

* Updated urls

* Removed `package.profile`

* Updated $schema

* Started extensions

* Finished extensions

* Updated wording

* Updated JSONSchema version

* Added extensions note

* Fixed recursivity

* Updated sections

* Fixed extension example

* Updated JSON Schema version

* Updated extensions features list

* Fixed unfinished sentence

* Replace idempotent -> immutable

* Update content/docs/specifications/extensions.md

Co-authored-by: Peter Desmet <peter.desmet@inbo.be>

* Improved grammar

---------

Co-authored-by: Peter Desmet <peter.desmet@inbo.be>
@roll roll added this to the v2.1 milestone Jun 26, 2024
@frictionlessdata frictionlessdata locked and limited conversation to collaborators Oct 21, 2024
@roll roll converted this issue into discussion #991 Oct 21, 2024
@roll roll removed this from the v2.1 milestone Oct 22, 2024

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

Labels
None yet
Projects
Status: Done
Development

No branches or pull requests

9 participants