Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

improvement: Context #1281

Closed
wants to merge 8 commits into from
Closed

improvement: Context #1281

wants to merge 8 commits into from

Conversation

nmorel
Copy link

@nmorel nmorel commented Sep 14, 2022

Hello 👋

@JSteunou and I needed the context to work for our company so we tried to fix the different issues we encountered with the current implementation.

We tried to follow the same "schema" for the internal structure and limit the changes.
It works (closes #1198) but it has some flaws.

For an example of the structure, for this file, it gives the following on extraction :

{
  "Component A": {
    "extractedComments": [],
    "origin": [["collect/componentA/componentA.js", 1]]
  },
  "Hello World": {
    "extractedComments": ["Comment A", "Comment A again"],
    "origin": [
      ["collect/componentA/componentA.js", 2],
      ["collect/componentA/componentA.js", 3]
    ]
  },
  "world ctxt": {
    "Hello World": {
      "extractedComments": [],
      "origin": [["collect/componentA/componentA.js", 4]]
    }
  }
}

With this, we no longer have a flat structure and context add another level.
We now have at the same level a key that can represent a message id or a context.
At various location, I added a test to know which one I'm manipulating.
If a user needs to translate the words origin or translation with a context, it breaks.
This structure also add lots of complexity.

Before we fix the remaining issues, what do you think ?
Should we change something ?

Some solutions we thought about but wanted your impressions first :

  • change the key to add the context like described here
  • add some default context in the structure :
{
  "_": {
    "Component A": {
      "extractedComments": [],
      "origin": [["collect/componentA/componentA.js", 1]]
    },
    "Hello World": {
      "extractedComments": ["Comment A", "Comment A again"],
      "origin": [
        ["collect/componentA/componentA.js", 2],
        ["collect/componentA/componentA.js", 3]
      ]
    }
  },
  "world ctxt": {
    "Hello World": {
      "extractedComments": [],
      "origin": [["collect/componentA/componentA.js", 4]]
    }
  }
}

For a complete example, I updated one of the repository example (the ^3.14.1 version references a local version I built).

@vercel
Copy link

vercel bot commented Sep 14, 2022

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Updated
js-lingui ✅ Ready (Inspect) Visit Preview Sep 14, 2022 at 4:39PM (UTC)

@nmorel nmorel changed the title Implement context improvement : Context Sep 14, 2022
@nmorel nmorel changed the title improvement : Context improvement: Context Sep 14, 2022
@JSteunou
Copy link
Contributor

As @nmorel said, it fixes the primary issue without changing the logic & structure already in place.

I think it would be better to focus on this here, and if some changes should be made, tailor this in another PR.

@nmorel
Copy link
Author

nmorel commented Sep 14, 2022

One thing I forgot to add, I'm not used to ramda 😅
So there is a few places where I used vanilla JS instead.

@semoal
Copy link
Contributor

semoal commented Sep 20, 2022

Will review this afternoon guys, thanks for the hard work!

@timofei-iatsenko
Copy link
Collaborator

@semoal is something in that PR what stops it from being merged?

@andrii-bodnar
Copy link
Contributor

@thekip could you please review this PR? 🙂

@nmorel It would be great if you could rebase the branch as it currently has conflicts that need to be resolved, sorry for the inconvenience

@nmorel
Copy link
Author

nmorel commented Jan 6, 2023

@andrii-bodnar I prefer to get feedback on the PR first before putting any more work on current branch.
The fix isn't perfect and I still have some unanswered questions.

@timofei-iatsenko
Copy link
Collaborator

@andrii-bodnar @nmorel I reviewed the code in general it looks good to me.

Here are my thoughts (doesn't mean you have to do this, just want to rise an opinion):

  1. It's probably would be better to add in the docs more real-life examples. Right now, docs for context would be good only for people how already know what context is and would look for that in the docs.

Example from other i18n project docs which I've found helpful:

The same text elements with different meanings are extracted with different IDs. For example, if the word "right" uses the following two definitions in two different locations, the word is translated differently and merged back into the application as different translation entries.

correct as in "you are right"
direction as in "turn right"
If the same text elements meet the following conditions, the text elements are extracted only once and use the same ID.

  1. Relaying on the field origin and translation feels a bit dirty and fragile. I like option with default context in internal structures. So lookup would always be as catalog[context || DEFAULT_CONTEXT][messageId]. This also would be nicer in the code as you will not have to copying these checks everywhere (message.hasOwnProperty('origin') || message.hasOwnProperty('translation') || message.context)

But actually i want to rise a bigger discussion here, read at the end of the comment.

  1. How context would be serialized in formats other than .po? I didn't check the code in action, unfortunately. Would it bring breaking changes in already existing catalogs?

Discussion:

Implementation offered in this PR is ok just as quick workaround for such essential feature as context. And this is fine.
However, I think better implementation is needed here, but it would require bigger changes in sourcecode and how the lib would be used.

Part of this was already discussed here, I want to summarize.

Current lingui implementation has a flaw: you have your original message in the bundle at least 2 times + 1 translation.
For a line "Hello world" you would have one in the source code as id in i18n call. Another one as key in the message catalog and then a translation itself. Strings could be very long, not just a couple lines, so this may bring more kb to the bundle.

This PR would also add even more kilobytes because for each msg which should have a context you would have context explicitly written in the bundle + in the msg catalog.

A much better option would be generating a "stable" ID based on the msg + context as hash with fixed length.
So internal structure would look like

{
 [<hash>]: {..}
}

And lookups in runtime would be much simpler.

Hash would be calculated at build time by macros + compile time for catalogs. So macros instead of

const message = t({
   context: 'My context',
   message: `Hello`
})

// ↓ ↓ ↓ ↓ ↓ ↓

import { i18n } from "@lingui/core"
const message = i18n._(/*i18n*/{
   context: 'My context',
   id: `Hello`
})

Would generate:

import { i18n } from "@lingui/core"
const message = i18n._(/*i18n*/{
   id: "<hash(message + context)>",
   message: `Hello`, // only for development purpose, would be dropped in production
})

Compile catalog command would also calculate hashes from msg + context on the fly.
So context would affect only how msg id generates and will not have it's place in the bundle.

However, there are amounts of users who use a lib without a macros. I don't know who they are and what were the reasons behind this decision. I think if you don't use a macro, you lose everything what making lingui awesome. Without a macro there are better alternatives with more stars and bigger community. So I think we should concentrate on one thing and do it well instead of supporting multiple different usages.

This optional status of the macro and necessity of implementing all existing features in 2 modes adds additional complexity to the project and prevents the possibility of implementing some really cool features because we always should think about "runtime" users.

Anyway, it would be great to know how many of users don't use a macro and how painful would be dropping support of the non-macro usage.

@andrii-bodnar what do you think? We can discuss dropping non-macro usage separately somewhere. Depending on this, we can later decide a what to do with this PR.

PS this proposal with build-time id generation still can be used without dropping non-macro users, but it would require 2 runtimes, 2 sets of tests and confusing setup options, that's why I want to avoid it. And avoid generating hashes at runtime as well.

@timofei-iatsenko
Copy link
Collaborator

I read test snapshots carefully one more time and the following came up. If we bring "a default" context as was proposed, this brings "nested" structures to these catalog formats:

Lingui (Raw) and Simple JSON catalog formats
https://lingui.js.org/ref/catalog-formats.html#json.

If we do that, we also need to provide an upgrade path for the users. This should be relatively simple.
However, if nested structure would be consumed by some translation tools, such as crowdin, it's probably would be treated as one entry, so it may break a flow for some.

So I would say context feature not really compatible with supported JSON formats.

Also generating a "hashed" id as proposed by me would definitely break simple JSON format flow, because key would no longer be an original string.

That's another questions to discuss.

@JSteunou
Copy link
Contributor

JSteunou commented Jan 9, 2023

Very well summarized on the structure. This PR was careful to not bring any breaking changes and only use what was previously attempted (but not finished) on context support. That's why there is no default context that would change the all structure to add one level in depth for everyone, not only context users.

I totally agree that a better way to handle the internal structure (and json format) would be with key as hash. So instead of having 2 breaking changes in a row, I think it would be better to skip the default context improvement and immediately jump on all flat with key as hash.

@nmorel
Copy link
Author

nmorel commented Jan 9, 2023

For the default context, it's possible I think to introduce it without breaking change.
It will only be present in the internal code. When we write to a file (json/js/po..), the default context can be removed.

@timofei-iatsenko
Copy link
Collaborator

@nmorel could you elaborate on this more?

@nmorel
Copy link
Author

nmorel commented Jan 9, 2023

If I take examples from here, we will have this with the current PR :

  • po
#: src/App.js:3
#  Comment for translators
msgid "messageId"
msgstr "Translated Message"

#: src/App.js:4
#  Comment for translators
msgctxt "messageCtxt"
msgid "messageId"
msgstr "Translated Message with ctxt"
  • json
{
  "messageId": "Translated Message",
  "messageCtxt": {
    "messageId": "Translated Message with ctxt"
  }
}
  • raw
{
  "messageId": {
    "translation": "Translated message",
    "defaults": "Default message",
    "description": "Comment for translators",
    "origin": [["src/App.js", 3]]
  },
  "messageCtxt": {
    "messageId": {
      "translation": "Translated Message with ctxt",
      "defaults": "Default message",
      "description": "Comment for translators",
      "origin": [["src/App.js", 4]]
    },
  }
}

We can keep this format and only introduce the default context internally with a structure like this :

{
  "<user context or default context>": {
    "<message id>": {
      "translation": "<message translation>",
      "defaults": "<default message>",
      "description": "<comments>",
      "origin": [<origin1>, <origin2>]
    }
  }
}

When parsing po files, if there is no context, we set the default context to the message.
When parsing json files, if the value is a string, we set the default context to the message.
Raw files is the hardest to parse. We could test origin/translation fields like the PR to assign the default context or introduce a special property to context to recognize them easily :

{
  "messageCtxt": {
    "__isContext__": true,
    "messageId": {
      "translation": "Translated Message with ctxt",
      "defaults": "Default message",
      "description": "Comment for translators",
      "origin": [["src/App.js", 4]]
    },
  }
}

When converting the internal structure to po/json/raw files, it won't be too hard to remove the default context.

With this, we keep the format intact (except for message with context) and in lingui code, the structure is the same.
We no longer need to test if the object we manipulate is a message or a context containing messages.

But if you want to introduce the hash solution soon, is it really necessary ? 🙂

@andrii-bodnar
Copy link
Contributor

@nmorel @thekip thanks a lot for your involvement here!

I'm still learning the Lingui and it's a bit hard to make decisions about the concerns above. @Martin005 maybe you will have some thoughts?

Anyway, we'd like to introduce this feature in the nearest release. It's important to keep it backward compatible and not to provide any breaking changes for now.

Could you please guys briefly summarize what is currently blocking this PR?

And a small general question - what's the difference between comment and context here? For me, it has quite a similar purpose - providing some additional information for translation about the string

@timofei-iatsenko
Copy link
Collaborator

timofei-iatsenko commented Jan 10, 2023

@andrii-bodnar the problem here if we release current implementation we would be forced to maintain it as it is right now. But the current approach has many flaws and will limit us in the implementing of new features.

So instead of making a release right now and returning to this question again and again in future we propose to implement it properly from the begining and probably ship it as version 4.

And a small general question - what's the difference between comment and context here? For me, it has quite a similar purpose - providing some additional information for translation about the string

This is a common for i18n process to have both. They might be called differently, f ex. meaning / description or context / comment as in Lingui. The purpose of context is provide few different translation for one source message:

correct as in "you are right"
direction as in "turn right"

Unlike context, comment don't affect message id generation, and used just as an additional information for the translator.
For example context could be "direction" where comment could be "used on the header in metablock, please keep it 20 symbols length max"

@andrii-bodnar andrii-bodnar marked this pull request as draft January 27, 2023 10:23
@andrii-bodnar andrii-bodnar linked an issue Jan 27, 2023 that may be closed by this pull request
@JSteunou
Copy link
Contributor

JSteunou commented Feb 6, 2023

Just to throw another stone here: it seems better to do things right, with a v4, if the v4 is really coming soon. But if it is coming soon, then there is no worries about supporting the v3, so the argument about being forced to maintain it turn void. So ship it quick and iterate would be better and non blocking.

On the contrary if the v4 is not coming soon, then it is also better to ship it now in v3 (maybe with the changes @nmorel explain above).

TLDR; in both cases I think it is better for users to ship it now.

@timofei-iatsenko
Copy link
Collaborator

Suppressed by #1440

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

RFC: Context + hash based message id Message context (msgctxt) not working
5 participants