Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Normalize Markdown in .pot files #19

Closed
mgeisler opened this issue Mar 10, 2023 · 20 comments
Closed

Normalize Markdown in .pot files #19

mgeisler opened this issue Mar 10, 2023 · 20 comments
Assignees
Labels
enhancement New feature or request

Comments

@mgeisler
Copy link
Collaborator

When mdbook-xgettext extracts translatable text, it would be great if it could normalize the strings. This would make it possible for us to reformat the entire course without fearing that the translations get destroyed while doing so.

The normalization would take Markdown like this

# This is a heading

This is another heading
=======================

A _little_
paragraph.

```rust,editable
fn main() {
    println!("Hello world!");
}
```

* First
* Second

and turn it into these messages in the .pot file:

  • "This is a heading" (atx heading is stripped)
  • "This is another heading" (setext heading is stripped)
  • "A _little_ paragraph." (soft-wrapped lines are unfolded)
  • "fn main() {\n println!("Hello world!");\n}" (info string is stripped, we should instead use a #, flag)
  • "First" (bullet point extracted individually)
  • "Second"

Like in google/comprehensive-rust#318, we should do this in a step-by-step fashion and make sure to apply the transformations to the existing translations. It would also be good if we have a way to let translators update their not-yet-submitted translations.

@mgeisler
Copy link
Collaborator Author

When writing this, it occured to me that we can get to the end-state of nicely formatted Markdown by running a formatter on

  • The .md files
  • The msgid fields in all .po files
  • The msgstr fields in all .po files

If we do this in lock-step, we ought to end up with a lossless process. I was experimenting with this over the weekend: the msgfilter program can process the msgstr fields, but we need our own little helper to do the same with the msgid fields.

I was using https://dprint.dev/ for the formatting since it seems fast and extensible. It can format code blocks inside Markdown which is something I really want 😄

@jooyunghan
Copy link

Hey Martin,
While updating Korean last week, i found a few missing statements in msgstr entries. Is it from any recent updates of xgettext?

@mgeisler
Copy link
Collaborator Author

While updating Korean last week, i found a few missing statements in msgstr entries. Is it from any recent updates of xgettext?

A few entries were lost and changed as part of google/comprehensive-rust#449@djmitche would know the details 😄

@djmitche
Copy link
Collaborator

Yeah, that was a somewhat lossy process. Hopefully those appear in fuzzy or old messages?

@djmitche
Copy link
Collaborator

@mgeisler I'm not sure what you mean about the formatter. Can you describe that in more detail?

@mgeisler
Copy link
Collaborator Author

Can you describe that in more detail?

Sure! My thinking is that we can safely format the Markdown files if we we know that it won't create more/fewer entries in the PO files when mdbook-xgettext is executed on the formatted files.

To avoid losing translations, we'll then run the same formatting on all msgid fields.

We can already safely run the formatter on the msgstr entries today with msgfilter and that might be something we should encourage translators to do to get smaller diffs.

@mgeisler mgeisler transferred this issue from google/comprehensive-rust Apr 5, 2023
@djmitche
Copy link
Collaborator

djmitche commented Apr 5, 2023

I don't think I'm going to get a chance to work on this issue soon.

@simonsan
Copy link

simonsan commented Apr 6, 2023

I was using https://dprint.dev/ for the formatting since it seems fast and extensible. It can format code blocks inside Markdown which is something I really want 😄

https://rust-lang.github.io/mdBook/format/mdbook.html?highlight=code#inserting-runnable-rust-files

I'm still thinking if it wouldn't be better to use rustfmt in that case and just refactor the markdown files to embed the code from rs files. This would make it possible to have a crate next to the documentation/tutorial and be able to check, fix, fmt and test it. Need to think about further implications. But it would at least separate the code from the text – with its own dis-/advantages for sure - which could make it easier to translate the text that is needed. because right now I see the code blocks showing up as well in the translations.

@djmitche
Copy link
Collaborator

djmitche commented Apr 6, 2023

I see the appeal, but I'm also wary of making it too easy to add too much code to a slide. I think a wide-open .rs file in a text editor invites adding extra lines.

@simonsan
Copy link

simonsan commented Apr 9, 2023

Yesterday, I started the translation of rust-unofficial/patterns to German with cloud-translate. And I think the initial state could be much better with markdown normalization. Because I often end up manually going over the entries due to \r and \n messing up the English grammatic for auto-translation. I think this feature would be really nice to have to make adoption much easier! Especially if you have set the max-line-length to 80 ... 😅

@mgeisler mgeisler self-assigned this Apr 9, 2023
@mgeisler
Copy link
Collaborator Author

mgeisler commented Apr 9, 2023

I don't think I'm going to get a chance to work on this issue soon.

That's alright, I'll look at it in the background and see what I can come up with.

@simonsan
Copy link

@mgeisler I'm even hesitant, do you think I should wait a bit with rust-unofficial/patterns#359 and reinitialize the de.po-file when this feature is implemented?

Because so far I translated 30% of the book, it took me a few hours already. And it feels a bit inefficient and super fiddly to just move two words in a sentence in the right place (due to \r \n), because the rest is somewhat fine.

@mgeisler
Copy link
Collaborator Author

@mgeisler I'm even hesitant, do you think I should wait a bit with rust-unofficial/patterns#359 and reinitialize the de.po-file when this feature is implemented?

I would not wait for this feature. My plan is to make it easy to move to a new version and keep the existing translations will continue to work. People (like yourself!) have put a lot of work into translations and we need to preserve this.

Concretely, I'm planning on providing a small program which people can run on the existing translations to normalize the Markdown found in them. So if the messages.pot file or a xx.po file contains

#: src/welcome.md:10
msgid ""
"* Give you a comprehensive understanding of the Rust syntax and language.\n"
"* Enable you to modify existing programs and write new programs in Rust.\n"
"* Show you common Rust idioms."
msgstr ""

Then the tool will turn the msgid field into three, one for each bullet point:

#: src/welcome.md:10
msgid "Give you a comprehensive understanding of the Rust syntax and language."
msgstr ""

#: src/welcome.md:11
msgid "Enable you to modify existing programs and write new programs in Rust."
msgstr ""

#: src/welcome.md:12
msgid "Show you common Rust idioms."
msgstr ""

The msgstr field will be split the same way, but in this case it was empty. You should upgrade to a new version of mdbook-i18n-helpers and run this program on the existing translations in a single step. Translators can run the same program on their in-progress work.

it feels a bit inefficient and super fiddly to just move two words in a sentence in the right place (due to \r \n), because the rest is somewhat fine.

I'm not sure why you have to fiddle with individual words like this? Markdown doesn't care about single newlines, so you can break your paragraphs any way you like in the msgstr fields. So your de.po file can look like this if you want:

msgid "Hallo, I am a little text."
msgstr ""
"Hallo,\n"
"ich bin\n"
"ein kleiner\n"
"Text."

The this will result in intermetiary Markdown looking like

Hallo,
ich bin
ein kleiner
Text.

and this in turn renders into HTML exactly the same way as if it the Markdown had been

Hallo, ich bin ein kleiner Text.

A different way to put this is that the translators are building up a full Markdown document, but doing it paragraph by paragraph. This implies that the translators must obey the Markdown formatting rules. If you mis-translate the above to

msgid "Hallo, I am a little text."
msgstr ""
"Hallo,\n\n"
"ich bin\n"
"ein kleiner\n"
"Text."

Then mdbook sees this Markdown

Hallo,

ich bin
ein kleiner
Text.

Now you have two paragraphs in the final book because of how Markdown uses empty lines to separate paragraphs:

<p>Hallo,</p>

<p>ich bin
ein kleiner
Text.</p>

Does that help to explain things?

@simonsan
Copy link

simonsan commented Apr 10, 2023

I'm not sure why you have to fiddle with individual words like this? Markdown doesn't care about single newlines, so you can break your paragraphs any way you like in the msgstr fields.

Ah, maybe I wrote it in a confusing way. Because of the line breaks it was sending that to cloud-translate in the exact same way and it didn't translate well, because it broke the grammatic of the sentence. Maybe it could be an issue with cloud-translate then?

This sentence is a good one as it usually\n
fits into a single line.

Would get translate by cloud-translate to something like this (abstract):

Dieser Satz ist ein guter, weil er normalerweise\n
passt in eine einzige Zeile.

It wouldn't factor in the second line of the translation as it only translates line by line, it seems?

So my assumption is, that if I would restart and regenerate the translation process, and send everything again via cloud-translate after formatting it would translate it better than now, ending up in less work overall.

@mgeisler
Copy link
Collaborator Author

It wouldn't factor in the second line of the translation as it only translates line by line, it seems?

Aha... thanks for explaining, now I understand what you mean!

I don't actually know how or if the line break influences the translations done with cloud-translate... in any case, I think removing the newlines would be something best left for that project.

So my assumption is, that if I would restart and regenerate the translation process, and send everything again via cloud-translate after formatting it would translate it better than now, ending up in less work overall.

From speaking to translators of the Rust course, I think people have had very mixed experiences with using cloud-translate: it gets a lot of things wrong because of the specialized context. Perhaps it works better for larger books, I'm not sure.

Feel free to open an issue about this over in https://github.com/mgeisler/cloud-translate — we could perhaps have the tool strip out things like \n and other formatting characters.

mgeisler added a commit that referenced this issue May 1, 2023
Before, we would extract text based on the byte offsets in the
original document. As a consequence of this, the extracted text would
look precisely like the original: the Markdown was copied directly
from the original. In particular, text from a block quote would
contain the leading ‘>’ characters and paragraphs in list items would
contain leading whitespace.

Now, we instead extract text by grouping the Markdown parse events
into those which should be translated and those who should be skipped.
We use this in two ways:

- When extracting messages in ‘mdbook-xgettext’, we turn the
  translatable events back into Markdown. The structure of the
  document (headings, lists, block quotes, …) is no longer present in
  the extracted messages: only the text content itself it extracted.

- When translating, we replace the sequence of translatable events
  with the events from the translation. We do this while leaving the
  structure of the document unchanged.

The result of this is a much more robust system: editing one list item
no longer impacts adjacent list items, moving a paragraph into a block
quote no longer changes the paragraph.

As a side effect of how we turn events into messages, links are now
all expanded. This makes the messages larger, but it removes a common
source of errors where ‘[foo][1]’ would end up pointing to the wrong
location if the reference link was updated.

Part of #19.
mgeisler added a commit that referenced this issue May 1, 2023
Before, we would extract text based on the byte offsets in the
original document. As a consequence of this, the extracted text would
look precisely like the original: the Markdown was copied directly
from the original. In particular, text from a block quote would
contain the leading ‘>’ characters and paragraphs in list items would
contain leading whitespace.

Now, we instead extract text by grouping the Markdown parse events
into those which should be translated and those who should be skipped.
We use this in two ways:

- When extracting messages in ‘mdbook-xgettext’, we turn the
  translatable events back into Markdown. The structure of the
  document (headings, lists, block quotes, …) is no longer present in
  the extracted messages: only the text content itself it extracted.

- When translating, we replace the sequence of translatable events
  with the events from the translation. We do this while leaving the
  structure of the document unchanged.

The result of this is a much more robust system: editing one list item
no longer impacts adjacent list items, moving a paragraph into a block
quote no longer changes the paragraph.

As a side effect of how we turn events into messages, links are now
all expanded. This makes the messages larger, but it removes a common
source of errors where ‘[foo][1]’ would end up pointing to the wrong
location if the reference link was updated.

Part of #19.
mgeisler added a commit that referenced this issue May 1, 2023
This change makes the extracted messages ignore any wrapping done for
readability of the Markdown source. So

    This is a
    paragraph.

and

    This is a paragraph.

now becomes the same message in the PO file. This makes it possible
for people to freely reformat the source files, without having to
worry about invalidating existing translations.

Part of #19.
mgeisler added a commit to google/comprehensive-rust that referenced this issue May 27, 2023
The dprint formatter is a flexible system which will use sandboxed
WebAssembly formatters to format our code (mostly: it calls out to
`rustfmt` for Rust code).

A particularly interesting feature is that dprint can format Rust code
blocks in the Markdown files. However, before we turn that on, we need
to have a way to normalize the Markdown text as it is extracted[1].
That is so that the word put into the translations is kept after the
reformatting.

[1]: google/mdbook-i18n-helpers#19
mgeisler added a commit to google/comprehensive-rust that referenced this issue May 27, 2023
The dprint formatter is a flexible system which will use sandboxed
WebAssembly formatters to format our code (mostly: it calls out to
`rustfmt` for Rust code).

A particularly interesting feature is that dprint can format Rust code
blocks in the Markdown files. However, before we turn that on, we need
to have a way to normalize the Markdown text as it is extracted[1].
That is so that the word put into the translations is kept after the
reformatting.

[1]: google/mdbook-i18n-helpers#19
@mgeisler
Copy link
Collaborator Author

I wanted to add a fuzz test to ensure that #25 doesn't "invent" new Markdown events. However, this is proving more difficult than I thought since the underlying pulldown-cmark-to-cmark library isn't completely round-tripping the Markdown input. See Byron/pulldown-cmark-to-cmark#55 for the discussion.

mgeisler added a commit to google/comprehensive-rust that referenced this issue May 30, 2023
The dprint formatter is a flexible system which will use sandboxed
WebAssembly formatters to format our code (mostly: it calls out to
`rustfmt` for Rust code).

A particularly interesting feature is that dprint can format Rust code
blocks in the Markdown files. However, before we turn that on, we need
to have a way to normalize the Markdown text as it is extracted[1].
That is so that the word put into the translations is kept after the
reformatting.

[1]: google/mdbook-i18n-helpers#19
mgeisler added a commit to google/comprehensive-rust that referenced this issue May 30, 2023
The dprint formatter is a flexible system which will use sandboxed
WebAssembly formatters to format our code (mostly: it calls out to
`rustfmt` for Rust code).

A particularly interesting feature is that dprint can format Rust code
blocks in the Markdown files. However, before we turn that on, we need
to have a way to normalize the Markdown text as it is extracted[1].
That is so that the word put into the translations is kept after the
reformatting.

[1]: google/mdbook-i18n-helpers#19
@mgeisler mgeisler added the enhancement New feature or request label Jul 13, 2023
NoahDragon pushed a commit to wnghl/comprehensive-rust that referenced this issue Jul 19, 2023
The dprint formatter is a flexible system which will use sandboxed
WebAssembly formatters to format our code (mostly: it calls out to
`rustfmt` for Rust code).

A particularly interesting feature is that dprint can format Rust code
blocks in the Markdown files. However, before we turn that on, we need
to have a way to normalize the Markdown text as it is extracted[1].
That is so that the word put into the translations is kept after the
reformatting.

[1]: google/mdbook-i18n-helpers#19
@mgeisler
Copy link
Collaborator Author

This was fixed with the 0.2.0 release! 🚀 Please remember to run mdbook-i18n-normalize on our PO files to get the benefits of this.

@simonsan
Copy link

This was fixed with the 0.2.0 release! 🚀 Please remember to run mdbook-i18n-normalize on our PO files to get the benefits of this.

Great! Thank you for the continuous work on mdbook-i18n-helpers! <3

@mgeisler
Copy link
Collaborator Author

@simonsan, thanks, you're very welcome! If you have a project which uses it, please add it to the README!

@simonsan
Copy link

simonsan commented Aug 24, 2023

@simonsan, thanks, you're very welcome! If you have a project which uses it, please add it to the README!

Yes, will do. Currently, I set up rust-unofficial/patterns to use an older version and working on another project more eagerly. When I have time to set up the German translation of it completely, I will for sure add it to the readme!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants