-
Notifications
You must be signed in to change notification settings - Fork 27
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Normalize Markdown in .pot
files
#19
Comments
When writing this, it occured to me that we can get to the end-state of nicely formatted Markdown by running a formatter on
If we do this in lock-step, we ought to end up with a lossless process. I was experimenting with this over the weekend: the I was using https://dprint.dev/ for the formatting since it seems fast and extensible. It can format code blocks inside Markdown which is something I really want 😄 |
Hey Martin, |
A few entries were lost and changed as part of google/comprehensive-rust#449 — @djmitche would know the details 😄 |
Yeah, that was a somewhat lossy process. Hopefully those appear in fuzzy or old messages? |
@mgeisler I'm not sure what you mean about the formatter. Can you describe that in more detail? |
Sure! My thinking is that we can safely format the Markdown files if we we know that it won't create more/fewer entries in the PO files when To avoid losing translations, we'll then run the same formatting on all We can already safely run the formatter on the |
I don't think I'm going to get a chance to work on this issue soon. |
https://rust-lang.github.io/mdBook/format/mdbook.html?highlight=code#inserting-runnable-rust-files I'm still thinking if it wouldn't be better to use |
I see the appeal, but I'm also wary of making it too easy to add too much code to a slide. I think a wide-open |
Yesterday, I started the translation of |
That's alright, I'll look at it in the background and see what I can come up with. |
@mgeisler I'm even hesitant, do you think I should wait a bit with rust-unofficial/patterns#359 and reinitialize the Because so far I translated 30% of the book, it took me a few hours already. And it feels a bit inefficient and super fiddly to just move two words in a sentence in the right place (due to \r \n), because the rest is somewhat fine. |
I would not wait for this feature. My plan is to make it easy to move to a new version and keep the existing translations will continue to work. People (like yourself!) have put a lot of work into translations and we need to preserve this. Concretely, I'm planning on providing a small program which people can run on the existing translations to normalize the Markdown found in them. So if the #: src/welcome.md:10
msgid ""
"* Give you a comprehensive understanding of the Rust syntax and language.\n"
"* Enable you to modify existing programs and write new programs in Rust.\n"
"* Show you common Rust idioms."
msgstr "" Then the tool will turn the #: src/welcome.md:10
msgid "Give you a comprehensive understanding of the Rust syntax and language."
msgstr ""
#: src/welcome.md:11
msgid "Enable you to modify existing programs and write new programs in Rust."
msgstr ""
#: src/welcome.md:12
msgid "Show you common Rust idioms."
msgstr "" The
I'm not sure why you have to fiddle with individual words like this? Markdown doesn't care about single newlines, so you can break your paragraphs any way you like in the msgid "Hallo, I am a little text."
msgstr ""
"Hallo,\n"
"ich bin\n"
"ein kleiner\n"
"Text." The this will result in intermetiary Markdown looking like Hallo,
ich bin
ein kleiner
Text. and this in turn renders into HTML exactly the same way as if it the Markdown had been Hallo, ich bin ein kleiner Text. A different way to put this is that the translators are building up a full Markdown document, but doing it paragraph by paragraph. This implies that the translators must obey the Markdown formatting rules. If you mis-translate the above to msgid "Hallo, I am a little text."
msgstr ""
"Hallo,\n\n"
"ich bin\n"
"ein kleiner\n"
"Text." Then Hallo,
ich bin
ein kleiner
Text. Now you have two paragraphs in the final book because of how Markdown uses empty lines to separate paragraphs: <p>Hallo,</p>
<p>ich bin
ein kleiner
Text.</p> Does that help to explain things? |
Ah, maybe I wrote it in a confusing way. Because of the line breaks it was sending that to cloud-translate in the exact same way and it didn't translate well, because it broke the grammatic of the sentence. Maybe it could be an issue with This sentence is a good one as it usually\n
fits into a single line. Would get translate by cloud-translate to something like this (abstract): Dieser Satz ist ein guter, weil er normalerweise\n
passt in eine einzige Zeile. It wouldn't factor in the second line of the translation as it only translates line by line, it seems? So my assumption is, that if I would restart and regenerate the translation process, and send everything again via cloud-translate after formatting it would translate it better than now, ending up in less work overall. |
Aha... thanks for explaining, now I understand what you mean! I don't actually know how or if the line break influences the translations done with
From speaking to translators of the Rust course, I think people have had very mixed experiences with using Feel free to open an issue about this over in https://github.com/mgeisler/cloud-translate — we could perhaps have the tool strip out things like |
Before, we would extract text based on the byte offsets in the original document. As a consequence of this, the extracted text would look precisely like the original: the Markdown was copied directly from the original. In particular, text from a block quote would contain the leading ‘>’ characters and paragraphs in list items would contain leading whitespace. Now, we instead extract text by grouping the Markdown parse events into those which should be translated and those who should be skipped. We use this in two ways: - When extracting messages in ‘mdbook-xgettext’, we turn the translatable events back into Markdown. The structure of the document (headings, lists, block quotes, …) is no longer present in the extracted messages: only the text content itself it extracted. - When translating, we replace the sequence of translatable events with the events from the translation. We do this while leaving the structure of the document unchanged. The result of this is a much more robust system: editing one list item no longer impacts adjacent list items, moving a paragraph into a block quote no longer changes the paragraph. As a side effect of how we turn events into messages, links are now all expanded. This makes the messages larger, but it removes a common source of errors where ‘[foo][1]’ would end up pointing to the wrong location if the reference link was updated. Part of #19.
Before, we would extract text based on the byte offsets in the original document. As a consequence of this, the extracted text would look precisely like the original: the Markdown was copied directly from the original. In particular, text from a block quote would contain the leading ‘>’ characters and paragraphs in list items would contain leading whitespace. Now, we instead extract text by grouping the Markdown parse events into those which should be translated and those who should be skipped. We use this in two ways: - When extracting messages in ‘mdbook-xgettext’, we turn the translatable events back into Markdown. The structure of the document (headings, lists, block quotes, …) is no longer present in the extracted messages: only the text content itself it extracted. - When translating, we replace the sequence of translatable events with the events from the translation. We do this while leaving the structure of the document unchanged. The result of this is a much more robust system: editing one list item no longer impacts adjacent list items, moving a paragraph into a block quote no longer changes the paragraph. As a side effect of how we turn events into messages, links are now all expanded. This makes the messages larger, but it removes a common source of errors where ‘[foo][1]’ would end up pointing to the wrong location if the reference link was updated. Part of #19.
This change makes the extracted messages ignore any wrapping done for readability of the Markdown source. So This is a paragraph. and This is a paragraph. now becomes the same message in the PO file. This makes it possible for people to freely reformat the source files, without having to worry about invalidating existing translations. Part of #19.
The dprint formatter is a flexible system which will use sandboxed WebAssembly formatters to format our code (mostly: it calls out to `rustfmt` for Rust code). A particularly interesting feature is that dprint can format Rust code blocks in the Markdown files. However, before we turn that on, we need to have a way to normalize the Markdown text as it is extracted[1]. That is so that the word put into the translations is kept after the reformatting. [1]: google/mdbook-i18n-helpers#19
The dprint formatter is a flexible system which will use sandboxed WebAssembly formatters to format our code (mostly: it calls out to `rustfmt` for Rust code). A particularly interesting feature is that dprint can format Rust code blocks in the Markdown files. However, before we turn that on, we need to have a way to normalize the Markdown text as it is extracted[1]. That is so that the word put into the translations is kept after the reformatting. [1]: google/mdbook-i18n-helpers#19
I wanted to add a fuzz test to ensure that #25 doesn't "invent" new Markdown events. However, this is proving more difficult than I thought since the underlying pulldown-cmark-to-cmark library isn't completely round-tripping the Markdown input. See Byron/pulldown-cmark-to-cmark#55 for the discussion. |
The dprint formatter is a flexible system which will use sandboxed WebAssembly formatters to format our code (mostly: it calls out to `rustfmt` for Rust code). A particularly interesting feature is that dprint can format Rust code blocks in the Markdown files. However, before we turn that on, we need to have a way to normalize the Markdown text as it is extracted[1]. That is so that the word put into the translations is kept after the reformatting. [1]: google/mdbook-i18n-helpers#19
The dprint formatter is a flexible system which will use sandboxed WebAssembly formatters to format our code (mostly: it calls out to `rustfmt` for Rust code). A particularly interesting feature is that dprint can format Rust code blocks in the Markdown files. However, before we turn that on, we need to have a way to normalize the Markdown text as it is extracted[1]. That is so that the word put into the translations is kept after the reformatting. [1]: google/mdbook-i18n-helpers#19
The dprint formatter is a flexible system which will use sandboxed WebAssembly formatters to format our code (mostly: it calls out to `rustfmt` for Rust code). A particularly interesting feature is that dprint can format Rust code blocks in the Markdown files. However, before we turn that on, we need to have a way to normalize the Markdown text as it is extracted[1]. That is so that the word put into the translations is kept after the reformatting. [1]: google/mdbook-i18n-helpers#19
This was fixed with the 0.2.0 release! 🚀 Please remember to run |
Great! Thank you for the continuous work on mdbook-i18n-helpers! <3 |
@simonsan, thanks, you're very welcome! If you have a project which uses it, please add it to the README! |
Yes, will do. Currently, I set up |
When
mdbook-xgettext
extracts translatable text, it would be great if it could normalize the strings. This would make it possible for us to reformat the entire course without fearing that the translations get destroyed while doing so.The normalization would take Markdown like this
and turn it into these messages in the
.pot
file:"This is a heading"
(atx heading is stripped)"This is another heading"
(setext heading is stripped)"A _little_ paragraph."
(soft-wrapped lines are unfolded)"fn main() {\n println!("Hello world!");\n}"
(info string is stripped, we should instead use a#,
flag)"First"
(bullet point extracted individually)"Second"
Like in google/comprehensive-rust#318, we should do this in a step-by-step fashion and make sure to apply the transformations to the existing translations. It would also be good if we have a way to let translators update their not-yet-submitted translations.
The text was updated successfully, but these errors were encountered: