-
Notifications
You must be signed in to change notification settings - Fork 110
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
hreflang should support ISO 639-1, ISO 3166-1 Alpha 2, ISO 15924 formats #668
Comments
Good point, seems legit for a plugin dealing with localizations! I won't work on this for now and i don't know how much should be changed but it should definitely be feasible. I I'll keep it open but anyone feel free to send a PR. |
Sorry that I am not good on programming, but I found that WP Multilang supports it. Although it uses 2 characters (e.g. en) officially, but I can still fill en-us in order to follow what Google suggests. I still loves qtranslate-xt, I hopes that it can meet Google's standard. Screenshot: https://ps.w.org/wp-multilang/assets/screenshot-1.png?rev=1760406 |
It should be feasible but it requires a bit of work and a lot of testing. The main part is handled via regex format such as For the new format, a very lazy way would just be to handle something like |
Hi, Just wanna know if there is any update about this enhancement, if you need any testing user, I am here :) |
Yes, i still have this in mind, i think it's an important feature to have. I often see in the code many places where the 2 character format is hard-coded so i know more or less what to change, but this is quite a deep change. I don't think it's reasonable to do it in the current codebase because it's way too much of a spaghetti code. This would happen after migrating to a new repo and refactoring most of the core part. |
Understood, hope to see it soon :) |
Sorry for my English: |
Devoleksiy, Your information is completely not related to this enhancement.
|
Yes it's a different topic. |
We have also to take @mikoet remark from #836 mentioning ISO 639-2 and ISO 639-3
See https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes For us the distinction ISO 639-2 and ISO 639-3 does not really matter because the 639-3 variant adds simply the notion of macro-language which is just one hierarchical level (we may come back to this concept later). The format for qTranslate can be summarized to 2 or 3 letter code. |
This topic is quite an interesting question, very important for qTranslate i'd say! Here we don't really care about the values but only the format that will define the pattern for the regex (regular expression). ISO 639-1 - the only format currently supported in qTranslate-XT (!) ISO 3166-1 alpha-2 ISO 3166-2
ISO 639-2 and 639-3 ISO 15924 hreflang according to google Uh!! 🤯 |
Google does not seem to encourage ISO-639-2 (3 letters language) for hreflang. But i think this is wrong. I don't know exactly how they handle this but if we follow the RFC 8288 that describes hreflang, it says:
If we look for the norml of the Language-Tag i found this: https://tools.ietf.org/html/rfc5646#section-2.1 Which is quite complex but if we keep just the simplest part...
The concept of "shortest ISO 639 code" might be a bit difficult to grasp but this other document can explain better what it means: https://www.w3.org/International/articles/language-tags/
I feel this is the right way to specify the languages properly. But this raises some questions for qTranslate-XT.
|
These functions are not used and don't seem useful: - qtranxf_stripSlashesIfNecessary (admin_utils) - qtranxf_get_domain_language - qtranxf_isAvailableIn
An update here, this is the feature i'm working on currently. I have done quite some big steps forward to handle the edition of the new language format on the admin side, which seems to work pretty well. I refactored the language code checks entirely and generalized to a unique regex defined on the server.
This is available in this new branch: The admin part should be quite fine. |
Front-end redirections fixed. This seems to work now! |
I have a little problem with the case-sensitive checks. In the current version, the 2-letter code check is entirely non-sensitive. This means you could enter
My view is that the internal language code should be case-sensitive to match exactly the official language specifications. When we check against the URL, this particular check can remain case-insensitive, but internally the language code should be strictly case-sensitive to ensure a perfect consistency between the definitions of the enabled languages and the embedded tags stored in DB. I would like to enable a strict case-sensitive check when the language codes are edited on the admin side. The problem is that some users may have a 2-letter codes with upper case e.g. with |
I also realize that adding the country code and the script raises a new need, having fallback mechanisms to a more generic language. Let's say you have now:
With the new feature, the two language codes would exist in DB. When looking for a translation, we would look for that language strictly, either tags Same would be for We can think the languages codes as a hierarchy depending on which level they are defined and what we have in database with the post content. That would require a few more changes but it's an idea for future features. |
Mmmm what we need maybe is rather embedded sub-tags... For example we would get this with standard tags (extended):
If we handle this with two separate language codes, all the post content would be duplicated. But we could think of a smarter format like this:
This would save a lot of space in the database because the common parts are shared. And the fallback propagation here would be quite intrinsic to the format. But this is definitely more complex and requires more work to handle. I don't want to block this new feature for the extension of the language formats too long but... the problem is if i release the new full extended format it may become very hard to migrate to such format in the future... I need to think about it a bit more. For sure the 3-letter format ISO-639-2 can be released. Maybe i will do a first release for this. The current regex allows much more flexibility for the main format. |
Or maybe it should be handled like this:
Though this looks more verbose than the embedded version, it could be much easier to handle in the code. It is important to realize that these specific variations of words are a very small subset compared to the common part. The example is just to show what we really need for the regional, but in reality the main language matters more for 99% of the content. Another way to see it would be to look for those specific words explicitly. We could have a database of regional translations for specific words translated on the fly for the front-end only. In database we would just keep one language of reference. @nethubonline is that the same for "zh-TW" / "zh-CN" and "zh-Hant" / "zh-Hans"? Are these variations changing only small part relatively or does it change drastically the whole content? |
Fixes #836. Fixes partially #668. Major refactoring: language code format now handled with a unique regex. The new format allows 2 or 3-letter (ISO 639-2 and 639-3), lower case. Upper case values are only allowed for legacy codes but not for new entries. A migration of DB will be required before enforcing to lower case. URL checks remain case-insensitive (unchanged).
Fixes #836. Fixes partially #668. Major refactoring: language code format now handled with a unique regex. The new format allows 2 or 3-letter (ISO 639-2 and 639-3), lower case. Upper case values are only allowed for legacy codes but not for new entries. A migration of DB will be required before enforcing to lower case. URL checks remain case-insensitive (unchanged).
So i released a 3.9.0 for the initial 3-letter support with all the related refactoring. For support of regions and scripts we can continue the discussion here or create new specific topics, i need to clarify a bit the needs before deciding for the right format in the database for the sub-parts related to regions and scripts. What i want to avoid is opting for a solution that would require a DB migration later. |
Hi herrvigg, Glad to know a big step of qtranslate-xt, thank you very much for your efforts. hm..... They can change drastically the whole content depends on content, below is an example (I bold the differences)
I agree that internally the language code should be strictly case-sensitive to ensure a perfect consistency, because we just need to show correct hreflang for SEO. FYI, not sure if it helps: please check the source of https://www.apple.com/hk/ , Apple shows quite a lot of hreflang tag.
|
Thank you, yes the hreflang example looks good, that's what we are aiming for! The country (region) or script suffix are meant to be optional but they are relevant and qTranslate should support this. In the example you gave between zh-TW and zh-CN it is actually similar to en-US vs en-GB in the sense most of the content is common and ideally should be shared. The longer the content, the more shared parts they will have for sure. If i just enable the regions and scripts as new language entries, we would have a lot of content duplication and i believe there is a better way to handle it. I could release the new format but if we change later, this may create problems of migration if the internal format changes. The question is how to find a good user interface to edit those and how to store it in DB. I still need a bit of time to reflect on this to find a good plan. |
herrvigg, Hm.....with all respect, I can confirm that no Chinese people will use the shared common content between zh-TW & zh-CN 😄 because editing the content will kill us 🤣 , and in most cases of Chinese characters, it will also take up more database space. for the same example above, we need to rewrite the original code: After: However, it may help for other languages 😄 |
Ahah fair point! Some regional variations should definitely not be merged 🤣😅 Indeed for chinese the differences are actually more frequent and it would perhaps be more efficient to let the contents separated. For other languages we may also give the same possibility. But in general there are two questions:
These are the main reasons we still need to think how to combine language with or without regions/scripts. |
As a complement to my point 2 and your example:
For the 3 first items, here there is an additional mapping between the URL path and hreflang. In qTranslate, i think the internal language should be the one with the region (en-GB, zh-HK, zh-TW) as shown in point 2 earlier. But we may add a new level of mapping just for the URL as a form of alias, to say "map the path For the last item this is not supported by qTranslate yet. What we have is the QTX_URL_DOMAIN or QTX_URL_DOMAINS option but it is not possible to mix different URL methods (it cannot be combined at the same time with QTX_URL_PATH). You may add some redirection rules in your web server configuration (HTTP level - not the web application) but that can create some complications. |
Yeah, it would be great feature to fallback for missing content. For point 2 URL redirections, I am not sure if it is useful for other users. If I have a site with URLs below, actually I don't need
Again, thank you for your great effort on this. |
Hybrid sequences - too hard!I've made a few experiments and my idea of mixing hybrid sequences of tags (shared and specific parts) is going to be way too complicated! Internally, qTranslate decomposes a post content in blocks in memory and assign each block to one language, given in the order set in the language options. Therefore, the order and the position of the original tags are lost. This is true now if you edit the raw content and enable the LSB later. There can only be a unique part per language. Also, the classic editor in the admin section only allows to edit one language at a time even if you have the LSB to switch quickly in the same session. In Gutenberg, maybe we could have other alternatives with the Gutenberg blocks, but this would require a whole different approach and we can't really do this for now. New feature - maybe soon!So, for a pragmatic solution regarding the regional codes, we have to treat them as a normal language. This is both for the storage and the edition. In other words, i think i will enable this feature soon just by extending the regex with the one i shown previously. My main concern was the storage format, but it will remain the same. The feature will allow more flexibility but the users will have to manage the new possibilities on their side. One need that may be raised is the possibility to rename a language code internally in the whole database. For example, if you created long ago a language as I'm still doing a few experiments and i'd like to solve the fallback question a bit better:
|
According to Google: https://support.google.com/webmasters/answer/189077
Since some languages have different type, such as Chinese Tranditional and Chinese Simplified, ISO 639-1 format only describes 1 language code "zh" for both Chinese languages, but we cannot fill "zh" for both Chinese Tranditional & Chinese Simplified at the same time.
From Google answer, we should use "zh-TW" / "zh-CN" or "zh-Hant" / "zh-Hans" instead, but the "Language Code" field in qtranslate-XT setting limit 2 characters only.
Could qtranslate-XT release the 2 characters limit so that we can follow what Google suggests?
The text was updated successfully, but these errors were encountered: