-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[fr] Dictionary additions #10952
[fr] Dictionary additions #10952
Conversation
WalkthroughThis pull request introduces multiple updates to the French language module of the language tool. It includes new entries in the part-of-speech dictionary ( Changes
Possibly related PRs
Suggested reviewers
Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
🧹 Outside diff range and nitpick comments (1)
languagetool-language-modules/fr/src/main/resources/org/languagetool/resource/fr/multiwords.txt (1)
Line range hint
1-4826
: Consider expanding coverage of emerging terms and expressionsThe multiword resource is already quite comprehensive, but language is constantly evolving. Consider periodically reviewing and adding:
- New technological terms and platform names (e.g., emerging social media platforms)
- Current events and newly prominent public figures
- Evolving cultural expressions and internet slang
- Industry-specific multiword terms in growing fields (e.g., AI, renewable energy)
This suggestion is not urgent but could help maintain the resource's relevance over time. You might want to establish a process for regular reviews and updates to capture new multiword expressions as they enter common usage.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (4)
- languagetool-language-modules/fr/src/main/resources/org/languagetool/resource/fr/added.txt (1 hunks)
- languagetool-language-modules/fr/src/main/resources/org/languagetool/resource/fr/hunspell/ignore.txt (1 hunks)
- languagetool-language-modules/fr/src/main/resources/org/languagetool/resource/fr/hunspell/spelling.txt (1 hunks)
- languagetool-language-modules/fr/src/main/resources/org/languagetool/resource/fr/multiwords.txt (1 hunks)
🔥 Files not summarized due to errors (1)
- languagetool-language-modules/fr/src/main/resources/org/languagetool/resource/fr/hunspell/spelling.txt: Error: Server error: no LLM provider could handle the message
🧰 Additional context used
🔇 Additional comments (16)
languagetool-language-modules/fr/src/main/resources/org/languagetool/resource/fr/added.txt (7)
376-376
: LGTM: New entry for 'Pausilippe' added correctlyThe entry follows the correct format and uses appropriate tags for a proper noun.
378-378
: LGTM: New entry for 'Méhémet' added correctlyThe entry follows the correct format and uses appropriate tags for a proper noun.
379-379
: LGTM: New entry for 'Zarafa' added correctlyThe entry follows the correct format and uses appropriate tags for a proper noun.
376-383
: Summary of additions to the French dictionaryOverall, the new entries enhance the French language module. However, there are a few points to address:
- Verify the noun type (N vs Z) for 'Türkiye' and 'Tanganil'.
- Double-check the spelling of 'acetylleucine'.
- Complete the entries for 'wesh' and 'Opella' by adding the missing baseforms and part-of-speech tags.
Once these issues are resolved, the additions will significantly improve the language tool's French vocabulary.
377-377
: LGTM: New entry for 'Türkiye' added, but verify noun typeThe entry follows the correct format. However, as this is a country name, consider verifying if it should be tagged as a proper noun (Z) instead of a common noun (N).
#!/bin/bash # Verify if other country names in the file use 'N' or 'Z' tag grep -E ';[NZ] [mf] s' languagetool-language-modules/fr/src/main/resources/org/languagetool/resource/fr/added.txt | grep -i 'country\|nation'
381-381
: Consider changing noun type for 'Tanganil'The entry follows the correct format, but Tanganil is a brand name. Consider changing the tag from common noun (N) to proper noun (Z) for consistency with other brand names in the dictionary.
#!/bin/bash # Check how other brand names are tagged in the file grep -E ';[NZ] [mf] s' languagetool-language-modules/fr/src/main/resources/org/languagetool/resource/fr/added.txt | grep -i 'brand'
380-380
: LGTM: New entry for 'acetylleucine' added, but verify spellingThe entry follows the correct format. However, as this is a technical term, it's advisable to double-check the spelling. Some sources might use 'acétylleucine' in French.
✅ Verification successful
To accurately verify the spelling of 'acetylleucine', please run the following script:
Verified: Spelling of 'acetylleucine' is correct
The entry is consistent and no alternative spellings were found in the codebase.
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Search for alternative spellings of 'acetylleucine' in the codebase rg -i 'ac[ée]tylleucine' --type txtLength of output: 222
Script:
#!/bin/bash # Corrected script to search for alternative spellings of 'acetylleucine' in .txt files rg -i 'ac[ée]tylleucine' --type txtLength of output: 279
languagetool-language-modules/fr/src/main/resources/org/languagetool/resource/fr/hunspell/ignore.txt (1)
1550-1551
: Additions look good and maintain alphabetical order.The new entries ESMO and CNGE have been added to the ignore list. These additions are correctly placed in alphabetical order within the file.
languagetool-language-modules/fr/src/main/resources/org/languagetool/resource/fr/multiwords.txt (3)
4823-4823
: LGTM: Correct entry for Benyamin NétanyahouThe entry for "Benyamin Nétanyahou" is correctly formatted and categorized as a proper noun (Z), masculine (m), and singular (s).
Line range hint
1-4826
: LGTM: Well-structured and comprehensive multiword resourceThe file maintains a consistent structure throughout, with proper formatting for each entry. The inclusion of a wide range of multiword expressions, including contemporary terms, names, and cultural references, enhances the resource's utility for modern French language processing tasks.
The overall organization and content of the file are commendable. Keep up the good work in maintaining and expanding this valuable resource.
Line range hint
1-4826
: Excellent update to the French multiword resourceThis update to the multiword resource file is a substantial improvement:
- Consistent formatting maintained throughout the file
- Significant expansion with relevant new entries
- Inclusion of contemporary terms, names, and cultural references
- Enhanced coverage of various types of multiword expressions
These changes will greatly benefit French language processing tasks by providing a more comprehensive and up-to-date resource.
Great job on this update! The expanded coverage and maintained consistency make this a valuable contribution to the project.
languagetool-language-modules/fr/src/main/resources/org/languagetool/resource/fr/hunspell/spelling.txt (5)
34723-34723
: Verify the usage of "Türkiye" in French context.While "Türkiye" is the new official name for Turkey in English, it's important to verify if this exact spelling is used in French contexts. In French, it might be more appropriate to use "Turquie" or a different adaptation.
Could you please confirm the official French usage for the country name? You may want to check official French government sources or reputable French news outlets for the most current usage.
34727-34727
: Verify the spelling and capitalization of "acetylleucine" in French.While "acetylleucine" appears to be a valid medical or chemical term, it's important to verify its exact spelling and capitalization in French scientific contexts. In some cases, such terms might be hyphenated or capitalized differently.
Could you please confirm the correct French spelling and capitalization for this term? You may want to check French medical dictionaries or pharmacological resources for the standard usage.
34729-34731
: Verify the usage of "Gen Z", "Gen Y", and "Gen X" in French context.While these generational terms are widely recognized in English, it's important to verify if they are commonly used as-is in French or if French equivalents are preferred.
Could you please confirm if these English terms are the standard usage in French contexts? Alternatively, consider using French equivalents if they exist and are more commonly used (e.g., "Génération Z", "Génération Y", "Génération X").
34732-34732
: Confirm the inclusion of colloquial terms like "wesh" in the dictionary.The term "wesh" is a colloquial French interjection, commonly used in informal or youth speech. While its inclusion reflects contemporary language use, it's important to ensure that including such informal terms aligns with the overall goals and standards of the LanguageTool French module.
Could you please confirm if the inclusion of colloquial and slang terms is intentional and consistent with the dictionary's purpose? This decision may impact the tool's behavior in formal writing contexts.
34722-34733
: Review the diverse range of additions and their impact on the dictionary.The additions to the French spelling dictionary include a wide variety of terms, ranging from historical names and medical terms to contemporary slang and generational labels. While this diversity enhances the dictionary's coverage of different linguistic domains, it also raises some considerations:
Consistency: Ensure that the inclusion criteria for different types of terms (technical, colloquial, proper nouns) are consistently applied across the dictionary.
Currency: The inclusion of contemporary terms like "wesh" and generational labels reflects an effort to keep the dictionary up-to-date. Consider implementing a regular review process to maintain this currency.
Scope: The addition of specialized terms (e.g., medical terms) and proper nouns expands the dictionary's scope. Verify that this aligns with the intended use cases of the LanguageTool French module.
Internationalization: Some terms added are not strictly French (e.g., "Türkiye", "Gen Z"). Consider clarifying the policy on including non-French terms that are used in French contexts.
Could you please provide more information about the guidelines or criteria used for selecting new additions to the French spelling dictionary? This would help ensure that these diverse additions align with the overall goals of the LanguageTool project.
acetylleucine;acetylleucine;N f s | ||
Tanganil;Tanganil;N m s | ||
wesh |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Entry for 'wesh' is incomplete
The entry for 'wesh' is missing the baseform and part-of-speech tags. Please complete the entry following the format: fullform;baseform;postags. For example: wesh;wesh;I
(assuming it's classified as an interjection).
Tanganil;Tanganil;N m s | ||
wesh | ||
Opella |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Entry for 'Opella' is incomplete
The entry for 'Opella' is missing the baseform and part-of-speech tags. Please complete the entry following the format: fullform;baseform;postags. For example: Opella;Opella;Z f s
(assuming it's a feminine singular proper noun).
Gen Z;N f s | ||
Gen Y;N f s | ||
Gen X;N f s |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Consider revising the gender and number for generational terms
The entries for "Gen Z", "Gen Y", and "Gen X" are correctly added, but their categorization as feminine singular nouns (N f s) might be too restrictive. In French, these terms are often used without a specific gender and can be both singular and plural. Consider using a more flexible categorization, such as "N e sp" (noun, epicene, singular and plural) to allow for broader usage.
You might want to update these entries as follows:
-Gen Z;N f s
-Gen Y;N f s
-Gen X;N f s
+Gen Z;N e sp
+Gen Y;N e sp
+Gen X;N e sp
This change would make the entries more versatile and accurate for various contexts in French texts.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
Gen Z;N f s | |
Gen Y;N f s | |
Gen X;N f s | |
Gen Z;N e sp | |
Gen Y;N e sp | |
Gen X;N e sp |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🧹 Outside diff range and nitpick comments (1)
languagetool-language-modules/fr/src/main/resources/org/languagetool/resource/fr/hunspell/spelling.txt (1)
34729-34732
: Generational labels: Approved with suggestionThe addition of "Gen Z", "Gen Y", and "Gen X" is valuable for recognizing modern demographic references. These terms are widely used, even in French contexts. However, consider also including French equivalents or variations if they exist (e.g., "Génération Z", "Génération Y", "Génération X") to provide comprehensive coverage for both borrowed English terms and native French expressions.
Would you like me to research and suggest French equivalents for these generational terms?
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (3)
- languagetool-language-modules/fr/src/main/resources/org/languagetool/resource/fr/added.txt (1 hunks)
- languagetool-language-modules/fr/src/main/resources/org/languagetool/resource/fr/hunspell/spelling.txt (1 hunks)
- languagetool-language-modules/fr/src/main/resources/org/languagetool/resource/fr/multiwords.txt (1 hunks)
🔥 Files not summarized due to errors (1)
- languagetool-language-modules/fr/src/main/resources/org/languagetool/resource/fr/hunspell/spelling.txt: Error: Server error: no LLM provider could handle the message
🚧 Files skipped from review as they are similar to previous changes (1)
- languagetool-language-modules/fr/src/main/resources/org/languagetool/resource/fr/added.txt
🧰 Additional context used
🔇 Additional comments (8)
languagetool-language-modules/fr/src/main/resources/org/languagetool/resource/fr/multiwords.txt (4)
4823-4826
: New generational terms addedThe following generational terms have been added:
- Gen Z
- Gen Y
- Gen X
These are correctly formatted and tagged as "N e sp" (noun, epicene, singular/plural). This addition will improve the recognition of contemporary generational references.
4823-4829
: New person names addedSeveral new person names have been added:
- Benyamin Nétanyahou
- Yahya Sinouar
- Yahya Sinwar
These entries are correctly formatted and tagged as "Z m s" (proper noun, masculine, singular). The inclusion of variant spellings for "Yahya Sinwar" is helpful for improved name recognition.
Line range hint
1-4829
: Overall assessment of multiwords.txt changesThe additions to the multiwords.txt file are well-formatted and will enhance the French language processing capabilities of the system. The new entries include contemporary generational terms, person names, and a military term, all of which are correctly tagged and structured.
Key points:
- All new entries follow the correct format: multiword expression;category tag.
- The added terms are relevant and will improve recognition of modern references.
- Variant spellings (e.g., for "Yahya Sinwar") are included, which is helpful.
The only suggestion for improvement is to consider sorting all entries alphabetically for better maintainability and ease of lookup. Overall, these changes are approved and will positively impact the French language module.
4829-4829
: New military term addedA new military term has been added:
- brigade Bislamach
This entry is correctly formatted and tagged as "N f s" (noun, feminine, singular).
To ensure the accuracy of this term, please run the following script:
This will help confirm if the term is used consistently throughout the codebase and if there are any alternate spellings or related terms we should consider including.
✅ Verification successful
To ensure the new term "brigade Bislamach" is used consistently and without duplicates, please run the following scripts:
Verification Successful
The new military term "brigade Bislamach;N f s" has been added correctly and is used consistently without duplicates in the codebase.
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Verify the spelling and usage of "brigade Bislamach" rg -i "brigade\s+bislamach" rg -i "bislamach"Length of output: 287
Script:
#!/bin/bash # Search for any additional occurrences of "bislamach" in the codebase rg -i "bislamach" # Confirm that "brigade Bislamach" appears only once rg -i "brigade\s+Bislamach" | wc -lLength of output: 177
languagetool-language-modules/fr/src/main/resources/org/languagetool/resource/fr/hunspell/spelling.txt (4)
34722-34723
: Proper nouns and place names: ApprovedThe additions of proper nouns and place names (Héraclée, Asculum, Pausilippe, Türkiye, Méhémet, Zarafa, Nétanyahou, Rafah) are correct and valuable. They include historical places, people, and current geographical names, which will enhance the tool's ability to process texts containing these references. The inclusion of "Türkiye" shows commendable attention to recent geopolitical name changes.
Also applies to: 34724-34727, 34734-34735
34727-34728
: Medical terms: ApprovedThe addition of "acetylleucine" and "Tanganil" enhances the medical vocabulary of the tool. "Acetylleucine" is correctly spelled, and "Tanganil" is its corresponding brand name. Including both generic and brand names is a thorough approach that will improve the tool's ability to process medical texts.
34721-34735
: Overall assessment: Mostly approved with minor suggestionsThe additions to the French spelling dictionary are generally valuable and correct. They include a good mix of proper nouns, medical terms, generational labels, and colloquial expressions that will enhance the language tool's vocabulary. However, there are a few points that require attention:
- Consider adding French equivalents for the generational labels (Gen X, Gen Y, Gen Z) if they exist.
- Ensure that the language tool can properly handle the slang term "wesh" in different formality contexts.
- Clarification is needed for the term "Opella" to verify its relevance and correctness.
Addressing these points will further improve the quality and comprehensiveness of the dictionary additions.
34721-34722
: Other terms: Partially approved, clarification needed
- "yogique" is correctly added as the French adjective related to yoga.
- "Shabak" is appropriate, likely referring to the Israeli internal security service.
However, clarification is needed for "Opella". Without context, it's difficult to verify its correctness or relevance to the French dictionary. Could you please provide more information about this term?
To gather more context about "Opella", please run the following script:
#!/bin/bash # Description: Search for context or usage of "Opella" in the project # Search for "Opella" in all text files rg -i "Opella" -t txt -C 3 # Check if "Opella" appears in any other dictionary files fd . -e txt -e dic | xargs rg -i "Opella"Also applies to: 34733-34734, 34735-34736
Benyamin Nétanyahou;Z m s | ||
Gen Z;N e sp | ||
Gen Y;N e sp | ||
Gen X;N e sp | ||
Yahya Sinouar;Z m s | ||
Yahya Sinwar;Z m s | ||
brigade Bislamach;N f s |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Suggestion for alphabetical ordering
While the new entries are correctly formatted and tagged, they are not in alphabetical order with respect to the rest of the file. Consider sorting all entries alphabetically to improve maintainability and ease of lookup.
To sort the entries alphabetically, you can use the following command:
sort -o languagetool-language-modules/fr/src/main/resources/org/languagetool/resource/fr/multiwords.txt languagetool-language-modules/fr/src/main/resources/org/languagetool/resource/fr/multiwords.txt
This will sort the file in-place, maintaining the original file and overwriting it with the sorted content.
Opella |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Codebase verification
Informal Language Handling: Issues Found
The addition of "wesh" is appropriate for recognizing contemporary French slang. However, the verification indicates that there are no existing rules or metadata for handling informal language within the tool. To ensure accurate differentiation between informal slang usage and potential misspellings in formal contexts, please implement metadata or usage flags that categorize text based on formality levels.
🔗 Analysis chain
Colloquial expression: Approved with usage note
The addition of "wesh" is appropriate for recognizing contemporary French slang. This term is commonly used in informal contexts, particularly among younger speakers. However, ensure that the language tool can differentiate between intentional usage of this slang term and potential misspellings in formal contexts. Consider adding metadata or usage flags to help the tool provide accurate suggestions based on the text's formality level.
To ensure proper handling of this slang term, please run the following script to check if there are any existing rules or metadata for handling informal language:
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Check for existing rules or metadata for handling informal language
# Search for files that might contain rules for informal language
fd -e xml -e txt . | xargs rg -i "informal|colloquial|slang"
# Check if there's a specific file for managing formality levels
fd "formality" -t f
Length of output: 7460
In the context of https://app.asana.com/0/1205948788428908/1208527842981360/f
Summary by CodeRabbit