Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cb changes 20241021 #10960

Merged
merged 6 commits into from
Oct 21, 2024
Merged

Cb changes 20241021 #10960

merged 6 commits into from
Oct 21, 2024

Conversation

tiff
Copy link
Member

@tiff tiff commented Oct 21, 2024

Summary by CodeRabbit

  • New Features

    • Expanded scientific names list for various species.
    • Enhanced German spelling rules with new patterns for gender-neutral language and compound words.
    • Updated compound words list to reflect contemporary usage in German.
    • Added new entries to the ignore list for specific terms in the German language module.
    • Significant additions to the Austrian German spell checker dictionary.
    • Updated multi-token suggestions and ignore lists to include contemporary terms and names.
    • New antipatterns introduced to improve grammar checking involving modal verbs and negations.
  • Bug Fixes

    • Corrections made to spelling rules to align with the 2024 German spelling reform.
  • Documentation

    • Updates to the documentation reflecting changes in spelling and grammar rules.

Copy link
Contributor

coderabbitai bot commented Oct 21, 2024

Walkthrough

The pull request introduces multiple updates across various files within the LanguageTool project, primarily focusing on the German language module. Key changes include the addition of scientific names in spelling_global.txt, enhancements to the German spelling rules in GermanSpellerRule.java, and updates to compound words, ignored terms, and spell-checking dictionaries. These modifications aim to improve the language processing capabilities and the accuracy of spelling suggestions.

Changes

File Path Change Summary
languagetool-core/src/main/resources/org/languagetool/resource/spelling_global.txt Added scientific names for various species, including plants, animals, and bacteria.
languagetool-language-modules/de/src/main/java/org/languagetool/rules/de/GermanSpellerRule.java Enhanced German spelling rules with new regex patterns, updated match logic, refined filtering, and added helper methods for gender-neutral language and compound words.
languagetool-language-modules/de/src/main/resources/org/languagetool/resource/de/compounds.txt Expanded the list of compound words with new entries and specific formatting symbols to guide suggestions.
languagetool-language-modules/de/src/main/resources/org/languagetool/resource/de/hunspell/ignore.txt Added new entries to the ignore list, expanding the terms that the spell checker will overlook.
languagetool-language-modules/de/src/main/resources/org/languagetool/resource/de/hunspell/spelling-de-AT.txt Introduced numerous new entries relevant to Austrian German, including nouns, adjectives, and colloquial expressions.
languagetool-language-modules/de/src/main/resources/org/languagetool/resource/de/hunspell/spelling.txt Added various new words and phrases, reflecting contemporary usage and enhancing the spell-checking lexicon.
languagetool-language-modules/de/src/main/resources/org/languagetool/resource/de/multitoken-ignore.txt Added a new entry "Happy Path/S" to the multitoken ignore list.
languagetool-language-modules/de/src/main/resources/org/languagetool/resource/de/multitoken-suggest.txt Added new entries and modified existing ones to enhance suggestion capabilities for contemporary terms and names.
languagetool-language-modules/de/src/main/resources/org/languagetool/rules/de/replace.txt Updated replacement rules to align with the 2024 German spelling reform, correcting various names and terms.
languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/added.txt Introduced several new entries to the English part-of-speech dictionary, including proper nouns and adjectives.
languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/hunspell/ignore.txt Added a substantial number of words and phrases to the ignore list for the English spell checker.
languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/hunspell/spelling.txt Expanded the dictionary with new words, including scientific terms and modern slang.
languagetool-language-modules/en/src/main/resources/org/languagetool/rules/en/grammar.xml Added new antipatterns for modal verbs and negations, and modified existing patterns to enhance grammar checking capabilities.
languagetool-language-modules/en/src/main/resources/org/languagetool/rules/en/style.xml No changes reported in this file.

Possibly related PRs

Suggested reviewers

  • SteVio89
  • GillouLT

Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Outside diff range and nitpick comments (24)
languagetool-language-modules/de/src/main/resources/org/languagetool/resource/de/hunspell/spelling.txt (2)

64662-64668: LGTM! Consider adding a note for "homemade".

The new entries are correctly spelled and annotated. They enhance the dictionary with contemporary terms and compound words.

For "homemade", consider adding a note to indicate it's an anglicism:

-homemade
+homemade #anglicism

64906-64938: LGTM! Consider adding notes for clarity.

The new entries significantly enhance the dictionary with contemporary terms, compound words, and anglicisms. They are correctly spelled and annotated.

Consider the following minor improvements:

  1. Add a note for "Scripted-Reality-" compounds to indicate they're media-related terms.
  2. Add a brief explanation for "WEEE-Reg.-Nr." (e.g., #environmental_regulation).
  3. Mark "Dropshipper" terms as anglicisms.

Example:

-Scripted-Reality-Star/S
+Scripted-Reality-Star/S #media
-WEEE-Reg.-Nr.
+WEEE-Reg.-Nr. #environmental_regulation
-Dropshipper/SN
+Dropshipper/SN #anglicism
languagetool-language-modules/en/src/main/resources/org/languagetool/rules/en/grammar.xml (2)

86448-86453: LGTM! Consider adding 'must' to the modal verb list.

The new antipattern effectively captures common constructions with modal verbs followed by negations and personal pronouns. This will help prevent false positives in the grammar checker.

Consider adding 'must' to the list of modal verbs in the first token for completeness:

-                <token regexp="yes" case_sensitive="yes">can|ca|have|should|could|will|wo|may|might</token>
+                <token regexp="yes" case_sensitive="yes">can|ca|have|should|could|will|wo|may|might|must</token>

86454-86459: LGTM! Consider refining the second token matcher.

The new antipattern effectively handles cases involving mentions (e.g., @username) followed by modal verbs and negations. This is a good addition for modern text analysis.

Consider refining the second token matcher to be more specific:

-                <token regexp="yes">.+</token>
+                <token regexp="yes">\w+</token>

This change would ensure that the token following the '@' mention consists of word characters only, which is more typical for usernames or mentions.

languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/added.txt (3)

1087-1087: Consider adding Choctaw as a proper noun.

The entry for "Choctaw" as an adjective (JJ) is correct. However, consider adding an additional entry for Choctaw as a proper noun (NNP) to cover its usage when referring to the Choctaw people or language.

Suggested addition:

Choctaw	Choctaw	NNP

Line range hint 1089-1089: Consider adding Seminole as a singular proper noun and adjective.

The entry for "Seminole" as a plural proper noun (NNPS) is correct. However, consider adding additional entries for Seminole as a singular proper noun (NNP) and as an adjective (JJ) to cover all common usages.

Suggested additions:

Seminole	Seminole	NNP
Seminole	Seminole	JJ

Line range hint 1090-1090: Consider adding counsel as a singular noun and verb.

The entry for "counsel" as a plural noun (NNS) is correct in some contexts. However, consider adding additional entries for counsel as a singular noun (NN) and as a verb (VB) to cover all common usages.

Suggested additions:

counsel	counsel	NN
counsel	counsel	VB
languagetool-language-modules/de/src/main/resources/org/languagetool/rules/de/replace.txt (4)

Line range hint 763-768: LGTM: Comprehensive addition of "EU-feindlich" entries

The additions for various forms of "EU-feindlich" are correct and cover all grammatical cases and genders. This is a good improvement for consistency in hyphenated compound adjectives.

Consider using a more compact representation if the system supports it, such as:

EUfeindlich(e|em|en|er|es)?=EU-feindlich$1

This could potentially reduce the number of entries while achieving the same result.


Line range hint 770-772: LGTM: Correct handling of "Crème" variations

The entries for "Crème de la Crème" and variations of "Crème" are handled correctly. Offering both "Creme" and "Crème" as alternatives is appropriate, considering their common usage in German.

Consider adding a comment explaining the acceptance of both spellings to provide context for future maintainers.


Line range hint 773-789: LGTM: Comprehensive updates for 2024 German spelling reform

The additions reflecting the 2024 German spelling reform are accurate and well-documented. The changes cover important words like "Spaghetti", "Thunfisch", "Exposé", and "Joghurt", including their various grammatical forms.

The explanatory comments for each change are particularly helpful. Consider adding a general comment at the beginning of this section to indicate that these changes are part of the 2024 reform, which could help with future maintenance and updates.


Line range hint 791-816: LGTM: Thorough updates for English loanword plurals

The changes to English loanword plurals ending in -y are comprehensive and align with the 2024 German spelling reform. The consistent explanatory comments are valuable for understanding the rule.

To optimize this section, consider using a regex pattern if the system supports it. For example:

([A-Za-z]+)ies=$1ys	Nach den offiziellen Rechtschreibanpassungen von 2024 wird der Plural von Substantiven aus dem Englischen, die auf -y enden, im Deutschen auf "-ys" gebildet (Ausnahme: Eigennamen)

This could potentially cover all cases with a single rule, reducing maintenance overhead. However, ensure that this doesn't conflict with any exceptions or special cases before implementing.

languagetool-language-modules/de/src/main/resources/org/languagetool/resource/de/multitoken-suggest.txt (2)

3081-3088: New entries added successfully.

The new entries have been added correctly, maintaining alphabetical order and following the existing format. Most entries use the /S suffix, likely indicating plural forms.

Consider adding a suffix to "Scripted Reality"

Unlike the other new entries, "Scripted Reality" doesn't have a suffix. If applicable, consider adding /S or another appropriate suffix for consistency.


Line range hint 1-3088: File structure and consistency are generally good

The file maintains a consistent format with entries listed one per line, generally in alphabetical order. The mix of languages and the inconsistent use of suffixes appear to be intentional, likely based on the specific needs of each entry in the context of the German language module.

Consider documenting suffix usage

To improve maintainability, consider adding a comment at the beginning of the file explaining the meaning and usage of different suffixes (e.g., /S, /E, /N). This would help future contributors understand when and how to apply these suffixes to new entries.

languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/hunspell/ignore.txt (2)

Line range hint 1-10960: Diverse name additions improve international text handling

The inclusion of numerous proper nouns and names from various cultures and languages greatly enhances the spellchecker's ability to handle international content. However, the "#name" annotation is used inconsistently throughout the list.

Consider applying the "#name" annotation consistently to all entries that represent personal names or surnames to improve maintainability and potential future processing of this list.


Line range hint 1-10960: Extensive technical term and acronym additions improve specialized text handling

The inclusion of numerous technical terms and acronyms from various fields significantly enhances the spellchecker's capability to handle specialized text. This is particularly beneficial for technical and scientific writing.

Consider implementing a process for regular review and update of the acronym list, especially for very specific or potentially temporary acronyms. This will help maintain the list's relevance over time and prevent it from becoming outdated or overly cluttered.

languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/multiwords.txt (5)

Line range hint 21-100: Remove redundant repeated entries

There are multiple identical entries for terms like "a priori", "a posteriori", "a fortiori", etc. These repetitions are unnecessary and should be removed to maintain a clean and efficient list. For example:

a priori	NNP
-a fortiori	NNP
-a posteriori	NNP
-a priori	NNP
-a fortiori	NNP
-a posteriori	NNP
-a priori	NNP
-a fortiori	NNP
-a posteriori	NNP
-a priori	NNP

Please review the entire file and remove all such redundant entries, keeping only one instance of each unique term-POS tag combination.


Line range hint 101-500: Consider adding alternative POS tags for versatile terms

Some entries in the list could benefit from additional POS tags to cover different usage contexts. For example:

-Banco National Park	NNP
+Banco National Park	NNP:NN
-Côte d'Azur	NNP
+Côte d'Azur	NNP:NN

Adding these alternative tags would allow for more flexible recognition of these terms in various grammatical contexts. Please review the list and identify other entries that could benefit from multiple POS tags.


Line range hint 501-1000: Excellent coverage of diverse terms; consider a systematic approach for product versions

The list demonstrates excellent coverage of terms from various domains, including different languages, scripts, historical periods, and product names. This diversity will greatly enhance the tool's ability to process a wide range of texts.

For product names with version numbers (e.g., "PlayStation 4", "iPhone 11"), consider implementing a more systematic approach to cover future versions without needing to add each new entry manually. For example:

+PlayStation	NNP
+iPhone	NNP

Then, implement a rule in the language processing logic to recognize these product names followed by a number as a single entity. This approach would automatically cover future versions like "PlayStation 5" or "iPhone 15" without requiring updates to this list.


Line range hint 1001-8024: Consider adding categories or domains to entries

The list includes a wide variety of terms, including names of people, fictional characters, places, currencies, and units of measurement. To enhance the usability and maintainability of this resource, consider adding category or domain information to certain types of entries. For example:

-Albert Einstein	NNP
+Albert Einstein	NNP	# category:person:scientist
-Hogwarts School	NNP
+Hogwarts School	NNP	# category:fictional:place
-Euro	NNP
+Euro	NNP	# category:currency
-kilowatt-hour	NN
+kilowatt-hour	NN	# category:unit:energy

This additional information could be used to improve context-aware processing and make it easier to update or filter specific types of entries in the future.


Line range hint 1-8024: Comprehensive and well-structured resource with room for optimization

This file provides a valuable and extensive list of multi-word terms with their corresponding POS tags, covering a wide range of domains, languages, and entity types. The consistent formatting and diverse coverage make it an excellent resource for English language processing in LanguageTool.

To further improve this file, consider implementing the following suggestions:

  1. Remove redundant repeated entries, especially for terms like "a priori", "a posteriori", etc.
  2. Add alternative POS tags for versatile terms that can function in different grammatical roles.
  3. Implement a systematic approach for handling product names with version numbers.
  4. Add category or domain information to entries to enhance context-aware processing and maintainability.

These optimizations will make the resource more efficient, easier to maintain, and potentially more powerful in its language processing capabilities.

languagetool-language-modules/de/src/main/resources/org/languagetool/resource/de/compounds.txt (3)

Line range hint 25-11595: Comprehensive list of compound words with correct formatting.

The list covers a wide range of topics and includes many modern terms, which is beneficial for language processing applications. The use of symbols and regular expressions appears to be consistent with the stated rules.

Consider reviewing the list for:

  1. Potential inconsistencies in hyphenation or capitalization (e.g., "Cyber-Deal$" vs "Cyber-deal$").
  2. Possible redundancies or very similar entries that could be combined.
  3. Opportunities to add context or explanations for less common or specialized terms.

11596-11602: New entries for "Scripted-Reality" terms are appropriate and consistently formatted.

The added entries related to "Scripted-Reality" are relevant to modern media and follow the established formatting rules. The use of the asterisk (*) symbol is correct for these terms.

Consider expanding this category with additional related terms, such as:

  • Scripted-Reality-Format*
  • Scripted-Reality-Produktion*
  • Scripted-Reality-Drehbuch*
    This would provide more comprehensive coverage of the "Scripted-Reality" concept.

Line range hint 11603-11634: Diverse and relevant new entries with correct formatting.

The additional entries cover various topics and use the appropriate symbols as per the established rules. They contribute to the comprehensiveness of the compound words list.

Consider reviewing the consistency of hyphenation for similar terms across the entire file. For example:

  • "Inhouse-Seminar$" is hyphenated, but there might be other "Inhouse" compounds that could be treated similarly for consistency.
languagetool-language-modules/en/src/main/resources/org/languagetool/rules/en/style.xml (1)

3394-3398: LGTM: Consistent application of "first and foremost" antipattern.

The addition of this antipattern to another rule is consistent with the previous change and will help prevent false positives across different contexts.

Consider creating a shared list of common antipatterns that can be referenced across multiple rules. This could reduce duplication and make it easier to maintain these exceptions in the future.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Files that changed from the base of the PR and between d59ad5c and 138201e.

📒 Files selected for processing (15)
  • languagetool-core/src/main/resources/org/languagetool/resource/spelling_global.txt (2 hunks)
  • languagetool-language-modules/de/src/main/java/org/languagetool/rules/de/GermanSpellerRule.java (1 hunks)
  • languagetool-language-modules/de/src/main/resources/org/languagetool/resource/de/compounds.txt (1 hunks)
  • languagetool-language-modules/de/src/main/resources/org/languagetool/resource/de/hunspell/ignore.txt (1 hunks)
  • languagetool-language-modules/de/src/main/resources/org/languagetool/resource/de/hunspell/spelling-de-AT.txt (0 hunks)
  • languagetool-language-modules/de/src/main/resources/org/languagetool/resource/de/hunspell/spelling.txt (4 hunks)
  • languagetool-language-modules/de/src/main/resources/org/languagetool/resource/de/multitoken-ignore.txt (1 hunks)
  • languagetool-language-modules/de/src/main/resources/org/languagetool/resource/de/multitoken-suggest.txt (1 hunks)
  • languagetool-language-modules/de/src/main/resources/org/languagetool/rules/de/replace.txt (1 hunks)
  • languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/added.txt (1 hunks)
  • languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/hunspell/ignore.txt (1 hunks)
  • languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/hunspell/spelling.txt (1 hunks)
  • languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/multiwords.txt (1 hunks)
  • languagetool-language-modules/en/src/main/resources/org/languagetool/rules/en/grammar.xml (1 hunks)
  • languagetool-language-modules/en/src/main/resources/org/languagetool/rules/en/style.xml (2 hunks)
🔥 Files not summarized due to errors (1)
  • languagetool-language-modules/en/src/main/resources/org/languagetool/rules/en/style.xml: Error: Server error: no LLM provider could handle the message
💤 Files with no reviewable changes (1)
  • languagetool-language-modules/de/src/main/resources/org/languagetool/resource/de/hunspell/spelling-de-AT.txt
✅ Files skipped from review due to trivial changes (2)
  • languagetool-language-modules/de/src/main/resources/org/languagetool/resource/de/hunspell/ignore.txt
  • languagetool-language-modules/de/src/main/resources/org/languagetool/resource/de/multitoken-ignore.txt
🧰 Additional context used
🔇 Additional comments (28)
languagetool-language-modules/de/src/main/resources/org/languagetool/resource/de/hunspell/spelling.txt (2)

64828-64829: LGTM! Good addition of colloquial variant.

The entries "herum_verhandeln" and "rum_verhandeln" are correctly formatted and provide both the standard and colloquial variants of the verb.


64865-64865: LGTM! Correct addition of compound adjective.

The entry "anbringbar/A" is correctly spelled and annotated. It's a valid compound adjective in German.

languagetool-language-modules/en/src/main/resources/org/languagetool/rules/en/grammar.xml (1)

86461-86464: LGTM! Good enhancement for handling modern text styles.

The addition of the exception for tokens starting with '@' aligns well with the new antipattern and makes the grammar checker more adaptable to text styles commonly used in social media and other online platforms.

languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/hunspell/spelling.txt (1)

478-500: Excellent additions to the spelling dictionary!

The new entries are well-chosen and comprehensive:

  1. Pyrography-related terms (lines 478-484) cover various forms (noun, verb, adjective) of this artistic technique.
  2. Misandry-related terms (lines 485-492) include different forms and even the more specific "transmisandry".
  3. Nationality terms (lines 493-500) add correct demonyms for various countries/regions, including less common terms like "Monegasque".

These additions will improve the spell checker's coverage of specialized terms and nationalities.

languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/added.txt (10)

1084-1084: LGTM: Kochi entry is correct.

The entry for "Kochi" is properly formatted and correctly tagged as a proper noun (NNP).


1085-1085: LGTM: Kerala entry is correct.

The entry for "Kerala" is properly formatted and correctly tagged as a proper noun (NNP).


1086-1086: LGTM: nanoplankton entry is correct.

The entry for "nanoplankton" is properly formatted and correctly tagged as a plural noun (NNS).


1088-1088: LGTM: Brangelina entry is correct.

The entry for "Brangelina" is properly formatted and correctly tagged as a proper noun (NNP).


Line range hint 1091-1091: LGTM: Hydrapple entry is correct.

The entry for "Hydrapple" is properly formatted and correctly tagged as a proper noun (NNP).


Line range hint 1092-1092: LGTM: Terapagos entry is correct.

The entry for "Terapagos" is properly formatted and correctly tagged as a proper noun (NNP).


Line range hint 1093-1093: LGTM: NewsNation entry is correct.

The entry for "NewsNation" is properly formatted and correctly tagged as a proper noun (NNP).


Line range hint 1094-1094: LGTM: Pecharunt entry is correct.

The entry for "Pecharunt" is properly formatted and correctly tagged as a proper noun (NNP).


Line range hint 1095-1095: LGTM: phygital entry is correct.

The entry for "phygital" is properly formatted and correctly tagged as an adjective (JJ).


Line range hint 1084-1095: Overall, the additions to the dictionary are well-formatted and appropriate.

The new entries maintain the consistent format of the file and expand the dictionary with relevant terms. Some minor suggestions have been made for a few entries to improve coverage of different word forms or parts of speech.

languagetool-language-modules/de/src/main/resources/org/languagetool/rules/de/replace.txt (4)

761-762: LGTM: Correct addition of "Taşan-Funke" entry

The addition of "Taşan-Funke" and its genitive form with the correct special character "ş" is appropriate and follows the file's format.


Line range hint 769-769: LGTM: Correct addition of "Place Saint-Lambert"

The addition of the entry for "Place Saint-Lambert" with proper hyphenation is correct and improves the handling of French place names.


Line range hint 817-818: LGTM: Correct additions for commonly misspelled phrases

The entries for "Fettarmekost" and "Gutelaune" are appropriate additions. They correctly separate these commonly misspelled compound words into their proper adjective-noun combinations: "fettarme Kost" and "gute Laune". This will help improve the accuracy of the language tool for these phrases.


Line range hint 1-818: Overall assessment: Excellent updates to German language rules

The changes in this file significantly improve the LanguageTool's ability to handle recent German spelling reforms, English loanwords, and common misspellings. The additions are well-structured, consistently formatted, and include helpful explanatory comments. The comprehensive coverage of grammatical forms and the attention to special characters and hyphenation rules are commendable.

Consider the suggested optimizations for more compact rule representations where applicable, and the addition of section comments for better organization. These minor improvements could further enhance the maintainability of the file.

Great work on keeping the language tool up-to-date with the latest language standards!

languagetool-language-modules/de/src/main/resources/org/languagetool/resource/de/multitoken-suggest.txt (1)

Line range hint 1-3088: Changes successfully enhance multitoken suggestions

The additions to this file improve the coverage of multitoken suggestions for the German language module in LanguageTool. The new entries are relevant, well-formatted, and maintain the existing structure of the file. These changes should positively impact the tool's ability to provide accurate suggestions for complex terms and phrases.

languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/hunspell/ignore.txt (4)

Line range hint 1-10960: Comprehensive update to the ignore list enhances spellchecker accuracy

The substantial additions to the ignore list cover a wide range of contemporary language usage, including common words, specialized terms, and proper nouns. This comprehensive update will significantly improve the spellchecker's accuracy and reduce false positives, especially for modern English text.


Line range hint 1-10960: Inclusion of contractions and informal terms improves handling of casual text

The addition of various contractions and informal language terms enhances the spellchecker's ability to handle casual writing and social media content. This update reflects the evolving nature of written communication and will reduce false positives in less formal contexts.


Line range hint 1-10960: Substantial improvements with recommendations for refinement

The extensive updates to the ignore.txt file significantly enhance the spellchecker's functionality across various types of text, including technical, informal, and international content. The additions reflect contemporary language usage and will greatly reduce false positives in spell-checking.

To further improve and maintain the effectiveness of this list, consider the following recommendations:

  1. Implement a regular review process to validate and update entries, especially for acronyms and specialized terms.
  2. Standardize the use of annotations (e.g., "#name") and formatting throughout the list.
  3. Develop guidelines for adding new entries to ensure consistency and relevance.
  4. Periodically assess the impact of the ignore list on overall spell-checking accuracy to balance comprehensiveness with precision.

These steps will help ensure that the ignore list remains a valuable and accurate resource for the spellchecker while adapting to evolving language usage.


Line range hint 1-10960: Review suggested for potential issues in the ignore list

While the majority of additions enhance the spellchecker's functionality, there are some potential issues to consider:

  1. Some entries appear to be misspellings or very uncommon terms, which might negatively impact spell-checking accuracy.
  2. There are inconsistencies in formatting, such as capitalization and use of hyphens, which could affect maintainability.
  3. Some entries might be too specific or localized, potentially leading to over-ignoring in general contexts.

To address these concerns, consider running the following script to identify potential issues:

Based on the results, consider implementing a review process to:

  1. Validate uncommon terms and potential misspellings.
  2. Standardize formatting and capitalization.
  3. Evaluate the inclusion of highly specific or localized terms.

This will help maintain the list's accuracy and effectiveness over time.

languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/multiwords.txt (1)

Line range hint 1-20: Well-structured and consistently formatted entries

The file begins with a clear explanation of its purpose and usage guidelines. The entries are consistently formatted with the term followed by a tab and then the POS tag. Additional comments are appropriately placed after the # symbol. The variety of POS tags (NNP, NN, UN, NNS, RB, JJ) indicates comprehensive coverage of different types of multi-word terms.

languagetool-language-modules/de/src/main/resources/org/languagetool/resource/de/compounds.txt (2)

Line range hint 1-24: Introductory comments and rules are clear and informative.

The file header provides a comprehensive explanation of the purpose, encoding, and formatting rules for the compound words list. The instructions for removing duplicates are also helpful for maintaining the file.


Line range hint 1-11634: Overall, the compounds.txt file is well-maintained and serves its purpose effectively.

The file provides a comprehensive list of German compound words with clear formatting rules and consistent use of symbols. The recent additions, including the "Scripted-Reality" terms and other diverse entries, enhance the list's relevance to modern language use.

To further improve the file:

  1. Consider conducting a thorough review for any inconsistencies in hyphenation or capitalization across similar terms.
  2. Evaluate the list for potential redundancies or very similar entries that could be combined.
  3. Where appropriate, add context or explanations for specialized or less common terms.
  4. Continue to expand categories (like the "Scripted-Reality" terms) with related words to provide more comprehensive coverage.

These steps will help maintain the high quality and usefulness of the compound words list for language processing applications.

languagetool-language-modules/en/src/main/resources/org/languagetool/rules/en/style.xml (2)

3353-3357: LGTM: Antipattern for "first and foremost" added.

The addition of this antipattern is a good improvement. It will prevent the rule from flagging the common idiomatic expression "first and foremost" as a style issue, reducing false positives in the language tool.


3353-3357: Summary: Improved handling of "first and foremost" expression.

These changes consistently add antipatterns for the phrase "first and foremost" across multiple rules. This will enhance the accuracy of the language tool by preventing false positives for this common idiomatic expression. The implementation is correct and well-placed within the rule structures.

For future improvements, consider exploring ways to centralize the definition of such common antipatterns to enhance maintainability and consistency across the configuration.

Also applies to: 3394-3398

@@ -3704,6 +3704,7 @@ protected List<SuggestedReplacement> getOnlySuggestions(String word) {
case "Dnake": return topMatch("Danke");
case "Muehe": return topMatch("Mühe");
case "Muehen": return topMatch("Mühen");
case "Kuhdamm": return topMatch("Ku’damm");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Consider using a straight apostrophe or including multiple suggestions for "Ku’damm".

The suggestion "Ku’damm" uses a typographic apostrophe (U+2019). For consistency and to avoid potential encoding issues, you might use the standard apostrophe (U+0027) as in "Ku'damm". Additionally, since "Kudamm" without an apostrophe is also a common spelling, consider including it as an alternative suggestion.

Apply the following diff to update the suggestion:

-          case "Kuhdamm": return topMatch("Ku’damm");
+          case "Kuhdamm": return topMatch("Ku'damm", "Kudamm");

Committable suggestion was skipped due to low confidence.

@tiff tiff merged commit c382af0 into master Oct 21, 2024
5 checks passed
@tiff tiff deleted the cb-changes-20241021 branch October 21, 2024 20:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant