Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Add support for escaping characters in KQL key names. #560

Merged
merged 7 commits into from
Nov 13, 2024

Conversation

gibber9809
Copy link
Contributor

@gibber9809 gibber9809 commented Oct 21, 2024

Description

This PR adds support for using backslash to escape characters in KQL key names. The change is mostly contained in the StringUtiles::tokenize_column_descriptor function, but does involve a small change to the kql.g4 grammar to allow the '.' character to be escaped.

For example the key consisting of the tokens {"com.bnn", "uuid"} can now be specified as "com\.bnn.uuid" in KQL.

This PR also adds error handling for invalid KQL key names. Invalid key names currently include keys ending ending in a trailing unescaped '' or '.'.

Validation performed

  • Added unit tests for parsing keys with escaped characters
  • Confirmed that sensible errors are shown on the command line when entering invalid KQL queries

Summary by CodeRabbit

Summary by CodeRabbit

  • New Features

    • Enhanced error handling in JSON parsing and KQL expression parsing.
    • Added support for escape sequences in column names within KQL.
    • Updated KQL grammar to recognize periods as special characters.
  • Bug Fixes

    • Improved robustness of tokenization methods to prevent failures from propagating.
  • Tests

    • Introduced new test cases for validating the parsing of queries with escaped characters in column names.
    • Added tests for handling illegal escape sequences in column names.
    • Expanded coverage for cases with empty tokens in column names.

Copy link
Contributor

coderabbitai bot commented Oct 21, 2024

Walkthrough

The changes across multiple files enhance error handling and validation processes in the parsing and processing of JSON and KQL data. Key modifications include improved error logging and exception handling in the JsonParser and TimestampDictionaryReader. The tokenize_column_descriptor method's return type was changed to bool in Utils, allowing for better error feedback. Additionally, the KQL grammar was updated to recognize periods as special characters, and new tests were added to validate parsing with escape sequences. Overall, the changes focus on robustness and error management.

Changes

File Path Change Summary
components/core/src/clp_s/JsonParser.cpp Enhanced error handling in constructor and parse method; added logging and exception handling.
components/core/src/clp_s/TimestampDictionaryReader.cpp Added error handling in read_new_entries method for tokenization failures.
components/core/src/clp_s/Utils.cpp Changed return type of tokenize_column_descriptor from void to bool and improved functionality.
components/core/src/clp_s/Utils.hpp Updated tokenize_column_descriptor method signature to return bool with [[nodiscard]] attribute.
components/core/src/clp_s/clp-s.cpp Improved error handling in search_archive function; checks tokenization result and logs errors.
components/core/src/clp_s/search/kql/Kql.g4 Modified SPECIAL_CHARACTER fragment to include a period (.) as a valid character.
components/core/src/clp_s/search/kql/kql.cpp Enhanced error handling in visitColumn and parse_kql_expression methods; added exception handling.
components/core/tests/test-kql.cpp Added new test cases for handling escape sequences in column names in KQL parsing.

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant KQLParser
    participant StringUtils
    participant Logger

    User->>KQLParser: Parse KQL Expression
    KQLParser->>StringUtils: tokenize_column_descriptor(column)
    alt Tokenization Success
        StringUtils-->>KQLParser: Tokens
        KQLParser->>User: Return Parsed Expression
    else Tokenization Failure
        StringUtils-->>KQLParser: false
        KQLParser->>Logger: Log Error
        KQLParser-->>User: Return nullptr
    end
Loading

Possibly related PRs

Suggested reviewers

  • kirkrodrigues

Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

@gibber9809 gibber9809 linked an issue Oct 21, 2024 that may be closed by this pull request
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

🧹 Outside diff range and nitpick comments (7)
components/core/src/clp_s/TimestampDictionaryReader.cpp (1)

47-49: Improved error handling for tokenization failure.

The addition of this error check enhances the robustness of the read_new_entries method by properly handling tokenization failures. This aligns well with the PR objectives and improves the overall error handling in the system.

Consider adding more context to the error message to aid in debugging:

-            throw OperationFailed(ErrorCodeCorrupt, __FILENAME__, __LINE__);
+            throw OperationFailed(ErrorCodeCorrupt, __FILENAME__, __LINE__,
+                "Failed to tokenize column descriptor: " + entry.get_key_name());

This change would provide more specific information about which key name caused the tokenization failure.

components/core/src/clp_s/search/kql/kql.cpp (2)

115-118: Improved error handling for invalid column descriptors. Consider adding more context to the error message.

The changes enhance error handling by checking the return value of tokenize_column_descriptor and logging an error message when tokenization fails. This aligns well with the PR objective of improving error handling for invalid KQL key names.

Consider adding more context to the error message to aid in debugging:

-SPDLOG_ERROR("Can not tokenize invalid column: \"{}\"", column);
+SPDLOG_ERROR("Failed to tokenize column in visitColumn: \"{}\"", column);

254-258: Enhanced exception handling in parse_kql_expression. Consider logging the caught exception.

The addition of a try-catch block improves the robustness of the parse_kql_expression function by handling potential exceptions during parsing. This change aligns with the PR objective of enhancing error handling for invalid KQL queries.

Consider logging the caught exception to aid in debugging:

 try {
     return std::any_cast<std::shared_ptr<Expression>>(visitor.visitStart(tree));
 } catch (std::exception& e) {
+    SPDLOG_ERROR("Exception caught in parse_kql_expression: {}", e.what());
     return {};
 }
components/core/tests/test-kql.cpp (2)

191-213: Well-structured test section for escaped sequences. Consider expanding test coverage.

The new test section "Escape sequences in column name" is well-structured and effectively tests the parsing of queries with escaped characters in column names. The use of the GENERATE macro for creating various test cases is a good practice.

However, consider the following suggestions to enhance the test coverage:

  1. Add test cases for error scenarios, such as improperly escaped sequences or unmatched escape characters.
  2. Include edge cases, like multiple consecutive escape characters or escaped characters at the beginning or end of the column name.

Would you like assistance in generating additional test cases for error scenarios and edge cases?


209-209: Consider removing or conditionally enabling logging statements.

The SPDLOG_INFO statements are useful for debugging but may clutter the test output in normal runs. Consider either removing these logging statements or wrapping them in a conditional compile directive or a debug flag.

Here's a suggestion to conditionally enable the logging:

-        SPDLOG_INFO("{}", it->get_token());
+        #ifdef DEBUG_KQL_TESTS
+        SPDLOG_DEBUG("Token: {}", it->get_token());
+        #endif

Apply this change to both logging statements.

Also applies to: 211-211

components/core/src/clp_s/Utils.hpp (1)

215-217: Excellent improvement to error handling. Consider enhancing the method description.

The change from void to [[nodiscard]] static bool is a great improvement for error handling. The [[nodiscard]] attribute ensures that callers handle the return value, which is crucial for proper error management.

To further improve clarity, consider updating the method description to explicitly mention the return value's meaning. For example:

 /**
  * Converts a string column descriptor delimited by '.' into a list of tokens
  * @param descriptor
  * @param tokens
- * @return the list of tokens pushed into the 'tokens' parameter
- * @return true if the descriptor was tokenized successfully, false otherwise
+ * @param tokens Output parameter to store the list of tokens
+ * @return true if the descriptor was tokenized successfully, false if an error occurred (e.g., invalid escape sequence)
  */
components/core/src/clp_s/clp-s.cpp (1)

193-196: Improved error handling for column tokenization. Consider enhancing the error message.

The changes effectively handle tokenization failures and prevent further processing with invalid data. This aligns well with the PR objectives of enhancing error handling for invalid KQL key names.

Consider formatting the error message to make it more readable in log files:

-                SPDLOG_ERROR("Can not tokenize invalid column: \"{}\"", column);
+                SPDLOG_ERROR("Cannot tokenize invalid column: '{}'", column);
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Files that changed from the base of the PR and between e21672b and 33d13f0.

📒 Files selected for processing (8)
  • components/core/src/clp_s/JsonParser.cpp (1 hunks)
  • components/core/src/clp_s/TimestampDictionaryReader.cpp (1 hunks)
  • components/core/src/clp_s/Utils.cpp (1 hunks)
  • components/core/src/clp_s/Utils.hpp (1 hunks)
  • components/core/src/clp_s/clp-s.cpp (1 hunks)
  • components/core/src/clp_s/search/kql/Kql.g4 (1 hunks)
  • components/core/src/clp_s/search/kql/kql.cpp (2 hunks)
  • components/core/tests/test-kql.cpp (1 hunks)
🧰 Additional context used
📓 Learnings (1)
components/core/src/clp_s/JsonParser.cpp (2)
Learnt from: AVMatthews
PR: y-scope/clp#543
File: components/core/src/clp_s/JsonParser.cpp:769-779
Timestamp: 2024-10-08T15:52:50.753Z
Learning: In `components/core/src/clp_s/JsonParser.cpp`, when handling errors in `parse_from_ir`, prefer to maintain the current mix of try-catch and if-statements because specific messages are returned back up in some cases.
Learnt from: AVMatthews
PR: y-scope/clp#543
File: components/core/src/clp_s/JsonParser.cpp:769-779
Timestamp: 2024-10-07T21:16:41.660Z
Learning: In `components/core/src/clp_s/JsonParser.cpp`, when handling errors in `parse_from_ir`, prefer to maintain the current mix of try-catch and if-statements because specific messages are returned back up in some cases.
🔇 Additional comments (3)
components/core/src/clp_s/search/kql/Kql.g4 (1)

99-99: LGTM! Consider verifying impact and updating docs.

The addition of the period (.) to the set of special characters aligns well with the PR objective to support escaping characters in KQL key names. This change enables the escaping of periods in key names, which is crucial for the desired functionality.

To ensure this change doesn't inadvertently affect existing queries, please run the following verification:

Consider updating the KQL documentation to reflect this change, especially any sections that discuss special characters or escaping in key names.

✅ Verification successful

Change Verified: No Impact on Existing Queries

The addition of the period (.) to the set of special characters has been verified and does not affect existing KQL, SQL, or text queries.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Search for existing KQL queries that might be affected by the new special character treatment.

# Test: Look for unescaped periods in column names or query structures
rg -g '*.kql' -g '*.sql' -g '*.txt' '(?<![\\])\.[a-zA-Z0-9_]+'

# Note: Review any matches to ensure they don't represent valid use cases that might now require escaping.

Length of output: 293


Script:

#!/bin/bash
# Description: Search for existing KQL queries that might be affected by the new special character treatment using PCRE2.

# Test: Look for unescaped periods in column names or query structures
rg --pcre2 -g '*.kql' -g '*.sql' -g '*.txt' '(?<![\\])\.[a-zA-Z0-9_]+'

# Note: Review any matches to ensure they don't represent valid use cases that might now require escaping.

Length of output: 94005

components/core/src/clp_s/search/kql/kql.cpp (1)

Line range hint 1-258: Summary: Improved error handling aligns well with PR objectives

The changes in this file significantly enhance error handling for invalid KQL key names and queries, which directly addresses the PR objectives. The modifications in both the visitColumn method and the parse_kql_expression function contribute to a more robust and error-resistant implementation.

The suggested improvements for more detailed error logging would further enhance debugging capabilities without affecting the core functionality.

components/core/src/clp_s/JsonParser.cpp (1)

24-29: Consider aligning error handling with existing practices in JsonParser.cpp.

While adding error handling for m_timestamp_key parsing enhances robustness, please ensure that throwing an exception here is consistent with the existing error handling strategy in JsonParser.cpp. Previously, we agreed to maintain a mix of try-catch blocks and if-statements to allow specific error messages to be returned appropriately.

Run the following script to verify the error handling practices in JsonParser.cpp:

✅ Verification successful

Error handling is consistent with existing practices in JsonParser.cpp.

The addition of the throw statement aligns with the file's established use of a mix of try-catch blocks and conditional error checks.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Search for 'throw' statements in 'JsonParser.cpp' to check consistency in error handling.

# Test: List all instances of 'throw' in 'JsonParser.cpp'. Expect: Error handling is consistent across the file.
rg 'throw' components/core/src/clp_s/JsonParser.cpp -n

Length of output: 236


Script:

#!/bin/bash
# Description: Search for try-catch blocks and alternative error handling in 'JsonParser.cpp' to evaluate consistency.

# Search for 'try' and 'catch' blocks
rg -e 'try' -e 'catch' components/core/src/clp_s/JsonParser.cpp -n

# Search for alternative error handling mechanisms, such as error codes or error objects
rg -e 'ErrorCode' -e 'error_' -e 'return' components/core/src/clp_s/JsonParser.cpp -n

Length of output: 1732

components/core/src/clp_s/Utils.cpp Outdated Show resolved Hide resolved
components/core/src/clp_s/Utils.cpp Show resolved Hide resolved
components/core/src/clp_s/Utils.cpp Outdated Show resolved Hide resolved
components/core/src/clp_s/Utils.cpp Show resolved Hide resolved
components/core/src/clp_s/Utils.cpp Outdated Show resolved Hide resolved
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Outside diff range and nitpick comments (1)
components/core/src/clp_s/Utils.cpp (1)

434-434: Address the TODO comment for Unicode support

The TODO comment indicates that Unicode sequence support is still pending. Consider creating a separate task or issue to track this enhancement.

Would you like me to create a GitHub issue to track the implementation of Unicode sequence support?

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Files that changed from the base of the PR and between 33d13f0 and c84c771.

📒 Files selected for processing (1)
  • components/core/src/clp_s/Utils.cpp (1 hunks)
🧰 Additional context used
🔇 Additional comments (2)
components/core/src/clp_s/Utils.cpp (2)

430-455: LGTM: Improved tokenization with escape handling

The new implementation of tokenize_column_descriptor correctly handles escaped characters and provides better error reporting. The change from void to bool return type is a good improvement for error handling.


450-451: Reconsider handling of empty final token

The current implementation returns false if the final token is empty. However, this might be too strict for some valid cases, such as when a descriptor ends with a period.

Consider if this behaviour aligns with the intended use cases. If empty final tokens should be allowed, you could modify the code as follows:

-if (cur_tok.empty()) {
-    return false;
-}
+if (!cur_tok.empty()) {
+    tokens.push_back(cur_tok);
+}
-tokens.push_back(cur_tok);
 return true;

To verify the impact of this change, we can search for usages of this function:

components/core/src/clp_s/Utils.cpp Show resolved Hide resolved
@gibber9809 gibber9809 changed the title Add support for escaping characters in KQL key names clp-s: Add support for escaping characters in KQL key names Oct 21, 2024
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Outside diff range and nitpick comments (1)
components/core/tests/test-kql.cpp (1)

209-209: Remove debug logging statements.

The SPDLOG_INFO statements appear to be debugging aids that were left in the test code. These should be removed as they add noise to the test output.

-        SPDLOG_INFO("{}", it->get_token());
         REQUIRE(DescriptorToken{"a.b"} == *it++);
-        SPDLOG_INFO("{}", it->get_token());

Also applies to: 211-211

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Files that changed from the base of the PR and between 723d49d and e8cb65c.

📒 Files selected for processing (1)
  • components/core/tests/test-kql.cpp (1 hunks)
🧰 Additional context used
🔇 Additional comments (2)
components/core/tests/test-kql.cpp (2)

191-213: Well-structured test cases for escaped column names.

The test cases effectively validate the new escape sequence functionality:

  • Covers both quoted and unquoted column names
  • Verifies nested object notation
  • Thoroughly validates the parsed results
  • Ensures correct tokenization of escaped dots

215-229: Comprehensive coverage of invalid escape sequences.

Excellent test coverage for illegal escape sequences:

  • Tests backslash escaping of non-special characters
  • Verifies handling of trailing backslashes
  • Includes cases with unescaped dots
  • Tests both quoted and unquoted variants

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Outside diff range and nitpick comments (3)
components/core/tests/test-kql.cpp (3)

209-209: Remove debug logging statements.

The SPDLOG_INFO statements appear to be debugging artifacts and should be removed from the test code.

-        SPDLOG_INFO("{}", it->get_token());
         REQUIRE(DescriptorToken{"a.b"} == *it++);
-        SPDLOG_INFO("{}", it->get_token());
         REQUIRE(DescriptorToken{"c"} == *it++);

Also applies to: 211-211


192-197: Consider adding more test cases for escaped characters.

The current test cases focus on escaping dots, but we should also test:

  • Multiple escaped characters (e.g., "a\\.b\\.c")
  • Escaped characters at the start/end of the key
  • Mixed escaped and unescaped special characters
GENERATE(
        "a\\.b.c: *",
        "\"a\\.b.c\": *",
        "a\\.b: {c: *}",
        "\"a\\.b\": {\"c\": *}",
+       "a\\.b\\.c: *",           // Multiple escaped dots
+       "\\.abc.def: *",          // Escaped dot at start
+       "abc\\.def\\.: *",        // Escaped dot at end
+       "a\\.b\\:c.d: *"          // Mixed escaped characters
)

216-225: Enhance error case coverage and validation.

While the current test cases cover basic invalid scenarios, consider:

  1. Adding more error cases:
    • Trailing backslash at the end of key
    • Multiple consecutive backslashes
    • Invalid escape sequences with special characters
  2. Validating error messages to ensure proper user feedback
GENERATE(
        //"a\\:*", this case is technically legal since ':' gets escaped
        "\"a\\\":*",
        "a\\ :*",
        "\"a\\\" :*",
        "a.:*",
        "\"a.\":*",
        "a. :*",
        "\"a.\" :*",
+       "abc\\: *",              // Trailing backslash
+       "a\\\\\\b: *",           // Multiple consecutive backslashes
+       "a\\x.b: *"              // Invalid escape sequence
)

Consider adding error message validation:

stringstream illegal_escape{query};
auto filter = parse_kql_expression(illegal_escape);
REQUIRE(nullptr == filter);
+ // Verify error message
+ std::string error_msg = get_last_error_message();
+ REQUIRE(error_msg.find("Invalid escape sequence") != std::string::npos);
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Files that changed from the base of the PR and between e8cb65c and 3714295.

📒 Files selected for processing (1)
  • components/core/tests/test-kql.cpp (1 hunks)
🧰 Additional context used
🔇 Additional comments (1)
components/core/tests/test-kql.cpp (1)

191-229: LGTM! The test coverage for escape sequences is comprehensive.

The implementation effectively validates both valid and invalid escape sequences in KQL key names. The test structure follows the existing patterns and maintains good readability.

@kirkrodrigues kirkrodrigues requested a review from wraymo November 12, 2024 16:17
Copy link
Contributor

@wraymo wraymo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch!

REQUIRE(FilterOperation::EQ == filter->get_operation());
REQUIRE(2 == filter->get_column()->get_descriptor_list().size());
auto it = filter->get_column()->descriptor_begin();
SPDLOG_INFO("{}", it->get_token());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to print logs for the test case?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No. Will remove this.

@@ -427,18 +427,31 @@ bool StringUtils::convert_string_to_double(std::string const& raw, double& conve
return true;
}

void StringUtils::tokenize_column_descriptor(
bool StringUtils::tokenize_column_descriptor(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think .a is an illegal case at this step, or leave it to populate_column_mapping?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd agree that it is. Currently the code only complains if the last token is "", but I'll change it to complain if any token is "" and add some tests.

//"a\\:*", this case is technically legal since ':' gets escaped
"\"a\\\":*",
"a\\ :*",
"\"a\\\" :*",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the difference between line 218 and 220 (or line 219 and 221)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just confirming that the tokenizing done by ANTLR respects the space really. E.g. if the space were (unexpectedly) included in the column token then it would actually be legal, so we're confirming that this is not the case.

@@ -212,8 +212,9 @@ class StringUtils {
* @param descriptor
* @param tokens
* @return the list of tokens pushed into the 'tokens' parameter
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we use @return here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Taking that out.

@gibber9809 gibber9809 requested a review from wraymo November 12, 2024 21:04
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Outside diff range and nitpick comments (5)
components/core/tests/test-kql.cpp (3)

191-211: LGTM! Consider adding more escape sequence test cases.

The test cases effectively validate the parsing of escaped dots in column names. However, consider adding test cases for:

  • Multiple escaped dots (e.g., a\\.b\\.c)
  • Escaped dots at the start/end of tokens
  • Mixed escaped and unescaped dots

215-215: Document why this test case is commented out.

The comment suggests that escaping ':' is technically legal, but this important edge case should be documented more clearly.

Add a more detailed comment explaining:

-                //"a\\:*", this case is technically legal since ':' gets escaped
+                // Note: "a\\:*" is intentionally excluded as escaping ':' is a valid use case.
+                // This differs from other special characters as ':' serves as a key-value separator
+                // in KQL syntax.

217-222: Consolidate redundant test cases.

Several test cases differ only in whitespace or quotation marks. Consider consolidating them to improve test maintainability:

-                "a\\ :*",
-                "\"a\\\" :*",
-                "a.:*",
-                "\"a.\":*",
-                "a. :*",
-                "\"a.\" :*"
+                "a\\ :*",      // Trailing backslash
+                "a.:*",        // Unescaped dot
+                "\"a.\" :*"    // Quoted unescaped dot
components/core/src/clp_s/Utils.cpp (2)

434-434: Consider tracking Unicode support as a separate issue

The TODO comment indicates future support for Unicode escape sequences. While not critical for the current implementation, it would be beneficial to track this as a separate issue for future enhancement.

Would you like me to create a GitHub issue to track the Unicode support implementation?


442-449: Consider enhancing error feedback

While the implementation correctly rejects empty tokens as discussed in previous reviews, consider adding descriptive error messages to help users understand why their KQL key names are invalid.

 if (cur_tok.empty()) {
-    return false;
+    // Provide context about the error
+    SPDLOG_ERROR("Invalid KQL key name: empty segment at position {}", cur);
+    return false;
 }
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between 3714295 and a62542b.

📒 Files selected for processing (5)
  • components/core/src/clp_s/JsonParser.cpp (1 hunks)
  • components/core/src/clp_s/Utils.cpp (1 hunks)
  • components/core/src/clp_s/Utils.hpp (1 hunks)
  • components/core/src/clp_s/clp-s.cpp (1 hunks)
  • components/core/tests/test-kql.cpp (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (3)
  • components/core/src/clp_s/JsonParser.cpp
  • components/core/src/clp_s/Utils.hpp
  • components/core/src/clp_s/clp-s.cpp
🔇 Additional comments (3)
components/core/tests/test-kql.cpp (1)

229-234: LGTM! Empty token test cases are comprehensive.

The test cases effectively cover all scenarios where empty tokens could appear in column names, including leading, trailing, and consecutive dots.

components/core/src/clp_s/Utils.cpp (2)

437-441: LGTM: Escape sequence handling is robust

The implementation correctly handles:

  • Boundary checks for trailing backslashes
  • Character preservation after escape sequences
  • Error conditions for invalid escape sequences

450-458: LGTM: Token validation and accumulation

The implementation properly:

  • Validates the final token
  • Prevents trailing empty segments
  • Returns appropriate success/failure status

Comment on lines +430 to +458
bool StringUtils::tokenize_column_descriptor(
std::string const& descriptor,
std::vector<std::string>& tokens
) {
// TODO: handle escaped . correctly
auto start = 0U;
auto end = descriptor.find('.');
while (end != std::string::npos) {
tokens.push_back(descriptor.substr(start, end - start));
start = end + 1;
end = descriptor.find('.', start);
// TODO: add support for unicode sequences e.g. \u263A
std::string cur_tok;
for (size_t cur = 0; cur < descriptor.size(); ++cur) {
if ('\\' == descriptor[cur]) {
++cur;
if (cur >= descriptor.size()) {
return false;
}
} else if ('.' == descriptor[cur]) {
if (cur_tok.empty()) {
return false;
}
tokens.push_back(cur_tok);
cur_tok.clear();
continue;
}
cur_tok.push_back(descriptor[cur]);
}
tokens.push_back(descriptor.substr(start));

if (cur_tok.empty()) {
return false;
}

tokens.push_back(cur_tok);
return true;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codebase verification

Add comprehensive tests for tokenize_column_descriptor

Ensure that various KQL key patterns, including those with escaped characters like periods, are thoroughly tested.

  • Implement test cases covering different escaped sequences and edge cases in the relevant test suite.
🔗 Analysis chain

Verify handling of various KQL key patterns

Let's verify the implementation handles various key patterns correctly.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Search for test cases covering various KQL key patterns
# Expected: Find comprehensive test coverage for escaped characters

# Search for test cases
rg -A 5 "TEST.*tokenize_column_descriptor" 

# Search for actual usage patterns
rg -A 3 "tokenize_column_descriptor\(" --type cpp

Length of output: 2291

@kirkrodrigues kirkrodrigues changed the title clp-s: Add support for escaping characters in KQL key names clp-s: Add support for escaping characters in KQL key names. Nov 13, 2024
@gibber9809 gibber9809 changed the title clp-s: Add support for escaping characters in KQL key names. feat: Add support for escaping characters in KQL key names. Nov 13, 2024
@gibber9809 gibber9809 merged commit 53c4f52 into y-scope:main Nov 13, 2024
21 of 22 checks passed
jackluo923 pushed a commit to jackluo923/clp that referenced this pull request Dec 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

can not serach on nesteld json
2 participants