feat: Add support for escaping characters in KQL key names. #560

gibber9809 · 2024-10-21T17:52:44Z

Description

This PR adds support for using backslash to escape characters in KQL key names. The change is mostly contained in the StringUtiles::tokenize_column_descriptor function, but does involve a small change to the kql.g4 grammar to allow the '.' character to be escaped.

For example the key consisting of the tokens {"com.bnn", "uuid"} can now be specified as "com\.bnn.uuid" in KQL.

This PR also adds error handling for invalid KQL key names. Invalid key names currently include keys ending ending in a trailing unescaped '' or '.'.

Validation performed

Added unit tests for parsing keys with escaped characters
Confirmed that sensible errors are shown on the command line when entering invalid KQL queries

Summary by CodeRabbit

New Features
- Enhanced error handling in JSON parsing and KQL expression parsing.
- Added support for escape sequences in column names within KQL.
- Updated KQL grammar to recognize periods as special characters.
Bug Fixes
- Improved robustness of tokenization methods to prevent failures from propagating.
Tests
- Introduced new test cases for validating the parsing of queries with escaped characters in column names.
- Added tests for handling illegal escape sequences in column names.
- Expanded coverage for cases with empty tokens in column names.

coderabbitai · 2024-10-21T17:52:51Z

Walkthrough

The changes across multiple files enhance error handling and validation processes in the parsing and processing of JSON and KQL data. Key modifications include improved error logging and exception handling in the JsonParser and TimestampDictionaryReader. The tokenize_column_descriptor method's return type was changed to bool in Utils, allowing for better error feedback. Additionally, the KQL grammar was updated to recognize periods as special characters, and new tests were added to validate parsing with escape sequences. Overall, the changes focus on robustness and error management.

Changes

File Path	Change Summary
`components/core/src/clp_s/JsonParser.cpp`	Enhanced error handling in constructor and `parse` method; added logging and exception handling.
`components/core/src/clp_s/TimestampDictionaryReader.cpp`	Added error handling in `read_new_entries` method for tokenization failures.
`components/core/src/clp_s/Utils.cpp`	Changed return type of `tokenize_column_descriptor` from `void` to `bool` and improved functionality.
`components/core/src/clp_s/Utils.hpp`	Updated `tokenize_column_descriptor` method signature to return `bool` with `[[nodiscard]]` attribute.
`components/core/src/clp_s/clp-s.cpp`	Improved error handling in `search_archive` function; checks tokenization result and logs errors.
`components/core/src/clp_s/search/kql/Kql.g4`	Modified `SPECIAL_CHARACTER` fragment to include a period (`.`) as a valid character.
`components/core/src/clp_s/search/kql/kql.cpp`	Enhanced error handling in `visitColumn` and `parse_kql_expression` methods; added exception handling.
`components/core/tests/test-kql.cpp`	Added new test cases for handling escape sequences in column names in KQL parsing.

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant KQLParser
    participant StringUtils
    participant Logger

    User->>KQLParser: Parse KQL Expression
    KQLParser->>StringUtils: tokenize_column_descriptor(column)
    alt Tokenization Success
        StringUtils-->>KQLParser: Tokens
        KQLParser->>User: Return Parsed Expression
    else Tokenization Failure
        StringUtils-->>KQLParser: false
        KQLParser->>Logger: Log Error
        KQLParser-->>User: Return nullptr
    end

Possibly related PRs

feat(ffi): Update IR stream protocol version handling in preparation for releasing the kv-pair IR stream format: #573: The changes in this PR involve enhancing error handling and validation related to protocol version management, which aligns with the error handling improvements made in the main PR's JsonParser.cpp, particularly in how errors are logged and managed during parsing processes.

Suggested reviewers

kirkrodrigues

Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Generate unit testing code for this file.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
- @coderabbitai generate unit testing code for this file.
- @coderabbitai modularize this function.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read src/utils.ts and generate unit testing code.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
- @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

@coderabbitai pause to pause the reviews on a PR.
@coderabbitai resume to resume the paused reviews.
@coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
@coderabbitai full review to do a full review from scratch and review all the files again.
@coderabbitai summary to regenerate the summary of the PR.
@coderabbitai resolve resolve all the CodeRabbit review comments.
@coderabbitai configuration to show the current CodeRabbit configuration for the repository.
@coderabbitai help to get help.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (`.coderabbit.yaml`)

You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
Please see the configuration documentation for more information.
If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

coderabbitai

Actionable comments posted: 5

🧹 Outside diff range and nitpick comments (7)

components/core/src/clp_s/TimestampDictionaryReader.cpp (1)
47-49: Improved error handling for tokenization failure.

The addition of this error check enhances the robustness of the read_new_entries method by properly handling tokenization failures. This aligns well with the PR objectives and improves the overall error handling in the system.

Consider adding more context to the error message to aid in debugging:
-            throw OperationFailed(ErrorCodeCorrupt, __FILENAME__, __LINE__);
+            throw OperationFailed(ErrorCodeCorrupt, __FILENAME__, __LINE__,
+                "Failed to tokenize column descriptor: " + entry.get_key_name());
This change would provide more specific information about which key name caused the tokenization failure.
components/core/src/clp_s/search/kql/kql.cpp (2)
115-118: Improved error handling for invalid column descriptors. Consider adding more context to the error message.

The changes enhance error handling by checking the return value of tokenize_column_descriptor and logging an error message when tokenization fails. This aligns well with the PR objective of improving error handling for invalid KQL key names.

Consider adding more context to the error message to aid in debugging:
-SPDLOG_ERROR("Can not tokenize invalid column: \"{}\"", column);
+SPDLOG_ERROR("Failed to tokenize column in visitColumn: \"{}\"", column);
254-258: Enhanced exception handling in parse_kql_expression. Consider logging the caught exception.

The addition of a try-catch block improves the robustness of the parse_kql_expression function by handling potential exceptions during parsing. This change aligns with the PR objective of enhancing error handling for invalid KQL queries.

Consider logging the caught exception to aid in debugging:
 try {
     return std::any_cast<std::shared_ptr<Expression>>(visitor.visitStart(tree));
 } catch (std::exception& e) {
+    SPDLOG_ERROR("Exception caught in parse_kql_expression: {}", e.what());
     return {};
 }
components/core/tests/test-kql.cpp (2)
191-213: Well-structured test section for escaped sequences. Consider expanding test coverage.

The new test section "Escape sequences in column name" is well-structured and effectively tests the parsing of queries with escaped characters in column names. The use of the GENERATE macro for creating various test cases is a good practice.

However, consider the following suggestions to enhance the test coverage:

Add test cases for error scenarios, such as improperly escaped sequences or unmatched escape characters.

Include edge cases, like multiple consecutive escape characters or escaped characters at the beginning or end of the column name.

Would you like assistance in generating additional test cases for error scenarios and edge cases?

209-209: Consider removing or conditionally enabling logging statements.

The SPDLOG_INFO statements are useful for debugging but may clutter the test output in normal runs. Consider either removing these logging statements or wrapping them in a conditional compile directive or a debug flag.

Here's a suggestion to conditionally enable the logging:
-        SPDLOG_INFO("{}", it->get_token());
+        #ifdef DEBUG_KQL_TESTS
+        SPDLOG_DEBUG("Token: {}", it->get_token());
+        #endif
Apply this change to both logging statements.

Also applies to: 211-211
components/core/src/clp_s/Utils.hpp (1)
215-217: Excellent improvement to error handling. Consider enhancing the method description.

The change from void to [[nodiscard]] static bool is a great improvement for error handling. The [[nodiscard]] attribute ensures that callers handle the return value, which is crucial for proper error management.

To further improve clarity, consider updating the method description to explicitly mention the return value's meaning. For example:
 /**
  * Converts a string column descriptor delimited by '.' into a list of tokens
  * @param descriptor
  * @param tokens
- * @return the list of tokens pushed into the 'tokens' parameter
- * @return true if the descriptor was tokenized successfully, false otherwise
+ * @param tokens Output parameter to store the list of tokens
+ * @return true if the descriptor was tokenized successfully, false if an error occurred (e.g., invalid escape sequence)
  */
components/core/src/clp_s/clp-s.cpp (1)
193-196: Improved error handling for column tokenization. Consider enhancing the error message.

The changes effectively handle tokenization failures and prevent further processing with invalid data. This aligns well with the PR objectives of enhancing error handling for invalid KQL key names.

Consider formatting the error message to make it more readable in log files:
-                SPDLOG_ERROR("Can not tokenize invalid column: \"{}\"", column);
+                SPDLOG_ERROR("Cannot tokenize invalid column: '{}'", column);

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Files that changed from the base of the PR and between e21672b and 33d13f0.

📒 Files selected for processing (8)

components/core/src/clp_s/JsonParser.cpp (1 hunks)
components/core/src/clp_s/TimestampDictionaryReader.cpp (1 hunks)
components/core/src/clp_s/Utils.cpp (1 hunks)
components/core/src/clp_s/Utils.hpp (1 hunks)
components/core/src/clp_s/clp-s.cpp (1 hunks)
components/core/src/clp_s/search/kql/Kql.g4 (1 hunks)
components/core/src/clp_s/search/kql/kql.cpp (2 hunks)
components/core/tests/test-kql.cpp (1 hunks)

🧰 Additional context used

📓 Learnings (1)

components/core/src/clp_s/JsonParser.cpp (2)

Learnt from: AVMatthews
PR: y-scope/clp#543
File: components/core/src/clp_s/JsonParser.cpp:769-779
Timestamp: 2024-10-08T15:52:50.753Z
Learning: In `components/core/src/clp_s/JsonParser.cpp`, when handling errors in `parse_from_ir`, prefer to maintain the current mix of try-catch and if-statements because specific messages are returned back up in some cases.

Learnt from: AVMatthews
PR: y-scope/clp#543
File: components/core/src/clp_s/JsonParser.cpp:769-779
Timestamp: 2024-10-07T21:16:41.660Z
Learning: In `components/core/src/clp_s/JsonParser.cpp`, when handling errors in `parse_from_ir`, prefer to maintain the current mix of try-catch and if-statements because specific messages are returned back up in some cases.

🔇 Additional comments (3)

components/core/src/clp_s/search/kql/Kql.g4 (1)
99-99: LGTM! Consider verifying impact and updating docs.

The addition of the period (.) to the set of special characters aligns well with the PR objective to support escaping characters in KQL key names. This change enables the escaping of periods in key names, which is crucial for the desired functionality.

To ensure this change doesn't inadvertently affect existing queries, please run the following verification:

Consider updating the KQL documentation to reflect this change, especially any sections that discuss special characters or escaping in key names.

✅ Verification successful

Change Verified: No Impact on Existing Queries

The addition of the period (.) to the set of special characters has been verified and does not affect existing KQL, SQL, or text queries.
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:
#!/bin/bash
# Description: Search for existing KQL queries that might be affected by the new special character treatment.

# Test: Look for unescaped periods in column names or query structures
rg -g '*.kql' -g '*.sql' -g '*.txt' '(?<![\\])\.[a-zA-Z0-9_]+'

# Note: Review any matches to ensure they don't represent valid use cases that might now require escaping.
Length of output: 293

Script:
#!/bin/bash
# Description: Search for existing KQL queries that might be affected by the new special character treatment using PCRE2.

# Test: Look for unescaped periods in column names or query structures
rg --pcre2 -g '*.kql' -g '*.sql' -g '*.txt' '(?<![\\])\.[a-zA-Z0-9_]+'

# Note: Review any matches to ensure they don't represent valid use cases that might now require escaping.
Length of output: 94005
components/core/src/clp_s/search/kql/kql.cpp (1)

Line range hint 1-258: Summary: Improved error handling aligns well with PR objectives

The changes in this file significantly enhance error handling for invalid KQL key names and queries, which directly addresses the PR objectives. The modifications in both the visitColumn method and the parse_kql_expression function contribute to a more robust and error-resistant implementation.

The suggested improvements for more detailed error logging would further enhance debugging capabilities without affecting the core functionality.
components/core/src/clp_s/JsonParser.cpp (1)
24-29: Consider aligning error handling with existing practices in JsonParser.cpp.

While adding error handling for m_timestamp_key parsing enhances robustness, please ensure that throwing an exception here is consistent with the existing error handling strategy in JsonParser.cpp. Previously, we agreed to maintain a mix of try-catch blocks and if-statements to allow specific error messages to be returned appropriately.

Run the following script to verify the error handling practices in JsonParser.cpp:

✅ Verification successful

Error handling is consistent with existing practices in JsonParser.cpp.

The addition of the throw statement aligns with the file's established use of a mix of try-catch blocks and conditional error checks.
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:
#!/bin/bash
# Description: Search for 'throw' statements in 'JsonParser.cpp' to check consistency in error handling.

# Test: List all instances of 'throw' in 'JsonParser.cpp'. Expect: Error handling is consistent across the file.
rg 'throw' components/core/src/clp_s/JsonParser.cpp -n
Length of output: 236

Script:
#!/bin/bash
# Description: Search for try-catch blocks and alternative error handling in 'JsonParser.cpp' to evaluate consistency.

# Search for 'try' and 'catch' blocks
rg -e 'try' -e 'catch' components/core/src/clp_s/JsonParser.cpp -n

# Search for alternative error handling mechanisms, such as error codes or error objects
rg -e 'ErrorCode' -e 'error_' -e 'return' components/core/src/clp_s/JsonParser.cpp -n
Length of output: 1732

components/core/src/clp_s/Utils.cpp

coderabbitai

Actionable comments posted: 1

🧹 Outside diff range and nitpick comments (1)

components/core/src/clp_s/Utils.cpp (1)

434-434: Address the TODO comment for Unicode support

The TODO comment indicates that Unicode sequence support is still pending. Consider creating a separate task or issue to track this enhancement.

Would you like me to create a GitHub issue to track the implementation of Unicode sequence support?

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Files that changed from the base of the PR and between 33d13f0 and c84c771.

📒 Files selected for processing (1)

components/core/src/clp_s/Utils.cpp (1 hunks)

🧰 Additional context used

🔇 Additional comments (2)

components/core/src/clp_s/Utils.cpp (2)
430-455: LGTM: Improved tokenization with escape handling

The new implementation of tokenize_column_descriptor correctly handles escaped characters and provides better error reporting. The change from void to bool return type is a good improvement for error handling.

450-451: Reconsider handling of empty final token

The current implementation returns false if the final token is empty. However, this might be too strict for some valid cases, such as when a descriptor ends with a period.

Consider if this behaviour aligns with the intended use cases. If empty final tokens should be allowed, you could modify the code as follows:
-if (cur_tok.empty()) {
-    return false;
-}
+if (!cur_tok.empty()) {
+    tokens.push_back(cur_tok);
+}
-tokens.push_back(cur_tok);
 return true;
To verify the impact of this change, we can search for usages of this function:

components/core/src/clp_s/Utils.cpp

coderabbitai

Actionable comments posted: 0

🧹 Outside diff range and nitpick comments (1)

components/core/tests/test-kql.cpp (1)
209-209: Remove debug logging statements.

The SPDLOG_INFO statements appear to be debugging aids that were left in the test code. These should be removed as they add noise to the test output.
-        SPDLOG_INFO("{}", it->get_token());
         REQUIRE(DescriptorToken{"a.b"} == *it++);
-        SPDLOG_INFO("{}", it->get_token());
Also applies to: 211-211

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Files that changed from the base of the PR and between 723d49d and e8cb65c.

📒 Files selected for processing (1)

components/core/tests/test-kql.cpp (1 hunks)

🧰 Additional context used

🔇 Additional comments (2)

components/core/tests/test-kql.cpp (2)

191-213: Well-structured test cases for escaped column names.

The test cases effectively validate the new escape sequence functionality:

Covers both quoted and unquoted column names

Verifies nested object notation

Thoroughly validates the parsed results

Ensures correct tokenization of escaped dots

215-229: Comprehensive coverage of invalid escape sequences.

Excellent test coverage for illegal escape sequences:

Tests backslash escaping of non-special characters

Verifies handling of trailing backslashes

Includes cases with unescaped dots

Tests both quoted and unquoted variants

coderabbitai

Actionable comments posted: 0

🧹 Outside diff range and nitpick comments (3)

components/core/tests/test-kql.cpp (3)
209-209: Remove debug logging statements.

The SPDLOG_INFO statements appear to be debugging artifacts and should be removed from the test code.
-        SPDLOG_INFO("{}", it->get_token());
         REQUIRE(DescriptorToken{"a.b"} == *it++);
-        SPDLOG_INFO("{}", it->get_token());
         REQUIRE(DescriptorToken{"c"} == *it++);
Also applies to: 211-211

192-197: Consider adding more test cases for escaped characters.

The current test cases focus on escaping dots, but we should also test:

Multiple escaped characters (e.g., "a\\.b\\.c")

Escaped characters at the start/end of the key

Mixed escaped and unescaped special characters
GENERATE(
        "a\\.b.c: *",
        "\"a\\.b.c\": *",
        "a\\.b: {c: *}",
        "\"a\\.b\": {\"c\": *}",
+       "a\\.b\\.c: *",           // Multiple escaped dots
+       "\\.abc.def: *",          // Escaped dot at start
+       "abc\\.def\\.: *",        // Escaped dot at end
+       "a\\.b\\:c.d: *"          // Mixed escaped characters
)
216-225: Enhance error case coverage and validation.

While the current test cases cover basic invalid scenarios, consider:

Adding more error cases:

Trailing backslash at the end of key

Multiple consecutive backslashes

Invalid escape sequences with special characters

Validating error messages to ensure proper user feedback
GENERATE(
        //"a\\:*", this case is technically legal since ':' gets escaped
        "\"a\\\":*",
        "a\\ :*",
        "\"a\\\" :*",
        "a.:*",
        "\"a.\":*",
        "a. :*",
        "\"a.\" :*",
+       "abc\\: *",              // Trailing backslash
+       "a\\\\\\b: *",           // Multiple consecutive backslashes
+       "a\\x.b: *"              // Invalid escape sequence
)
Consider adding error message validation:
stringstream illegal_escape{query};
auto filter = parse_kql_expression(illegal_escape);
REQUIRE(nullptr == filter);
+ // Verify error message
+ std::string error_msg = get_last_error_message();
+ REQUIRE(error_msg.find("Invalid escape sequence") != std::string::npos);

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Files that changed from the base of the PR and between e8cb65c and 3714295.

📒 Files selected for processing (1)

components/core/tests/test-kql.cpp (1 hunks)

🧰 Additional context used

🔇 Additional comments (1)

components/core/tests/test-kql.cpp (1)

191-229: LGTM! The test coverage for escape sequences is comprehensive.

The implementation effectively validates both valid and invalid escape sequences in KQL key names. The test structure follows the existing patterns and maintains good readability.

wraymo

Good catch!

wraymo · 2024-11-12T16:33:36Z

components/core/tests/test-kql.cpp

+        REQUIRE(FilterOperation::EQ == filter->get_operation());
+        REQUIRE(2 == filter->get_column()->get_descriptor_list().size());
+        auto it = filter->get_column()->descriptor_begin();
+        SPDLOG_INFO("{}", it->get_token());


Do we need to print logs for the test case?

No. Will remove this.

wraymo · 2024-11-12T16:53:43Z

components/core/src/clp_s/Utils.cpp

@@ -427,18 +427,31 @@ bool StringUtils::convert_string_to_double(std::string const& raw, double& conve
    return true;
 }

-void StringUtils::tokenize_column_descriptor(
+bool StringUtils::tokenize_column_descriptor(


Do you think .a is an illegal case at this step, or leave it to populate_column_mapping?

I'd agree that it is. Currently the code only complains if the last token is "", but I'll change it to complain if any token is "" and add some tests.

wraymo · 2024-11-12T17:00:50Z

components/core/tests/test-kql.cpp

+                //"a\\:*", this case is technically legal since ':' gets escaped
+                "\"a\\\":*",
+                "a\\ :*",
+                "\"a\\\" :*",


What's the difference between line 218 and 220 (or line 219 and 221)?

Just confirming that the tokenizing done by ANTLR respects the space really. E.g. if the space were (unexpectedly) included in the column token then it would actually be legal, so we're confirming that this is not the case.

wraymo · 2024-11-12T20:05:16Z

components/core/src/clp_s/Utils.hpp

@@ -212,8 +212,9 @@ class StringUtils {
     * @param descriptor
     * @param tokens
     * @return the list of tokens pushed into the 'tokens' parameter


Should we use @return here?

Taking that out.

coderabbitai

Actionable comments posted: 1

🧹 Outside diff range and nitpick comments (5)

components/core/tests/test-kql.cpp (3)
191-211: LGTM! Consider adding more escape sequence test cases.

The test cases effectively validate the parsing of escaped dots in column names. However, consider adding test cases for:

Multiple escaped dots (e.g., a\\.b\\.c)

Escaped dots at the start/end of tokens

Mixed escaped and unescaped dots

215-215: Document why this test case is commented out.

The comment suggests that escaping ':' is technically legal, but this important edge case should be documented more clearly.

Add a more detailed comment explaining:
-                //"a\\:*", this case is technically legal since ':' gets escaped
+                // Note: "a\\:*" is intentionally excluded as escaping ':' is a valid use case.
+                // This differs from other special characters as ':' serves as a key-value separator
+                // in KQL syntax.
217-222: Consolidate redundant test cases.

Several test cases differ only in whitespace or quotation marks. Consider consolidating them to improve test maintainability:
-                "a\\ :*",
-                "\"a\\\" :*",
-                "a.:*",
-                "\"a.\":*",
-                "a. :*",
-                "\"a.\" :*"
+                "a\\ :*",      // Trailing backslash
+                "a.:*",        // Unescaped dot
+                "\"a.\" :*"    // Quoted unescaped dot
components/core/src/clp_s/Utils.cpp (2)
434-434: Consider tracking Unicode support as a separate issue

The TODO comment indicates future support for Unicode escape sequences. While not critical for the current implementation, it would be beneficial to track this as a separate issue for future enhancement.

Would you like me to create a GitHub issue to track the Unicode support implementation?

442-449: Consider enhancing error feedback

While the implementation correctly rejects empty tokens as discussed in previous reviews, consider adding descriptive error messages to help users understand why their KQL key names are invalid.
 if (cur_tok.empty()) {
-    return false;
+    // Provide context about the error
+    SPDLOG_ERROR("Invalid KQL key name: empty segment at position {}", cur);
+    return false;
 }

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between 3714295 and a62542b.

📒 Files selected for processing (5)

components/core/src/clp_s/JsonParser.cpp (1 hunks)
components/core/src/clp_s/Utils.cpp (1 hunks)
components/core/src/clp_s/Utils.hpp (1 hunks)
components/core/src/clp_s/clp-s.cpp (1 hunks)
components/core/tests/test-kql.cpp (1 hunks)

🚧 Files skipped from review as they are similar to previous changes (3)

components/core/src/clp_s/JsonParser.cpp
components/core/src/clp_s/Utils.hpp
components/core/src/clp_s/clp-s.cpp

🔇 Additional comments (3)

components/core/tests/test-kql.cpp (1)

229-234: LGTM! Empty token test cases are comprehensive.

The test cases effectively cover all scenarios where empty tokens could appear in column names, including leading, trailing, and consecutive dots.

components/core/src/clp_s/Utils.cpp (2)

437-441: LGTM: Escape sequence handling is robust

The implementation correctly handles:

Boundary checks for trailing backslashes
Character preservation after escape sequences
Error conditions for invalid escape sequences

450-458: LGTM: Token validation and accumulation

The implementation properly:

Validates the final token
Prevents trailing empty segments
Returns appropriate success/failure status

coderabbitai · 2024-11-12T21:09:39Z

components/core/src/clp_s/Utils.cpp

+bool StringUtils::tokenize_column_descriptor(
        std::string const& descriptor,
        std::vector<std::string>& tokens
 ) {
-    // TODO: handle escaped . correctly
-    auto start = 0U;
-    auto end = descriptor.find('.');
-    while (end != std::string::npos) {
-        tokens.push_back(descriptor.substr(start, end - start));
-        start = end + 1;
-        end = descriptor.find('.', start);
+    // TODO: add support for unicode sequences e.g. \u263A
+    std::string cur_tok;
+    for (size_t cur = 0; cur < descriptor.size(); ++cur) {
+        if ('\\' == descriptor[cur]) {
+            ++cur;
+            if (cur >= descriptor.size()) {
+                return false;
+            }
+        } else if ('.' == descriptor[cur]) {
+            if (cur_tok.empty()) {
+                return false;
+            }
+            tokens.push_back(cur_tok);
+            cur_tok.clear();
+            continue;
+        }
+        cur_tok.push_back(descriptor[cur]);
    }
-    tokens.push_back(descriptor.substr(start));
+
+    if (cur_tok.empty()) {
+        return false;
+    }
+
+    tokens.push_back(cur_tok);
+    return true;


💡 Codebase verification

Add comprehensive tests for tokenize_column_descriptor

Ensure that various KQL key patterns, including those with escaped characters like periods, are thoroughly tested.

Implement test cases covering different escaped sequences and edge cases in the relevant test suite.

🔗 Analysis chain

Verify handling of various KQL key patterns

Let's verify the implementation handles various key patterns correctly.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash # Description: Search for test cases covering various KQL key patterns # Expected: Find comprehensive test coverage for escaped characters # Search for test cases rg -A 5 "TEST.*tokenize_column_descriptor" # Search for actual usage patterns rg -A 3 "tokenize_column_descriptor\(" --type cpp

Length of output: 2291

)

Add support for escaping characters in kql key names

33d13f0

gibber9809 linked an issue Oct 21, 2024 that may be closed by this pull request

can not serach on nesteld json #559

Closed

coderabbitai bot reviewed Oct 21, 2024

View reviewed changes

Minor cleanup

c84c771

coderabbitai bot reviewed Oct 21, 2024

View reviewed changes

components/core/src/clp_s/Utils.cpp Show resolved Hide resolved

gibber9809 changed the title ~~Add support for escaping characters in KQL key names~~ clp-s: Add support for escaping characters in KQL key names Oct 21, 2024

Add test for illegal cases

723d49d

gibber9809 assigned wraymo Oct 22, 2024

Fix test

e8cb65c

coderabbitai bot reviewed Oct 23, 2024

View reviewed changes

Fix unit test

3714295

coderabbitai bot reviewed Oct 23, 2024

View reviewed changes

kirkrodrigues requested a review from wraymo November 12, 2024 16:17

wraymo reviewed Nov 12, 2024

View reviewed changes

gibber9809 added 2 commits November 12, 2024 20:54

Address review comments

56b89c9

Merge remote-tracking branch 'upstream/main' into kql-escape-char-fix

a62542b

gibber9809 requested a review from wraymo November 12, 2024 21:04

coderabbitai bot reviewed Nov 12, 2024

View reviewed changes

wraymo approved these changes Nov 13, 2024

View reviewed changes

kirkrodrigues changed the title ~~clp-s: Add support for escaping characters in KQL key names~~ clp-s: Add support for escaping characters in KQL key names. Nov 13, 2024

gibber9809 changed the title ~~clp-s: Add support for escaping characters in KQL key names.~~ feat: Add support for escaping characters in KQL key names. Nov 13, 2024

gibber9809 merged commit 53c4f52 into y-scope:main Nov 13, 2024
21 of 22 checks passed

coderabbitai bot mentioned this pull request Nov 13, 2024

feat(clp-s): Add boilerplate for SQL parsing. #504

Merged

jackluo923 pushed a commit to jackluo923/clp that referenced this pull request Dec 4, 2024

feat: Add support for escaping characters in KQL key names. (y-scope#560

2a81913

)

coderabbitai bot mentioned this pull request Dec 16, 2024

feat(clp-s): Unescape string values during ingestion and fix support for search using escape sequences. #622

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Add support for escaping characters in KQL key names. #560

feat: Add support for escaping characters in KQL key names. #560

gibber9809 commented Oct 21, 2024 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Oct 21, 2024 •

edited

Loading

Chat

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (`.coderabbit.yaml`)

Documentation and Community

coderabbitai bot left a comment

coderabbitai bot left a comment

coderabbitai bot left a comment

coderabbitai bot left a comment

wraymo left a comment

wraymo Nov 12, 2024

gibber9809 Nov 12, 2024

wraymo Nov 12, 2024

gibber9809 Nov 12, 2024

wraymo Nov 12, 2024

gibber9809 Nov 12, 2024

wraymo Nov 12, 2024

gibber9809 Nov 12, 2024

coderabbitai bot left a comment

coderabbitai bot Nov 12, 2024

feat: Add support for escaping characters in KQL key names. #560

feat: Add support for escaping characters in KQL key names. #560

Conversation

gibber9809 commented Oct 21, 2024 • edited by coderabbitai bot Loading

Description

Validation performed

Summary by CodeRabbit

Summary by CodeRabbit

coderabbitai bot commented Oct 21, 2024 • edited Loading

Walkthrough

Changes

Sequence Diagram(s)

Possibly related PRs

Suggested reviewers

Chat

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (.coderabbit.yaml)

Documentation and Community

coderabbitai bot left a comment

Choose a reason for hiding this comment

coderabbitai bot left a comment

Choose a reason for hiding this comment

coderabbitai bot left a comment

Choose a reason for hiding this comment

coderabbitai bot left a comment

Choose a reason for hiding this comment

wraymo left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

coderabbitai bot left a comment

Choose a reason for hiding this comment

coderabbitai bot Nov 12, 2024

Choose a reason for hiding this comment

gibber9809 commented Oct 21, 2024 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Oct 21, 2024 •

edited

Loading

CodeRabbit Configuration File (`.coderabbit.yaml`)