Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fillnull command introduced #723

Merged
merged 3 commits into from
Oct 8, 2024
Merged

Conversation

lukasz-soszynski-eliatra
Copy link
Contributor

@lukasz-soszynski-eliatra lukasz-soszynski-eliatra commented Oct 1, 2024

Description

Implementation of the fillnull command

Issues Resolved

#670

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

@lukasz-soszynski-eliatra
Copy link
Contributor Author

Usage examples:

search source=users |  fillnull value = 'N/A' surname, nickname, first_name;
search source=users |  fillnull FIELDS surname='surname unknown'
search source=users |  fillnull FIELDS surname='surname unknown!', nickname='!nickname unknown!', first_name='?????';

@lukasz-soszynski-eliatra
Copy link
Contributor Author

I pushed the draft version of the fillnull command.

  • tests are lacking
  • syntax without providing column names is not supported
  • I changed a little bit a syntax because I was not able to force Antler to parse a query that uses syntax requested by the issue [FEATURE]Add fillnull command to PPL #670
  • the value that replaces nulls must currently be a literal expression. If support for expressions as a null replacement is needed, then I think that the command syntax should be changed from fillnull value = 'N/A' surname, nickname, first_name to e.g. fillnull with 'N/A' in surname, nickname. This should make the command more straightforward for the end user to understand.

@lukasz-soszynski-eliatra lukasz-soszynski-eliatra changed the title Fillnull command introducd Fillnull command introduced Oct 1, 2024
@YANG-DB YANG-DB added Lang:PPL Pipe Processing Language support 0.6 labels Oct 1, 2024
@YANG-DB
Copy link
Member

YANG-DB commented Oct 1, 2024

I pushed the draft version of the fillnull command.

  • tests are lacking
  • syntax without providing column names is not supported
  • I changed a little bit a syntax because I was not able to force Antler to parse a query that uses syntax requested by the issue [FEATURE]Add fillnull command to PPL #670
  • the value that replaces nulls must currently be a literal expression. If support for expressions as a null replacement is needed, then I think that the command syntax should be changed from fillnull value = 'N/A' surname, nickname, first_name to e.g. fillnull with 'N/A' in surname, nickname. This should make the command more straightforward for the end user to understand.

@lukasz-soszynski-eliatra
Thanks for your feedback - I agree with the syntax changes - please add the proposed changes in a separate comment
In case column not specified an specific error should appear

@lukasz-soszynski-eliatra
Copy link
Contributor Author

fillnull syntax proposal

  1. Proposed syntax changes with null replacement with the same values in various fields

    • ... | fillnull with 0 in field1
    • ... | fillnull with 'N/A' in field1, field2, field3
    • ... | fillnull with 2*pi() + field1 in field2
    • ... | fillnull with concat(field1, field2) in field3, field4
    • ... | fillnull with 'N/A'
      • incorrect syntax
    • ... | fillnull with 'N/A' in
      • validation error related to missing columns
  2. Proposed syntax changes with null replacement with the various values in various fields

  • currently implemented, not conform to previous syntax proposal (fields addition)
    • ... | fillnull fields status_code=101
    • ... | fillnull fields request_path='/not_found', timestamp='*'
  • New syntax proposal
    • ... | fillnull using field1=101
    • ... | fillnull using field1=concat(field2, field3), field4=2*pi()*field5
    • ... | fillnull using field1=concat(field2, field3), field4=2*pi()*field5, field6 = 'N/A'
    • ... | fillnull using
      • validation error related to missing columns

New syntax definition in ANTLR

fillnullCommand
 : FILLNULL (fillNullWithTheSameValue
 | fillNullWithFieldVariousValues)
 ;

fillNullWithTheSameValue
: WITH nullReplacement IN nullableField (COMMA nullableField)*
;

fillNullWithFieldVariousValues
: USING nullableField EQUAL nullReplacement (COMMA nullableField EQUAL nullReplacement)*
;


 nullableField
 : fieldExpression
 ;

 nullReplacement
 : expression
 ;

@YANG-DB
Copy link
Member

YANG-DB commented Oct 3, 2024

@lukasz-soszynski-eliatra this looks great !
can u also plz resolve conflicts ?

@YANG-DB
Copy link
Member

YANG-DB commented Oct 3, 2024

@LantaoJin can you plz review the suggested syntax and add your comments ?
thanks

@lukasz-soszynski-eliatra lukasz-soszynski-eliatra force-pushed the fillnull branch 2 times, most recently from 1e711df to 6ade195 Compare October 3, 2024 10:45
@lukasz-soszynski-eliatra
Copy link
Contributor Author

@lukasz-soszynski-eliatra this looks great ! can u also plz resolve conflicts ?

Conflicts resolved.

@YANG-DB
Copy link
Member

YANG-DB commented Oct 4, 2024

@lukasz-soszynski-eliatra can u plz resolve latest conflicts ?
And plz review the new documentation page and validate if all the existing examples and specifications does fit to the actual code:

Signed-off-by: Lukasz Soszynski <lukasz.soszynski@eliatra.com>
…for the review.

Signed-off-by: Lukasz Soszynski <lukasz.soszynski@eliatra.com>
@lukasz-soszynski-eliatra
Copy link
Contributor Author

The conflicts have been resolved. Furthermore, the syntax of the rename command is correct.

Could you please clarify my doubt about the fillnull command?

  1. Has a decision been made related to syntax modification?
  2. The issue [FEATURE]Add fillnull command to PPL #670 contains the following requirement "It should support conditional filling based on other field values or expressions." Please elaborate a little more on this requirement.
  3. Another requirement from issue 670 is "Optimize performance for large datasets with many null values." Can you advise us on how to implement this requirement?
  4. The following requirement, "Consider type-checking or type-conversion for filled values" is also not implemented. As far as I am aware, integration between PPL and Spark is done on the UnresolvedPlans level, where type information is unavailable. There is a chance to implement this requirement with the usage of the following Spark expression assert_true(typeof (nullable_column)=typeof(null_replacement)) . However, an additional column will be created using the function assert_true. Should we try to implement the requirement by using the function assert_true?

@YANG-DB
Copy link
Member

YANG-DB commented Oct 7, 2024

@lukasz-soszynski-eliatra

  1. yes I support your suggestion - please continue
  2. lets postpone the conditional filling based on other field values ... requirement for now
  3. lets postpone the Optimize performance for large datasets with many null values... requirement for now
  4. for this issue lets reduce complexity as possible - I would also recommend to postpone this requirement

Signed-off-by: Lukasz Soszynski <lukasz.soszynski@eliatra.com>
@lukasz-soszynski-eliatra
Copy link
Contributor Author

@YANG-DB The new syntax has been introduced, and PR is ready for review.

@YANG-DB
Copy link
Member

YANG-DB commented Oct 8, 2024

@lukasz-soszynski-eliatra great job - Thanks !!

@lukasz-soszynski-eliatra
Copy link
Contributor Author

Thank you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
0.6 Lang:PPL Pipe Processing Language support
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants