-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add SparkSql function to_pretty_string #10359
Add SparkSql function to_pretty_string #10359
Conversation
✅ Deploy Preview for meta-velox canceled.
|
SELECT toprettystring(null); -- "NULL" | ||
SELECT toprettystring(cast('abcd' as binary)); -- "[61 62 63 64 65 66]" | ||
|
||
.. spark:function:: toprettystring(x, timeZone) -> varchar |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think timeZone
can be removed and let the above doc for toprettystring(x)
cover timestamp type. TimeZone is actually read from Velox config. And like other time/date function, Gluten will not directly pass it to native function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! I change the implement but forget the document.
{prefix + "toprettystring"}); | ||
registerFunction<ToPrettyStringFunction, Varchar, float>( | ||
{prefix + "toprettystring"}); | ||
registerFunction<ToPrettyStringFunction, Varchar, double>( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Better to use a helper function for these numeric types. E.g., add registerUnaryNumeric
in RegistrationHelpers.h
, then use it here.
if constexpr (std::is_same_v<TInput, StringView>) { | ||
if (inputType_->isVarchar()) { | ||
result = input; | ||
} else if (inputType_->isVarbinary()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The existing code for cast cannot be re-used?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
template <typename TInput> | ||
void callNullable(out_type<Varchar>& result, const TInput* a) { | ||
if (a) { | ||
call(result, *a); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As call
method co-exists, the non-null input will directly go into call
method and will not go here. Maybe, we can only keep callNullable
method with the impl. for call
covered.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1. We can merge the implementation of call into callNullable.
template <typename TInput> | ||
void callNullable(out_type<Varchar>& result, const TInput* a) { | ||
if (a) { | ||
call(result, *a); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ditto.
input, scale_, maxRowSize_, result.data()); | ||
if (view.isInline()) { | ||
result.setNoCopy(view); | ||
result.resize(0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't look right. Why are we resizing to 0?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is right. The result is inline, so we don't need to allocate from buffer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here we write some bytes to result.data()
indeed, so it looks not right to resize it to 0. Alternatively, we could write data to an inlined buffer and set it to result.
char inlined[StringView::kInlineSize];
result.setNoCopy(view); | ||
result.resize(0); | ||
} else { | ||
result.resize(view.size()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there is not need for if-else block. You can just do this unconditionally.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is for decimal, min long decimal value -1'000'000'000'000'000'000 + 1
with .
is not inline
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. Added several comments.
@@ -283,6 +283,20 @@ Unless specified otherwise, all functions return NULL if at least one of the arg | |||
SELECT substring_index('aaaaa', 'aa', 5); -- "aaaaa" | |||
SELECT substring_index('aaaaa', 'aa', -5); -- "aaaaa" | |||
|
|||
.. spark:function:: toprettystring(x) -> varchar |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Function name suggestion: to_pretty_string
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This name is used in Spark plan, so I use this name.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps follow Velox's style for the function name and add a mapping in compute engine.
|
||
- It prints binary values (either from column or struct field) using the hex format. :: | ||
|
||
SELECT toprettystring('spark'); -- "spark" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The queries fail in Spark as internal function cannot be called by user. Shall we remove from documentation if their tests are covered?
SELECT toprettystring('spark'); -- "spark" | ||
SELECT toprettystring(12); -- "12" | ||
SELECT toprettystring(null); -- "NULL" | ||
SELECT toprettystring(cast('abcd' as binary)); -- "[61 62 63 64 65 66]" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the result [61 62 63 64]
?
SELECT toprettystring(12); -- "12" | ||
SELECT toprettystring(null); -- "NULL" | ||
SELECT toprettystring(cast('abcd' as binary)); -- "[61 62 63 64 65 66]" | ||
SELECT toprettystring('2000-01-01 12:21:56'); -- "2000-01-01 04:21:56" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps document how timezone is handled.
velox/expression/CastExpr-inl.h
Outdated
@@ -51,6 +51,9 @@ inline std::exception_ptr makeBadCastException( | |||
makeErrorMessage(input, row, resultType, errorDetails), | |||
false)); | |||
} | |||
} // namespace | |||
|
|||
namespace detail { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We may need to extract convertToStringView out in a header to use it elsewhere.
APIs in 'detail' namespaces are not supposed to be used outside of the header/cpp files.
#10241 (comment)
input, scale_, maxRowSize_, result.data()); | ||
if (view.isInline()) { | ||
result.setNoCopy(view); | ||
result.resize(0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here we write some bytes to result.data()
indeed, so it looks not right to resize it to 0. Alternatively, we could write data to an inlined buffer and set it to result.
char inlined[StringView::kInlineSize];
setNoCopy will copy the data to StringView itself when it is inline. |
@jinchengchenghh I notice some bytes are written to |
@@ -283,6 +283,20 @@ Unless specified otherwise, all functions return NULL if at least one of the arg | |||
SELECT substring_index('aaaaa', 'aa', 5); -- "aaaaa" | |||
SELECT substring_index('aaaaa', 'aa', -5); -- "aaaaa" | |||
|
|||
.. spark:function:: toprettystring(x) -> varchar |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps follow Velox's style for the function name and add a mapping in compute engine.
@@ -283,6 +283,16 @@ Unless specified otherwise, all functions return NULL if at least one of the arg | |||
SELECT substring_index('aaaaa', 'aa', 5); -- "aaaaa" | |||
SELECT substring_index('aaaaa', 'aa', -5); -- "aaaaa" | |||
|
|||
.. spark:function:: toprettystring(x) -> varchar | |||
|
|||
Returns pretty string for all scalar type. It has several differences with casting value to string: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Document suggestion:
Returns pretty string for the input value ``x``. All scalar types are supported. There are several differences with the behavior of casting ``x`` as string.
Also notice Spark describes it to support all kinds of values, and are we choosing to support scalar types only?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I don't support complex datatype now.
|
||
- It prints binary values (either from column or struct field) using the hex format. | ||
|
||
Considering the time zone if config ``session_timezone`` is set. :: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved to L288.
@@ -165,6 +166,29 @@ inline void registerArrayMinMaxFunctions(const std::string& prefix) { | |||
registerArrayMinMaxFunctions<Timestamp>(prefix); | |||
registerArrayMinMaxFunctions<Date>(prefix); | |||
} | |||
|
|||
void registerToPrettyStringFunctions(const std::string& prefix) { | |||
registerUnaryIntegralWithTReturn<ToPrettyStringFunction, Varchar>( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: declare a const variable to avoid repeating "toprettystring".
} | ||
if constexpr (std::is_same_v<TInput, int32_t>) { | ||
if (inputType_->isDate()) { | ||
auto output = DATE()->toString(*input); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This conversion could also throw exception as it calls DateType::toIso8601
. We may need to handle the exception correctly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, thanks for your catch. I will change the return type to Status.
Spark will also throw exception when it cannot be valid Date, the LocalDate.ofEpochDay(days)
may throw exception. https://github.com/apache/spark/blob/branch-3.5/sql/api/src/main/scala/org/apache/spark/sql/catalyst/util/DateFormatter.scala#L70
util::Converter<TypeKind::VARCHAR, void, util::SparkCastPolicy>:: | ||
tryCast(*input); | ||
if (castResult.hasError()) { | ||
result.setNoCopy(detail::kNull); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This indicates for invaild input null is returned. Is that the case in Spark? Perhaps also document and test it.
/// toprettystring(x) -> varchar | ||
/// Returns pretty string for int8, int16, int32, int64, bool, Date, Varchar. It | ||
/// has one difference with casting value to string: | ||
/// - It prints null values (either from column or struct field) as "NULL". |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Date -> date, Varchar -> varchar to align with others.
/// It has one difference with casting value to string:
/// - It prints null values (either from column or struct field) as "NULL".
Comment suggestion: The difference with casting input value as string is that "NULL" is returned for null inputs.
either from column or struct field
It seems complex type is not supported. Do we need this description?
/// Returns pretty string for int8, int16, int32, int64, bool, Date, Varchar. It | ||
/// has one difference with casting value to string: | ||
/// - It prints null values (either from column or struct field) as "NULL". | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: remove this empty line.
}; | ||
|
||
/// Returns pretty string for Varbinary. It has several differences with casting | ||
/// value to string: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Varbinary -> varbinary
There are several differences with cast(varbinary as string).
|
||
/// Returns pretty string for Varbinary. It has several differences with casting | ||
/// value to string: | ||
/// - It prints null values (either from column or struct field) as "NULL". |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: -
> 1)
, perhaps remove either from column or struct field
as varbinary could not have struct field.
/// Returns pretty string for Varbinary. It has several differences with casting | ||
/// value to string: | ||
/// - It prints null values (either from column or struct field) as "NULL". | ||
/// - It prints binary values (either from column or struct field) using the hex |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
/// - It prints null values (either from column or struct field) as "NULL". | ||
/// - It prints binary values (either from column or struct field) using the hex | ||
/// format. Returns a pretty string of the byte array which prints each byte as | ||
/// a hex digit and add spaces between them. For example, [1A C0]. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Returns a pretty string of the byte array which prints each byte as a hex digit and add spaces between them. For example, [1A C0].
Comment suggestion:
/// The pretty string is composed of the hex digits of bytes and spaces between
/// them. E.g., the result of to_pretty_string("abc") is "[31 32 33]".
TypePtr inputType_; | ||
}; | ||
|
||
/// Returns pretty string for Varbinary. It has several differences with casting |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems the binary format is controlled by a config SQLConf.BINARY_OUTPUT_STYLE
, and shall we document that only hex format is supported?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the Spark3.5 version, it use default hex format version. https://github.com/apache/spark/blob/branch-3.5/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/ToPrettyString.scala#L52
char* pos = startPosition; | ||
*pos++ = '['; | ||
for (auto i = 0; i < input->size(); i++) { | ||
auto formated = fmt::format("{:X}", input->data()[i]); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: formated -> formatted
d808d2b
to
ff52ca8
Compare
Address all the comments. |
@@ -342,6 +342,16 @@ String Functions | |||
SELECT substring_index('aaaaa', 'aa', 5); -- "aaaaa" | |||
SELECT substring_index('aaaaa', 'aa', -5); -- "aaaaa" | |||
|
|||
.. spark:function:: to_pretty_string(x) -> varchar | |||
|
|||
Returns pretty string for all scalar type. Considering the time zone if config ``session_timezone`` is set. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Document suggestion:
Returns pretty string for ``x``. All scalar types are supported. Adjusts the timestamp input to the given time zone if set through ``session_timezone`` config.
|
||
Returns pretty string for all scalar type. Considering the time zone if config ``session_timezone`` is set. | ||
|
||
It has several differences with casting value to string: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Document suggestion:
The result is different from that of casting ``x`` as string in the following aspects.
|
||
It has several differences with casting value to string: | ||
|
||
- It prints null values as "NULL". |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Document suggestion:
It prints null input as "NULL" rather than producing null output.
/// to_pretty_string(x) -> varchar | ||
/// Returns pretty string for int8, int16, int32, int64, bool, Date, Varchar. It | ||
/// has one difference with casting value to string: | ||
/// 1) It prints null values as "NULL". |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It prints null input as "NULL" rather than producing null output.
for (auto i = 0; i < input->size(); i++) { | ||
auto formatted = fmt::format("{:X}", input->data()[i]); | ||
*pos++ = formatted.data()[0]; | ||
*pos++ = formatted.data()[1]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps we can use 'snprintf' to write to 'pos' directly and avoid creating intermediate string.
std::snprintf(pos, 2, "%02X", static_cast(input->data()[i]));
|
||
- It prints null values as "NULL". | ||
|
||
- It prints binary values using the hex format. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It prints binary value as hex string representation rather than UTF-8.
2026b1a
to
77fd4a0
Compare
Timestamp::tsToStringView(inputValue, options_, result.data()); | ||
result.resize(stringView.size()); | ||
} catch (const std::exception& e) { | ||
return Status::Invalid(e.what()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Try-catch is used for date and timestamp types. Wondering if it is preferred to make these APIs return status.
This pull request has been automatically marked as stale because it has not had recent activity. If you'd still like this PR merged, please comment on the PR, make sure you've addressed reviewer comments, and rebase on the latest main. Thank you for your contributions! |
Spark implement: https://github.com/apache/spark/blob/branch-3.5/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/ToPrettyString.scala
This is an internal spark function.