Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implementing array_sum function #5

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

jwyles-ahana
Copy link

No description provided.

aditi-pandit pushed a commit that referenced this pull request May 11, 2022
…tor#1500)

Summary:
Enhance printExprWithStats to identify common-sub expressions.

For example, `c0 + c1` is a common sub-expression in
`"(c0 + c1) % 5", " (c0 + c1) % 3"` expression set. It is evaluated only once and
there is a single Expr object that represents it. That object appears in the
expression tree twice. printExprWithStats does not show the runtime stats for
second instance of that expression and instead annotates it with `[CSE https://github.com/facebookincubator/velox/issues/2]`,
where CSE stands for common sub-expression and 2 refers to the first instance
of the expression.

```
mod [cpu time: 50.49us, rows: 1024] -> BIGINT [#1]
   cast(plus as BIGINT) [cpu time: 68.15us, rows: 1024] -> BIGINT [#2]
      plus [cpu time: 51.84us, rows: 1024] -> INTEGER [#3]
         c0 [cpu time: 0ns, rows: 0] -> INTEGER [#4]
         c1 [cpu time: 0ns, rows: 0] -> INTEGER [#5]
   5:BIGINT [cpu time: 0ns, rows: 0] -> BIGINT [#6]

mod [cpu time: 49.29us, rows: 1024] -> BIGINT [#7]
   cast((plus(c0, c1)) as BIGINT) -> BIGINT [CSE #2]
   3:BIGINT [cpu time: 0ns, rows: 0] -> BIGINT [#8]
```

Pull Request resolved: facebookincubator#1500

Reviewed By: Yuhta

Differential Revision: D35994836

Pulled By: mbasmanova

fbshipit-source-id: 6bacbbe61b68dad97ce2fd5f99610c4ad55897be
@jwyles-ahana jwyles-ahana changed the title [WIP] First try at implementing array_sum function Implementing array_sum function May 13, 2022
Copy link

@yingsu00 yingsu00 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some initial comments

// Allocate new vector for the result
memory::MemoryPool* pool = context->pool();
auto resultVector = BaseVector::create(outputType, numRows, pool);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove extra empty line

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually you need to run the reformat code(Code-> Reformat Code). There're other lines that would fail the format check too

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated.

// Get access to raw values for the result
OT* resultValues = (OT*) resultVector->valuesAsVoid();

// Iterate over the input vector and find the sum of each array's values

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment is not needed because it's very obvious. The comments need to be succinct.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed.


// Iterate over the input vector and find the sum of each array's values
for (int i = 0; i < numRows; i++) {
// If the whole array is null then set the row null in the output

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment is not needed

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed.

if (arrayVector->isNullAt(i)) {
resultVector->setNull(i, true);
}
// If the array is not null then sum the elements and set the result to the sum

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment is not needed

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed.

}
}

// Set the value at i equal to the sum

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment is not needed

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed.

for (int i = 0; i < numRows; i++) {
// If the whole array is null then set the row null in the output
if (arrayVector->isNullAt(i)) {
resultVector->setNull(i, true);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Presto function description says "Returns the sum of all non-null elements of the array. If there is no non-null elements, returns 0. "

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Presto function description does not specify what to do in the case that the array itself is null (the case handled here) so I went with null in null out.

if (kind == TypeKind::REAL || kind == TypeKind::DOUBLE) {
return std::make_shared<ArraySumFunction<IT, double>>();
}
VELOX_FAIL()

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add the message showing what kind of error it is.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

}

// Define function signature.
// array(T1) -> T2 where T must be coercible to bigint or double, and

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

T should be T1?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it should. I will fix.

} // namespace

// Test integer arrays.
TEST_F(ArraySumTest, integer64Input) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add tests on the some of the types that are not coercible to double, and expect the query fails and output expected message?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have added tests for StringView and bool types which should (and do) fail with an exception.

// array(T1) -> T2 where T must be coercible to bigint or double, and
// T2 is bigint or double
std::vector<std::shared_ptr<exec::FunctionSignature>> signatures() {
return {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just curious does array_sum works with decimal types ? If yes, then that can be the cause of some complexity for these signatures and implementation. We needn't work on it on that PR but please inform Karteek, etc about it.

arrayType->kind(),
TypeKind::ARRAY,
"array_sum requires argument of type ARRAY");
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a validation here that the child of the array type should be coercible to double.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have added a validation.

}

template <>
void ArraySumFunction<Timestamp, int64_t>::apply(

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just curious, why are these needed ?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The main apply function will not compile for Timestamp, StringView, and Date. Since these specializations are never used (due to the acceptable signatures of array_sum) I have added the specializations that don't compile as no-op functions which lets things compile.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't we follow the applyTyped + VELOX_DYNAMIC_SCALAR_TEMPLATE_TYPE_DISPATCH approach as in ArrayMinMax.cpp? That would clean up these?


// Test floating point arrays
TEST_F(ArraySumTest, floatInput) {
auto input = makeNullableArrayVector<float>({{0, 1, 2},

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add tests for values std::numeric_limits::min(), max(), inifinity() and quiet_NaN().

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have added some tests for these.

// Define function signature.
// array(T1) -> T2 where T1 must be coercible to bigint or double, and
// T2 is bigint or double
std::vector<std::shared_ptr<exec::FunctionSignature>> signatures() {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The signatures() approach from ArrayMinMax.cpp can be followed here as well, using an unordered_map instead of the vector.

};
std::vector<std::shared_ptr<exec::FunctionSignature>> signatures;
signatures.reserve(s.size());
for (const auto& typeName : s) {
Copy link

@majetideepak majetideepak May 27, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: for (const auto& [returnType, argType] : s) is preferred.

Copy link

@majetideepak majetideepak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jwyles-ahana great to see the dummy definitions go away. Made some more comments.

createTyped, elementType->kind(), inputArgs);
switch (elementType->kind()) {
case TypeKind::TINYINT: {
return std::make_shared<ArraySumFunction<int8_t, int64_t>>();

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use TypeTraits<TypeKind::TINYINT>::NativeType instead of int8_t. Same below.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made change.

// Allocate new vector for the result
memory::MemoryPool* pool = context->pool();
auto resultVector = BaseVector::create(outputType, numRows, pool);
OT* resultValues = (OT*)resultVector->valuesAsVoid();

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can directly write to result and avoid a copy. Also, I feel the compiler cannot vectorize the loop, so we can use the API to set the values instead of dealing with raw values.

    BaseVector::ensureWritable(rows, outputType, context->pool(), result);
    auto resultValues = (*result)->asFlatVector<OT>();
    ...
    resultValues->setNull(i, true);
    ....
    resultValues->set(i, sum);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have to use rows.applyToSelected to ensure only selected rows are set.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated.

*/

#include "velox/expression/EvalCtx.h"
#include "velox/expression/Expr.h"

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You might be able to remove the first 2 includes. Just including "VectorFunction.h" is sufficient I think.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed.

template <typename IT, typename OT>
class ArraySumFunction : public exec::VectorFunction {
public:
// Execute function.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment can be removed.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed.

namespace facebook::velox::functions {
namespace {

// See documentation at https://prestodb.io/docs/current/functions/array.html

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move this line below "Implements the array_sum function"

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

valueTypeKind == TypeKind::SMALLINT ||
valueTypeKind == TypeKind::INTEGER || valueTypeKind == TypeKind::BIGINT ||
valueTypeKind == TypeKind::REAL || valueTypeKind == TypeKind::DOUBLE;
VELOX_USER_CHECK_EQ(isCoercibleToDouble, true, "Invalid value type");

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add information of the invalid valueTypeKind in the error message.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@jwyles-ahana jwyles-ahana force-pushed the array_sum branch 7 times, most recently from 08a1e49 to aa59d73 Compare July 27, 2022 20:54
@jwyles-ahana jwyles-ahana force-pushed the array_sum branch 11 times, most recently from c3dd658 to bec6db1 Compare August 2, 2022 18:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants