Skip to content
Permalink

Comparing changes

Choose two branches to see what’s changed or to start a new pull request. If you need to, you can also or learn more about diff comparisons.

Open a pull request

Create a new pull request by comparing changes across two branches. If you need to, you can also . Learn more about diff comparisons here.
base repository: facebookincubator/velox
Failed to load repositories. Confirm that selected base ref is valid, then try again.
Loading
base: 6010f956c3735fdadf6a4a0d24a2e4e87f5ea9c6
Choose a base ref
...
head repository: facebookincubator/velox
Failed to load repositories. Confirm that selected head ref is valid, then try again.
Loading
compare: ce273fac228ca214386c9a577b11e755994ef611
Choose a head ref
  • 16 commits
  • 45 files changed
  • 13 contributors

Commits on Jan 26, 2025

  1. feat: Add IoStatistics in ReadFile to collect storage statistics (#12160

    )
    
    Summary:
    1.Add IoStatistics in ReadFile to collect storage statistics
    2.Collect WS stats in IoStatistics and add to runtimeStats
     {F1974667171}
    
    Pull Request resolved: #12160
    
    Test Plan:
    Deployed to a test cluster
    Java: 20250125_082712_05968_cwjy8
    S1-wsInRegionReadBytes	985.81 MB	1232	322.87 KB	1.00 MB
    
    CPP: 20250125_095038_00000_sgud6
    S1-TableScan.0.wsInRegionReadBytes	908.81 MB	154	5.80 MB	5.97 MB
    
    Reviewed By: yuandagits
    
    Differential Revision: D68598642
    
    Pulled By: kewang1024
    
    fbshipit-source-id: 84de8b31350b94cfc738a66fafa001cc20372e84
    kewang1024 authored and facebook-github-bot committed Jan 26, 2025
    Copy the full SHA
    cb3f8b6 View commit details

Commits on Jan 27, 2025

  1. fix: Fix config name for stats-based-filter-reorder-disabaled (#12175)

    Summary: Pull Request resolved: #12175
    
    Reviewed By: xiaoxmeng
    
    Differential Revision: D68679634
    
    fbshipit-source-id: 3b594976ab183461f1e725fdb63f02ad88622ee2
    Ke Wang authored and facebook-github-bot committed Jan 27, 2025
    Copy the full SHA
    028d9e4 View commit details
  2. fix(array functions): Add support to array_has_duplicates for input o…

    …f type json (#12158)
    
    Summary:
    Pull Request resolved: #12158
    
    Minor update to function array_has_duplicates which registers json as input in array registration file. No need to cast input as json is already treated as a varchar.
    
    Reviewed By: kgpai
    
    Differential Revision: D68580104
    
    fbshipit-source-id: cc01dff823befc367a3716350c79c0a955fc6762
    peterenescu authored and facebook-github-bot committed Jan 27, 2025
    Copy the full SHA
    9dede0e View commit details
  3. build(ci): Fix artifact handling in Conbench Upload job (#12183)

    Summary:
    The v4 artifact action handles paths different leading to the 'benchmark-results' dir getting dropped.
    
    Working run of the fixed workflow: https://github.com/facebookincubator/velox/actions/runs/12994805091/job/36240073932#step:8:26
    
    Pull Request resolved: #12183
    
    Reviewed By: kagamiori
    
    Differential Revision: D68721703
    
    Pulled By: kgpai
    
    fbshipit-source-id: b2f9f852d153442f94d89c6da5cc1cdfea126f5c
    assignUser authored and facebook-github-bot committed Jan 27, 2025
    Copy the full SHA
    a55e9a2 View commit details
  4. let VeloxPromise move-ctor be noexcept (#12182)

    Summary:
    Pull Request resolved: #12182
    
    It seems like the intention was for the `VeloxPromise` move-ctor to be `noexcept`.
    
    Reviewed By: pedroerp
    
    Differential Revision: D68715056
    
    fbshipit-source-id: 37938e434d519f0302a325f1b25ad84f83e5d8d3
    yfeldblum authored and facebook-github-bot committed Jan 27, 2025
    Copy the full SHA
    cc264cd View commit details
  5. build: Fix pyvelox build failing due to missing protobuf headers (#12185

    )
    
    Summary: Pull Request resolved: #12185
    
    Reviewed By: kagamiori
    
    Differential Revision: D68724760
    
    Pulled By: kgpai
    
    fbshipit-source-id: ff3562c7bec8c44efe1f3734944499a2680e08ce
    assignUser authored and facebook-github-bot committed Jan 27, 2025
    Copy the full SHA
    dfe7d1a View commit details
  6. Revert D68598642 (#12186)

    Summary:
    Pull Request resolved: #12186
    
    This diff reverts D68598642
    It's causing the task to crash due to race condition
    
    Reviewed By: zation99
    
    Differential Revision: D68727623
    
    fbshipit-source-id: 6cd847c1d6f954d254089d9c80aa038dd689aa9b
    generatedunixname89002005232357 authored and facebook-github-bot committed Jan 27, 2025
    Copy the full SHA
    e2ad89c View commit details

Commits on Jan 28, 2025

  1. feat: Add Global Config in place of gflags for expressions (#12127)

    Summary:
    This is continuing the previous work of removing GFlags in place of using global config
    
    Pull Request resolved: #12127
    
    Reviewed By: kgpai
    
    Differential Revision: D68610683
    
    Pulled By: xiaoxmeng
    
    fbshipit-source-id: 7915a4db79a1ceab975491b827c5e38678974334
    majetideepak authored and facebook-github-bot committed Jan 28, 2025
    Copy the full SHA
    cb6e652 View commit details
  2. feat: Extend connector interface to support index lookup (#12187)

    Summary:
    Pull Request resolved: #12187
    
    Extend connector interface to support create index source for index join lookup
    Add IndexSource interface which provides the lookup interface between velox operator
    and backend index table
    Add lookup request and result data structures for lookup operation with index sourc
    
    Reviewed By: mbasmanova
    
    Differential Revision: D68612814
    
    fbshipit-source-id: 0f0900829dcafee8bf6c95d58627095b74be4757
    xiaoxmeng authored and facebook-github-bot committed Jan 28, 2025
    Copy the full SHA
    805fe4b View commit details
  3. misc: Add comment for failing tests in FIPS environments (#12188)

    Summary:
    In FIPS enabled environments OpenSSL is restricting access to certain APIs due to them being insecure. Velox supports functions that calculate the results using these obsolete functions. The tests will fail if run.
    
    Pull Request resolved: #12188
    
    Reviewed By: kgpai
    
    Differential Revision: D68746609
    
    Pulled By: xiaoxmeng
    
    fbshipit-source-id: c428ea8cd7170436f784d9dbab96282150983f1f
    czentgr authored and facebook-github-bot committed Jan 28, 2025
    Copy the full SHA
    68fcaa1 View commit details
  4. misc(build): Add variable to configure dependency build parallelism (#…

    …12156)
    
    Summary:
    This change adds the BUILD_THREADS environment variable. This controls the parallelism used when building dependencies. By default, the number of cores in the host machine is used but can result in OOM kills. For example, on machines with 8 cores and 16GB ram (plus 16GB swap) OOM kills can be observed. This change allows a user to configure a flexible build automation.
    
    Pull Request resolved: #12156
    
    Reviewed By: kgpai
    
    Differential Revision: D68746594
    
    Pulled By: xiaoxmeng
    
    fbshipit-source-id: 589ae16013d99ad541251f3c8ba32a1337c7a0f7
    czentgr authored and facebook-github-bot committed Jan 28, 2025
    Copy the full SHA
    1692f5b View commit details
  5. build(cmake): Make monolithic libs be built inside Velox's own binary…

    … directory when being used as a subproject (#12144)
    
    Summary:
    Similar to #12128, currently when Velox is imported as a subproject of another project, the monolithic libs `libvelox.a` / `libvelox.so` will be generated in root project's binary directory which is sub-optimal.
    
    The patch fixes the issue so the libs will be generated inside Velox's binary directory.
    
    Pull Request resolved: #12144
    
    Reviewed By: kgpai
    
    Differential Revision: D68606631
    
    Pulled By: xiaoxmeng
    
    fbshipit-source-id: d0ef042c5ee935a598637b07549c9c48a06708fe
    zhztheplayer authored and facebook-github-bot committed Jan 28, 2025
    Copy the full SHA
    9fd0b0f View commit details
  6. build(simdjson): Add CMake option to skip utf8 validation (#12165)

    Summary:
    Add VELOX_SIMDJSON_SKIPUTF8VALIDATION CMake option to do this.
    
    Discussed in #10639 (comment)
    
    Pull Request resolved: #12165
    
    Reviewed By: kgpai
    
    Differential Revision: D68754000
    
    Pulled By: xiaoxmeng
    
    fbshipit-source-id: 4a5e43dd1c14a17c52d896521af30699b1b541a6
    marin-ma authored and facebook-github-bot committed Jan 28, 2025
    Copy the full SHA
    a837ebd View commit details
  7. fix: regexp_extract_all to not return a match in mismatched group (#1…

    …2143)
    
    Summary:
    Fix regexp_extract_all to avoid returning a match in mismatched group.
    
    For example,
    
    ```
    regexp_extract_all("rat cat\nbat dog", "ra(.)|blah(.)(.)", 2)
    ```
    
    is expected to return {NULL} because the first group `ra(.)` has a match while the second group `blah(.)` has no match. The current buggy implementation returns {""}.
    
    This test case and the expected result is from presto's existing UT, see https://github.com/prestodb/presto/blob/48f2fb1cd0244e8ae1230d27445fcd15d6520597/presto-main/src/test/java/com/facebook/presto/operator/scalar/AbstractTestRegexpFunctions.java#L214.
    
    Fixes #12119.
    
    Pull Request resolved: #12143
    
    Reviewed By: kagamiori
    
    Differential Revision: D68746581
    
    Pulled By: xiaoxmeng
    
    fbshipit-source-id: 10e6c919fde1c4419a404592b7e7c6941fe983a1
    HolyLow authored and facebook-github-bot committed Jan 28, 2025
    Copy the full SHA
    2d24256 View commit details
  8. test: Disable O_DIRECT in cache fuzzer (#12154)

    Summary:
    Pull Request resolved: #12154
    
    O_DIRECT requires I/O size needs to be the same as a disk file block size
    which is not handled in SSD cache. Misalignment can lead to EINVAL in some
    filesystem and kernel version.
    
    Reviewed By: xiaoxmeng, kagamiori, yuandagits
    
    Differential Revision: D68562695
    
    fbshipit-source-id: 5a3dbdfed288dfb8cfa606f9d40fe0e2c0568f8a
    zacw7 authored and facebook-github-bot committed Jan 28, 2025
    Copy the full SHA
    9999ae0 View commit details
  9. refactor: Consolidate subfield filter definitions (#12184)

    Summary:
    X-link: facebookexperimental/verax#2
    
    Pull Request resolved: #12184
    
    There are multiple definitions of subfield filters in velox and consolidate them in Filter.h
    
    Reviewed By: Yuhta
    
    Differential Revision: D68719054
    
    fbshipit-source-id: 7085c7ad803980d960e32ce4a981bd4a722558e9
    xiaoxmeng authored and facebook-github-bot committed Jan 28, 2025
    Copy the full SHA
    ce273fa View commit details
Showing with 327 additions and 115 deletions.
  1. +1 −1 .github/workflows/conbench_upload.yml
  2. +2 −2 CMake/VeloxUtils.cmake
  3. +4 −0 CMake/resolve_dependency_modules/simdjson.cmake
  4. +2 −0 CMakeLists.txt
  5. +6 −0 README.md
  6. +1 −1 scripts/setup-centos9.sh
  7. +2 −1 scripts/setup-helper-functions.sh
  8. +1 −1 scripts/setup-macos.sh
  9. +1 −1 scripts/setup-ubuntu.sh
  10. +1 −1 setup.py
  11. +4 −4 velox/common/caching/tests/AsyncDataCacheTest.cpp
  12. +24 −0 velox/common/config/GlobalConfig.h
  13. +2 −2 velox/common/file/tests/FaultyFileSystem.h
  14. +1 −1 velox/common/future/VeloxPromise.h
  15. +122 −0 velox/connectors/Connector.h
  16. +2 −2 velox/connectors/hive/HiveConfig.h
  17. +3 −3 velox/connectors/hive/HiveConnectorUtil.cpp
  18. +5 −3 velox/connectors/hive/HiveConnectorUtil.h
  19. +1 −4 velox/connectors/hive/HiveDataSource.h
  20. +2 −2 velox/connectors/hive/TableHandle.cpp
  21. +3 −6 velox/connectors/hive/TableHandle.h
  22. +0 −3 velox/connectors/hive/iceberg/PositionalDeleteFileReader.h
  23. +2 −2 velox/connectors/hive/tests/HiveConnectorUtilTest.cpp
  24. +1 −3 velox/core/PlanNode.h
  25. +2 −2 velox/dwio/common/tests/LoggedExceptionTest.cpp
  26. +0 −2 velox/dwio/common/tests/utils/FilterGenerator.h
  27. +6 −1 velox/exec/fuzzer/CacheFuzzer.cpp
  28. +8 −7 velox/exec/tests/TableScanTest.cpp
  29. +1 −1 velox/exec/tests/utils/HiveConnectorTestBase.h
  30. +1 −1 velox/exec/tests/utils/PlanBuilder.cpp
  31. +10 −27 velox/expression/Expr.cpp
  32. +0 −13 velox/expression/Expr.h
  33. +1 −1 velox/expression/tests/ExprEncodingsTest.cpp
  34. +5 −6 velox/expression/tests/ExprTest.cpp
  35. +26 −0 velox/flag_definitions/flags.cpp
  36. +7 −2 velox/functions/lib/Re2Functions.cpp
  37. +14 −0 velox/functions/lib/tests/Re2FunctionsTest.cpp
  38. +1 −0 velox/functions/prestosql/registration/ArrayFunctionsRegistration.cpp
  39. +40 −0 velox/functions/prestosql/tests/ArrayHasDuplicatesTest.cpp
  40. +2 −0 velox/functions/prestosql/tests/BinaryFunctionsTest.cpp
  41. +1 −0 velox/py/CMakeLists.txt
  42. +5 −5 velox/substrait/SubstraitToVeloxPlan.cpp
  43. +1 −1 velox/substrait/SubstraitToVeloxPlan.h
  44. +3 −0 velox/type/Filter.h
  45. +0 −3 velox/type/tests/SubfieldFiltersBuilder.h
2 changes: 1 addition & 1 deletion .github/workflows/conbench_upload.yml
Original file line number Diff line number Diff line change
@@ -108,7 +108,7 @@ jobs:
--run_id "GHA-${{ github.run_id }}-${{ github.run_attempt }}" \
--pr_number "${{ steps.extract.outputs.pr_number }}" \
--sha "${{ steps.download.outputs.contender_sha }}" \
--output_dir "/tmp/artifacts/benchmark-results/contender/"
--output_dir "/tmp/artifacts/contender/"
- name: "Check the status of the upload"
# Status functions like failure() only work in `if:`
4 changes: 2 additions & 2 deletions CMake/VeloxUtils.cmake
Original file line number Diff line number Diff line change
@@ -77,9 +77,9 @@ function(velox_add_library TARGET)
# Create the target if this is the first invocation.
add_library(velox ${_type} ${ARGN})
set_target_properties(velox PROPERTIES LIBRARY_OUTPUT_DIRECTORY
${CMAKE_BINARY_DIR}/lib)
${PROJECT_BINARY_DIR}/lib)
set_target_properties(velox PROPERTIES ARCHIVE_OUTPUT_DIRECTORY
${CMAKE_BINARY_DIR}/lib)
${PROJECT_BINARY_DIR}/lib)
install(TARGETS velox DESTINATION lib/velox)
endif()
# create alias for compatability
4 changes: 4 additions & 0 deletions CMake/resolve_dependency_modules/simdjson.cmake
Original file line number Diff line number Diff line change
@@ -29,4 +29,8 @@ FetchContent_Declare(
URL ${VELOX_SIMDJSON_SOURCE_URL}
URL_HASH ${VELOX_SIMDJSON_BUILD_SHA256_CHECKSUM})

if(${VELOX_SIMDJSON_SKIPUTF8VALIDATION})
set(SIMDJSON_SKIPUTF8VALIDATION ON)
endif()

FetchContent_MakeAvailable(simdjson)
2 changes: 2 additions & 0 deletions CMakeLists.txt
Original file line number Diff line number Diff line change
@@ -149,6 +149,8 @@ option(
VELOX_ENABLE_INT64_BUILD_PARTITION_BOUND
"make buildPartitionBounds_ a vector int64 instead of int32 to avoid integer overflow when the hashtable has billions of records"
OFF)
option(VELOX_SIMDJSON_SKIPUTF8VALIDATION
"Skip simdjson utf8 validation in JSON parsing" OFF)

# Explicitly force compilers to generate colored output. Compilers usually do
# this by default if they detect the output is a terminal, but this assumption
6 changes: 6 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
@@ -131,6 +131,12 @@ subsequent Velox builds can use the installed packages.
*You can reuse `DEPENDENCY_INSTALL` and `INSTALL_PREFIX` for Velox clients such as Prestissimo
by specifying a common shared directory.`*

The build parallelism for dependencies can be controlled by the `BUILD_THREADS` environment
variable and overrides the default number of parallel processes used for compiling and linking.
The default value is the number of cores on your machine.
This is useful if your machine has lots of cores but no matching memory to process all
compile and link processes in parallel resulting in OOM kills by the kernel.

### Setting up on macOS

On a macOS machine (either Intel or Apple silicon) you can setup and then build like so:
2 changes: 1 addition & 1 deletion scripts/setup-centos9.sh
Original file line number Diff line number Diff line change
@@ -30,7 +30,7 @@ set -efx -o pipefail
# so that some low level types are the same size. Also, disable warnings.
SCRIPTDIR=$(dirname "${BASH_SOURCE[0]}")
source $SCRIPTDIR/setup-helper-functions.sh
NPROC=$(getconf _NPROCESSORS_ONLN)
NPROC=${BUILD_THREADS:-$(getconf _NPROCESSORS_ONLN)}
export CXXFLAGS=$(get_cxx_flags) # Used by boost.
export CFLAGS=${CXXFLAGS//"-std=c++17"/} # Used by LZO.
CMAKE_BUILD_TYPE="${BUILD_TYPE:-Release}"
3 changes: 2 additions & 1 deletion scripts/setup-helper-functions.sh
Original file line number Diff line number Diff line change
@@ -18,6 +18,7 @@

DEPENDENCY_DIR=${DEPENDENCY_DIR:-$(pwd)/deps-download}
OS_CXXFLAGS=""
NPROC=${BUILD_THREADS:-$(getconf _NPROCESSORS_ONLN)}

function run_and_time {
time "$@" || (echo "Failed to run $* ." ; exit 1 )
@@ -214,7 +215,7 @@ function cmake_install {
-DBUILD_TESTING=OFF \
"$@"
# Exit if the build fails.
cmake --build "${BINARY_DIR}" || { echo 'build failed' ; exit 1; }
cmake --build "${BINARY_DIR}" "-j ${NPROC}" || { echo 'build failed' ; exit 1; }
${SUDO} cmake --install "${BINARY_DIR}"
}

2 changes: 1 addition & 1 deletion scripts/setup-macos.sh
Original file line number Diff line number Diff line change
@@ -36,7 +36,7 @@ PYTHON_VENV=${PYTHON_VENV:-"${SCRIPTDIR}/../.venv"}
# by tagging the brew packages to be system packages.
# This is used during package builds.
export OS_CXXFLAGS=" -isystem $(brew --prefix)/include "
NPROC=$(getconf _NPROCESSORS_ONLN)
NPROC=${BUILD_THREADS:-$(getconf _NPROCESSORS_ONLN)}

BUILD_DUCKDB="${BUILD_DUCKDB:-true}"
VELOX_BUILD_SHARED=${VELOX_BUILD_SHARED:-"OFF"} #Build folly shared for use in libvelox.so.
2 changes: 1 addition & 1 deletion scripts/setup-ubuntu.sh
Original file line number Diff line number Diff line change
@@ -34,7 +34,7 @@ source $SCRIPTDIR/setup-helper-functions.sh
# are the same size.
COMPILER_FLAGS=$(get_cxx_flags)
export COMPILER_FLAGS
NPROC=$(getconf _NPROCESSORS_ONLN)
NPROC=${BUILD_THREADS:-$(getconf _NPROCESSORS_ONLN)}
BUILD_DUCKDB="${BUILD_DUCKDB:-true}"
export CMAKE_BUILD_TYPE=Release
VELOX_BUILD_SHARED=${VELOX_BUILD_SHARED:-"OFF"} #Build folly shared for use in libvelox.so.
2 changes: 1 addition & 1 deletion setup.py
Original file line number Diff line number Diff line change
@@ -136,7 +136,7 @@ def build_extension(self, ext):
f"-DCMAKE_BUILD_TYPE={cfg}",
f"-DCMAKE_INSTALL_PREFIX={extdir}",
"-DCMAKE_VERBOSE_MAKEFILE=ON",
"-DVELOX_BUILD_MINIMAL=ON",
"-DVELOX_BUILD_MINIMAL_WITH_DWIO=ON",
"-DVELOX_BUILD_PYTHON_PACKAGE=ON",
f"-DPYTHON_EXECUTABLE={exec_path} ",
]
8 changes: 4 additions & 4 deletions velox/common/caching/tests/AsyncDataCacheTest.cpp
Original file line number Diff line number Diff line change
@@ -718,7 +718,7 @@ TEST_P(AsyncDataCacheTest, pin) {

TEST_P(AsyncDataCacheTest, replace) {
constexpr int64_t kMaxBytes = 64 << 20;
FLAGS_velox_exception_user_stacktrace_enabled = false;
config::globalConfig.exceptionUserStacktraceEnabled = false;
initializeCache(kMaxBytes);
// Load 10x the max size, inject an error every 21 batches.
loadLoop(0, kMaxBytes * 10, 21);
@@ -736,7 +736,7 @@ TEST_P(AsyncDataCacheTest, replace) {

TEST_P(AsyncDataCacheTest, evictAccounting) {
constexpr int64_t kMaxBytes = 64 << 20;
FLAGS_velox_exception_user_stacktrace_enabled = false;
config::globalConfig.exceptionUserStacktraceEnabled = false;
initializeCache(kMaxBytes);
auto pool = manager_->addLeafPool("test");

@@ -760,7 +760,7 @@ TEST_P(AsyncDataCacheTest, evictAccounting) {
TEST_P(AsyncDataCacheTest, largeEvict) {
constexpr int64_t kMaxBytes = 256 << 20;
constexpr int32_t kNumThreads = 24;
FLAGS_velox_exception_user_stacktrace_enabled = false;
config::globalConfig.exceptionUserStacktraceEnabled = false;
initializeCache(kMaxBytes);
// Load 10x the max size, inject an allocation of 1/8 the capacity every 4
// batches.
@@ -839,7 +839,7 @@ TEST_P(AsyncDataCacheTest, DISABLED_ssd) {
constexpr uint64_t kRamBytes = 32 << 20;
constexpr uint64_t kSsdBytes = 512UL << 20;
#endif
FLAGS_velox_exception_user_stacktrace_enabled = false;
config::globalConfig.exceptionUserStacktraceEnabled = false;
initializeCache(kRamBytes, kSsdBytes);
cache_->setVerifyHook(
[&](const AsyncDataCacheEntry& entry) { checkContents(entry); });
24 changes: 24 additions & 0 deletions velox/common/config/GlobalConfig.h
Original file line number Diff line number Diff line change
@@ -17,6 +17,7 @@
#pragma once

#include <cstdint>
#include <string>

namespace facebook::velox::config {

@@ -53,6 +54,29 @@ struct GlobalConfiguration {
/// Min time interval in milliseconds between stack traces captured in
/// system type of VeloxException; off when set to 0 (the default).
int32_t exceptionSystemStacktraceRateLimitMs{0};
/// Whether to overwrite queryCtx and force the use of simplified expression
/// evaluation path.
bool forceEvalSimplified{false};
/// This is an experimental flag only to be used for debugging purposes. If
/// set to true, serializes the input vector data and all the SQL expressions
/// in the ExprSet that is currently executing, whenever a fatal signal is
/// encountered. Enabling this flag makes the signal handler async signal
/// unsafe, so it should only be used for debugging purposes. The vector and
/// SQLs are serialized to files in directories specified by either
/// 'saveInputOnExpressionAnyFailurePath' or
/// 'saveInputOnExpressionSystemFailurePath'
bool experimentalSaveInputOnFatalSignal{false};
/// Used to enable saving input vector and expression SQL on disk in case
/// of any (user or system) error during expression evaluation. The value
/// specifies a path to a directory where the vectors will be saved. That
/// directory must exist and be writable.
std::string saveInputOnExpressionAnyFailurePath;
/// Used to enable saving input vector and expression SQL on disk in case
/// of a system error during expression evaluation. The value specifies a path
/// to a directory where the vectors will be saved. That directory must exist
/// and be writable. This flag is ignored if
/// saveInputOnExpressionAnyFailurePath flag is set.
std::string saveInputOnExpressionSystemFailurePath;
};

extern GlobalConfiguration globalConfig;
4 changes: 2 additions & 2 deletions velox/common/file/tests/FaultyFileSystem.h
Original file line number Diff line number Diff line change
@@ -47,8 +47,8 @@ class FaultyFileSystem : public FileSystem {
// Extracts the delegated real file path by removing the faulty file system
// scheme prefix.
inline std::string_view extractPath(std::string_view path) const override {
VELOX_CHECK_EQ(path.find(scheme()), 0, "");
const auto filePath = path.substr(scheme().length());
const auto filePath =
(path.find(scheme()) == 0) ? path.substr(scheme().length()) : path;
return getFileSystem(filePath, config_)->extractPath(filePath);
}

2 changes: 1 addition & 1 deletion velox/common/future/VeloxPromise.h
Original file line number Diff line number Diff line change
@@ -42,7 +42,7 @@ class VeloxPromise : public folly::Promise<T> {
}
}

explicit VeloxPromise(VeloxPromise<T>&& other)
explicit VeloxPromise(VeloxPromise<T>&& other) noexcept
: folly::Promise<T>(std::move(other)),
context_(std::move(other.context_)) {}

122 changes: 122 additions & 0 deletions velox/connectors/Connector.h
Original file line number Diff line number Diff line change
@@ -43,6 +43,10 @@ namespace facebook::velox::config {
class ConfigBase;
}

namespace facebook::velox::core {
class ITypedExpr;
}

namespace facebook::velox::connector {

class DataSource;
@@ -287,6 +291,73 @@ class DataSource {
}
};

class IndexSource {
public:
virtual ~IndexSource() = default;

/// Represents a lookup request for a given input.
struct LookupRequest {
/// Contains the input column vectors used by lookup join and range
/// conditions.
RowVectorPtr input;

explicit LookupRequest(RowVectorPtr input) : input(std::move(input)) {}
};

/// Represents the lookup result for a subset of input produced by the
/// 'LookupResultIterator'.
struct LookupResult {
/// Specifies the indices of input row in the lookup request that have
/// matches in 'output'. It contains the input indices in the order
/// of the input rows in the lookup request. Any gap in the indices means
/// the input rows that has no matches in output.
///
/// Example:
/// LookupRequest: input = [0, 1, 2, 3, 4]
/// LookupResult: inputHits = [0, 0, 2, 2, 3, 4, 4, 4]
/// output = [0, 1, 2, 3, 4, 5, 6, 7]
///
/// Here is match results for each input row:
/// input row #0: match with output rows #0 and #1.
/// input row #1: no matches
/// input row #2: match with output rows #2 and #3.
/// input row #3: match with output row #4.
/// input row #4: match with output rows #5, #6 and #7.
///
/// 'LookupResultIterator' must also produce the output result in order of
/// input rows.
BufferPtr inputHits;

/// Contains the lookup result rows.
RowVectorPtr output;

LookupResult(BufferPtr _inputHits, RowVectorPtr _output)
: inputHits(std::move(_inputHits)), output(std::move(_output)) {
VELOX_CHECK_EQ(inputHits->size() / sizeof(int32_t), output->size());
}
};

/// The lookup result iterator used to fetch the lookup result in batch for a
/// given lookup request.
class LookupResultIterator {
public:
virtual ~LookupResultIterator() = default;

/// Invoked to fetch update to 'size' number of output rows. Returns nullptr
/// if all the lookup results have been fetched. Returns std::nullopt and
/// sets the 'future' if started asynchronous work and needs to wait for it
/// to complete to continue processing. The caller will wait for the
/// 'future' to complete before calling 'next' again.
virtual std::optional<std::shared_ptr<LookupResult>> next(
vector_size_t size,
velox::ContinueFuture& future) = 0;
};

virtual LookupResultIterator lookup(const LookupRequest& request) = 0;

virtual std::unordered_map<std::string, RuntimeCounter> runtimeStats() = 0;
};

/// Collection of context data for use in a DataSource, IndexSource or DataSink.
/// One instance of this per DataSource and DataSink. This may be passed between
/// threads but methods must be invoked sequentially. Serializing use is the
@@ -479,6 +550,57 @@ class Connector {
return false;
}

/// Creates index source for index join lookup.
/// @param inputType The list of probe-side columns that either used in
/// equi-clauses or join conditions.
/// @param numJoinKeys The number of key columns used in join equi-clauses.
/// The first 'numJoinKeys' columns in 'inputType' form a prefix of the
/// index, and the rest of the columns in inputType are expected to be used in
/// 'joinCondition'.
/// @param joinConditions The join conditions. It expects inputs columns from
/// the 'tail' of 'inputType' and from 'columnHandles'.
/// @param outputType The lookup output type from index source.
/// @param tableHandle The index table handle.
/// @param columnHandles The column handles which maps from column name
/// used in 'outputType' and 'joinConditions' to the corresponding column
/// handles in the index table.
/// @param connectorQueryCtx The query context.
///
/// Here is an example that how the lookup join operator uses index source:
///
/// SELECT t.sid, t.day_ts, u.event_value
/// FROM t LEFT JOIN u
/// ON t.sid = u.sid
/// AND contains(t.event_list, u.event_type)
/// AND t.ds BETWEEN '2024-01-01' AND '2024-01-07'
///
/// Here,
/// - 'inputType' is ROW{t.sid, t.event_list}
/// - 'numJoinKeys' is 1 since only t.sid is used in join equi-clauses.
/// - 'joinConditions' is list of one expression: contains(t.event_list,
/// u.event_type)
/// - 'outputType' is ROW{u.event_value}
/// - 'tableHandle' specifies the metadata of the index table.
/// - 'columnHandles' is a map from 'u.event_type' (in 'joinConditions') and
/// 'u.event_value' (in 'outputType') to the actual column names in the
/// index table.
/// - 'connectorQueryCtx' provide the connector query execution context.
///
virtual std::unique_ptr<IndexSource> createIndexSource(
const RowTypePtr& inputType,
size_t numJoinKeys,
const std::vector<std::shared_ptr<const core::ITypedExpr>>&
joinConditions,
const RowTypePtr& outputType,
const std::shared_ptr<ConnectorTableHandle>& tableHandle,
const std::unordered_map<
std::string,
std::shared_ptr<connector::ColumnHandle>>& columnHandles,
ConnectorQueryCtx* connectorQueryCtx) {
VELOX_UNSUPPORTED(
"Connector {} does not support index source", connectorId());
}

virtual std::unique_ptr<DataSink> createDataSink(
RowTypePtr inputType,
std::shared_ptr<ConnectorInsertTableHandle> connectorInsertTableHandle,
4 changes: 2 additions & 2 deletions velox/connectors/hive/HiveConfig.h
Original file line number Diff line number Diff line change
@@ -165,9 +165,9 @@ class HiveConfig {
"hive.reader.timestamp_unit";

static constexpr const char* kReadStatsBasedFilterReorderDisabled =
"hive.reader.stats-based-filter-reorder-disabaled";
"hive.reader.stats-based-filter-reorder-disabled";
static constexpr const char* kReadStatsBasedFilterReorderDisabledSession =
"hive.reader.stats_based_filter_reorder_disabaled";
"hive.reader.stats_based_filter_reorder_disabled";

static constexpr const char* kLocalDataPath = "hive_local_data_path";
static constexpr const char* kLocalFileFormat = "hive_local_file_format";
6 changes: 3 additions & 3 deletions velox/connectors/hive/HiveConnectorUtil.cpp
Original file line number Diff line number Diff line change
@@ -288,7 +288,7 @@ void checkColumnNameLowerCase(const std::shared_ptr<const Type>& type) {
}

void checkColumnNameLowerCase(
const SubfieldFilters& filters,
const common::SubfieldFilters& filters,
const std::unordered_map<std::string, std::shared_ptr<HiveColumnHandle>>&
infoColumns) {
for (const auto& filterIt : filters) {
@@ -349,7 +349,7 @@ std::shared_ptr<common::ScanSpec> makeScanSpec(
const RowTypePtr& rowType,
const folly::F14FastMap<std::string, std::vector<const common::Subfield*>>&
outputSubfields,
const SubfieldFilters& filters,
const common::SubfieldFilters& filters,
const RowTypePtr& dataColumns,
const std::unordered_map<std::string, std::shared_ptr<HiveColumnHandle>>&
partitionKeys,
@@ -837,7 +837,7 @@ core::TypedExprPtr extractFiltersFromRemainingFilter(
const core::TypedExprPtr& expr,
core::ExpressionEvaluator* evaluator,
bool negated,
SubfieldFilters& filters,
common::SubfieldFilters& filters,
double& sampleRate) {
auto* call = dynamic_cast<const core::CallTypedExpr*>(expr.get());
if (call == nullptr) {
Loading