Skip to content
This repository has been archived by the owner on Mar 27, 2024. It is now read-only.

v1.4.0

Compare
Choose a tag to compare
@HongW2019 HongW2019 released this 07 Jul 06:50
2659106

Overview

OAP 1.4.0 released with three major components: Gazelle, OAP MLlib and CloudTik (a new addition). In this release, 59 features/improvements were committed to Gazelle; OAP MLlib is paused to focus more on oneDAL; CloudTik is a cloud scale platform for distributed analytics and AI on public cloud providers including AWS, Azure, GCP, and so on. CloudTik enables users or enterprises to easily create and manage analytics and AI platform on public clouds with out-of-box optimized functionalities and performance, and to go quickly to focus on running the business workloads in minutes or hours instead of spending months to construct and optimize the platform.

Here are the major features/improvements in OAP 1.4.0.

Gazelle

  • Reach 1.6X overall performance vs. vanilla Spark on TPC-DS 103 queries with 5TB dataset on ICX clusters
  • Reach 1.8X overall performance vs. vanilla Spark on TPC-H 22 queries with 5TB dataset on ICX clusters
  • Implement split by reducer by column and allocate large block of memory for all reducer to optimize shuffle
  • Optimize Columnar2Row and Row2Columnar performance
  • Add support for 10+ expressions
  • Pack the classes into one single jar
  • Bugfixes on WholeStage Codegen on unsupported pattern, sort spill, etc

CloudTik

  • Scalable, robust, powerful and unified control plane for Cloud cluster and runtime management
  • Support 3 major public Cloud providers AWS, GCP and Azure with managed cloud storage and on-premise mode
  • Support of Cloud workspace to manage shared Cloud resources including VPC and subnets, cloud storage, firewalls, identities/roles and so on
  • Out-of-box runtimes including Spark, Presto/Trino, HDFS, Metastore, Kafka, Zookeeper and Ganglia
  • Integrate with OAP (Gazelle and MLlib) and various tools to run benchmarks on CloudTik

OAP MLlib

  • Reach over 11X performance vs. vanilla Spark with PCA, Linear and Ridge Regression on ICX clusters

Changelog

Gazelle

Features

#781 Add spark eventlog analyzer for advanced analyzing
#927 Column2Row further enhancement
#913 Add Hadoop 3.3 profile to pom.xml
#869 implement first agg function
#926 Support UDF URLDecoder
#856 [SHUFFLE] manually split of Variable length buffer (String likely)
#886 Add pmod function support
#855 [SHUFFLE] HugePage support in shuffle
#872 implement replace function
#867 Add substring_index function support
#818 Support length, char_length, locate, regexp_extract
#864 Enable native parquet write by default
#828 CoalesceBatches native implementation
#800 Combine datasource and columnar core jar

Performance

#848 Optimize Columnar2Row performance
#943 Optimize Row2Columnar performance
#854 Enable skipping columnarWSCG for queries with small shuffle size
#857 [SHUFFLE] split by reducer by column

Bugs Fixed

#827 Github action is broken
#987 TPC-H q7, q8, q9 run failed when using String for Date
#892 Q47 and q57 failed on ubuntu 20.04 OS without open-jdk.
#784 Improve Sort Spill
#788 Spark UT of "randomSplit on reordered partitions" encountered "Invalid: Map array child array should have no nulls" issue
#821 Improve Wholestage Codegen check
#831 Support more expression types in getting attribute
#876 Write arrow hang with OutputWriter.path
#891 Spark executor lost while DatasetFileWriter failed with speculation
#909 "INSERT OVERWRITE x SELECT /*+ REPARTITION(2) */ * FROM y LIMIT 2" drains 4 rows into table x using Arrow write extension
#889 Failed to write with ParquetFileFormat while using ArrowWriteExtension
#910 TPCDS failed, segfault caused by PR903
#852 Unit test fix for NSE-843
#843 ArrowDataSouce: Arrow dataset inspect() is called every time a file is read

PRs

#1005 [NSE-800] Fix an assembly warning
#1002 [NSE-800] Pack the classes into one single jar
#988 [NSE-987] fix string date
#977 [NSE-126] set default codegen opt to O1
#975 [NSE-927] Add macro AVX512BW check for different CPU architecture
#962 [NSE-359] disable unit tests on spark32 package
#966 [NSE-913] Add support for Hadoop 3.3.1 when packaging
#936 [NSE-943] Optimize IsNULL() function for Row2Columnar
#937 [NSE-927] Implement AVX512 optimization selection in Runtime and merge two C2R code files into one.
#951 [DNM] update sparklog
#938 [NSE-581] implement rlike/regexp_like
#946 [DNM] update on sparklog script
#939 [NSE-581] adding ShortType/FloatType in ColumnarLiteral
#934 [NSE-927] Extract and inline functions for native ColumnartoRow
#933 [NSE-581] Improve GetArrayItem(Split()) performance
#922 [NSE-912] Remove extra handleSafe costs
#925 [NSE-926] Support a UDF: URLDecoder
#924 [NSE-927] Enable AVX512 in Binary length calculation for native ColumnartoRow
#918 [NSE-856] Optimize of string/binary split
#908 [NSE-848] Optimize performance for Column2Row
#900 [NSE-869] Add 'first' agg function support
#917 [NSE-886] Add pmod expression support
#916 [NSE-909] fix slow test
#915 [NSE-857] Further optimizations of validity buffer split
#912 [NSE-909] "INSERT OVERWRITE x SELECT /*+ REPARTITION(2) */ * FROM y L…
#896 [NSE-889] Failed to write with ParquetFileFormat while using ArrowWriteExtension
#911 [NSE-910] fix bug of PR903
#901 [NSE-891] Spark executor lost while DatasetFileWriter failed with speculation
#907 [NSE-857] split validity buffer by reducer
#902 [NSE-892] Allow to use jar cmd not in PATH
#898 [NSE-867][FOLLOWUP] Add substring_index function support
#894 [NSE-855] allocate large block of memory for all reducer #881
#880 [NSE-857] Fill destination buffer by reducer
#839 [DNM] some optimizations to shuffle's split function
#879 [NSE-878]Wip get phyplan bugfix
#877 [NSE-876] Fix writing arrow hang with OutputWriter.path
#873 [NSE-872] implement replace function
#850 [NSE-854] Small Shuffle Size disable wholestagecodegen
#868 [NSE-867] Add substring_index function support
#847 [NSE-818] Support length, char_length, locate & regexp_extract
#865 [NSE-864] Enable native parquet write by default
#811 [NSE-810] disable codegen for SMJ with local limit
#860 remove sensitive info from physical plan
#853 [NSE-852] Unit test fix for NSE-843
#844 [NSE-843] ArrowDataSouce: Arrow dataset inspect() is called every tim…
#842 fix in eventlog script
#841 fix bug of script
#829 [NSE-828] Add native CoalesceBatches implementation
#830 [NSE-831] Support more expression types in getting attribute
#815 [NSE-610] Shrink hashmap to use less memory
#822 [NSE-821] Fix Wholestage Codegen on unsupported pattern
#824 [NSE-823] Use SPARK_VERSION_SHORT instead of SPARK_VERSION to find SparkShims
#826 [NSE-827] fix GHA
#819 [DNM] complete sparklog script
#802 [NSE-794] Fix count() with decimal value
#801 [NSE-786] Adding docs for shim layers
#790 [NSE-781]Add eventlog analyzer tool
#789 [NSE-788] Quick fix for randomSplit on reordered partitions
#780 [NSE-784] fallback Sort after SortHashAgg

OAP MLlib

Performance

#204 Intel-MLlib require more memory to run Bayes algorithm.

PRs

#208 [ML-204][NaiveBayes] Remove cache from NaiveBayes