Overview

OAP 1.4.0 released with three major components: Gazelle, OAP MLlib and CloudTik (a new addition). In this release, 59 features/improvements were committed to Gazelle; OAP MLlib is paused to focus more on oneDAL; CloudTik is a cloud scale platform for distributed analytics and AI on public cloud providers including AWS, Azure, GCP, and so on. CloudTik enables users or enterprises to easily create and manage analytics and AI platform on public clouds with out-of-box optimized functionalities and performance, and to go quickly to focus on running the business workloads in minutes or hours instead of spending months to construct and optimize the platform.

Here are the major features/improvements in OAP 1.4.0.

Gazelle

Reach 1.6X overall performance vs. vanilla Spark on TPC-DS 103 queries with 5TB dataset on ICX clusters
Reach 1.8X overall performance vs. vanilla Spark on TPC-H 22 queries with 5TB dataset on ICX clusters
Implement split by reducer by column and allocate large block of memory for all reducer to optimize shuffle
Optimize Columnar2Row and Row2Columnar performance
Add support for 10+ expressions
Pack the classes into one single jar
Bugfixes on WholeStage Codegen on unsupported pattern, sort spill, etc

CloudTik

Scalable, robust, powerful and unified control plane for Cloud cluster and runtime management
Support 3 major public Cloud providers AWS, GCP and Azure with managed cloud storage and on-premise mode
Support of Cloud workspace to manage shared Cloud resources including VPC and subnets, cloud storage, firewalls, identities/roles and so on
Out-of-box runtimes including Spark, Presto/Trino, HDFS, Metastore, Kafka, Zookeeper and Ganglia
Integrate with OAP (Gazelle and MLlib) and various tools to run benchmarks on CloudTik

OAP MLlib

Reach over 11X performance vs. vanilla Spark with PCA, Linear and Ridge Regression on ICX clusters

Changelog

Gazelle

Features


#781	Add spark eventlog analyzer for advanced analyzing
#927	Column2Row further enhancement
#913	Add Hadoop 3.3 profile to pom.xml
#869	implement first agg function
#926	Support UDF URLDecoder
#856	[SHUFFLE] manually split of Variable length buffer (String likely)
#886	Add pmod function support
#855	[SHUFFLE] HugePage support in shuffle
#872	implement replace function
#867	Add substring_index function support
#818	Support length, char_length, locate, regexp_extract
#864	Enable native parquet write by default
#828	CoalesceBatches native implementation
#800	Combine datasource and columnar core jar

Performance


#848	Optimize Columnar2Row performance
#943	Optimize Row2Columnar performance
#854	Enable skipping columnarWSCG for queries with small shuffle size
#857	[SHUFFLE] split by reducer by column

Bugs Fixed


#827	Github action is broken
#987	TPC-H q7, q8, q9 run failed when using String for Date
#892	Q47 and q57 failed on ubuntu 20.04 OS without open-jdk.
#784	Improve Sort Spill
#788	Spark UT of "randomSplit on reordered partitions" encountered "Invalid: Map array child array should have no nulls" issue
#821	Improve Wholestage Codegen check
#831	Support more expression types in getting attribute
#876	Write arrow hang with OutputWriter.path
#891	Spark executor lost while DatasetFileWriter failed with speculation
#909	"INSERT OVERWRITE x SELECT /+ REPARTITION(2) / * FROM y LIMIT 2" drains 4 rows into table x using Arrow write extension
#889	Failed to write with ParquetFileFormat while using ArrowWriteExtension
#910	TPCDS failed, segfault caused by PR903
#852	Unit test fix for NSE-843
#843	ArrowDataSouce: Arrow dataset inspect() is called every time a file is read

PRs


#1005	[NSE-800] Fix an assembly warning
#1002	[NSE-800] Pack the classes into one single jar
#988	[NSE-987] fix string date
#977	[NSE-126] set default codegen opt to O1
#975	[NSE-927] Add macro AVX512BW check for different CPU architecture
#962	[NSE-359] disable unit tests on spark32 package
#966	[NSE-913] Add support for Hadoop 3.3.1 when packaging
#936	[NSE-943] Optimize IsNULL() function for Row2Columnar
#937	[NSE-927] Implement AVX512 optimization selection in Runtime and merge two C2R code files into one.
#951	[DNM] update sparklog
#938	[NSE-581] implement rlike/regexp_like
#946	[DNM] update on sparklog script
#939	[NSE-581] adding ShortType/FloatType in ColumnarLiteral
#934	[NSE-927] Extract and inline functions for native ColumnartoRow
#933	[NSE-581] Improve GetArrayItem(Split()) performance
#922	[NSE-912] Remove extra handleSafe costs
#925	[NSE-926] Support a UDF: URLDecoder
#924	[NSE-927] Enable AVX512 in Binary length calculation for native ColumnartoRow
#918	[NSE-856] Optimize of string/binary split
#908	[NSE-848] Optimize performance for Column2Row
#900	[NSE-869] Add 'first' agg function support
#917	[NSE-886] Add pmod expression support
#916	[NSE-909] fix slow test
#915	[NSE-857] Further optimizations of validity buffer split
#912	[NSE-909] "INSERT OVERWRITE x SELECT /+ REPARTITION(2) / * FROM y L…
#896	[NSE-889] Failed to write with ParquetFileFormat while using ArrowWriteExtension
#911	[NSE-910] fix bug of PR903
#901	[NSE-891] Spark executor lost while DatasetFileWriter failed with speculation
#907	[NSE-857] split validity buffer by reducer
#902	[NSE-892] Allow to use jar cmd not in PATH
#898	[NSE-867][FOLLOWUP] Add substring_index function support
#894	[NSE-855] allocate large block of memory for all reducer #881
#880	[NSE-857] Fill destination buffer by reducer
#839	[DNM] some optimizations to shuffle's split function
#879	[NSE-878]Wip get phyplan bugfix
#877	[NSE-876] Fix writing arrow hang with OutputWriter.path
#873	[NSE-872] implement replace function
#850	[NSE-854] Small Shuffle Size disable wholestagecodegen
#868	[NSE-867] Add substring_index function support
#847	[NSE-818] Support length, char_length, locate & regexp_extract
#865	[NSE-864] Enable native parquet write by default
#811	[NSE-810] disable codegen for SMJ with local limit
#860	remove sensitive info from physical plan
#853	[NSE-852] Unit test fix for NSE-843
#844	[NSE-843] ArrowDataSouce: Arrow dataset inspect() is called every tim…
#842	fix in eventlog script
#841	fix bug of script
#829	[NSE-828] Add native CoalesceBatches implementation
#830	[NSE-831] Support more expression types in getting attribute
#815	[NSE-610] Shrink hashmap to use less memory
#822	[NSE-821] Fix Wholestage Codegen on unsupported pattern
#824	[NSE-823] Use `SPARK_VERSION_SHORT` instead of `SPARK_VERSION` to find SparkShims
#826	[NSE-827] fix GHA
#819	[DNM] complete sparklog script
#802	[NSE-794] Fix count() with decimal value
#801	[NSE-786] Adding docs for shim layers
#790	[NSE-781]Add eventlog analyzer tool
#789	[NSE-788] Quick fix for randomSplit on reordered partitions
#780	[NSE-784] fallback Sort after SortHashAgg

OAP MLlib

Performance


#204	Intel-MLlib require more memory to run Bayes algorithm.

PRs


#208	[ML-204][NaiveBayes] Remove cache from NaiveBayes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v1.4.0

Overview

Gazelle

CloudTik

OAP MLlib

Changelog

Gazelle

Features

Performance

Bugs Fixed

PRs

OAP MLlib

Performance

PRs