Release v1.2.0 · oap-project/oap-tools

OAP 1.2.0 is the second release we reorganized the source code and transit to the dedicated oap-project organization (https://github.com/oap-project). In this release, 36 issues/improvements were committed. We also completed first round Cloud integration evaluation for Native SQL Engine, OAP MLlib and SQL Data Source Cache on different Cloud platforms including AWS EMR, GCP Dataproc and AWS EKS. The performance data is not promising and 5 Cloud integration functionality and performance issues postponed. In addition, we released two other dedicated Conda packages for EMR and Dataproc Cloud integration.

Here are the major features/improvements in OAP 1.2.0:

SQL Data Source Cache adds K8S support with Unix domain socket deployment limitation left.
Native SQL Engine further optimization to gain 25% overall performance on TPC-DS 103 queries; adds RDD Cache support; adds Spill & UDF Support; completes Column to Row optimization feature; and further enhances the stability of performance.
OAP MLlib further enhances the stability of performance of KMeans, PCA & ALS; supports Linear regression & Bayes algorithm optimization for CPU and gains over 10x training performance for Linear regression algorithm; and completes functionality of KMeans & PCA algorithm support for GPU.
PMem Shuffle adds Remote PMem pool feature (experimental).

Gazelle Plugin

Features


#394	Support ColumnarArrowEvalPython operator
#368	Encountered Hadoop version (3.2.1) conflict issue on AWS EMR-6.3.0
#375	Implement a series of datetime functions
#183	Add Date/Timestamp type support
#362	make arrow-unsafe allocator as the default
#343	configurable codegen opt level
#333	Arrow Data Source: CSV format support fix
#223	Add Parquet write support to Arrow data source
#320	Add build option to enable unsafe Arrow allocator
#337	UDF: Add test case for validating basic row-based udf
#326	Update Scala unit test to spark-3.1.1

Performance


#400	Optimize ColumnarToRow Operator in NSE.
#411	enable ccache on C++ code compiling

Bugs Fixed


#358	Running TPC DS all queries with native-sql-engine for 10 rounds will have performance degradation problems in the last few rounds
#481	JVM heap memory leak on memory leak tracker facilities
#436	Fix for Arrow Data Source test suite
#317	persistent memory cache issue
#382	Hadoop version conflict when supporting to use gazelle_plugin on Google Cloud Dataproc
#384	ColumnarBatchScanExec reading parquet failed on java.lang.IllegalArgumentException: not all nodes and buffers were consumed
#370	Failed to get time zone: NoSuchElementException: None.get
#360	Cannot compile master branch.
#341	build failed on v2 with -Phadoop-3.2

PRs


#489	[NSE-481] JVM heap memory leak on memory leak tracker facilities (Arrow Allocator)
#486	[NSE-475] restore coalescebatches operator before window
#482	[NSE-481] JVM heap memory leak on memory leak tracker facilities
#470	[NSE-469] Lazy Read: Iterator objects are not correctly released
#464	[NSE-460] fix decimal partial sum in 1.2 branch
#439	[NSE-433]Support pre-built Jemalloc
#453	[NSE-254] remove arrow-data-source-common from jar with dependency
#452	[NSE-254]Fix redundant arrow library issue.
#432	[NSE-429] TPC-DS Q14a/b get slowed down within setting spark.oap.sql.columnar.sortmergejoin.lazyread=true
#426	[NSE-207] Fix aggregate and refresh UT test script
#442	[NSE-254]Issue0410 jar size
#441	[NSE-254]Issue0410 jar size
#440	[NSE-254]Solve the redundant arrow library issue
#437	[NSE-436] Fix for Arrow Data Source test suite
#387	[NSE-383] Release SMJ input data immediately after being used
#423	[NSE-417] fix sort spill on inplsace sort
#416	[NSE-207] fix left/right outer join in SMJ
#422	[NSE-421]Disable the wholestagecodegen feature for the ArrowColumnarToRow operator
#369	[NSE-417] Sort spill support framework
#401	[NSE-400] Optimize ColumnarToRow Operator in NSE.
#413	[NSE-411] adding ccache support
#393	[NSE-207] fix scala unit tests
#407	[NSE-403]Add Dataproc integration section to README
#406	[NSE-404]Modify repo name in documents
#402	[NSE-368]Update emr-6.3.0 support
#395	[NSE-394]Support ColumnarArrowEvalPython operator
#346	[NSE-317]fix columnar cache
#392	[NSE-382]Support GCP Dataproc 2.0
#388	[NSE-382]Fix Hadoop version issue
#385	[NSE-384] "Select count(*)" without group by results in error: java.lang.IllegalArgumentException: not all nodes and buffers were consumed
#374	[NSE-207] fix left anti join and support filter wo/ project
#376	[NSE-375] Implement a series of datetime functions
#373	[NSE-183] fix timestamp in native side
#356	[NSE-207] fix issues found in scala unit tests
#371	[NSE-370] Failed to get time zone: NoSuchElementException: None.get
#347	[NSE-183] Add Date/Timestamp type support
#363	[NSE-362] use arrow-unsafe allocator by default
#361	[NSE-273] Spark shim layer infrastructure
#364	[NSE-360] fix ut compile and travis test
#264	[NSE-207] fix issues found from join unit tests
#344	[NSE-343]allow to config codegen opt level
#342	[NSE-341] fix maven build failure
#324	[NSE-223] Add Parquet write support to Arrow data source
#321	[NSE-320] Add build option to enable unsafe Arrow allocator
#299	[NSE-207] fix unsuppored types in aggregate
#338	[NSE-337] UDF: Add test case for validating basic row-based udf
#336	[NSE-333] Arrow Data Source: CSV format support fix
#327	[NSE-326] update scala unit tests to spark-3.1.1

OAP MLlib

Features


#110	Update isOAPEnabled for Kmeans, PCA & ALS
#108	Update PCA GPU, LiR CPU and Improve JAR packaging and libs loading
#93	[GPU] Add GPU support for PCA
#101	[Release] Add version update scripts and improve scripts for examples
#76	Reorganize Spark version specific code structure
#82	[Tests] Add NaiveBayes test and refactors

Bugs Fixed


#119	[SDLe][Klocwork] Security vulnerabilities found by static code scan
#121	Meeting freeing memory issue after the training stage when using Intel-MLlib to run PCA and K-means algorithms.
#122	Cannot run K-means and PCA algorithm with oap-mllib on Google Dataproc
#123	[Core] Improve locality handling for native lib loading
#116	Cannot run ALS algorithm with oap-mllib thanks to the commit "2883d3447d07feb55bf5d4fee8225d74b0b1e2b1"
#114	[Core] Improve native lib loading
#94	Failed to run KMeans workload with oap-mllib in JLSE
#95	Some shared libs are missing in 1.1.1 release
#105	[Core] crash when libfabric version conflict
#98	[SDLe][Klocwork] Security vulnerabilities found by static code scan
#88	[Test] Fix ALS Suite "ALS shuffle cleanup standalone"
#86	[NaiveBayes] Fix isOAPEnabled and add multi-version support

PRs


#124	[ML-123][Core] Improve locality handling for native lib loading
#118	[ML-116] use getOneCCLIPPort and fix lib loading
#115	[ML-114] [Core] Improve native lib loading
#113	[ML-110] Update isOAPEnabled for Kmeans, PCA & ALS
#112	[ML-105][Core] Fix crash when libfabric version conflict
#111	[ML-108] Update PCA GPU, LiR CPU and Improve JAR packaging and libs loading
#104	[ML-93][GPU] Add GPU support for PCA
#103	[ML-98] [Release] Clean Service.java code
#102	[ML-101] [Release] Add version update scripts and improve scripts for examples
#90	[ML-88][Test] Fix ALS Suite "ALS shuffle cleanup standalone"
#87	[ML-86][NaiveBayes] Fix isOAPEnabled and add multi-version support
#83	[ML-82] [Tests] Add NaiveBayes test and refactors
#75	[ML-53] [CPU] Add Linear & Ridge Regression
#77	[ML-76] Reorganize multiple Spark version support code structure
#68	[ML-55] [CPU] Add Naive Bayes
#64	[ML-42] [PIP] Misc improvements and refactor code
#62	[ML-30][Coding Style] Add code style rules & scripts for Scala, Java and C++

SQL DS Cache

Features


#155	reorg to support profile based multi spark version

Bugs Fixed


#190	The function of vmem-cache and guava-cache should not be associated with arrow.
#181	[SDLe]Vulnerabilities scanned by Snyk

PRs


#182	[SQL-DS-CACHE-181][SDLe]Fix Snyk code scan issues
#191	[SQL-DS-CACHE-190]put plasma detector in seperate object to avoid unnecessary dependency of arrow
#189	[SQL-DS-CACHE-188][POAE7-1253] improvement of fallback from plasma cache to simple cache
#157	[SQL-DS-CACHE-155][POAE7-1187]reorg to support profile based multi spark version

PMem Shuffle

Bugs Fixed


#46	Cannot run Terasort with pmem-shuffle of branch-1.2
#43	Rpmp cannot be compiled due to the lack of boost header file.

PRs


#51	[PMEM-SHUFFLE-50] Remove description about download submodules manually since they can be downloaded automatically.
#49	[PMEM-SHUFFLE-48] Fix the bug about mapstatus tracking and add more connections for metastore.
#47	[PMEM-SHUFFLE-46] Fix the bug that off-heap memory is over used in shuffle reduce stage.
#40	[PMEM-SHUFFLE-39] Fix the bug that pmem-shuffle without RPMP fails to pass Terasort benchmark due to latest patch.
#38	[PMEM-SHUFFLE-37] Add start-rpmp.sh and stop-rpmp.sh
#33	[PMEM-SHUFFLE-28]Add RPMP with HA support and integrate it with Spark3.1.1
#27	[PMEM-SHUFFLE] Change artifact name to make it compatible with naming…

Remote Shuffle

Bugs Fixed


#24	Enhance executor memory release

PRs


#25	[REMOTE-SHUFFLE-24] Enhance executor memory release

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v1.2.0

Gazelle Plugin

Features

Performance

Bugs Fixed

PRs

OAP MLlib

Features

Bugs Fixed

PRs

SQL DS Cache

Features

Bugs Fixed

PRs

PMem Shuffle

Bugs Fixed

PRs

Remote Shuffle

Bugs Fixed

PRs