Spark Native SQL Engine Installation

For detailed testing scripts, please refer to solution guide

Install spark 3.0.0 or above

Download Spark from download

wget http://archive.apache.org/dist/spark/spark-3.0.0/spark-3.0.0-bin-hadoop2.7.tgz
tar -xf ./spark-3.0.0-bin-hadoop2.7.tgz
export SPARK_HOME=`pwd`/spark-3.0.0-bin-hadoop2.7

Install arrow 0.17.0 & Native-SQL

We have provided a Conda package which will automatically install dependencies needed by OAP, you can refer to OAP-Installation-Guide for more information. Once finished, you will get arrow 0.17.0 dependencies installed by Conda, and the compiled spark-columnar-core & spark-arrow-datasource jar will be put into dir $HOME/miniconda2/envs/oapenv/oap_jars/

When you finish OAP-Installation-Guide , just jump to Spark Configurations for Native SQL Engine.

Manully Install arrow 0.17.0

Step 1. Install arrow 0.17.0 dependencies

git clone https://github.com/intel-bigdata/arrow && cd arrow && git checkout branch-0.17.0-oap-0.9
vim ci/conda_env_gandiva.yml 
clangdev=7
llvmdev=7

conda create -y -n pyarrow-dev -c conda-forge \
    --file ci/conda_env_unix.yml \
    --file ci/conda_env_cpp.yml \
    --file ci/conda_env_python.yml \
    --file ci/conda_env_gandiva.yml \
    compilers \
    python=3.7

conda activate pyarrow-dev

Step2. Install arrow 0.17.0

Please refer this doc to install Apache Arrow and Gandiva. Apache Arrow Installation

compile and install oap-native-sql

Install Googletest and Googlemock

yum install gtest-devel
yum install gmock

Build Native SQL Engine

git clone https://github.com/Intel-bigdata/OAP.git
cd OAP && git checkout branch-0.17.0-oap-0.9
cd oap-native-sql
cd cpp/
mkdir build/
cd build/
cmake .. -DTESTS=ON
make -j
#when deploying on multiple node, make sure all nodes copied libhdfs.so and libprotobuf.so.13

cd ../../core/
mvn clean package -DskipTests

Additonal Notes

Notes for Installation Issues

Spark Configurations for Native SQL Engine

Add below configuration to spark-defaults.conf

##### Columnar Process Configuration

spark.sql.sources.useV1SourceList avro
spark.sql.join.preferSortMergeJoin false
spark.sql.extensions com.intel.oap.ColumnarPlugin
spark.shuffle.manager org.apache.spark.shuffle.sort.ColumnarShuffleManager

# note native sql engine depends on arrow data source
spark.driver.extraClassPath $HOME/miniconda2/envs/oapenv/oap_jars/spark-columnar-core-0.9.0-jar-with-dependencies.jar:$HOME/miniconda2/envs/oapenv/oap_jars/spark-arrow-datasource-0.9.0-jar-with-dependencies.jar
spark.executor.extraClassPath $HOME/miniconda2/envs/oapenv/oap_jars/spark-columnar-core-0.9.0-jar-with-dependencies.jar:$HOME/miniconda2/envs/oapenv/oap_jars/spark-arrow-datasource-0.9.0-jar-with-dependencies.jar

######

About spark-arrow-datasource.jar, you can refer Unified Arrow Data Source .
Here's one example to verify if native sql engine works, make sure you have TPC-H dataset. We could do a simple projection on one parquet table. For detailed testing scripts, please refer to solution guide.

val orders = spark.read.format("arrow").load("hdfs:////user/root/date_tpch_10/orders")
orders.createOrReplaceTempView("orders")
spark.sql("select * from orders where o_orderdate > date '1998-07-26'").show(20000, false)

The result should showup on Spark console and there should be a similar diagram in the SQL page of Spark history:

Performance data

For initial microbenchmark performance, we add 10 fields up with spark, data size is 200G data

Coding Style

For Java code, we used google-java-format
For Scala code, we used Spark Scala Format, please use scalafmt or run ./scalafmt for scala codes format
For Cpp codes, we used Clang-Format, check on this link google-vim-codefmt for details.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

installation.md

installation.md

Spark Native SQL Engine Installation

Install spark 3.0.0 or above

Install arrow 0.17.0 & Native-SQL

Manully Install arrow 0.17.0

compile and install oap-native-sql

Install Googletest and Googlemock

Build Native SQL Engine

Additonal Notes

Spark Configurations for Native SQL Engine

Performance data

Coding Style

Files

installation.md

Latest commit

History

installation.md

File metadata and controls

Spark Native SQL Engine Installation

Install spark 3.0.0 or above

Install arrow 0.17.0 & Native-SQL

Manully Install arrow 0.17.0

compile and install oap-native-sql

Install Googletest and Googlemock

Build Native SQL Engine

Additonal Notes

Spark Configurations for Native SQL Engine

Performance data

Coding Style