Secco (Separate communication from computation) is a distributed SQL system for SQL Query, Graph Analytic, and Subgraph Query.
You need to install Spark 2.4.5, Hadoop 2.7.2 on your cluster.
/datasets - folder for storing toy datasets and folder template for storing datasets of synthetic workload experiment
/project - project related configuration files.
/script - scripts for running and testing Secco.
/src
src/main - source files
src/main/resource - configuration files for Secco
src/main/scala - scala source files
org/apache/spark/secco: main package
org/apache/spark/secco/analysis - analyzer
org/apache/spark/secco/benchmark - benchmark & testing
org/apache/spark/secco/catalog - catalog of database
org/apache/spark/secco/config - configurations
org/apache/spark/secco/execution - physical plans & planner
org/apache/spark/secco/expression - expressions
org/apache/spark/secco/optimization - logical plans & optimizer
org/apache/spark/secco/parsing - parser
org/apache/spark/secco/trees - tree struture used in optimizer framework
org/apache/spark/secco/types - types
org/apache/spark/secco/utils - utility
src/test - unit tests files
src/test/resource - configuration files for Secco in unit tests
src/test/scala - scala unit tests files
src/test/scala/integration - integration test
src/test/scala/playground - playground for testing new functions
src/test/scala/unit - unit test
src/test/scala/util - utility for testing
Secco-assembly-0.1.jar - compiled jar package of Secco
You can import the source code of Secco project using Jetbrain IntelliJ IDEA.
The main object in Secco to manipulate is Dataset
, which just like the Dataset
in SparkSQL
. In Dataset
, it defines relational algebra operators (e.g., select, project, join) that transforms the dataset.
The main entry of Secco is SeccoSession, where you can create the Dataset
, register Dataset
in Catalog
, get Dataset
from Catalog
, and issuse SQL
query.
An example is shown below.
// Obtain SeccoSession via singleton.
val dlSession = SeccoSession.currentSession
// Create datasets.
val seq1 = Seq(Array(1.0, 2.0), Array(2.0, 2.0))
val tableName = "R1"
val schema = Seq("A", "B")
val ds1 =
dlSession.createDatasetFromSeq(seq1, Some(tableName), Some(schema))
// Construct RA expression via relational algebra like API.
val ds2 = ds1.select("A < B")
// Explain the query execution of ds1 and ds2. It will show parsed plan, analyzed plan, optimized plan, execution plan.
ds1.explain()
ds2.explain()
For more usage, please check class org.apache.spark.secco.SeccoSession
and org.apache.spark.secco.Dataset
, there contains comments for guiding you using the system. We recommand you using the Dataset
api instead of SQL
api, as it currently have some bugs, and we disable it for now.
To reproduce the experiment mentioned in the paper, we prepare the compiled jar packages and scripts. You can follow the guide below to reproduce the results.
To download the real datasets found in paper
- For WB, AS, LJ, OK, go to https://snap.stanford.edu/data/index.html
- For UK, go to http://law.di.unimi.it/datasets.php
- For TW, go to https://anlab-kaist.github.io/traces/WWW2010
- For IMDB, go to https://www.imdb.com
To generate synthetic datasets needed in Workload Experiment Testing
- install SBT.
- execute SBT
- in SBT shell, execute
testOnly *SyntheticDatasetsSuite
- the generated synthetic datasets will be in
./datasets
We have prepared three demo datasets, debugData
, imdb
(demo version), and wiki
in ./datasets
You need to do some preprocessing on the raw datasets.
- For UK, you need to convert it from WebGraph format into edgelist format first. Please follow the instruction in https://github.com/helgeho/HadoopWebGraph.
- For edge list of WB, AS, LJ, OK, UK, TW, you need to name the original file by
rawData
and prepare an undirected version graph namedundirected
, which will be used in subgraph query experiment. - For IMDB, it needs to be preprocessed with imdbpy3 package, which can be downloaded in https://bitbucket.org/alberanid/imdbpy/get/5.0.zip
- After you have prepared all datasets, put all dataset in HDFS.
- For all relations of IMDB, you need to put it under a folder named
imdb
- For all relations (i.e.,
directed
andundirected
) of a graph dataset (e.g., WB), you need to put it under a folder (e.g.,wb
). Please name the folders of the graph datasets WB, AS, LJ, OK, UK, TW as wb, as, soc-lj, ok, uk tw respectively.
There are several scripts included in "/script" folder fro helping you running Secco in the distributed environment.
runSpark-yarn.sh: script for submitting spark program to yarn
upload.sh: script for uploading relevant jar packages and datasets to the remote cluters
test.sh: script that contains test in the paper
To correctly run the scripts, you need to modify the scripts based on your own computer's and clusters' settings.
- put files, e.g., datasets, you want to upload to cluster under
script/upload
, and modify upload.sh by replacingCluster
with your own clusters folder address. The compile jar file (Secco-assembly-0.1.jar) will be uploaded by default. - modify
test.sh
by assigingDataLocation
, e.g.,XXX/dataset
with the location you stored datasets in HDFS. - modify
runSpark-yarn.sh
by replacing$SPARK_HOME
with your own spark installation address.
We've prepare a compiled jar package, Secco-assembly-0.1.jar, so that you can use test Secco without import the project and compiling it.
To run the experiments in the paper:
-
On local machine, cd to folder that contains
Secco
-
On local machine, execute
script/upload.sh
to upload jar and datasets -
In cluster, execute
test.sh
with selective commands uncommented.-
For Subgraph Query, you need to uncomment
SimpleSubgraphQueryJob
andSimpleSubgraphQueryJob
intest.sh
-
For SQL Query, you need to uncomment
ComplexOLAPQueryJob
intest.sh
-
For Graph Analytic Query, you need to uncomment
SimpleGraphAnalyticJob
andComplexGraphAnalyticJob
intest.sh
-
For Workload Experiment Query,
-
you need to uncomment
WorkloadExpJob
intest.sh
-
you need to modify configuration files
/Secco/src/main/resources/reference.conf
- setting
secco.optimizer.estimator = Histogram // select from "Exact", "Naive", "Histogram" secco.optimizer.exact_cardinality_mode = provided // select from "provided" and "computed" secco.optimizer.enable_only_decouple_optimization = true // if only enable decoupled related optimizations secco.optimizer.enable_early_aggregation_optimization = false // if enable early aggregate optimizations
-
recompile secco project by typing
sbt assembly
in root folder ofSecco
-
upload recompiled jar to cluster.
-
-