NaturalSym, proposed in Natural Symbolic Execution-based Testing for Big Data Analytics (FSE24), is a symbolic execution-based tool to generate natural tests for big data analytics. To replicate our evaluation, please go to https://zenodo.org/records/11090237.
Users can download NaturalSym from github. Ubuntu>=20.04 with Python>=3.12 are suggested.
> apt-get install python3.12 python3-pip ant -y
> pip3 install numpy scipy
> git clone https://github.com/UCLA-SEAL/NaturalSym.git
> cd NaturalSym
NaturalSym is a test generator for scala-based Apache Spark programs. To use NaturalSym, users need to provide a scala program under test and a input annotation file, e.g. grades.scala and grades.config.
- grades.scala contains a method
def execute(input1: RDD[String], input2: RDD[String])
, which takes two input tables respectively representing students' math and physics scores, and filters out students who fail the total score. - grades.config contains user knowledge about the program input. Each line in grades.config corresponds to one input table in grades.scala, e.g.
input1 := Discrete("alice","bob") | scipy.binom(100, 0.1)
. The above annotation means that users give two examples of the first column (student name), i.e. "alice" and "bob". The second column (maths score) should follow a binomial distributionbinom(n=100,p=0.1)
.
To run NaturalSym, try out > ./naturalsym.sh grades.scala grades.config
. You'll see the generated tests both in console and grades.tests. The output snippet is shown below which refers to the test generated for the first condition that a student's math grade and physics grade can be found from two tables and the sum is smaller than 60.
Generated tests for Path1 in Rundir/1.smt2
input1.csv
alice,8
input2.csv
alice,34
Under the root folder, please execute > ./naturalsym <target.scala> <target.config>
. You'll see the output in both console and <target.tests>
.
- Limited by the back-end symbolic execution engine, the target method must be
execute
and input arguments must be of the shapeRDD[String]
. Please see our template intemplate.scala
. - The configuration file format is shown below. In general, each line of the configuration file declares user knowledge about each column from a input table delimited by "|". User knowledge can be none, an example list, Gaussian distribution, uniform distribution, or any distribution supported by Python scipy library.
<config> := "" | <config> "\n" <tab>
<tab> := <tab-name> ":=" <cols>
<cols> := "" | <cols> "|" <col>
<col> := <none> | <examples> | <uniform> | <gaussian> | <trunc-gaussian> | <scipy-distr>
<none> := ""
<examples> := Discrete(<examples-delimited-by-comma>)
<uniform> := Uniform(<l-bound>,<r-bound>)
<gaussian> := Gaussian(<mu>,<sigma>)
<trunc-gaussian>:= <l-bound><=Gaussian(<mu>,<sigma>)<=<r-bound>
<scipy-distr>:= scipy.<distr-name>(<parameter-list>)
Users can run > ./NaturalSym/scripts/run1.sh <bench>
to run a subject program from our benchmark suite. <bench>
should be one of airport,movie1,usedcars,transit,credit,Q1,Q3,Q6,Q7,Q12,Q15,Q19,Q20
.
For example, > ./NaturalSym/scripts/run1.sh airport
will run NatualSym for NaturalSym/newbench/src/airport/airport.scala
with the configuration file NaturalSym/newbench/config/airport.config
. Generated tests are under NaturalSym/newbench/geninputs/airport
.