Scatter Search is a tool to explore large datasets by interacting with its scatter plot. The exploration reveals key insights about the dataset.
This application should be viewed as an implementation of ongoing research under the project Zenvisage at the University of Illinois at Urbana Champaign led by Prof. Aditya Parameswaran.
The in-progress research paper can be found here. (Note: Update link)
In a nutshell, one should be able to:
- Select the columns one wants to explore (the XAxis, the YAxis and the ZAxis)
- Draw the region(s) one wants to explore on the representative plot
- Select the Ranking Algorithm
- Get a ranked set of scatter plots
Notes:
- The ZAxis represents the category, or the class, based on which the candidates are ranked.
- The region(s) are 'drawn' on the representative plot in the form of polygons.
- The ranking aims to find the order of prominence of candidates in the regions specified.
1.0.0
Scatter Search uses a number of open source projects and modules to work properly:
- D3 - combines powerful visualization components and a data-driven approach to DOM manipulation
- Materialize - A modern responsive front-end framework based on Material Design
- Python 3.x - evented I/O for the backend
- Flask - microframework for Python based on Werkzeug, Jinja 2 and good intentions
- Numpy - for computation
- Pickle - a python module for object serialization
- matplotlib - for its Path.contains_points goodness
- hopscotch - An amazing framework to add product tours to their pages.
- jQuery - duh
And of course Scatter Search itself is open source with a public repository on GitHub.
Scatter Search requires Python 3.x to run.
$ brew install python3
Download the repository and then run pip to install all the dependencies.
$ git clone https://github.com/zenvisage/scatter-search
$ cd scatter-search
$ pip install --upgrade pip
$ pip install --upgrade -r requirements.txt
Then Run the Flask instance.
$ python scatter-search.py
Refer to https://www.python.org/downloads/windows/ for downloading and installing the latest version of Python 3.x
Also make sure you have git installed and added to the PATH.
Then run the following commands on the command line:
$ git clone https://github.com/zenvisage/scatter-search
$ cd scatter-search
$ python -m pip install --upgrade pip
$ python -m pip install --upgrade -r requirements.txt
$ python scatter-search.py
Note: To test the app on a mobile device, look at ngrok.
To add a new dataset file, just add the file to the data/ folder. As of now, only .csv and .txt are supported, but incase we'd like to include other formats, it's an easy update.
Note: The included datasets - iris, test, and diabetic_data represent three scenarios. Test wouldn't work because it has just one trivial row. Diabetic_data would be extremely slow and is not fit for the tool. Iris is the small (150 rows only) and easy dataset that I used during development.
The app uses Pickle to load and save datasets (in the form of python dictionaries). It's possible to compute and add more information to the dataset before saving the dictionary, to help improve performance the next time the dataset is used. Also depickling instead of reading the .csv helps load datasets much faster.
def index_data(data_dict):
"""
This function might be used to index the data dictionary to be more useful
in sophisticated algorithms.
Note: In case the file is loaded as pickle, you shouldn't be calling this function.
:param data_dict: dict, required
:return: Indexed dictionary
"""
return data_dict
Note: More information about this section soon
Adding better, more sophisticated algorithms to the application is easy.
####Step 1: Define your algorithm function in utility/algorithms.py.
Here is an example of how the existing naive algorithm is implemented. You should be able to see the output as a console log when you click 'Get Results'.
def naive_algorithm(polygons, candidates_info, dataset, options):
"""
:param polygons: {points: [[x1,y1],[x2,y2],...], type:'green'}
:param candidates_info: {'CandidateA':{
numOfPoints: 50,
data: [[row1_val1, row1_val2,...],...]
},
'CandidateB'...}
:param dataset: As described in loadSaveDataset.py
:param options: {algorithm: string, xAxis: index, yAxis: index, zAxis: index}
:return: Must return full dictionaries (same format as dataset) for the top k candidates
"""
x = options['xAxis']
y = options['yAxis']
z = options['zAxis']
data = dataset['data']
result = {}
for candidate in candidates_info:
result_element = {
'dataset_name': candidate,
'data': candidates_info[candidate]['data'],
'column_names': dataset['column_names'],
'numOfPoints': candidates_info[candidate]['numOfPoints'],
'numOfPointsInPolygons': 0,
'score': 0
}
result[candidate] = result_element
for polygon in polygons:
points = np.array(polygon['points'])
path = mplPath.Path(points)
total_points = 0
for row in data:
point = [row[x], row[y]]
if path.contains_point(point):
total_points += 1
result[row[z]]['numOfPointsInPolygons'] += 1
result_array = []
for candidate in result:
candidate = result[candidate]
candidate['score'] = candidate['numOfPointsInPolygons'] / candidate['numOfPoints']
result_array.append(candidate)
result_array = sorted(result_array, key=lambda k: k['score'], reverse=True)
return {'algorithm': 'Naive Algorithm', 'result': result_array, 'x': x,'y': y}
def complex_algorithm(polygons, candidates_info, dataset):
"""
:param polygons:
:param candidates_info:
:param dataset:
:return:
"""
return {'comment': 'Hello from the Complex Algorithm'}
Any algorithm takes four parameters:
- polygons
- candidates_info
- dataset
- options
polygons are formatted to easily fit into the matplotlib.PATH object (they form a closed loop).
:param polygons: {points: [[x1,y1],[x2,y2],...], type:'green'}
candidates_info is an exhaustive map of all candidates and their data points, number of total points:
:param candidates_info: { 'CandidateA':{
numOfPoints: 50,
data: [[row1_val1, row1_val2,...],...]
},
'CandidateB'...
}
dataset is formatted like this (as converted from .csv in the loadSaveDataset script):
def get_data_dict(dataset_name, skip_header=0):
"""
Function to get a python dictionary representation of a dataset.
Only CSVs are supported as of now.
TODO: Add pickling/depickling.
:param file_path: str, required
Eg: 'data/iris.csv'
:param skip_header: int, optional
Defaults to 0, otherwise skips the given number of rows
Using the last skipped row as column names
:return: A python dictionary
{
dataset_name: str,
column_names:[str],
data: [[row1_val1, row1_val2,...],...],
cols: int,
rows: int,
loaded_from_pickle: bool
}
or if an exception was thrown:
{
error: str
}
"""
options is a dictonary of all the user preferences:
:param options: {
algorithm: string,
xAxis: index,
yAxis: index,
zAxis: index
}
####Step 2: Add your algorithm name and function to the mapping dictionary.
EXISTING_ALGORITHMS = {
'Naive Algorithm': naive_algorithm,
'Complex Algorithm': complex_algorithm
}
That's it! Now you should access to the algorithm on the web application.
Implement naive algorithmAdd Results section (also provide a way to control number of top ranks)(Added Results Page)- Use Pickle to optimize load time
- Use hopscotch to provide page tour
- Update pending sections.
- Add Usage section.
- Add support for using multiple algorithms at once.
MIT