- route:
/new/search_data
- methods:
POST
- body
form-data
: two files, one for query json, the other for supplied data(csv, optional)query
: file, json file following query schemadata
: file, csv file (optional)
- params:
max_return_docs
: a number for the maximum number of search results will be returned(default is 10)return_named_entity
:true
orfalse
for if return thenamed_entity
(default is false)
- example:
curl -X POST \ https://dsbox02.isi.edu:9000/new/search_data -H 'content-type: multipart/form-data' \ -F data=@datamart/example/fifa_example/fifa.csv \ -F query=@datamart/example/fifa_example/fifa_query.json
- sample response
{ "code": "0000", "message": "Success", "data": [ { "summary": "STRING SUMMARY FOE THE DATASET" "score": 84.735825, "metadata": {}, "datamart_id": "127860000" }, ... ] }
- route:
/new/materialize_data
- methods:
GET
- params: one param for datamart_id, and an optional param to return the first several rows only
datamart_id
: the datamart_id of the data you would like to materializefirst_n_rows
: int, if you would like to get only the first several rows of the dataset rather than all of them
- example:
curl -X GET \ 'https://dsbox02.isi.edu:9000/new/materialize_data?datamart_id=127860000&first_n_rows=10'
- sample response
{ "code": "0000", "message": "Success", "data": "CSV RESULT HERE" }
-
route:
/new/join_data
-
methods:
POST
-
body
form-data
: one file for supplied dataset, one field for datamart_id of augment data, two fields for the joining columnsleft_data
: file, a csv file, which is the supplied data provided by usersright_data
: text, a datamart_id for the data you would like to use for augmentationleft_columns
: text, specify the join features in the left dataset, by column indecesright_columns
: text, specify the join features in the right dataset, by column indecesleft_meta
: text, json for the metadata of supplied data, follow index_schema- useful when there is
implicit_variables
, e.g. :
{ "implicit_variables": [ { "name": "city", "value": "New York", "semantic_type": [] } ] }
- useful when there is
exact_match
: text, exact join or fuzzy join, eithertrue
orfalse
, default isfalse
- fuzzy matching
-
example:
curl -X POST \ https://dsbox02.isi.edu:9000/new/join_data \ -H 'content-type: multipart/form-data' \ -F left_data=@datamart/example/fifa_example/fifa.csv \ -F right_data=127860000 \ -F 'left_columns=[[3], [4]]' \ -F 'right_columns=[[22], [24]]' \ -F 'left_meta={"implicit_variables":[{"name":"city","value":"New York","semantic_type":[]}]}' \ -F 'exact_match=true'
*
exact_match
usespandas
left-merge, may return a table with more rows when there is multiple rows in the right dataset matched one row in the left dataset*Fuzzy match will return a dataset has exactly the same number of rows as the left dataset
-
{ "code": "0000", "message": "Success", "data": "CSV RESULT HERE", "matched_rows": [1, 3, 2, 0, None], "cover_ratio": 0.8 }
*The
matched_row
andcover_ratio
only available when NOT useexact_match
for nowmatched_rows
: which row(index) in the right row is aligned to the each left(row)- e.g.
[1, 3, 2, 0, None]
here means:- left_rows[0] <-matched-> right_rows[1]
- left_rows[1] <-matched-> right_rows[3]
- left_rows[2] <-matched-> right_rows[2]
- left_rows[3] <-matched-> right_rows[0]
- left_rows[5] <--- Nothing matched in the right dataset
- e.g.
cover_ratio
: how many rows in the left dataset is augmented, in the example above it is 0.8 (4/5).
If you would like to index a new dataset into ISI-datamart, there are two methods:
-
By single file:
- find the url for the data you would like to upload:
- it can be a csv file, an excel file, an html page with tabular data, or a json file.
- construct a description json for the data like:
{ "title": "title for the dataset", "description": "the description for the dataset", ... "materialization_arguments": { "url": "http://example.com/sample_csv.csv", "file_type": "csv" } }
- The only required field is
materialization_arguments.url
, all the others are optional. - More available attributes can be found in index_schema
- The only required field is
- call the
/new/get_metadata_single_file
api with the description json, and check the returned metadata - send the confirmed metadata through
/new/upload_metadata_list
to finish indexing
- find the url for the data you would like to upload:
-
By a html page that includes many links for "single file"
- find the url for the html, containing many links for dataset you would like to upload
- ISI-datamart will extract tags and recognize if the link is a data file
- if so, try to materialize each file and generate metadata
- call
/new/get_metadata_extract_links
with the url in body json and check the returned metadata - send the confirmed metadata through
/new/upload_metadata_list
to finish indexing
- find the url for the html, containing many links for dataset you would like to upload
- route:
/new/get_metadata_single_file
- methods:
POST
- body
json
: the description json for the file, including the url- see 1.ii above
materialization_arguments.file_type
can be one ofcsv
,excel
,html
,table
- params:
enable_two_ravens_profiler
: iftrue
, will try to use twoRavens profiler on the dataset and append info onto metadata. default is disabled.
- example:
curl -X POST \ https://dsbox02.isi.edu:9000/new/get_metadata_single_file?enable_two_ravens_profiler=false \ -H 'Content-Type: application/json' \ -d '{ "materialization_arguments": { "url": "https://www.w3schools.com/html/html_tables.asp", "file_type": "html" } }'
- sample response:
{ "code": "0000", "message": "Success", "data": [ // a list of metadata object, mostly only one metadata in the list {}, // when there are many sheets in an excel file there can be mutiple matadata ... ] }
- route:
/new/get_metadata_extract_links
- methods:
POST
- body
json
:{"url": "http://example.page.with.many.csv.links"}
- example:
curl -X POST \ https://dsbox02.isi.edu:9000/new/get_metadata_extract_links \ -H 'Content-Type: application/json' \ -d '{ "url": "https://sample-videos.com/download-sample-xls.php" }'
- sample response:
{ "code": "0000", "message": "Success", "data": [ // each inner-list is for a link: [ // a list of metadata object, mostly only one metadata in the list {}, // when there are many sheets in an excel file there can be mutiple matadata ... ], [], ... ] }
- route:
/new/upload_n=metadata_list
- methods:
POST
- body
json
: hold the metadata(or list(metadata), list(list(metadata)))metadata
: put the metadata(or list(metadata), list(list(metadata)))
- example:
curl -X POST \ https://dsbox02.isi.edu:9000/new/upload_metadata_list \ -H 'Content-Type: application/json' \ -d '{ "metadata": { "datamart_id": 0, "title": "html tables", "url": "https://www.w3schools.com/html/html_tables.asp", "materialization": { "python_path": "general_materializer", "arguments": { "url": "https://www.w3schools.com/html/html_tables.asp", "file_type": "html", "index": 0 } }, "variables": [ { "datamart_id": 1, "semantic_type": [], "name": "Company", "description": "column name: Company, dtype: object", "named_entity": [ "Alfreds Futterkiste", "Centro comercial Moctezuma", "Ernst Handel", "Island Trading", "Laughing Bacchus Winecellars", "Magazzini Alimentari Riuniti" ] }, { "datamart_id": 2, "semantic_type": [], "name": "Contact", "description": "column name: Contact, dtype: object", "named_entity": [ "Maria Anders", "Francisco Chang", "Roland Mendel", "Helen Bennett", "Yoshi Tannamuri", "Giovanni Rovelli" ] }, { "datamart_id": 3, "semantic_type": [], "name": "Country", "description": "column name: Country, dtype: object", "named_entity": [ "Germany", "Mexico", "Austria", "UK", "Canada", "Italy" ] } ], "description": "Company : object, Contact : object, Country : object", "keywords": [ "Company", "Contact", "Country" ] } }'
- sample response:
{ "code": "0000", "message": "Success", "data": [ // successed metadata, with valid datamart_id assigned {}, ... ] }
conda activate datamart_env
python ../../datamart_web/webapp.py