Skip to content
Tom Conlin edited this page Feb 19, 2016 · 3 revisions

General

Always re-use existing APIs if they exist! No need to write our own in most cases. While remaining mindful of the dependencies we are committing to.

Formats

CSV/TSV

Sometimes a streaming approach is possible (i.e. dump triple(s) for every line, then discard line)

In some cases it may be necessary to load all lines into an in-memory model, but this should in general be avoided

XML

TODO: investigate best lib to use

One possibility is to use xslt but in my experience this leads to scalability issues, and a programmatic approach is usually best

Slurp all into memory vs SAX-type approach?

i.e.
http://pythonhosted.org/generateDS/

JSON

It may be possible to 'convert' JSON by simply providing a JSON-LD context. Then it will naturally translate to RDF (e.g. via Apache-Jena RIOT - possibly also python rdflib equivalent)

SQL dbs

We want to avoid writing our own ORMs. We must first ask - if the SQL db is widely used (e.g. ENSEMBL) is there an existing API we can use?

Scraping

See Beautiful Soup. Also if already ingested into DISCO continue with disco2turtle rout for now