From 230b3f38e596dcf51005d91727bc4dfbd60cd1cc Mon Sep 17 00:00:00 2001 From: Everaldo Date: Tue, 27 Feb 2024 16:31:39 -0800 Subject: [PATCH] 0.12.x (#320) * removed extra bullet point * Remove line for switching branches now that docker-compose is in the main branch * changed to main branch docker-compose url * Revert "Merge branch 'master' into 0.12.x" This reverts commit 209b8ab3cf207e3faf47de3f848b15e0bff4ff0b, reversing changes made to 738af0a965ed8108602f2a532746170a14d1e487. --------- Co-authored-by: Jason Lin <35415519+jal347@users.noreply.github.com> --- docs/tutorial/cli.rst | 4 +- docs/tutorial/studio_tutorial.rst | 85 +++++++++++++------------------ 2 files changed, 37 insertions(+), 52 deletions(-) diff --git a/docs/tutorial/cli.rst b/docs/tutorial/cli.rst index a8ee986c3..062463868 100644 --- a/docs/tutorial/cli.rst +++ b/docs/tutorial/cli.rst @@ -96,8 +96,8 @@ command we will be using in this tutorial. into JSON documents and upload the to the source database * ``biothings-cli dataplugin serve``: *serve* command runs a simple API server for serving documents from the source database. -* ``biothings-cli dataplugin clean``: Delete all dumped files and drop uploaded -* sources tables +* ``biothings-cli dataplugin clean``: Delete all dumped files and drop uploaded + sources tables If you have any further questions on what other options are available in our ``biothings-cli``. You can check out more using the ``--help`` or ``-h`` flag diff --git a/docs/tutorial/studio_tutorial.rst b/docs/tutorial/studio_tutorial.rst index 89d113ad8..3168e1553 100644 --- a/docs/tutorial/studio_tutorial.rst +++ b/docs/tutorial/studio_tutorial.rst @@ -7,8 +7,7 @@ to a fully operational BioThings API. In a second part, this API will enrich for .. note:: You may also want to read the `developer's guide `_ for more detailed informations. -.. note:: The following tutorial is only valid for **BioThings Studio** release **0.2b**. Check - all available `releases `_ for more. +.. note:: The following tutorial uses a docker-compose file to run the **BioThings Studio** and **Hub**. This file is available `here `_ ================= 1. What you'll learn @@ -16,7 +15,7 @@ to a fully operational BioThings API. In a second part, this API will enrich for Through this guide, you'll learn: -* how to obtain a Docker image to run your favorite API +* how to run a docker-compose to run your favorite API * how to run that image inside a Docker container and how to access the **BioThings Studio** application * how to integrate a new data source by defining a data plugin * how to define a build configuration and create data releases @@ -29,14 +28,9 @@ Through this guide, you'll learn: ============= Using **BioThings Studio** requires a Docker server up and running, some basic knowledge -about commands to run and use containers. Images have been tested on Docker >=17. Using AWS cloud, -you can use our public AMI **biothings_demo_docker** (``ami-44865e3c`` in Oregon region) with Docker pre-configured -and ready for studio deployment. Instance type depends on the size of data you -want to integrate and parsers' performances. For this tutorial, we recommend using instance type with at least -4GiB RAM, such as ``t2.medium``. AMI comes with an extra 30GiB EBS volume, which is more than enough -for the scope of this tutorial. - -Alternately, you can install your own Docker server (on recent Ubuntu systems, ``sudo apt-get install docker.io`` +about commands to run and use containers. Images have been tested on Docker >=17. + +You can install your own Docker server (on recent Ubuntu systems, ``sudo apt-get install docker.io`` is usually enough). You may need to point Docker images directory to a specific hard drive to get enough space, using ``-g`` option: @@ -47,13 +41,16 @@ using ``-g`` option: # restart to make this change active sudo service docker restart +Alternatively, if you have a Mac or Windows, you can install `Docker Desktop `_. +It will install the docker server for you. Once you have Docker Desktop installed, go to settings->resources->advanced. You should give at least 80% of your resources to Docker for each category. +This will prevent your Docker from crashing if you are running a large datasource or build. ============ 3. Installation ============ **BioThings Studio** is available as a docker-compose file at our `github repository `_. -Clone the repository and go to the ``docker-compose`` branch. +Clone the repository to your local. A **BioThings Studio** instance exposes several services on different ports: @@ -70,39 +67,21 @@ A **BioThings Studio** instance exposes several services on different ports: .. code:: bash - $ docker run --rm --name studio -p 8080:8080 -p 7022:7022 -p 7080:7080 -p 7081:7081 -p 9200:9200 \ - -p 27017:27017 -p 8000:8000 -p 9000:9000 -p 60080:60080 -d biothings/biothings-studio:0.2b - -.. note:: we need to add the release number after the image name: biothings-studio:**0.2b**. Should you use another release (including unstable releases, - tagged as ``master``) you would need to adjust this parameter accordingly. - -.. note:: Biothings Studio and the Hub are not designed to be publicly accessible. Those ports should **not** be exposed. When - accessing the Studio and any of these ports, SSH tunneling can be used to safely access the services from outside. - Ex: ``ssh -L 7080:localhost:7080 -L 8080:localhost:8080 -L 7022:localhost:7022 -L 9000:localhost:9000 user@mydockerserver`` will expose the Hub REST API, the web application, - the Hub SSH, and Cerebro app ports to your computer, so you can access the webapp using http://localhost:8080, the Hub REST API using http://localhost:7080, - http://localhost:9000 for Cerebro, and directly type ``ssh -p 7022 biothings@localhost`` to access Hub's internals via the console. - See https://www.howtogeek.com/168145/how-to-use-ssh-tunneling for more details. + $ docker compose up -d --build We can follow the starting sequence using ``docker logs`` command: .. code:: bash - $ docker logs -f studio - Waiting for mongo - tcp 0 0 127.0.0.1:27017 0.0.0.0:* LISTEN - - * Starting Elasticsearch Server - ... - Waiting for cerebro - ... - now run webapp - not interactive + $ docker logs -f biothings + ARG + SSH keys not yet created, creating + Generating SSH Keys for BioThings Hub... + SSH Key has been generated, Public Key: Please refer to `Filesystem overview `_ and `Services check `_ for more details about Studio's internals. -By default, the studio will auto-update its source code to the latest version available and install all required dependencies. This behavior can be skipped -by adding ``no-update`` at the end of the command line of ``docker run ...``. - We can now access **BioThings Studio** using the dedicated web application (see `webapp overview `_). @@ -114,7 +93,7 @@ In this section we'll dive in more details on using the **BioThings Studio** and within the **Hub**, declare a build configuration using that datasource, create a build from that configuration, then a data release and finally instantiate a new API service and use it to query our data. -The whole source code is available at https://github.com/sirloon/pharmgkb, each branch pointing to a specific step in this tutorial. +The whole source code is available at https://github.com/biothings/tutorials/tree/master, each branch pointing to a specific step in this tutorial. 4.1. Input data ^^^^^^^^^^^^^^^ @@ -133,14 +112,14 @@ The last two files will be used in the second part of this tutorial when we'll a .. _`occurrences.zip`: https://s3.pgkb.org/data/occurrences.zip These files will be downloaded by the **Hub** when we trigger the dumper. These files will go into a folder named ``data_folder`` by default. -This will be explained in more detail in the `Data plugin `_ section. +This will be explained in more detail in the `Data Plugin `_ section. 4.2. Parser ^^^^^^^^^^^ In order to ingest this data and make it available as an API, we first need to write a parser. Data is pretty simple, tab-separated files, and we'll make it even simpler by using ``pandas`` python library. The first version of this parser is available in branch ``pharmgkb_v1`` at -https://github.com/sirloon/pharmgkb/blob/pharmgkb_v1/parser.py. After some boilerplate code at the beginning for dependencies and initialization, +https://github.com/biothings/tutorials/blob/pharmgkb_v1/parser.py. After some boilerplate code at the beginning for dependencies and initialization, the main logic is the following: @@ -177,6 +156,10 @@ containing the downloaded data. This path is automatically set by the Hub and po It is the responsibility of the parser to select, within that folder, the file(s) of interest. Here we need data from a file named ``var_drug_ann.tsv``. Following the moto "don't assume it, prove it", we make that file exists. +.. note:: In this case, an assertion isn't necessary as code will fail anyway if the file doesn't exist. But it's a good practice to make sure + the file exists before trying to open it. Also, it's a good practice to use ``os.path.join()`` to build the path to the file, as it will + automatically use the right path separator depending on the operating system. + .. code:: python dat = pandas.read_csv(infile,sep="\t",squeeze=True,quoting=csv.QUOTE_NONE).to_dict(orient='records') @@ -230,6 +213,9 @@ in a dictionary indexed by gene ID. The final documents are assembled in the las .. note:: In this specific example, we read the whole content of this input file in memory, then store annotations per gene. The data itself is small enough to do this, but memory usage always needs to be cautiously considered when we write a parser. +.. note:: In this case, the final documents are assembled within a generator function, which is a good practice to save memory. + You may see within our `Biothings github organization `_ that we have plugins where we return a dictonary or a list of documents. + This is also fine, but it is recommended to use a generator function when possible. 4.3. Data plugin ^^^^^^^^^^^^^^^^ @@ -250,7 +236,7 @@ that contains everything useful for the datasource. This is what we'll do in the so we don't have to regurlarly update the plugin code (``git pull``) from the webapp, to fetch the latest code. That said, since the plugin is already defined in github in our case, we'll use the github repo registration method. -The corresponding data plugin repository can be found at https://github.com/sirloon/pharmgkb/tree/pharmgkb_v1. The manifest file looks like this: +The corresponding data plugin repository can be found at https://github.com/biothings/tutorials/tree/pharmgkb_v1. The manifest file looks like this: .. code:: bash @@ -311,12 +297,9 @@ reconnect, which we'll do! .. image:: ../_static/hub_restarting.png :width: 250px -The Hub shows an error though: +Once you reconnect, you will have to do a hard refresh on your webpage, for example, ``cmd + shift + r`` on a Mac or ``ctrl + shift + r`` on a Windows/Linux. -.. image:: ../_static/nomanifest.png - :width: 250px - -Indeed, we fetch source code from branch ``master``, which doesn't contain any manifest file. We need to switch to another branch (this tutorial is organized using branches, +Since we fetch source code from branch ``master``, which doesn't contain any manifest file. We need to switch to another branch (this tutorial is organized using branches, and also it's a perfect opportunity to learn how to use a specific branch/commit using **BioThings Studio**...) Let's click on ``tutorials`` link, then |plugin|. In the textbox on the right, enter ``pharmgkb_v1`` then click on ``Update``. @@ -330,6 +313,8 @@ Let's click on ``tutorials`` link, then |plugin|. In the textbox on the right, e **BioThings Studio** will fetch the corresponding branch (we could also have specified a commit hash for instance), source code changes will be detected and the Hub will restart. The new code version is now visible in the plugin tab +.. note:: Remember to do a hard refresh again before continuing as the hub will attempt to restart. + .. image:: ../_static/branch.png :width: 400px @@ -373,7 +358,7 @@ release number, the data folder, when the last download was, how long it tooks t .. image:: ../_static/dumptab.png :width: 450px -Same for the `Uploader` tab, we now have 979 documents uploaded to MongoDB. Exact number may change depending on the source file that is downloaded. +Same for the `Uploader` tab, we now have 979 documents uploaded to MongoDB. Exact number may change depending on when the source file that is downloaded. .. image:: ../_static/uploadtab.png :width: 450px @@ -496,10 +481,10 @@ tells the **Hub** which datasources should be merged together, and how. Click on * the `document type` represents the kind of documents stored in the merged collection. It gives its name to the annotate API endpoint (eg. /gene). This source is about gene annotations, so "gene" it is... * open the dropdown list and select the `sources` you want to be part of the merge. We only have one, "pharmgkb" -* in `root sources`, we can declare which sources are allowed to create new documents in the merged collection, that is merge documents from a - datasource, but only if corresponding documents exist in the merged collection. It's useful if data from a specific source relates to data on - another source (it only makes sense to merge that relating data if the data itself is present). If root sources are declared, **Hub** will first - merge them, then the others. In our case, we can leave it empty (no root sources specified, all sources can create documents in the merged collection) +* in `root sources`, we can declare which sources are allowed to create new documents in the merged collection. + If a root source is declared, data from other sources will only be merged if documents previously exist with same IDs (documents coming from root sources). + If not, data is discarded. Finally, if no root source is declared, any data sources can generate a new document in the merged data. + In our case, we can leave it empty (no root sources specified, all sources can create documents in the merged collection). * selecting a builder is optional, but for the sake of this tutorial, we'll choose ``LinkDataBuilder``. This special builder will fetch documents directly from our datasources `pharmgkb` when indexing documents, instead of duplicating documents into another connection (called `target` or `merged` collection). We can do this (and save time and disk space) because we only have one datasource here.