catch up introduction lesson

Code4NP · Jan 8, 2025 · 0fea19c · 0fea19c
1 parent 7cd8eb6
commit 0fea19c
Show file tree

Hide file tree

Showing 3 changed files with 234 additions and 9 deletions.
diff --git a/posts/intro_to_python_and_jupyter/Introduction_to_jupyter_notebooks.ipynb b/posts/intro_to_python_and_jupyter/Introduction_to_jupyter_notebooks.ipynb
@@ -16,8 +16,8 @@
     "    affiliation: Ometa Labs LLC\n",
     "    orcid: \n",
     "categories: []\n",
-    "date: \"2024-09-26\"\n",
-    "description: \"A task-based introduction into coding with python in the jupyter notebook. This lession teaches how to interact with tsv files to retrieve data and build figures using matplotlib.\"\n",
+    "date: \"2025-01-08\"\n",
+    "description: \"A task-based introduction into coding with python in the jupyter notebook. This lession teaches how to interact with tsv files to retrieve data, introduces custom functions, and API interaction.\"\n",
     "draft: true\n",
     "appendix-cite-as: display\n",
     "funding: \"The author(s) received no funding for this work.\"\n",
@@ -47,11 +47,10 @@
     "- How to read in your data \n",
     "- Python packages (and why you should use them)\n",
     "- Designing your own functions (and why you'll need to)\n",
-    "- Generate custom figures for presentations \n",
     "\n",
     "## Lesson Case Study: \n",
     "\n",
-    "We will search data from the NPAtlas to determine the number of reports of compounds per genus of bacteria. \n",
+    "We will search data from the NPAtlas using APIs to get information about a set of compounds by using a customized function. \n",
     "\n",
     "## Why Python: \n",
     "\n",
@@ -265,7 +264,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 33,
+   "execution_count": 28,
    "metadata": {},
    "outputs": [],
    "source": [
@@ -328,6 +327,36 @@
     "print(\"The last item in the bacteria_genera list is:\",bacteria_genera[-1])"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "If you wanted to go through every item in the list, you can create a \"for loop\" to do that. This becomes very handy if you want to go through lots of data and do the same thing, and is not limited to lists. In the example below, we use a placeholder of 'genus' to hold the information we are getting each time we go through the loop - so it gets overwritten every time it goes through the next item. \"For loops\" are handy, but they can be inefficient in the long run - we'll handle advanced ways to go through lists in the future. "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 29,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Escherichia\n",
+      "Salmonella\n",
+      "Bacillus\n",
+      "Staphylococcus\n",
+      "Streptococcus\n",
+      "Bhurkholderia\n"
+     ]
+    }
+   ],
+   "source": [
+    "for genus in bacteria_genera:\n",
+    "\tprint(genus)"
+   ]
+  },
   {
    "cell_type": "markdown",
    "metadata": {},
@@ -390,6 +419,13 @@
     "isolation_locations"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Dictionaries and lists can be nested as well, so you can have a list of lists, or a dictionary of dictionaries. One format we will work with by the end of this lesson -JSON- can be manipulated like a dictionary of dictionaries! "
+   ]
+  },
   {
    "cell_type": "markdown",
    "metadata": {},
@@ -547,22 +583,196 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 44,
+   "execution_count": 59,
    "metadata": {},
    "outputs": [],
    "source": [
-    "# %load ./exercise_solutions/exercise_2.py"
+    "# %load ./exercise_solutions/exercise_2.py\n"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
-   "source": []
+   "source": [
+    "# Python Packages\n",
+    "\n",
+    "Python has quite a few in-built functions - but these are rarely all you need. Specific packages are created to tackle one problem or manipulate data faster than we could code ourselves. \n",
+    "\n",
+    "There are packages for all kinds of scientific programming and data including: \n",
+    "* Mass Spectrometry Data \n",
+    "* NMR Data\n",
+    "* Statistics and Bioinformatics \n",
+    "* Figure Generation\n",
+    "* Interaction with API's\n",
+    "\n",
+    "Some of these packages are already built-in to a standard python environment too, but are not always available unless you call them up. *Requests* is one of these packages that we'll use to retrieve data from a website. \n",
+    "\n",
+    "To import a package, you call it by name using an import statement. "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import requests"
+   ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
-   "source": []
+   "source": [
+    "Sometimes, you don't want to type out the entire package - and we'll see why later. For now, lets import requests as the varaible name \"r\""
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 4,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import requests as r"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Most websites have documentation on how to interact with their API's, which we can use in conjunction with requests to find information VERY quickly. \n",
+    "\n",
+    "The NPAtlas documentation can be found [here](https://www.npatlas.org/api/v1/docs#/)\n",
+    "\n",
+    "For now, we are going to focus on simply searching for a compound by it's NPAID (Natural Products Atlas ID) and using a \"GET\" request to get the information. \n",
+    "\n",
+    "When we construct the url, we can simply add in the variable we want and add the strings together, as is shown below. Alternatively, we could also use a concept called f-string construction (which we will not cover, but is shown below). f-strings can be very helpful if you have something change in the middle of a URL, or many variables - but for now, they are not required. \n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 14,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "npaid = \"NPA024652\"\n",
+    "response = r.get(\"https://www.npatlas.org/api/v1/compound/\"+npaid)\n",
+    "# response = r.get(f\"https://www.npatlas.org/api/v1/compound/{npaid}\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "We can view the data in several different ways, including text string or compile it into a json for easy and quick sorting of any values it returns. Usually, the documentation will tell you if you expect a quick and simple value - or a laundry list of properties, often stored as JSON. Since the Atlas contains a wealth of information, it's easy to see the advantages - try flipping between the two by removing the #: "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 19,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "'{\"id\":24652,\"npaid\":\"NPA024652\",\"original_name\":\"Streptomycin\",\"mol_formula\":\"C21H39N7O12\",\"mol_weight\":\"581.5800\",\"exact_mass\":\"581.2657\",\"inchikey\":\"UCSJYZPVAKXKNQ-HZYVHMACSA-N\",\"smiles\":\"C[C@H]1[C@@]([C@H]([C@@H](O1)O[C@@H]2[C@H]([C@@H]([C@H]([C@@H]([C@H]2O)O)N=C(N)N)O)N=C(N)N)O[C@H]3[C@H]([C@@H]([C@H]([C@@H](O3)CO)O)O)NC)(C=O)O\",\"cluster_id\":592,\"node_id\":528,\"has_exclusions\":false,\"synonyms\":[],\"inchi\":\"InChI=1S/C21H39N7O12/c1-5-21(36,4-30)16(40-17-9(26-2)13(34)10(31)6(3-29)38-17)18(37-5)39-15-8(28-20(24)25)11(32)7(27-19(22)23)12(33)14(15)35/h4-18,26,29,31-36H,3H2,1-2H3,(H4,22,23,27)(H4,24,25,28)/t5-,6-,7+,8-,9-,10-,11+,12-,13-,14+,15+,16-,17-,18-,21+/m0/s1\",\"m_plus_h\":\"582.2730\",\"m_plus_na\":\"604.2549\",\"origin_reference\":{\"doi\":\"10.1021/ja01187a006\",\"pmid\":18875100,\"authors\":\"Kuehl, FA; Peck, RL; Hoffhine Jr, CE;.Folkers, K\",\"title\":\"Streptomyces antibiotics; structure of streptomycin.\",\"journal\":\"Journal of the American Chemical Society\",\"year\":1948,\"volume\":\"70\",\"issue\":\"7\",\"pages\":\"2325-2330\"},\"origin_organism\":{\"id\":670,\"type\":\"Bacterium\",\"genus\":\"Streptomyces\",\"species\":\"griseus\",\"taxon\":{\"id\":283,\"name\":\"Streptomyces\",\"rank\":\"genus\",\"taxon_db\":\"lpsn\",\"external_id\":\"517119\",\"ncbi_id\":1883,\"ancestors\":[{\"id\":1,\"name\":\"Bacteria\",\"rank\":\"domain\",\"taxon_db\":\"lpsn\",\"external_id\":\"0\",\"ncbi_id\":2},{\"id\":203,\"name\":\"Actinobacteria\",\"rank\":\"phylum\",\"taxon_db\":\"lpsn\",\"external_id\":\"0\",\"ncbi_id\":201174},{\"id\":204,\"name\":\"Actinobacteria\",\"rank\":\"class\",\"taxon_db\":\"lpsn\",\"external_id\":\"0\",\"ncbi_id\":null},{\"id\":275,\"name\":\"Streptomycetales\",\"rank\":\"order\",\"taxon_db\":\"lpsn\",\"external_id\":\"0\",\"ncbi_id\":85011},{\"id\":276,\"name\":\"Streptomycetaceae\",\"rank\":\"family\",\"taxon_db\":\"lpsn\",\"external_id\":\"0\",\"ncbi_id\":2062}]}},\"syntheses\":[\"10.7164/antibiotics.27.997\"],\"reassignments\":[],\"mol_structures\":[{\"current_structure\":true,\"reference_doi\":\"10.1021/ja01187a006\",\"structure_smiles\":\"C[C@H]1[C@@]([C@H]([C@@H](O1)O[C@@H]2[C@H]([C@@H]([C@H]([C@@H]([C@H]2O)O)N=C(N)N)O)N=C(N)N)O[C@H]3[C@H]([C@@H]([C@H]([C@@H](O3)CO)O)O)NC)(C=O)O\",\"is_reassignment\":false,\"version\":1}],\"exclusions\":[],\"external_ids\":[{\"external_db_name\":\"mibig\",\"external_db_code\":\"BGC0000717\"},{\"external_db_name\":\"gnps\",\"external_db_code\":\"CCMSLIB00012112970%Suspect related to Massbank: Streptomycin (predicted molecular formula SIRIUS: C22H43N7O13 / BUDDY: C33H43NO10) with delta m/z 32.026 (putative explanation: unspecified; atomic difference: 1C,4H,1O) [M+H]+%4\"},{\"external_db_name\":\"gnps\",\"external_db_code\":\"CCMSLIB00000075309%Streptomycin%3!CCMSLIB00000075310%Streptomycin%3!CCMSLIB00000206377%Massbank: Streptomycin%3!CCMSLIB00000206378%Massbank: Streptomycin%3!CCMSLIB00000206379%Massbank: Streptomycin%3!CCMSLIB00000206380%Massbank: Streptomycin%3!CCMSLIB00000206381%Massbank: Streptomycin%3!CCMSLIB00000220513%Massbank:KO003997 Streptomycin%3!CCMSLIB00000220515%Massbank:KO003998 Streptomycin%3!CCMSLIB00000220517%Massbank:KO003999 Streptomycin%3!CCMSLIB00000220519%Massbank:KO004000 Streptomycin%3!CCMSLIB00000220521%Massbank:KO004001 Streptomycin%3!CCMSLIB00000220524%Massbank:KO004002 Streptomycin%3!CCMSLIB00000220526%Massbank:KO004003 Streptomycin%3!CCMSLIB00000220528%Massbank:KO004004 Streptomycin%3!CCMSLIB00000220530%Massbank:KO004005 Streptomycin%3!CCMSLIB00000220532%Massbank:KO004006 Streptomycin%3!CCMSLIB00000570650%MoNA:2366441 Streptomycin (TN)%3!CCMSLIB00000571996%MoNA:2303472 Streptomycin%3!CCMSLIB00000571998%MoNA:2303213 Streptomycin%3!CCMSLIB00000572125%MoNA:2312045 Streptomycin%3!CCMSLIB00000572135%MoNA:2354240 Streptomycin%3!CCMSLIB00000574208%MoNA:2366441 Streptomycin (TN)%3\"},{\"external_db_name\":\"gnps\",\"external_db_code\":\"CCMSLIB00005723215%Streptomycin_20eV%3\"},{\"external_db_name\":\"gnps\",\"external_db_code\":\"CCMSLIB00005723216%Streptomycin_40eV%3\"},{\"external_db_name\":\"gnps\",\"external_db_code\":\"CCMSLIB00005723217%Streptomycin_50eV%3\"},{\"external_db_name\":\"gnps\",\"external_db_code\":\"CCMSLIB00009952766%Suspect related to Massbank: Streptomycin (predicted molecular formula: C20H39N11O9) with delta m/z 18.01 (putative explanation: Proline oxidation to 5-hydroxy-2-aminovaleric acid|water; atomic difference: 2H,1O|2H,1O)%4\"},{\"external_db_name\":\"gnps\",\"external_db_code\":\"CCMSLIB00009952767%Suspect related to Massbank: Streptomycin (predicted molecular formula: C22H43N7O13) with delta m/z 32.026 (putative explanation: unspecified; atomic difference: 1C,4H,1O)%4\"},{\"external_db_name\":\"gnps\",\"external_db_code\":\"CCMSLIB00005728509%Massbank:KO001831 Streptomycin%3\"},{\"external_db_name\":\"gnps\",\"external_db_code\":\"CCMSLIB00005728729%Massbank:KO001828 Streptomycin%3\"},{\"external_db_name\":\"gnps\",\"external_db_code\":\"CCMSLIB00005729154%Massbank:KO001827 Streptomycin%3\"},{\"external_db_name\":\"gnps\",\"external_db_code\":\"CCMSLIB00005729246%Massbank:KO001830 Streptomycin%3\"},{\"external_db_name\":\"gnps\",\"external_db_code\":\"CCMSLIB00005729422%Massbank:KO001829 Streptomycin%3\"},{\"external_db_name\":\"gnps\",\"external_db_code\":\"CCMSLIB00005771303%Massbank: Streptomycin%3\"},{\"external_db_name\":\"gnps\",\"external_db_code\":\"CCMSLIB00005771106%Massbank: Streptomycin%3\"},{\"external_db_name\":\"gnps\",\"external_db_code\":\"CCMSLIB00005771137%Massbank: Streptomycin%3\"},{\"external_db_name\":\"gnps\",\"external_db_code\":\"CCMSLIB00005771154%Massbank: Streptomycin%3\"},{\"external_db_name\":\"gnps\",\"external_db_code\":\"CCMSLIB00005771334%Massbank: Streptomycin%3\"},{\"external_db_name\":\"gnps\",\"external_db_code\":\"CCMSLIB00012112969%Suspect related to Massbank: Streptomycin (predicted molecular formula SIRIUS: C20H39N11O9 / BUDDY: C32H41NO10) with delta m/z 18.01 (putative explanation: Proline oxidation to 5-hydroxy-2-aminovaleric acid|water; atomic difference: 2H,1O|2H,1O) [M+H]+%4\"},{\"external_db_name\":\"npmrd\",\"external_db_code\":\"NP0008060\"}]}'"
+      ]
+     },
+     "execution_count": 19,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "response.text\n",
+    "# response.json()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "As you can see - there's quite a lot here from one simple request. But APIs can offer quite a lot of information if you give it the right data to search. There are GET requests for quick inquiries, POST requests for specifying different types and levels of information (think about it as an 'advanced search' function), and PUT requests for updating databases or adding new information. Typically, PUT requests are locked down but with the right credentials, you can add new information for others to use. "
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Functions\n",
+    "\n",
+    "Sometimes, we are running through an analysis and just want bits an pieces of information from specific inputs. Luckily, we can design functions to take a number of inputs and give us results so we do not have to do things one variable at a time.\n",
+    "\n",
+    "To start out, we can define a function in a script and then re-use it later. In advanced applications, you can import functions from other places and use them directly. This is handy if you re-use functions all the time, but don't want to waste time importing them every time you make a new script. Take a look at the function below: "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 20,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "def get_compound_data(npaid):\n",
+    "\tresponse = r.get(\"https://www.npatlas.org/api/v1/compound/\"+npaid)\n",
+    "\treturn response.json()\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Here, we're constructing a function to take an atlas ID and return the information we want about that compound. If we had a list of compounds, we can fetch information on each one, parse it, and add in the relevant information to a list outside of the function. See below for a quick example: "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 27,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Lincomycin is produced by Streptomyces and has a molecular weight of 406.5450 Da.\n",
+      "Erythromycin B is produced by Streptomyces and has a molecular weight of 717.9380 Da.\n",
+      "Collismycin A is produced by Streptomyces and has a molecular weight of 275.3330 Da.\n"
+     ]
+    }
+   ],
+   "source": [
+    "npaid_list = [\"NPA024602\",\"NPA015585\",\"NPA020595\"]\n",
+    "for npaid in npaid_list: \n",
+    "\tcompound_data = get_compound_data(npaid)\n",
+    "\tprint(compound_data['original_name'],\"is produced by\",compound_data['origin_organism']['genus'],\"and has a molecular weight of\",compound_data['mol_weight'],\"Da.\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "In the above example, we are combining a number of things we've already learned - list construction, for-loops, dictionary manipulation, and retrieving information from a JSON file as if it were a dictionary filled with other dictionaries. As you can see, putting these elements together means you can find all kinds of information systematically in just a few lines of code. \n",
+    "\n",
+    "In these examples, we use the NPAID - the number associated with a compound - to look at information. But how can we construct an API inquiry to search for a compounds from a list of names? \n",
+    "\n",
+    "HINT: use the [NPAtlas API Documentation](https://www.npatlas.org/api/v1/docs#/) to see how to construct the URL "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "## Exercise 3 Workspace:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 58,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# %load ./exercise_solutions/exercise_3.py"
+   ]
   }
  ],
  "metadata": {

diff --git a/posts/intro_to_python_and_jupyter/exercise_solutions/exercise_2.py b/posts/intro_to_python_and_jupyter/exercise_solutions/exercise_2.py
@@ -0,0 +1,2 @@
+for genus in np_atlas_set:
+	bacteria_genera.append(genus)
diff --git a/posts/intro_to_python_and_jupyter/exercise_solutions/exercise_3.py b/posts/intro_to_python_and_jupyter/exercise_solutions/exercise_3.py
@@ -0,0 +1,13 @@
+import requests as r
+
+def get_compound_data_by_name(compound_name):
+	response = r.get("https://www.npatlas.org/api/v1/compounds/full/?name="+compound_name)
+	return response.json()[0]
+
+compound_list = ['Lincomycin','Collismycin A','Streptomycin','Erythromycin B']
+for compound in compound_list:
+	try:
+		compound_data = get_compound_data_by_name(compound)
+		print(compound_data['original_name'],"is produced by",compound_data['origin_organism']['genus'],"and has a molecular weight of",compound_data['mol_weight'],"Da.")
+	except:
+		print(compound,"was not found in the NPAtlas database.")
Original file line number	Diff line number	Diff line change
		@@ -0,0 +1,2 @@
		for genus in np_atlas_set:
		bacteria_genera.append(genus)