Skip to content

Commit

Permalink
Josh's edits for whitespace
Browse files Browse the repository at this point in the history
  • Loading branch information
joshlawrimore committed Jan 28, 2025
1 parent 17207f6 commit ef043d3
Showing 1 changed file with 28 additions and 26 deletions.
54 changes: 28 additions & 26 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,6 @@ A collection of scripts for extracting, transforming, and loading data.

## Development setup


The following will allow you to run the scripts in this project:

```
Expand All @@ -31,21 +30,21 @@ docker compose -f .docker/postgres-compose.yaml up -d
### Tidying the local database resources

The following will remove the postgres container and its associated volume (the -v flag).
```

```bash
docker compose -f .docker/postgres-compose.yaml down -v
```

### Install the pre-commit hooks

If you are developing locally you should make use of pre-commit hooks to ensure that your code is formatted correctly and passes linting checks.

```
```bash
pre-commit install
# run the pre-commit hooks on all files
pre-commit run --all-files
```


### Run the tests

You can run the test suite (assuming you have activated the virtual environment and set up required resources) with the following command:
Expand All @@ -58,14 +57,16 @@ pytest

To set up the database for this project, follow these steps:

1. **Create the Database**:
1. **Create the Database**:
- If the database does not exist, you need to create it. This can be done using a database client or command line tool specific to your database system. For example, using PostgreSQL, you might run:

```bash
createdb your_database_name
```

2. **Initialize the Database Schema**:
- Once the database is created, you need to apply the database schema using Alembic. Run the following command to apply all migrations:

```bash
alembic upgrade head
```
Expand All @@ -75,14 +76,14 @@ To set up the database for this project, follow these steps:
3. **Verify the Setup**:
- After running the migrations, verify that the database schema is correctly set up by checking the tables and their structures.


### Database Migrations

This project uses Alembic for database migrations. Follow the steps below to generate and apply migrations to the database.

#### Prerequisites

- Ensure your database is running. If you're using Docker, you can start the database with:
```bash
docker-compose -f .docker/postgres-compose.yaml up -d
```
Expand All @@ -92,6 +93,7 @@ This project uses Alembic for database migrations. Follow the steps below to gen
1. **Configure Alembic**: Ensure that the `alembic/env.py` file is correctly set up to connect to your database. The connection settings are managed through environment variables in your `.env` file.
2. **Create a New Migration**: To create a new migration script, run the following command:
```bash
alembic revision --autogenerate -m "Description of changes"
```
Expand All @@ -101,6 +103,7 @@ This project uses Alembic for database migrations. Follow the steps below to gen
3. **Review the Migration Script**: Open the generated migration script and review it to ensure it accurately reflects the changes you want to make to the database schema.
4. **Apply the Migration**: To apply the migration to the database, run:
```bash
alembic upgrade head
```
Expand All @@ -116,42 +119,42 @@ This project uses Alembic for database migrations. Follow the steps below to gen
For more detailed information on using Alembic, refer to the [Alembic documentation](https://alembic.sqlalchemy.org/en/latest/).
### Database Maintenance
The shared database is deployed using Opentofu (see the terraform directory).
A connection example (adding db password and address as required):
```
```bash
PGPASSWORD=<password> psql -h <host> -U postgres -d dsst_etl -c "\l"
```
To list snapshots:
```
```bash
aws rds describe-db-snapshots --db-instance-identifier dsst-etl-postgres-prod --query 'DBSnapshots[*].{SnapshotId:DBSnapshotIdentifier,SnapshotType:SnapshotType,Status:Status,Created:SnapshotCreateTime}'
```
To manually create a snapshot:
```
```bash
aws rds create-db-snapshot \
--db-instance-identifier dsst-etl-postgres-prod \
--db-snapshot-identifier dsst-etl-postgres-prod-manual-1
```
To delete a snapshot:
```
```bash
aws rds delete-db-snapshot \
--db-snapshot-identifier dsst-etl-postgres-prod-manual-1
```
## Script descriptions
### get_ipids.py
# Script descriptions
## get_ipids.py
### 'IC': Institute or Center abbreviation
#### 'IC': Institute or Center abbreviation
- Values are defined in the list 'ICs', which includes abbreviations for various NIH institutes and centers.
Expand All @@ -166,34 +169,34 @@ aws rds delete-db-snapshot \
- Regular expression (re.findall) is used to extract IPID numbers from the response text.
- For each unique IPID, a row with 'IC', 'YEAR', and 'IPID' is added to the CSV, avoiding duplicates.
## get_pmids.py
### get_pmids.py
### 'PI': Principal Investigator(s)
#### 'PI': Principal Investigator(s)
- The 'headings' and 'showname' HTML elements are searched for relevant labels to extract the names of Principal Investigators.
### 'PMID': PubMed ID
#### 'PMID': PubMed ID
- A regular expression is used to find patterns matching PubMed IDs in the HTML content.
### 'DOI': Digital Object Identifier
#### 'DOI': Digital Object Identifier
- A regular expression is used to find patterns matching DOI values in the HTML content.
### 'PROJECT': Project associated with the report
#### 'PROJECT': Project associated with the report
- Extracted from the 'contentlabel' HTML element within the reports.
## get_pmids_articles.py
### get_pmids_articles.py
### 'pmids_articles.csv': Filtered CSV containing articles that meet specific criteria
#### 'pmids_articles.csv': Filtered CSV containing articles that meet specific criteria
- Removes publications with types: ['Review', 'Comment', 'Editorial', 'Published Erratum'].
- Only includes publications identified as articles based on PubMed API data.
## data_conversion.py
### data_conversion.py
### Fetches information for PubMed articles, specifically titles and journal names
#### Fetches information for PubMed articles, specifically titles and journal names
- 'pmid': PubMed ID (unique identifier for a publication in PubMed).
- 'title': Title of the PubMed article.
Expand Down Expand Up @@ -237,4 +240,3 @@ _ renv
- [Open Data Detection in Publications (ODDPub)](https://github.com/quest-bih/oddpub). Required for [rtransparent](https://github.com/serghiou/rtransparent). *Must us v6.0!* If installing manually run `devtools::install_github("quest-bih/oddpub@v6")`. Updated ODDPub uses different parameters in latest version than is
- [CrossRef Minter (crminer)](https://github.com/cran/crminer). Required for [metareadr](https://github.com/serghiou/metareadr)
_ [Meta Reader (metareadr)](https://github.com/serghiou/metareadr). Required for [rtransparent](https://github.com/serghiou/rtransparent).

0 comments on commit ef043d3

Please sign in to comment.