Skip to content

Commit

Permalink
Update README.md
Browse files Browse the repository at this point in the history
Added information
  • Loading branch information
Movanna authored Dec 12, 2020
1 parent fc12077 commit 72534de
Showing 1 changed file with 18 additions and 10 deletions.
28 changes: 18 additions & 10 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,15 @@
# database_migration
Python scripts for migrating a MySQL database to a PostgreSQL database with a different structure
Python scripts for migrating a MySQL database to a PostgreSQL database with a different structure.

# Database migration scripts for Zacharias Topelius Skrifter, topelius.sls.fi
The databases belong to an edition project, Zacharias Topelius Skrifter, which publishes the works of the author Zacharias Topelius. The edition is published on a website (http://topelius.sls.fi), as e-books and in print. The PostgreSQL database is part of a generic digital edition platform, managed by the Society of Swedish Literature in Finland (see https://github.com/slsfi/digital_edition_documentation/wiki).

The database contains information about editions, texts, manuscripts, versions, facsimiles and editorial texts. The scripts listed below were written in the first half of 2020 and then used for migrating the relevant information. The new database improves the work flow of the publishing process considerably by storing more information, such as file paths for each text. Before, when (re)publishing a text on the web, the editor had to choose the file path by hand, and it wasn't permanently stored. On the new platform, this is no longer necessary. A lot of the functionality in the scripts deals with finding out the correct file paths for each text. Obviously, we didn't want to add 7 000 file paths manually into the new database.

There is also a script which creates the tables of contents for the editions (TOC was a table in the old database, but each TOC is a separate JSON file on the new platform). Another script splits large XML files into smaller ones, based on information from the database. Yet another one uses an API to fetch metadata for URL:s and inserts it into the database. The two last scripts update a remaining MySQL database, connected to the new platform.

All scripts were written in a pair programming context.

# Database migration scripts for Zacharias Topelius Skrifter

As a general rule, the scripts only migrate data concerning collections that are or will be a part of the published edition.

Expand All @@ -12,17 +20,17 @@ The scripts should be run in the following order:
## 1. migrate_main_tables.py
Enter the project name as the value of the variable project_name before running the script.

The script creates two json-files which map id:s from the old db to id:s created in the new db, in the id_dictionaries folder.
The script creates two JSON files which map id:s from the old db to id:s created in the new db, in the id_dictionaries folder.

## 2. migrate_manuscripts_and_versions.py
This script uses publication_ids.json which was created by script 1.

The script creates two json-files which map id:s from the old db to id:s created in the new db, in the id_dictionaries folder.
The script creates two JSON files which map id:s from the old db to id:s created in the new db, in the id_dictionaries folder.

## 3. migrate_facsimiles.py
This script uses publication_ids.json which was created by script 1 and manuscript_ids.json which was created by script 2.

The script creates a json-file which map id:s from the old db to id:s created in the new db, in the id_dictionaries folder.
The script creates a JSON file which map id:s from the old db to id:s created in the new db, in the id_dictionaries folder.

It creates a log file with info about facsimiles which do not belong to migrated publications. In some cases, these facsimiles appear elsewere on the website and might need special attention.

Expand Down Expand Up @@ -65,26 +73,26 @@ Enter the project id as the value of the variable PROJECT_ID before running the

This script uses collection_ids.json which was created by script 1 and Lfb_split.csv (for Läsning för barn).

The script fetches info from table tableofcontents in old db and transforms it into one properly ordered toc json file for each new collection. It sorts the toc items based on different values in the db.
The script fetches info from table tableofcontents in old db and transforms it into one properly ordered toc JSON file for each new collection. It sorts the toc items based on different values in the db.

## 9. split_lasning_for_barn.py
This script uses the XML files for Lasning for barn and the list Lfb_split.csv, which contains publication info.

The script splits the 8 large XML files so that each story/publication is in a separate file.

The script also creates a csv file (Lfb_signum_filer.csv) mapping legacy id:s with the newly created file paths. This file is used by update_publication_with_filepaths.py.
The script also creates a CSV file (Lfb_signum_filer.csv) mapping legacy id:s with the newly created file paths. This file is used by update_publication_with_filepaths.py.

## 10. split_lasning_for_barn_comments.py
This script uses Lfb_split.csv, which contains publication info, and Lfb_kommentarer.xml, which contains all the general comments for Lasning for barn.

The script creates an XML file for each comment, adds the right content and saves the file path. It then inserts comment info into the database, connecting it to the right publication.

The script also creates a csv file (Lfb_kommentarer_filer.csv) mapping legacy id:s with the newly created file paths.
The script also creates a CSV file (Lfb_kommentarer_filer.csv) mapping legacy id:s with the newly created file paths.

## 11. facsimile_url_info.py
This script uses publicistiktabell.csv to create a list of facsimile legacy id:s and url:s.

It fetches metadata based on url using API and inserts it into table publication_facsimile_collection. The script also inserts id:s into table publication_facsimile.
It fetches metadata based on URL using API and inserts it into table publication_facsimile_collection. The script also inserts id:s into table publication_facsimile.

## 12. create_introduction_and_title.py
This script uses collection_ids.json, which was created by script 1, and introduction_title_names.csv, which contains the name bases for each title page and introduction XML file. It also needs the current project_id.
Expand All @@ -95,4 +103,4 @@ It inserts data for introductions and title pages and updates table publication_
This script updates the document_id:s in table documentnote in topelius_notes for collection Läsning för barn. It has to be run only once; document_id:s are not continously updated. The XML files need to be accessible for the script.

## 14. update_comment_with_lasning_for_barn.py
This script was used for updating table document in db topelius_notes with filepaths to comments for Läsning för barn.
This script was used for updating table document in db topelius_notes with filepaths to comments for Läsning för barn.

0 comments on commit 72534de

Please sign in to comment.