Overview

This document describes the process that a url/dataset goes through from the time it has been identified by a seeder & sorter as "uncrawlable" until it is made available as a record in the datarefuge.org ckan data catalog. The process involves several distinct stages, and is designed to maximize smooth hand-offs so that each phase is handled by someone with distinct expertise in the area they're tackling, while the data is always being tracked for security.

Before you begin

Get account credentials and go over workflow.

Plan Overview

1. Seeding and Sorting

Seeders and Sorters will use the EDGI subprimer systems (found here), or a similar set of resources, to identify important/at risk data. Individual events should set up spreadsheets or other tools in which search efforts can be recorded. The work of this group includes:

canvassing the resources of a given government agency, identifying important URLs.
Idntifying whether those URL's can be crawled by the Internet Archive's webcrawler
- If URL's are crawlable, nominate them to the EOT crawl using the EDGI Nomination Tool
- If they are not crawlable, add them to the "uncrawlable" spreadsheet, generating a UUID for this dataset. the web-based tool UUID Generator can generate individual or multiple UUID's.

2. Research

Resarchers inspect the "uncrawlable" list to confirm that seeders' assessments were correct (that is, that the URL/dataset is indeed uncrawlable. Research.md describes this process in more detail.

Often this step is incorporated into either "Seeding and Sorting" or "Harvesting".

3. Harvesting

Harvesters take the "uncrawlable" data and try to figure out how to capture it. This is a complex task which can require substantial technical expertise, and which requires different techniques for different tasks. Harvesters should see the included Harvesting Toolkit for more details and tools. Group Leaders should familiarize themselves with this process before the start of the event.

4. Bagging

do quality assurance on the work of the harvesters to make sure that a second pair of eyes has passed over each dataset
ensure that everything a researcher would need to understand the data is present
package the data into a bagit file, which includes basic technical metadata

5. Uploading

take finished bag-it file and upload to S3 instance

6. Metadata

creates a CKan record for this S3 resource
links bag, makes public

Name		Name	Last commit message	Last commit date
Latest commit History 82 Commits
harvesting-toolkit		harvesting-toolkit
README.md		README.md
advance-work.md		advance-work.md
bagging.md		bagging.md
metadata.md		metadata.md
research.md		research.md
seednsort.md		seednsort.md
uploaders.md		uploaders.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Overview

Before you begin

Plan Overview

1. Seeding and Sorting

2. Research

3. Harvesting

4. Bagging

5. Uploading

6. Metadata

About

Releases

Packages

Contributors 8

Languages

datarefuge/old_thumbdrive_workflow

Folders and files

Latest commit

History

Repository files navigation

Overview

Before you begin

Plan Overview

1. Seeding and Sorting

2. Research

3. Harvesting

4. Bagging

5. Uploading

6. Metadata

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 8

Languages

Packages