Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add docs around getting started. #13

Closed
wants to merge 2 commits into from
Closed

Add docs around getting started. #13

wants to merge 2 commits into from

Conversation

parkr
Copy link
Collaborator

@parkr parkr commented Aug 6, 2014

I had no clue how to get started so I monkeyed around until I figured most of it out. Still unclear about how to download the report data so tasks/inspectors.js can process it and index it in ES.

#### Initializing the data

1. Run `bundle install && rake -l ./tasks/elasticsearch.rake elasticsearch:init`
2. Run `??????????????`, which places the report data in `data/`
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@konklone What do I have to do here?

@konklone
Copy link
Owner

Sorry for the delay in responding! I was at DEFCON without a computer for several days.

I should really do some more dedicated docs in the README, but the short of it is that the data comes from this @unitedstates project. The data/ directory for that project should be, or be symlinked to be, the data/ directory for this project.

@konklone
Copy link
Owner

Let's leave this PR open, and I'll add to it with more docs later (unless you beat me to it) and merge it in.

@parkr
Copy link
Collaborator Author

parkr commented Aug 14, 2014

the short of it is that the data comes from this @unitedstates project. The data/ directory for that project should be, or be symlinked to be, the data/ directory for this project.

That's easy enough. Curious, however, that there is no data directory in the project to which you linked. Is that intentional? Why aren't you tracking the data? Does every person have to scrape for himself or herself? Why not use a submodule in both projects for this directory and populate and update as new reports come in?

Then installation of this dir is as simple as git submodule update --init.

@konklone
Copy link
Owner

It's currently 32GB of data - it's not good for git. So, someone needs to scrape it all themselves, if they want a complete import. However, you can scrape any subset you want to have something to load in.

It might be nice to provide a helpful bulk data sample, though keeping it up to date if there are schema changes would be annoying. In the long run, I want to get the entire dataset regularly into the Internet Archive's archives, but that's not done yet. In the meantime, the setup instructions require a separate project to be downloaded, and scrapers run.

@konklone
Copy link
Owner

The Internet Archive thing needn't be a long run thing, actually - it's something I plan to do in the next few weeks. I'm tracking that issue at unitedstates/inspectors-general#63. IA has a very nice S3-compatible interface for bulk uploading.

IA integration is an issue I'm not looking for help on, I want to be the point person on that. I don't think the size of my collection will require any negotiation on my or the project's part, or whether they'd be interested in making a more dedicated collection view for it, but I'm not certain, I just haven't dug in yet.

@konklone konklone closed this in 59d7508 Aug 16, 2014
@konklone
Copy link
Owner

@parkr, I just spent some time and more fully documented the project in the README. Let me know if there's still stuff missing.

@parkr parkr deleted the better-docs branch August 17, 2014 00:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants