OpenEDGAR is designed to be run on Amazon Web Services to provide high-quality, reliable Internet access and intra-DC access to Amazon S3 for storage. While users can run OpenEDGAR from outside of AWS, an AWS account is required for S3 usage and performance will be substantially reduced.
-
Launch an EC2 instance
-
Update all packages
a.
$ sudo apt update
b.
$ sudo apt upgrade
-
Reboot
-
Format and mount disks (optional)
a.
$ mkfs.ext4 /dev/nvme1n1
b. add to
/etc/fstab
c. Reboot to test mount
-
Install Python:
$ sudo apt install build-essential python3-dev python3-pip virtualenv
-
Install Postgres:
$ sudo apt install postgresql-9.5 postgresql-client-common libpq-dev
-
Install Oracle Java
a.
$ sudo add-apt-repository ppa:webupd8team/java
b.
$ sudo apt-get update
c.
$ sudo apt-get install oracle-java8-installer oracle-java8-set-default oracle-java8-unlimited-jce-policy
d.
$ java -version
-
Clone repo (you may need to ensure you have permissions to create a directory under /opt)
a.
$ cd /opt
b.
$ git clone https://github.com/LexPredict/openedgar.git
-
Setup virtual environment
a.
$ cd /opt/openedgar
b.
$ virtualenv -p /usr/bin/python3 env
c.
$ ./env/bin/pip install -r lexpredict_openedgar/requirements/full.txt
-
Setup database. Note that the password chosen for openegar must be set as DJANGO_PASSWORD in the .env later
a.
$ sudo -u postgres createuser -l -P -s openedgar
b.
$ sudo -u postgres createdb -O openedgar openedgar
c. Move PG data folder (optional)
$ sudo systemctl stop postgresql $ sudo systemctl status postgresql $ sudo mv /var/lib/postgresql /data $ sudo ln -s /data/postgresql /var/lib/postgresql $ sudo chown -R postgres:postgres /var/lib/postgresql $ sudo systemctl start postgresql $ sudo systemctl status postgresql $ sudo -u postgres psql
-
Install and configure RabbitMQ
a.
$ wget https://packages.erlang-solutions.com/erlang-solutions_1.0_all.deb
b.
$ sudo dpkg -i erlang-solutions_1.0_all.deb
c.
$ sudo apt update
d.
$ sudo apt install rabbitmq-server
e.
$ sudo rabbitmqctl add_user openedgar openedgar
f.
$ sudo rabbitmqctl add_vhost openedgar
g.
$ sudo rabbitmqctl set_permissions -p openedgar openedgar ".*" ".*" ".*"
h. Move rabbitmq data folder (optional)
$ sudo systemctl stop rabbitmq-server.service $ sudo mv /var/lib/rabbitmq /data/ $ sudo ln -s /data/rabbitmq /var/lib/rabbitmq $ sudo chown -R rabbitmq:rabbitmq /var/lib/rabbitmq $ sudo systemctl start rabbitmq-server.service $ sudo systemctl status rabbitmq-server.service
-
Update .env file. For local testing (downloading files locally, instead of to S3), set CLIENT_TYPE to LOCAL and DOWNLOAD_PATH to a local path
a.
$ cp lexpredict_openedgar/sample.env lexpredict_openedgar/.env
b. Update DATABASE_URL
c. Update CELERY_BROKER_URL
d. Setup AWS S3 bucket
e. Setup IAM policy
{ "Version": "2012-10-17", "Statement": [ { "Sid": "[REPLACE:unique ID]", "Effect": "Allow", "Action": [ "s3:*" ], "Resource": [ "arn:aws:s3:::[REPLACE:your bucket]" ] }, { "Sid": "[REPLACE:unique ID]", "Effect": "Allow", "Action": [ "s3:*" ], "Resource": [ "arn:aws:s3:::[REPLACE:your bucket]/*" ] } ] }
f. Update
S3_ACCESS_KEY
,S3_SECRET_KEY
, andS3_BUCKET
-
Initial database migration
a.
$ cd /opt/openedgar/lexpredict_openedgar
b.
$ source ../env/bin/activate
c.
$ source .env
d.
$ python manage.py migrate
-
Setup Apache Tika and run
a.
$ cd /opt/openedgar/tika
b.
$ bash download_tika.sh
c.
$ bash run_tika.sh
(run with&
,nohup
, or as service) -
Setup Celery
a.
$ cd /opt/openedgar/lexpredict_openedgar
b.
$ source ../env/bin/activate
c.
$ source .env
d.
$ bash scripts/run_celery.sh
(run with&
,nohup
, or as service)
-
Build database of 10-Ks from 2018 from latest SEC EDGAR data
a.
$ cd /opt/openedgar/lexpredict_openedgar
b.
$ source ../env/bin/activate
c.
$ source .env
d.
$ python manage.py shell_plus
e. Retrieve all 10-Ks from 2018
>>> from openedgar.processes.edgar import download_filing_index_data, process_all_filing_index >>> download_filing_index_data(year=2018) >>> process_all_filing_index(year=2018, form_type_list=["10-K"])
f. Sample timing on
m5.large
(2 core, 8GB RAM): ~24 hours to retrieve and parse all 2018 10-Ksg. Sample statistics for 2018 10-Ks as of May
# Data on S3 Size of edgar/ on S3: Objects: 1645 Size: 2.4 GB Size of documents/raw/ on S3: Objects: 135497 Size: 2.1 GB Size of documents/text/ on S3: Object: 130469 Size: 1000.4 MB # Data in Postgres In [7]: Filing.objects.count() Out[7]: 1521 In [8]: FilingDocument.objects.count() Out[8]: 147598 In [9]: Company.objects.count() Out[9]: 1451