- The application is fully dockerized for ease of deployment, nevertheless it is a large system with more than a dozen containers.
- There are two major subsystems, Sitehound-* and deep-*. You can read more here
- Ideally these two components are deploying separately and they will communicate via Apache Kafka.
- For reference, there is an Architecture Diagram shown below.
- Once deployed you will only interact with the application via your browser.
We are aware that the hardware requirements are not easy to meet so we also provide a hosted version. Send us an email to support@hyperiongray.com and we will set you up and account.
Since the stack of applications contains several infrastructure containers (mongo, elasticsearch, redis) and is designed to take advantage of the multicore architecture, we would recommend:
For a single host:
- At least 100GB of storage (if you plan to do serious crawling, 1TB is better)
- At least 16GB of dedicated RAM, 32GB is better
- 8 CPU cores
- For a single host installation, we recomend
m4.2xlarge
instance type. - On the security groups panel, open the inbound traffic for ports:
5081
,2181
and9092
on the Sitehound host EC2.
-
Ubuntu 16.04 is the recommended OS but it should play well with other distros. It won't work on Windows nor Mac though so a Virtual Machine would be needed in these cases. You can get one here.
-
Update the system:
sudo apt update
-
Docker CE or better installed. Docker API version should be at least 1.24. For Ubuntu 16.04 run:
sudo apt install docker.io
-
docker-compose installed on the Deep-deep server. Version should be at least 1.10. For Ubuntu 16.04 run:
sudo apt install -y python-pip export LC_ALL="en_US.UTF-8" export LC_CTYPE="en_US.UTF-8" sudo -H pip install docker-compose
-
$USER
(current user) added to the docker group. For Ubuntu 16.04 and userubuntu
, run:sudo usermod -aG docker ubuntu
and re-login.
-
From the provided .zip file, copy the folder ./sitehound-configs to the home directory of the server, or servers if you choose the dual host installation.
scp -r sitehound-configs ubuntu@{host-ip}:~
-
All further actions are executed in this
sitehound-configs
directory:cd sitehound-configs
Download deep-deep models on the host where the deep-deep will be running:
./download-deep-deep-models.sh
-
Sitehound uses external services that are not part of this suite, such as: Splash, excavaTor (currently private) for onion searches and Crawlera as a proxy. In order to have the app fully running, you need to specify hosts/credentials for them on the config file by replacing the placeholder values with the actual ones
-
Make sure port
5081
is open. Start all services with:
docker-compose \
-f docker-compose.yml \
-f docker-compose.deep-deep.yml \
up -d
Sitehound data is kept in mongodb and elasticsearch databases, outside the containers,
using mounted volumes in volumes/
folder,
so if you do docker-compose down
or remove containers with docker rm
,
data will be persisted.
Crawl results of deep-deep would be in deep-deep-jobs/
,
and broad crawl results will be in ./dd-jobs/
. Crawled data in CDRv3 format
will be in ./dd-jobs/*/out/*.jl.gz
,
and downloaded media items in ./dd-jobs/*/out/media/
.
The application will be using several external services:
- Crawlera: a proxy rotator.
- Our clustered hosted version of Splash.
- An onion index.
- Navigate to http://localhost:5081 (or your Sitehound's IP address).
- Log in with user
admin@hyperiongray.com
and passwordchangeme!
. - Create a new workspace, and click on the row to select it.
- Follow this walk-through for a better insight.
You can reach us at support@hyperiongray.com.