MapPTTH uses:
- libcURL (>= 7.62.0)
- Lexbor (see Installation if you don't want to install it on your system)
- libxml2
- libPCRE
- CMake (>= 3.1.0)
- GraphViz (libgvc and libcgraph): generate graphs
- libcheck: unit tests
On Ubuntu (with GraphViz support):
sudo apt install cmake libpcre3-dev libcurl4-openssl-dev libxml2-dev libgraphviz-dev
If you don't have Lexbor installed and don't want to install it, you can clone Lexbor while cloning MapPTTH and compile without any installation:
git clone --recurse-submodules https://github.com/A3onn/mapptth/
cd mapptth/
mkdir build/ && cd build/
cmake .. && make -j5
If you have all dependencies installed on your system:
git clone https://github.com/A3onn/mapptth/
cd mapptth/
mkdir build/ && cd build/
cmake .. && make -j5
If you want to generate unit tests
If GraphViz is found on the system when running CMake, you will be able to generate graphs.
If you want to disable it, you can run cmake -DMAPPTTH_NO_GRAPHVIZ=1 ..
instead of cmake ..
.
The only required argument is an URL. This URL specifies where the crawler will start its crawling.
Here is the list of available parameters grouped by category:
Name | Argument |
---|---|
URL where to start crawling, the last specified will be used. (REQUIRED) | <URL> |
String that will be used as user-agent. You can disable sending the user-agent header by giving an empty string. (default='MAPPTTH/') | -U <user-agent> |
Timeout in seconds for each connection. If a connection timeout, an error will be printed to standard error but no informations about the URL. (default=3) | -m <timeout> |
Only resolve to IPv4 addresses. | -4 |
Only resolve to IPv6 addresses. | -6 |
Add headers in the HTTP request, they are like this: "<key>:<value>;", the ':' and the value are optionals and they have to end with a ';'. | -Q <header> |
Allow insecure connections when using SSL/TLS. | -i |
Add cookies in the HTTP request, they are like this: "<key>:<value>;", you can specify mulitple cookies at once by separating them by a ';'. Note that they won't be modified during the crawl. | -C <cookies> |
Name | Argument |
---|---|
Allow the crawler to go into subdomains of the initial URL and allowed domains. (default=false) | -s |
Allow the crawler to go to these domains. | -a <domain> |
Disallow the crawler to go to these domains. | -d <domain> |
Allow the crawler to only fetch URL starting with these paths. Can be a regex (extended and case-sensitive). | -p <path or regex> |
Disallow the crawler to fetch URL starting with these paths. Can be a regex (extended and case-sensitive). | -P <path or regex> |
Maximum depth of paths. If a path has a longer depth, it won't be fetched. | -D <depth> |
Only fetch URLs with HTTP as scheme (Don't forget to add '-r 80' if you start with an 'https://' URL). | -f |
Only fetch URLs with HTTPS as scheme (Don't forget to add '-r 443' if you start with an 'http://' URL). | -F |
Allow the crawler to only fetch files with these extensions. If no extension is found then this filter won't apply. | -x .<extension> |
Disallow the crawler to fetch files with these extensions. If no extension is found then this filter won't apply. | -X .<extension> |
Allow the crawler to go to theses ports | -r |
Keep the query part of the URL. Note that if two same URLs with a different query is found, both will be fetched. | -q |
Name | Argument |
---|---|
Only parse the <head> part. | -H |
Only parse the <body> part. | -B |
Name | Argument |
---|---|
Don't print with colors. | -c |
Print the title of the page if there is one when displaying an URL. | -T |
File to write output into (without colors). | -o <file name> |
Print a summary of what was found as a directory structure | -O |
Print when encountering tel: and mailto: URLs. | -I |
MapPTTH must be compiled with GraphViz support.
Name | Argument |
---|---|
Create a graph. | -g |
Change the layout of the graph. (default='sfdp') | -L <layout> |
Change the output graph file format. (default='png') | -G <format> |
Name | Argument |
---|---|
Number of threads that will fetch URLs. (default=5) | -t <number of threads> |
Parse the sitemap of the site, this should speeds up the crawler and will maybe provide URLs that couldn't be found without the sitemap. | -S <URL of the sitemap> |
Parse the robots.txt of the site, paths found in 'allowed' and 'disallowed' directives are added to the list of found URLs. Other directives are ignored. | -R <URL of the robots.txt file> |
URL of the proxy to use. | -z <URL of the proxy> |
Print the help. | -h |
Print the version. | -V |
You can stop the crawler with CTRL-C at any moment, this will gracefully stop the crawler and it will finish as normal.
Simple crawl:
mapptth https://google.com
Start crawling at a certain URL:
mapptth https://google.com/some/url/file.html
More threads:
mapptth https://google.com -t 10
Allow to crawl into subdomains (ex: www.google.com, mail.google.com, ww.mail.google.com):
mapptth https://google.com -s
Allow to crawl certain domains and their subdomains (ex: www.google.com, mail.gitlab.com, www.mail.github.com):
mapptth http://google.com -s -a gitlab.com -a github.com -r 443
Disallow some paths:
mapptth https://google.com -P /path -P /some-path
Disallow a path and only fetch .html and .php files:
mapptth https://google.com -P /some-path -x .html -x .php
Only crawl in the /path directory:
mapptth https://google.com -p /path
A more complete and complicated one:
mapptth https://google.com/mail -x .html -P /some-path -t 10 -m 5 -s -q -D 6 -T -o output.txt -H -S http://www.google.com/sitemap.xml
ASAP:
- Handling the <base> tag
Without any priority :
-
Add a parameter to control the connection rate
-
Create logo (maybe)
-
Print when encountering mailto: or tel:
-
Add robots.txt parser
-
Add proxy support
-
Use regex in filters (disallowed paths, allowed paths, etc...)
-
Add exemples in readme
-
More unit tests
-
Use only getopt to parse arguments
-
GraphViz support to generate graphs
-
Output to file
-
Add parameters to control: disallowed domains, only allowed paths and disallowed extensions