Skip to content

Tools for diffing and comparing web content. Also includes a web server that makes diffs available as an HTTP service.

License

Notifications You must be signed in to change notification settings

edgi-govdata-archiving/web-monitoring-diff

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Download LatestVersion from PyPI  Code of Conduct  Build Status  Documentation Status

web-monitoring-diff

Web-Monitoring-Diff is a suite of functions that diff (find the differences between) types of content commonly found on the web, such as HTML, text files, etc. in a variety of ways. It also includes an optional web server that generates diffs as an HTTP service.

This package was originally built as a component of EDGI’s Web Monitoring Project, but is also used by other organizations and tools.

Installation

web-monitoring-diff requires Python 3.7 or newer. Before anything else, make sure you’re using a supported version of Python. If you need to support different local versions of Python on your computer, we recommend using pyenv or Conda.

  1. web-monitoring-diff depends on several system-level libraries that you may need to install first. Specifically, you’ll need: libxml2, libxslt, openssl, and libcurl.

    On MacOS, we recommend installing these with Homebrew:

    brew install libxml2
    brew install libxslt
    brew install openssl
    # libcurl is built-in, so you generally don't need to install it

    On Debian Linux:

    apt-get install libxml2-dev libxslt-dev libssl-dev openssl libcurl4-openssl-dev

    Other systems may have different package managers or names for the packages, so you may need to look them up.

  2. Install this package with pip. Be sure to include the --no-binary lxml option:

    pip install web-monitoring-diff --no-binary lxml

    Or, to also install the web server for generating diffs on demand, install the server extras:

    pip install web-monitoring-diff[server] --no-binary lxml

    The --no-binary flag ensures that pip downloads and builds a fresh copy of lxml (one of web-monitoring-diff’s dependencies) rather than using a pre-built version. It’s slower to install, but is required for all the dependencies to work correctly together. If you publish a package that depends on web-monitoring-diff, your package will need to be installed with this flag, too.

    On MacOS, you may need additional configuration to get pycurl to use the Homebrew openssl. Try one of the following:

    # Homebrew install locations vary by architechture.
    # For Apple silicon/ARM:
    PYCURL_SSL_LIBRARY=openssl \
      LDFLAGS="-L/opt/homebrew/opt/openssl/lib" \
      CPPFLAGS="-I/opt/homebrew/opt/openssl/include" \
      pip install web-monitoring-diff --no-binary lxml --no-cache-dir
    
    # Or for Intel:
    PYCURL_SSL_LIBRARY=openssl \
      LDFLAGS="-L/usr/local/opt/openssl/lib" \
      CPPFLAGS="-I/usr/local/opt/openssl/include" \
      pip install web-monitoring-diff --no-binary lxml --no-cache-dir

    The --no-cache-dir flag tells pip to re-build the dependencies instead of using versions it’s built already. If you tried to install once before but had problems with pycurl, this will make sure pip actually builds it again instead of re-using the version it built last time around.

    For local development, make sure to do an editable installation instead. See the “contributing” section below for more.

  3. (Optional) Install experimental diffs. Some additional types of diffs are considered “experimental” — they may be new and still have lots of edge cases, may not be publicly available via PyPI or another package server, or may have any number of other issues. To install them, run:

    pip install -r requirements-experimental.txt

Basic Usage

This package can imported as a library that provides diffing functions for use in your own python code, or it can be run as a standalone web server.

Library Usage

Import web_monitoring_diff, then call a diff function:

import web_monitoring_diff

page1 = "<!doctype html>\n<html><body>This is page 1.</body></html>"
page2 = "<!doctype html>\n<html><body>This is page 2.</body></html>"
comparison = web_monitoring_diff.html_diff_render(page1, page2)

Web Server

Start the web server:

$ web-monitoring-diff-server

This starts the web server on port 8888.

Then use cURL, a web browser, or any other HTTP tools to get a list of supported diff types:

$ curl "http://localhost:8888/"

That should output some JSON like:

{"diff_types": ["length", "identical_bytes", "side_by_side_text", "links", "links_json", "html_text_dmp", "html_source_dmp", "html_token", "html_tree", "html_perma_cc", "links_diff", "html_text_diff", "html_source_diff", "html_visual_diff", "html_tree_diff", "html_differ"], "version": "0.1.0"}

You can use each of these diff types by requesting the URL:

http://localhost:8888/<diff_type>?a=<url_to_left_side_of_comparison>&b=<url_to_right_side_of_comparison>

For example, to compare how the links on the National Renewable Energy Laboratory’s “About” page changed between 2018 and 2020 using data from the Internet Archive:

# URL of a version of the page archived in 2018:
$ VERSION_2018='http://web.archive.org/web/20180918073921id_/https://www.nrel.gov/about/'
# URL of a version of the page archived in 2020:
$ VERSION_2020='http://web.archive.org/web/20201006001420id_/https://www.nrel.gov/about/'
# Use the `links_json` diff to compare the page’s links and output as JSON:
$ curl "http://localhost:8888/links_json?a=${VERSION_2018}&b=${VERSION_2020}"

If you have jq installed, you might want to use it to format the result in a nicer way:

$ curl "http://localhost:8888/links_json?a=${VERSION_2018}&b=${VERSION_2020}" | jq

You can pass additional arguments to the various diffs in the query string. See the full documentation of the server and of the various diffs for more details.

Docker

You can deploy the web server as you might any Python application, or as a Docker image. We publish official images at: https://hub.docker.com/repository/docker/envirodgi/web-monitoring-diff. The most recent stable release is always available using the :latest tag.

Specific versions are tagged with the SHA-1 of the git commit they were built from. For example, the image envirodgi/web-monitoring-diff:446ae83e121ec8c2207b2bca563364cafbdf8ce0 was built from commit 446ae83e121ec8c2207b2bca563364cafbdf8ce0.

Note that, unlike running the command locally, the Docker image defaults to listening/serving on port 80 in the container. When you run it, you’ll want to map your ports. For example, to use port 8888 on your machine:

$ docker run -p 8888:80 envirodgi/web-monitoring-diff

Building Images

To build a production image, use the web-monitoring-diff target:

# Build it:
$ docker build -t web-monitoring-diff .

# Then run it:
$ docker run -p 8888:80 web-monitoring-diff

Point your browser or curl at http://localhost:8888.

Code of Conduct

This repository falls under EDGI's Code of Conduct.

Contributing

This project wouldn’t exist without a lot of amazing people’s help. It could use yours, too: please file bugs or feature requests or make a pull request to address an issue or help improve the documentation.

If you’re looking for ways to help with the project, issues with the label “good-first-issue” are usually a good place to start.

When contributing to this project, please make sure to follow EDGI's Code of Conduct.

Developing Locally

When developing locally, you’ll want to do an editable install from your local git checkout, rather than installing normally from PyPI as described in the “installation” section above.

First, make sure you have an appropriate Python version and the necessary system-level dependencies described above in the “installation” section. Then:

  1. Clone this repository wherever you’d like to edit it on your hard drive:

    $ git clone https://github.com/edgi-govdata-archiving/web-monitoring-diff.git
    $ cd web-monitoring-diff
  2. Perform an editable install of the package in the repo:

    $ pip install -e .[server,dev,docs] --no-binary lxml

    NOTE: if you are using Python 3.9 or earlier you may not be able to install both the development and docs dependencies at the same time. Instead, just install the dev dependencies:

    $ pip install -e .[server,dev] --no-binary lxml
  3. Install additional dependencies for experimental features:

    $ pip install -r requirements-experimental.txt
  4. Make sure it works without errors by running a python interpreter and importing the package:

    import web_monitoring_diff
  5. Edit some code!

  6. Before pushing your commits and making a PR, run the tests and lint your code:

    # Run tests:
    $ pytest .
    
    # Lint your code to make sure it doesn't have any style issues:
    $ pyflakes .

Contributors

Thanks to the following people for all their contributions! This project depends on their work.

Contributions Name
💻 ⚠️ 🚇 📖 💬 👀 Dan Allan
💻 Vangelis Banos
💻 📖 Chaitanya Prakash Bapat
💻 ⚠️ 🚇 📖 💬 👀 Rob Brackett
💻 Stephen Buckley
💻 📖 📋 Ray Cha
💻 ⚠️ Janak Raj Chadha
💻 Autumn Coleman
💻 Luming Hao
🤔 Mike Hucka
💻 Stuart Lynn
💻 ⚠️ Julian Mclain
💻 Allan Pichardo
📖 📋 Matt Price
💻 Mike Rotondo
📖 Susan Tan
💻 ⚠️ Fotis Tsalampounis
📖 📋 Dawn Walker

(For a key to the contribution emoji or more info on this format, check out “All Contributors.”)

License & Copyright

Copyright (C) 2017-2022 Environmental Data and Governance Initiative (EDGI)

This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, version 3.0.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

See the LICENSE file for details.