Skip to content

Getting Started Developing CourtListener

Chad Weider edited this page Oct 19, 2024 · 25 revisions

CourtListener is built upon a few key pieces of technology that come together to make everything work. Whether you're contributing platform code, web design, etc. it helps to know the easiest ways to run and test the project.

But before we can get into that, we must address...

Legal Matters

Not surprisingly, we have a lot of legal, and, in particular, intellectual property, lawyers around here. As a result, we endeavor to be a model for other open source projects in how we handle IP contributions and concerns.

We do this in a couple of ways. First, we use a copy-left license for CourtListener, the GNU GPL Affero license. Read the details in the license itself, but the high level is that it's a copy-left license that's designed specifically for the kind of code that's run on servers and isn't distributed to end users (like an app would be, say).

The other thing we do is require a contributor license agreement from any non-employees or non-contractors that contribute code to the project. The first time you make a contribution to any of our repos, a bot will ask you sign the agreement. Please do so. If you have any questions about it, please ask.

On with the show.

Discussing things

You can use Github Discussions to ask questions and search past ones. We should use this more, but mostly people seem to find their way into our Slack and ask things there. When they do that, the answers to their questions go into a black hole and only ever help the person that asked them.

If you can, please try to ask questions in the discussions board and if answers are useful, they should go into a wiki page.

Architecture

The major components of CourtListener are:

  • Postgresql - For database storage. We love Postgresql.

  • Redis - For in-memory fast storage, caching, task queueing, some stats logging, etc. Everybody loves Redis for a reason. It's great. If you have something small you want to store quickly and kind of durably, it's fantastic.

  • Celery - For running asynchronous tasks. We've been using this a long time. It causes a lot of annoyance and sometimes will have unsolvable bugs, but as of 2019 it's better than any of the competition that we've tried.

  • Judge Pics and Court Seals microservices - These are services we host via S3. They're not complicated. They just give you a photo of a judge or a seal at a given URL.

  • Doctor - This is our microservice for document and audio file conversions. Inside the service we do text extraction, OCR (using tesseract), mp3 enhancements, etc.

  • Solr - For making things searchable. It's decent. Our version is currently very old, but it hangs in there. We've also tried Sphinx. Lately, we've been moving towards Elastic. It's being deployed one object at a time right now.

  • React and HTMX - We're experimenting with react for dynamic front-end features, but increasingly it feels overly complex. Instead, the tool we generally reach for is HTMX.

  • Python/Django/et al - And their associated bits and pieces.

Developer Installation

We use a Docker Compose file to make development easier. Below is the process for getting everything working. If you get stuck, note that we run all tests in GitHub Actions and that there is a complete, automated setup script in tests.yml that you can consult.

To set up a development machine, do the following:

  1. Clone the courtlistener and courtlistener-solr-server repositories so that they are side-by-side in the same folder.

  2. Next, you'll need to update the group permissions for the Solr server. cd into the courtlistener-solr-server directory, and run the following commands:

    sudo chown -R :1024 data
    sudo chown -R :1024 solr
    sudo find data -type d -exec chmod g+s {} \;
    sudo find solr -type d -exec chmod g+s {} \;
    sudo find data -type d -exec chmod 775 {} \;
    sudo find solr -type d -exec chmod 775 {} \;
    sudo find data -type f -exec chmod 664 {} \;
    sudo find solr -type f -exec chmod 664 {} \;
  3. Create a personal settings file in the courtlistener directory. To do that, copy-paste the .env.example file to .env.dev, and then minimally uncomment these settings:

    • ALLOWED_HOSTS: This is needed so tests can pass. You can set it to localhost for more security, or set it to * if you're on a safe LAN.
    • SECRET_KEY: This is a django setting that salts encryption algorithms, among other things. Just set it to something random, if you like.

That will get you pretty far, but CourtListener does rely on a number of cloud services, as you'll see in the env.example file. To make all features work, you'll need to get tokens for these services.

See [below](#how-settings-work-in-courtlistener) for more information about settings files.
  1. Next, create the bridge network that docker relies on:

    docker network create -d bridge --attachable cl_net_overlay

    This is important so that each service in the compose file can have a hostname.

  2. cd into courtlistener/docker/courtlistener, then launch the server by running:

     docker compose up
    

    Docker Desktop for Mac users: By default, Docker runs with very little memory (2GB), so to run everything properly you will need to change the default values:

    • Go to docker Settings/Resources/Advanced
    • Increase Memory to at least 4GB and Swap to 2GB
    • Apply changes and Restart.
  3. Generate some dummy data for your database: (how did /opt/courtlistener/manage.py appear? What installed it? I have /home/ray/Projects/courtlistener/courtlistener/manage.py. How did something get into /opt? @rkiddy)

     docker exec -it cl-django python /opt/courtlistener/manage.py make_dev_data
    

    If this does not make an object that you want for your work, you should update it so it does. It's a relatively new tool and it's growing as we use it.

    If you need specific data from CourtListener (to debug something, say), you can use clone_from_cl, which pulls data from the CourtListener API into your dev database.

  4. Finally, create a new superuser login by running this command, and entering the required information:

     docker exec -it cl-django python /opt/courtlistener/manage.py createsuperuser
    

Speed Tip: If you want your tests and docker images to go faster, you might be able to run:

docker compose -f docker-compose.yml -f docker-compose.tmpfs.yml up

M1 Mac/ARM Tip: If you are running tests on Apple silicon you may need to change the selenium image to image: seleniarm/standalone-chromium:latest to properly run the tests.

If you do that, you'll run postgresql in memory. That means it'll get wiped out whenever you restart docker, but it should provide a speed bump. We do this in CI, for example.

So that should be it! You should now be able to access the following URLs:

A good next step is to run the test suite to verify that your development server is configured correctly.

AWS

If you are working on something that needs access to AWS resources, reach out to us so that if the use is justified, we will provide you with credentials you'll use to generate your temporary keys.

Export the variables in you console before running compose in your terminal and then run docker compose. The variables will get set into the containers.

export AWS_ACCESS_KEY_ID=XXXXXXXXXXXX
export AWS_SECRET_ACCESS_KEY=XXXXXXXXXXXX
export AWS_SESSION_TOKEN=XXXXXXXXXXXX

Problem Solving

  1. It seems like dependencies are missing when I start docker.

    Sometimes the code is ahead of the images that the docker containers (e.g. cl-django) were created with. If so, you’ll want rebuild the image and recreate the container. The sure fire way is:

     docker compose up -d --build
    

    But if you know exactly what image that you want to address:

     docker compose up -d --build cl-django
    
  2. Add your problem here...

Ongoing code upgrades you should do when possible

Python and its ecosystem are always evolving. There are a number of best practices that we do to try to keep up:

  1. We are gradually adding type hints to our code. It's too difficult to do this all at once, so all new code should be hinted. If you have time while in a file, you should add hints where you can. It may help to enable an error in your IDE for unhinted code.

    After you've added hints, you should add the file or the module to the list of supported modules in lint.yml. A great pull request will add a file or module to lint.yml.

  2. We no longer use django fixtures in tests. They are slow and confusing and should be removed when you see them. Instead, they should be replaced with setupTestData() and factories. Search the code for examples. New fixtures should be strictly avoided.

  3. Wherever possible, we should use Pathlib instead of os.path and friends. It's almost always nicer, and it's worth learning and using it.

Logs

You can see most of the logs via docker when you start it. CourtListener also keeps a log in the cl-django image that you can tail with:

docker exec -it cl-django tail -f /var/log/courtlistener/django.log

But usually you won't need to look at these logs.

Logs in production are currently only available to a few people, but once we move to Kubernetes, they'll be available to all.

Uncaught exceptions and logger.error calls are monitored using Sentry. Some advice on how to improve Sentry's grouping of events can be found here.

If you want to collect the Sentry event data for debugging purposes, you can access it using the API as explained here.

How Settings Work in CourtListener

CourtListener has two kinds of settings: Those that can and should be adjusted by end users and those that are the same for everybody. Ones that can be adjusted have sane, safe defaults that can be overridden by environment variables. Ones that should not be overridden are simply hardcoded until somebody makes the case that they should be adjustable.

For developers (it's different in prod), adjustable settings are configured via a file located at .env.dev. Variables set in that file are picked up via the docker compose file or a kubernetes manifest, which passes them to the docker image running your code. Finally, our django code uses django-environ to pick up and use those variables.

All defaults — for adjustable and fixed settings alike — can be found in the cl/settings directory, where they are organized into a few categories of settings (third-party vs. first-party, django, and misc).

This design comports with factor three of the 12-factor app guidelines.

Guidelines for Contributions

For the most part, we use Github flow to get our work done. Our BDFL and primary developer is @mlissner. For better and/or for worse, he doesn't care too much about git, provided things get done smoothly and his life is fairly easy. What that means generally, is:

  1. Commits should represent a unit of work. In other words, if you're working on a big feature, each commit should be a discrete step along the path of getting that feature ready to land. Bad or experimental work shouldn't be in a commit that you submit as part of a PR, if you can avoid it. Often you can clean up your commits with an interactive rebase followed by a force push to your branch.

  2. After a code review has begun, do not force push. If you do, it makes it difficult for the next review to identify your latest changes. It's better to have a messy commit log during review than to force your poor reviewer to look at the entire PR again just to see the one or two lines you changed. After the review is completed, a force push might be used to clean things up, but a squash merge might do it too, and those are easy.

  3. Your commit messages should use the format defined by the Angular.js project. This is pretty easy if you use this plugin for Intellij/PyCharm/et al.

    Your commits should therefore look something like:

     fix(component): Makes the whatsit do the thingsit
    
     Some longer explanation might go here, explaining why the whatsit is
     broken and how you fixed it.
    
     Fixes: #xyz
    

    We welcome LONG commit messages that could literally double as blog posts. If somebody is looking at the commit in five years, they want an essay, not a tweet.

  4. PR's that do anything visual (email templates, HTML pages, etc) should include a comment with a screenshot or gif of the visual changes.

  5. KEEP YOUR PR's SMALL. A good PR should land a specific thing of some sort. It doesn't have to be done — it doesn't even have to work! — but it should be clean, and it should be your best effort at clean progress. PRs are both a way of getting your work into the system and a way to communicate your work. The latter is more important. 10 small, clean PRs are about 50× better than a monolithic one that is fully functional.

    Say you are developing a system that relies on regexes to do something. Why not submit the regexes (and their tests!) in one PR and the thing that uses those regexes in another? That'd be much easier to review than trying to see the whole thing at once.

  6. We use a number of linters to make our code better. Some of these are enforced by Github Actions, and others are not. The easiest way to do your work is to use pre-commit.

    You can run pre-commit with:

     docker exec -it cl-django pre-commit run --all-files
    

    You might also want to install it locally. If you do that, you can make it run automatically every time you do a commit, as a "pre-commit hook" (hence the name).

    However you run it, when you do it'll run a bunch of linters including black, isort, and flynt. Unfortunately, pre-commit doesn't work well with mypy, but that'll need to pass too. More on that below.

  7. We use the black code formatter to make sure all our Python code has the same formatting. This is an automated tool that you must run on any code you run before you push it to Github. When you run it, it will reformat your code. We recommend integrating it into your editor.

    Beyond what black will do for you by default, if you somehow find a way to do whitespace or other formatting changes, do so in their own commit and ideally in its own PR. When whitespace is combined with other code changes, the PR's become impossible to read and risky to merge. This is a big reason we use black.

  8. We are beginning to use mypy to add type hints to our Python code. New code should include hints, and updates to old code should add hints to the old code. The idea is for our hints to gradually get better and more complete. Our Github Action for mypy is in lint.yml, and should be updated to run against any areas that have hints. This just takes a second once mypy is working properly on a file or module.

  9. We use iSort to sort our imports. If your imports aren't sorted properly, iSort will tell you so when you push your code to Github. Again, we recommend getting iSort integrated into your editor or workflow.

  10. We have an editorconfig, an eslint configuration, and a black configuration. Please use them.

  11. We do not yet have a Code of Conduct, but we do have our employee manual, and we expect all our employees and volunteers to abide by it.

Some of these guidelines are a little sloppy compared with many projects. Those projects have greater quality needs, are popular enough to demand a high bar, and can envision coding techniques as a part of their overall goal. We don't have to lead the industry with our approach, we just need to get good work done. That's the goal here.

Special notes for special types of code

  1. If your PR includes a migration of the DB, we need SQL files for those migrations. See the migrations page for details.

CI/CD

We use Github Actions to run the full test and linting suite on every push. If the tests fail or your code is not formatted properly according to our linters, your code probably won't get merged.

When code is merged into main, we also automatically build and push new docker images, then deploy that code to our kubernetes cluster. If the code has a database migration, it is not deployed without manual review and a re-run of the deploy step after the migration is applied by hand.

Production Monitoring

ElasticSearch/Kibana

When working with our ElasticSearch cluster we can provide access to our Kibana deployment. We can create a user for you with the needed permissions to perform actions such as running queries, monitoring the indices, etc. You would need to provide us with the public IP address from which you'll be connecting to Kibana.

Also we can provide you direct access to our ElasticSearch cluster in case you need to use a tool or run scripts directly.

Grafana

We can grant access to users to our internal Grafana application which allows you to access logs, metrics (from both our Prometheus instance and AWS CloudWatch) and to run queries in our replica database.

1

2

Logs

We use Grafana Loki to centralize the logs from our Kubernetes cluster. All logs are available through Grafana when going to the explore utility and selecting Loki as a datasource:

4

5

More sophisticated queries can be created using the LogQL language: https://grafana.com/docs/loki/latest/query/

Metrics

Cloudwatch Metrics

From Grafana you can query metrics available on AWS Cloudwatch using the explore utility. You have access to all the AWS resources we currently have and that feed metrics into Cloudwatch:

8

Prometheus Metrics

Likewise, selecting Promethues will give you access to all the metrics from the k8s resources. Metrics documentation

7

PromQL documentation: https://prometheus.io/docs/prometheus/latest/querying/basics/

SQL Queries

We provide through Grafana the possibility to run PostgreSQL queries directly into one of our replicas. You can either use the Builder UI which allows to use drop down selectors to create your query:

3

Alternatively you can switch into the code version which allows to input your query using a text editor:

6