Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support running OpenWPM crawls on Windows #503

Open
2 of 8 tasks
motin opened this issue Oct 9, 2019 · 2 comments
Open
2 of 8 tasks

Support running OpenWPM crawls on Windows #503

motin opened this issue Oct 9, 2019 · 2 comments
Labels
backlog Not a priorty, but nice to have feature-request

Comments

@motin
Copy link
Contributor

motin commented Oct 9, 2019

Path, I hope, to supporting Windows. There may be some limitations, but first step.

ToDo:

  • Once Conda install dependencies #648 is merged, there will be three dependencies that don't work on windows:
    • leveldb
    • plyvel
      • i believe we can switch out plyvel for python-leveldb with almost no fuss
    • python-virtualdriver
      • this is for running xvfb which won't work on windows anyway, so just need to figure a package management solution / environment.yaml that accomodates both (most likely just making installing python-xvfb a manual step, as install xvfb is manual anyway -- maybe moving to pip will workaround)
  • Make some tweaks in deploy_firefox so we're not manually making paths by concatenating strings
  • Also suggest making some tweaks in deploy_firefox so that we let geckodriver set a profile path and we then read off it. this will help in goal of restoring stateful crawls and will make it easier to work here.
  • Find a replacement for the log interceptor that uses mkfifo which is unix only. This stack overflow thread has something that maybe we can drop in as a replacement. Alternatively, I used a different approach in faust-selenium and created something to constantly "tail" geckodriver.log (https://github.com/birdsarah/faust-selenium/blob/master/crawler/geckodriver_log_reader.py). Alternatively again, we just save the geckodriver.log at the end and don't weave it into our logging. @englehardt - what is the motivation for interleaving the geckodriver logs?
    • First step could be to skip geckodriver logs for windows platform - they're not crawl essential as best as I can tell.

Future (open issues):

  • Add CircleCI tests and test on Win, OSX, and Linux (at least once per PR - or once a week).
@birdsarah
Copy link
Contributor

An alternate version of openwpm was created as a proof of concept and has done windows crawls with openwpm. It uses basically the same openwpm instrumentation extension, but replaces the socket with a websocket, and uses kafka for orchestrating the crawl: https://github.com/birdsarah/faust-selenium

@birdsarah
Copy link
Contributor

birdsarah commented May 15, 2020

Moved to issue ToDos.

@birdsarah birdsarah self-assigned this May 18, 2020
@englehardt englehardt added the backlog Not a priorty, but nice to have label Nov 9, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backlog Not a priorty, but nice to have feature-request
Projects
None yet
Development

No branches or pull requests

3 participants