Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Net Scan: 72TB BAB&anOptions, 210 TB app_vars #4368

Closed
jspenguin2017 opened this issue Dec 22, 2018 · 73 comments
Closed

Net Scan: 72TB BAB&anOptions, 210 TB app_vars #4368

jspenguin2017 opened this issue Dec 22, 2018 · 73 comments

Comments

@jspenguin2017
Copy link
Contributor

jspenguin2017 commented Dec 22, 2018

I'm not sure why but I can't manage to keep 10 worker children alive, so I'm downscaling my cluster to 4 worker servers (8 children), I finished processing 72 TB out of 210 TB of data.

In order to prevent maintainers from burning out, I will now add a daily cap on how many links I post:

BlockAdBlock  500
app_vars      10
Other         100

In addition, I will not post more than 500 links combined per day.


Note that some domains may be NSFW.

@jspenguin2017

This comment has been minimized.

@jspenguin2017 jspenguin2017 changed the title Net Scan: 72TB BAB&app_vars Net Scan: 72TB BAB Dec 22, 2018
@gwarser

This comment has been minimized.

@jspenguin2017

This comment has been minimized.

@jspenguin2017
Copy link
Contributor Author

jspenguin2017 commented Dec 22, 2018

@gwarser
I thought we are sorting alphabetically?

The Internet is much much bigger than 1 million websites, it's close to 2 billion. I use Common Crawl, which (I think) crawls top 40 million websites or something.

Maybe it's time for generic scriptlet?

@okiehsch
Copy link
Contributor

okiehsch commented Dec 22, 2018

http://ad1.ink/GUHa3Za is a parked domain.
http://www.binnenland.nl the domain is for sale.

@jspenguin2017
Copy link
Contributor Author

jspenguin2017 commented Dec 22, 2018

Whitelisted, and whitelisted.

@okiehsch
Copy link
Contributor

Can you test http://3050.pw/, I see literally nothing.

@jspenguin2017
Copy link
Contributor Author

Yea, the website is a bit broken, but BAB still works...

image

@jspenguin2017
Copy link
Contributor Author

It's up to you, if you think it's not worth fixing, I'll just whitelist it.

@okiehsch
Copy link
Contributor

Yes, I meant nothing until BAB kicks in 😆.
I will pass, and add the rest, I tested approximately every 5th domain.

@jspenguin2017
Copy link
Contributor Author

jspenguin2017 commented Dec 22, 2018

Alright, whitelisted.

I'm still working on my Chromium as a Service, it refuses to load extensions in headless mode and I can't find a working tutorial to make EC2 headful...

okiehsch added a commit that referenced this issue Dec 22, 2018
@jspenguin2017

This comment has been minimized.

okiehsch added a commit that referenced this issue Dec 23, 2018
@jspenguin2017

This comment has been minimized.

@jspenguin2017
Copy link
Contributor Author

Main type (FAB, anOptions, etc.) is from the scanner output, which may not be accurate.
Sub type (hard / soft) is tested.

@okiehsch
Copy link
Contributor

We should probably add a warning that some of the listed domains may be NSFW.

Main type (FAB, anOptions, etc.) is from the scanner output, which may not be accurate.

Yes, I fixed https://www.1x1trainer.net and the site was not using FAB.

@jspenguin2017
Copy link
Contributor Author

OK, updated opening.

@jspenguin2017 jspenguin2017 changed the title Net Scan: 72TB BAB Net Scan: 72TB BAB&app_vars Dec 23, 2018
@jspenguin2017 jspenguin2017 changed the title Net Scan: 72TB BAB&app_vars Net Scan: 72TB BAB&app_vars&anOptions Dec 23, 2018
@okiehsch
Copy link
Contributor

https://www.anton-hilft.de is not using anOptions, it uses antiblock.org.
No other false detections in the first "hard batch".

@okiehsch
Copy link
Contributor

Second "hard batch" domains are all valid.

okiehsch added a commit that referenced this issue Dec 23, 2018
@jspenguin2017
Copy link
Contributor Author

jspenguin2017 commented Dec 23, 2018

Well yea, I tested those ones.

I'll be testing everything from now on, either manually or semi-automated with Puppeteer.

@dumbusernameidk
Copy link

dumbusernameidk commented Dec 23, 2018

didn't want to open an issue since it's related to this.

nextdoordolls.com (nsfw) uses anOptions as well. Never got it to trigger but probably should add.

@garry-ut99

This comment was marked as abuse.

@garry-ut99

This comment was marked as abuse.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants