Memory issues (leak?) investigation #5817

jywarren · 2019-06-01T16:35:10Z

We're seeing a persistent memory issue since one week ago on Saturday, and I'm compiling information about it here to investigate.

Wondering if it's related to this controller method for the dashboard.

https://www.skylight.io/app/applications/GZDPChmcfm1Q/1559320320/1d/endpoints/HomeController%23dashboard?responseType=html

jywarren · 2019-06-01T16:37:03Z

Noting @icarito's comment:

I wonder jywarren because I had edited docker-compose-production.yml to use fewer processes (didn't make a PR for it). So it could be we just made it fit that way.

And this graph:

jywarren · 2019-06-01T16:39:18Z

We're seeing a lot of SMTP test errors too:

Link: | https://intelligence.rackspace.com/cloud/entities/en45StuOyk/checks/chXoX9GHhF/alarm/alycd3HZyu

icarito · 2019-06-02T03:05:37Z

Yes load is very high too. From the htop and especially iotop it appears mailman is quite active. It's the culprit for sure! Prior to May 22th we ran it a few times a day - if we can run it every few minutes minute or so (not every second!) - it would be fine!

I, [2019-05-07T23:56:44.702410 #1]  INFO -- : Mailman v0.7.0 started 
I, [2019-05-08T21:33:03.762360 #1]  INFO -- : Mailman v0.7.0 started 
I, [2019-05-09T07:47:27.518491 #1]  INFO -- : Mailman v0.7.0 started 
I, [2019-05-09T08:18:47.825703 #1]  INFO -- : Mailman v0.7.0 started 
I, [2019-05-10T08:14:53.010705 #1]  INFO -- : Mailman v0.7.0 started 
I, [2019-05-10T21:45:50.739207 #1]  INFO -- : Mailman v0.7.0 started 
I, [2019-05-11T17:38:51.647335 #1]  INFO -- : Mailman v0.7.0 started 
I, [2019-05-13T03:33:15.682877 #1]  INFO -- : Mailman v0.7.0 started 
I, [2019-05-14T05:51:40.603184 #1]  INFO -- : Mailman v0.7.0 started 
I, [2019-05-14T05:53:20.857041 #1]  INFO -- : Mailman v0.7.0 started 
I, [2019-05-14T05:55:00.356772 #1]  INFO -- : Mailman v0.7.0 started 
I, [2019-05-14T05:56:40.487219 #1]  INFO -- : Mailman v0.7.0 started 
I, [2019-05-15T01:43:42.908744 #1]  INFO -- : Mailman v0.7.0 started 
I, [2019-05-16T10:13:45.703985 #1]  INFO -- : Mailman v0.7.0 started 
I, [2019-05-18T12:57:16.194957 #1]  INFO -- : Mailman v0.7.0 started 
I, [2019-05-22T08:49:27.019569 #1]  INFO -- : Mailman v0.7.0 started 
I, [2019-05-22T08:49:55.827419 #1]  INFO -- : Mailman v0.7.0 started 
I, [2019-05-22T08:50:18.722700 #1]  INFO -- : Mailman v0.7.0 started 
I, [2019-05-22T08:50:41.709075 #1]  INFO -- : Mailman v0.7.0 started 
I, [2019-05-22T08:51:00.124271 #1]  INFO -- : Mailman v0.7.0 started 
I, [2019-05-22T08:51:17.146210 #1]  INFO -- : Mailman v0.7.0 started 
I, [2019-05-22T08:51:33.745494 #1]  INFO -- : Mailman v0.7.0 started 
I, [2019-05-22T08:51:51.387282 #1]  INFO -- : Mailman v0.7.0 started 
I, [2019-05-22T08:52:09.145006 #1]  INFO -- : Mailman v0.7.0 started 
I, [2019-05-22T08:52:31.266559 #1]  INFO -- : Mailman v0.7.0 started 
I, [2019-05-22T08:53:03.176998 #1]  INFO -- : Mailman v0.7.0 started 
I, [2019-05-22T08:53:26.991989 #1]  INFO -- : Mailman v0.7.0 started 
I, [2019-05-22T08:53:54.074275 #1]  INFO -- : Mailman v0.7.0 started 
I, [2019-05-22T08:54:13.905343 #1]  INFO -- : Mailman v0.7.0 started 
I, [2019-05-22T08:54:37.736641 #1]  INFO -- : Mailman v0.7.0 started 
I, [2019-05-22T08:54:57.357057 #1]  INFO -- : Mailman v0.7.0 started 
I, [2019-05-22T08:55:15.522535 #1]  INFO -- : Mailman v0.7.0 started 
I, [2019-05-22T08:55:34.343241 #1]  INFO -- : Mailman v0.7.0 started 
I, [2019-05-22T08:55:51.964241 #1]  INFO -- : Mailman v0.7.0 started 
I, [2019-05-22T08:56:10.016964 #1]  INFO -- : Mailman v0.7.0 started 
I, [2019-05-22T08:56:42.822692 #1]  INFO -- : Mailman v0.7.0 started 
I, [2019-05-22T08:56:59.826809 #1]  INFO -- : Mailman v0.7.0 started 
I, [2019-05-22T08:57:16.178517 #1]  INFO -- : Mailman v0.7.0 started 
I, [2019-05-22T08:57:35.871196 #1]  INFO -- : Mailman v0.7.0 started 
I, [2019-05-22T08:57:59.731422 #1]  INFO -- : Mailman v0.7.0 started 
I, [2019-05-22T08:58:16.353160 #1]  INFO -- : Mailman v0.7.0 started 
I, [2019-05-22T08:58:33.608591 #1]  INFO -- : Mailman v0.7.0 started 
I, [2019-05-22T08:58:50.037296 #1]  INFO -- : Mailman v0.7.0 started 
I, [2019-05-22T08:59:06.912680 #1]  INFO -- : Mailman v0.7.0 started 
I, [2019-05-22T08:59:32.287362 #1]  INFO -- : Mailman v0.7.0 started 
I, [2019-05-22T08:59:59.201948 #1]  INFO -- : Mailman v0.7.0 started 
I, [2019-05-22T09:00:18.739067 #1]  INFO -- : Mailman v0.7.0 started 
I, [2019-05-22T09:00:42.144910 #1]  INFO -- : Mailman v0.7.0 started 
I, [2019-05-22T09:01:03.495556 #1]  INFO -- : Mailman v0.7.0 started 
I, [2019-05-22T09:01:20.493712 #1]  INFO -- : Mailman v0.7.0 started 
I, [2019-05-22T09:01:37.089192 #1]  INFO -- : Mailman v0.7.0 started 
I, [2019-05-22T09:01:53.921571 #1]  INFO -- : Mailman v0.7.0 started 
I, [2019-05-22T09:02:14.509227 #1]  INFO -- : Mailman v0.7.0 started

icarito · 2019-06-02T03:11:52Z

The log is filled with cycles of these, no error:

I, [2019-06-02T02:35:26.270644 #1]  INFO -- : Mailman v0.7.0 started
I, [2019-06-02T02:35:26.270851 #1]  INFO -- : Rails root found in ., requiring environment...
I, [2019-06-02T02:35:56.930267 #1]  INFO -- : POP3 receiver enabled (notifications@publiclab.org@pop.gmail.com).
I, [2019-06-02T02:35:56.938850 #1]  INFO -- : Polling enabled. Checking every 5 seconds.

icarito · 2019-06-02T03:25:19Z

Looks like mailman is crashing and being immediately respawn!

icarito@rs-plots2:/srv/plots_container/plots2$ docker ps
CONTAINER ID        IMAGE                COMMANDCREATED             STATUS              PORTS NAMES
8d13c675568e        containers_mailman   "script/mailman_serv…"4 days ago          Up 14 seconds containers_mailman_1
f423dec91ebe        containers_web       "/bin/bash -c 'sleep…"4 days ago          Up 4 days           127.0.0.1:4001->4001/tcp containers_web_1
24f7b43efebc        containers_sidekiq   "bundle exec sidekiq…"4 days ago          Up 4 days containers_sidekiq_1
070511ab43d1        redis:latest         "docker-entrypoint.s…"4 days ago          Up 4 days           6379/tcp containers_redis_1
6ea8f0498b2c        mariadb:10.2         "docker-entrypoint.s…"4 days ago          Up 3 days           3306/tcp containers_db_1

I've decided to stop this container for tonight in order to monitor the effect on performance.

jywarren · 2019-06-03T13:22:44Z

I think we may also look at what gems updatea were merged in the days leading up to this code publication. Thanks!

jywarren · 2019-06-03T13:23:28Z

That's so weird about mailman, I will look at the config but I don't remember any changes to the rate.

jywarren · 2019-06-03T13:25:07Z

Oh you know what? We set it to retry 3 times. Maybe these are overlapping now? It could at least have increased the rate of attempts since it retries 3 times for every scheduled run.

plots2/script/mailman_server

Line 32 in faf66c0

retry if (retries += 1) < 3

jywarren · 2019-06-03T13:26:51Z

Ok modified it for 20 seconds, which should mean max a retry every 5 seconds --

a40ea56

That'll be the same rate as before when we added retries.

jywarren · 2019-06-03T20:57:42Z

OK, now working on analysis after a few hours:

https://oss.skylight.io/app/applications/GZDPChmcfm1Q/1559574420/6h/endpoints

Overall looks good. But, on closer look, it's ramping up in load time:

Comparing the latter portion where it's starting to go back up:

to the earlier just after the reboot:

And then to this from a couple weeks ago before all our trouble:

Then finally just after we started seeing issues on the 22-23rd of May:

Overall it's not conclusive.

Resources:

a list of memory-leaky gems and versions: https://github.com/ASoftCo/leaky-gems
https://samsaffron.com/archive/2015/03/31/debugging-memory-leaks-in-ruby

One of the tough things about this is that it's right around where these two commits happened:

disabling caching on profiles (which we later reverted): 794df37
the container build process change: 794df37

I'd like to think it relates to the addition of the retry 3 times code in 2bc7b49, which I tried tweaking today. But actually load times are still slowly growing.

This could mean that a) something else is driving it, or b) the "rescue/retry" cycle itself could be causing memory leak buildup?

jywarren · 2019-06-03T21:12:32Z

shall i comment out the rescue/retry code entirely?

maybe the hanging waiting for mysql to pick up is actually taking up threads?

I'll try this. Site is almost unresponsive.

I removed the retry here: faa5a12

Deploying... it'll take a while.

jywarren · 2019-06-04T02:40:53Z

Hmm it really doesn't seem solved... https://oss.skylight.io/app/applications/GZDPChmcfm1Q/1559577660/8h13m/endpoints

jywarren · 2019-06-04T04:33:59Z

Ok I wonder if the container setup affected the mailman container at all? Because at this point we've reverted all the likely stuff from the mailman script.

jywarren · 2019-06-04T13:53:19Z

OK, overnight it peaked and went back down a bit. But our problematic ones are still quite high, with peaks at about 20 seconds:

jywarren · 2019-06-04T13:53:52Z

The stats range calls are taking up to 40+ seconds!

jywarren · 2019-06-04T13:54:54Z

They're also taking forever on cache generation:

Could we be seeing an issue with the cache read/write?

jywarren · 2019-06-04T13:55:45Z

@icarito could there be like an issue on the read/write io or something on cache generation? I'm just not sure why it would take this long to pack all the data into the cache.

jywarren · 2019-06-04T20:00:17Z

jywarren · 2019-06-05T14:31:48Z

I'm still seeing this massive cache generation time for stats_controller#range and wondering if we need to tweak where cache is stored. It looks like the default is file storage (and I checked, we have cache files in /plots2/tmp/cache/. Would we be helped at all by switching to in-memory caching or memcached, both of which are apparently pretty simple changes?

https://guides.rubyonrails.org/v3.2/caching_with_rails.html#activesupport-cache-memorystore

jywarren · 2019-06-05T14:35:56Z

Also looking at https://www.skylight.io/support/performance-tips

jywarren · 2019-06-05T14:39:53Z

I'll look at the email configuration now but if it doesn't yield anything I'll merge this, turning off the begin/rescue loop: #5840

jywarren · 2019-06-05T16:39:43Z

OK our next step for #5841 is to develop a monitoring strategy for if mailman goes down.

jywarren · 2019-06-05T16:48:33Z

Deploying with the new email credentials, AND the begin/rescue removal. However, I think it's worth redeploying with the begin/rescue re-instated if the memory leak is solved, because it could have been the email credential issues.

jywarren · 2019-06-05T18:54:05Z

Latest error:

mailman_1 | /app/app/models/comment.rb:265:in add_comment': undefined methodbody' for nil:NilClass (NoMethodError) mailman_1 | from /app/app/models/comment.rb:218:in receive_mail' mailman_1 | from script/mailman_server:31:inblock (2 levels) in <main>' mailman_1 | from /usr/local/bundle/gems/mailman-0.7.0/lib/mailman/router.rb:66:in instance_exec' mailman_1 | from /usr/local/bundle/gems/mailman-0.7.0/lib/mailman/router.rb:66:inroute' mailman_1 | from /usr/local/bundle/gems/mailman-0.7.0/lib/mailman/message_processor.rb:23:in block in process' mailman_1 | from /usr/local/bundle/gems/mailman-0.7.0/lib/mailman/middleware.rb:33:inblock in run' mailman_1 | from /usr/local/bundle/gems/mailman-0.7.0/lib/mailman/middleware.rb:38:in run' mailman_1 | from /usr/local/bundle/gems/mailman-0.7.0/lib/mailman/message_processor.rb:22:inprocess' mailman_1 | from /usr/local/bundle/gems/mailman-0.7.0/lib/mailman/receiver/pop3.rb:43:in block in get_messages' mailman_1 | from /usr/local/lib/ruby/2.4.0/net/pop.rb:666:ineach' mailman_1 | from /usr/local/lib/ruby/2.4.0/net/pop.rb:666:in each_mail' mailman_1 | from /usr/local/bundle/gems/mailman-0.7.0/lib/mailman/receiver/pop3.rb:42:inget_messages' mailman_1 | from /usr/local/bundle/gems/mailman-0.7.0/lib/mailman/application.rb:133:in block in polling_loop' mailman_1 | from /usr/local/bundle/gems/mailman-0.7.0/lib/mailman/application.rb:130:inloop' mailman_1 | from /usr/local/bundle/gems/mailman-0.7.0/lib/mailman/application.rb:130:in polling_loop' mailman_1 | from /usr/local/bundle/gems/mailman-0.7.0/lib/mailman/application.rb:83:inrun' mailman_1 | from /usr/local/bundle/gems/mailman-0.7.0/lib/mailman/application.rb:11:in run' mailman_1 | from script/mailman_server:22:in<main>'

That's here:

plots2/app/models/comment.rb

Line 265 in e62bb49

    
           mail_doc = Nokogiri::HTML(mail.html_part.body.decoded) # To parse the mail to extract comment content and reply content

jywarren · 2019-06-05T18:57:22Z

Fixed that! c02b560

jywarren · 2019-06-05T20:47:55Z

ug, finally relaly publishing the comment.rb fix....

icarito · 2019-06-18T06:11:32Z

unstable is still deleting...

Doing operation on live production database:

Mysql dump rsessions table with where updated_at > DATE_SUB(NOW(), INTERVAL 7 DAY)
- time: 48 s
- dump size: 143Mb
Rename rsessions table
- time: 11 s
- home page was down for 15 minutes at ~6am UTC
Load clean data from dumped table to new rsessions table
- creates new rsessions table file (220Mb)
- restores homepage
Drop old rsessions table:
- MariaDB [plots]> drop table rsessions_prob; Query OK, 0 rows affected (43.39 sec)
- Freed ~29 GB.

Tested https://publiclab.org - session was retained!
🎉

icarito · 2019-06-18T06:12:06Z

mitigation done! Hopefully this will free us!

icarito · 2019-06-18T06:28:40Z

i'll leave it for tonight, site looks speedy to me... 😝 hopefully this is it!

icarito · 2019-06-18T06:30:18Z

OK, so mitigations list would be:

reduce process pool

move db to google cloud db solution

reduce rsessions

~~switch to memcached~~

jywarren · 2019-06-18T20:41:24Z

Hmm, it was very fast this morning, but overall I don't see a huge difference! 😞

skilfullycurled · 2019-06-19T01:40:06Z

Nooooooooooooo! Well, there's only one other explanation and that's ghosts. I'll open up another issue and look into finding an exorcist or ghostbusters gem.

icarito · 2019-06-19T05:26:30Z

I think actually there's been improvement on I/O use because using a 30GB table is heavy - if you look closely the peaks seem related to Statscontroller... maybe we could do the stats work on staging? I can make it copy production database regularly say weekly?

skilfullycurled · 2019-06-19T05:41:44Z

Hey @icarito, I was wondering if you could answer some "educational" questions for me :

if you look closely the peaks seem related to Statscontroller...

Why would this be? Due to the caching? I can only think of three people who would be using it and I'm one of them and I haven't been.

maybe we could do the stats work on staging?

I've been hearing...er...seeing you use the word "staging" a lot lately. What is that and how does it play into the site/workflow? If it's a part of the docs, let me know which one and I'll take a crack at understanding it first.

I can make it copy production database regularly say weekly?

I think that'd be good. It's not so much that the freshest data are important, but between the Q&A system being changed and the recent tags migration, I suppose weekly is a good idea since it will catch any structural changes as they come in. @cesswairimu, what do you think?

cesswairimu · 2019-06-19T11:30:34Z

This was a really awesome thread to read. Yeah its a great idea having the stats in stage and copying weekly is fine too 👍
I have had this thought of in future making the stats queries a script that creates a sql view and its deleted and recreated daily/or weekly by a job and maybe this can live in stage also. Would like to hear your thoughts on this and if this can help the memory leaks in any way.

grvsachdeva · 2019-06-19T16:23:08Z

Hey @icarito, can we increase the RAM of the server? Maybe that'll help in speeding up the website until we improve our query response rate?

Thanks!

icarito · 2019-06-19T16:49:40Z

Thanks for your replies! I am thankful for the work that you are doing and for replying to this issue and reading thru our efforts! I don't want to sound accusing or anything! I'm just looking at the data and trying to improve our site's reliability.
For instance we got a peak this morning: https://www.skylight.io/app/applications/GZDPChmcfm1Q/1560920940/5m/endpoints

We also see peaks every night (6AM UTC) on backup for a couple of hours.

Regarding staging and production, currently we have three instances:

Instance	URL	Build log	Workspace
unstable	https://unstable.publiclab.org/	https://jenkins.laboratoriopublico.org/view/Staging/job/Plots-Unstable/	https://jenkins.laboratoriopublico.org/view/Staging/job/Plots-Unstable/ws/
stable	https://stable.publiclab.org/	https://jenkins.laboratoriopublico.org/view/Staging/job/Plots-Stable/	https://jenkins.laboratoriopublico.org/view/Staging/job/Plots-Stable/ws/
production	https://publiclab.org/	n/a	n/a

You are right that documentation wise we should do a better job describing this process. Currently i found some docs here https://github.com/publiclab/plots2/blob/master/doc/TESTING.md#testing-branches but it's not clear at all that these branches build when we push to those branches.

The database is currently updated manually every so often but it should be simple to automate it now that we have daily database dumps. I will set it up and ping you!

This doesn't mean we shouldn't implement more solutions, next I think a threaded webserver (Puma) could help!

icarito · 2019-06-19T16:59:03Z

That is a good question! We are in the process of moving our hosting to new provider and were hoping to deploy as a container cluster in the new hosting provider. Since running in containers isn't immediately trivial (because our app container isn't immutable) - an alternative to start is that we could move the database first to make room. I don't think we should increase our hosting usage in our current host as we are barely within our allowed quota, but @jywarren can confirm? Thanks for your work!

…

On 19/06/19 11:23, Gaurav Sachdeva wrote: Hey @icarito <https://github.com/icarito>, can we increase the RAM of the server? Maybe that'll help in speeding up the website until we improve our query response rate? Thanks! — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#5817?email_source=notifications&email_token=AABQYS3R6ENGBU4FYJXVNXTP3JMPBA5CNFSM4HSA3N32YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODYCM7NI#issuecomment-503631797>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AABQYS7LYPEKQ4QEANK5PRLP3JMPBANCNFSM4HSA3N3Q>.

jywarren · 2019-06-19T17:03:12Z

Actually I wonder if we could temporarily boost our ram in that container until we do the move and if it would help short term. I think we'd be ok with that cost increasing! On Wed, Jun 19, 2019, 12:59 PM Sebastian Silva <notifications@github.com> wrote:

…

That is a good question! We are in the process of moving our hosting to new provider and were hoping to deploy as a container cluster in the new hosting provider. Since running in containers isn't immediately trivial (because our app container isn't immutable) - an alternative to start is that we could move the database first to make room. I don't think we should increase our hosting usage in our current host as we are barely within our allowed quota, but @jywarren can confirm? Thanks for your work! On 19/06/19 11:23, Gaurav Sachdeva wrote: > > Hey @icarito <https://github.com/icarito>, can we increase the RAM of > the server? Maybe that'll help in speeding up the website until we > improve our query response rate? > > Thanks! > > — > You are receiving this because you were mentioned. > Reply to this email directly, view it on GitHub > < #5817?email_source=notifications&email_token=AABQYS3R6ENGBU4FYJXVNXTP3JMPBA5CNFSM4HSA3N32YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODYCM7NI#issuecomment-503631797 >, > or mute the thread > < https://github.com/notifications/unsubscribe-auth/AABQYS7LYPEKQ4QEANK5PRLP3JMPBANCNFSM4HSA3N3Q >. > — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#5817?email_source=notifications&email_token=AAAF6J4GPT5S2JYJCMGJWP3P3JQVRA5CNFSM4HSA3N32YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODYCQFCY#issuecomment-503644811>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAAF6J4ERAAUV6JD3HUDZKDP3JQVRANCNFSM4HSA3N3Q> .

skilfullycurled · 2019-06-19T19:18:42Z

Oh, @icarito, no, no, I didn't sense any accusation, not at all. I read, "this is what's happening" and I was just saying "that's odd, why would it be doing that if no one was on it...?" Along the same lines, I didn't mean to imply the documentation was poor. Only that you didn't have to explain it if there was any.

And hey, it's not an entirely unfounded accusation : ) although I am having a bit of fun pretending that I've been framed and I've gone underground and have to prove my innocence but that's a whole other screenplay that I'm working on.

Thankfully these lurid and baseless accusations ; ) on both are parts have been cleared up and we can get back to the business at hand.

Related question: Why would the stats controller be active if no one was using it or is that the mystery?

Regarding the staging, thanks for the explanation. To make sure I've got, is saying...

I'll try this in stable staging instance.

...interchangeable with saying, "I'll try this on stable.publiclab.org"?

jywarren · 2019-06-19T19:31:25Z

To the stable.publiclab.org Q -- yes! And that's built off of any push to the `master` branch - hope that helps!

…

On Wed, Jun 19, 2019 at 3:19 PM Benjamin Sugar ***@***.***> wrote: Oh, @icarito <https://github.com/icarito>, no, no, I didn't sense any accusation, not at all. I read, "this is what's happening" and I was just saying "that's odd, why would it be doing that if no one was on it...?" Along the same lines, I didn't mean to imply the documentation was poor. Only that you didn't have to explain it if there was any. And hey, it's not an entirely unfounded accusation : ) although I am having a bit of fun pretending that I've been framed and I've gone underground and have to prove my innocence but that's a whole other screenplay that I'm working on. Thankfully these lurid and baseless accusations ; ) on both are parts have been cleared up and we can get back to the business at hand. Related question: Why would the stats controller be active if no one was using it or is that the mystery? Regarding the staging, thanks for the explanation. To make sure I've got, is saying... I'll try this in stable staging instance. ...interchangeable with saying, "I'll try this on stable.publiclab.org"? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#5817?email_source=notifications&email_token=AAAF6J23U74QTJEVCLT6FLDP3KBBFA5CNFSM4HSA3N32YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODYDAD5Y#issuecomment-503710199>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAAF6J2RLJGI3ESQKZARV6DP3KBBFANCNFSM4HSA3N3Q> .

skilfullycurled · 2019-06-19T19:52:34Z

@jywarren, yup! Got now. Thank you!

icarito · 2019-06-20T04:19:50Z

Thanks for the clarification @skilfullycurled !
It is indeed a mystery why StatsController is so active?

Brief moments ago we had another peak that knocked us down for few minutes:

The trigger in this case was actually the Full Text Search.
But one can see that even in this brief timeslice (3 min), StatsController was called 21 times.

I think this may be significantly affecting our baseline performance. If this usage is not known, then perhaps crawlers are hitting these endpoints? Maybe a robots.txt or some access control would fix it?

@jywarren thanks for the clarification, I'll look into doing it asap then.

icarito · 2019-06-20T04:20:43Z

Actually here's Statscontroller details for previous timeslice:

jywarren · 2019-06-20T18:45:54Z

Shall we robots.txt all stats routes? So /stats* basically?

…

On Thu, Jun 20, 2019 at 12:21 AM Sebastian Silva ***@***.***> wrote: Actually here's Statscontroller details for previous timeslice: [image: imagen] <https://user-images.githubusercontent.com/199755/59818278-d4b1c980-92e8-11e9-9b9e-46900a253bd8.png> — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#5817?email_source=notifications&email_token=AAAF6J7GBBZKJQY6TCZMQE3P3MARXA5CNFSM4HSA3N32YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODYD6ZTY#issuecomment-503835855>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAAF6J7PGJ5YZZHPLWPIJ73P3MARXANCNFSM4HSA3N3Q> .

jywarren · 2019-06-20T18:48:50Z

OK, i did, and also exempted /api/* - we had already blocked /stats/range* but now it's all /stats* aa93dc3

…

On Thu, Jun 20, 2019 at 2:45 PM Jeffrey Warren ***@***.***> wrote: Shall we robots.txt all stats routes? So /stats* basically? On Thu, Jun 20, 2019 at 12:21 AM Sebastian Silva ***@***.***> wrote: > Actually here's Statscontroller details for previous timeslice: > [image: imagen] > <https://user-images.githubusercontent.com/199755/59818278-d4b1c980-92e8-11e9-9b9e-46900a253bd8.png> > > — > You are receiving this because you were mentioned. > Reply to this email directly, view it on GitHub > <#5817?email_source=notifications&email_token=AAAF6J7GBBZKJQY6TCZMQE3P3MARXA5CNFSM4HSA3N32YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODYD6ZTY#issuecomment-503835855>, > or mute the thread > <https://github.com/notifications/unsubscribe-auth/AAAF6J7PGJ5YZZHPLWPIJ73P3MARXANCNFSM4HSA3N3Q> > . >

skilfullycurled · 2019-06-20T20:52:31Z

So you don't think it's the caching?

jywarren · 2019-06-20T22:41:01Z

The cache is use-generated, that is, it generates when a) it's expired, AND b) a new request comes in. So something has to be requesting it for the cache to generate... if I can resolve a couple unrelated issues and merge their PRs, i'll start a new publication to production tonight (otherwise tomorrow) and we can see if the robots.txt helps at all?

…

On Thu, Jun 20, 2019 at 4:53 PM Benjamin Sugar ***@***.***> wrote: So you don't think it's the caching? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#5817?email_source=notifications&email_token=AAAF6JZ5WFKAP5ZCICW67VLP3PUZBA5CNFSM4HSA3N32YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODYGSO4I#issuecomment-504178545>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAAF6J4MBMWM6WIOH6VJCY3P3PUZBANCNFSM4HSA3N3Q> .

jywarren · 2019-06-21T17:08:28Z

statscontroller is called 5.5 times per minute

via @icarito - so on tonight's update we can see if robots.txt changes help this.

grvsachdeva · 2019-06-25T19:50:16Z

Hey @jywarren, I saw that robot.txt update commit was pushed to stable some days ago. Any improvement you noticed?

skilfullycurled · 2019-06-25T21:31:55Z

Yes, would love an update! Not sure I grabbed the correct data, but here's some images from skylight of before the commit, after the commit, and the last ~24 hours. The red line indicates when the commit was made. Looks on the surface like the answer is yes, but it may not be significant, or I might be interpreting the data incorrectly.

jywarren · 2019-06-25T22:05:52Z

Yes i think a full analysis would be great. But the short answer is that we've almost halved our average problem response time for all site requests from 5.5+ to 3 or less. It's really a huge improvement. It was a combination of a) almost doubling RAM from 8-15GB, b) blocking a marketing bot in robots.txt, and c) blocking it in nginx configs as well (i think by IP address range). The tough part is how much the bot/stats_controller was part of it, because we didn't want to hold back the overall site upgrade. The timing was: 1. robots.txt at about 5-6pm ET, i think 2. nginx block hours later after we weren't sure how quickly robots.txt was read or respected 3. ~7am ET site memory expansion on Saturday. In any case we're doing really well now. Load average is <4 instead of ~8, and we have 6 instead of 4 CPUs.

…

On Tue, Jun 25, 2019 at 5:32 PM Benjamin Sugar ***@***.***> wrote: Yes, would love an update! Not sure I grabbed the correct data, but here's some images from skylight of before the commit, after the commit, and the last ~24 hours. The red line indicates when the commit was made. Looks on the surface like the answer is yes, but it may not be significant, or I might be interpreting the data incorrectly. [image: robots_txt] <https://user-images.githubusercontent.com/950291/60135129-05718300-976f-11e9-8fe7-3ca1c081abe3.png> — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#5817?email_source=notifications&email_token=AAAF6J6ALZMY2QMSC7TZQHDP4KFEXA5CNFSM4HSA3N32YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODYRVAIQ#issuecomment-505630754>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAAF6J4E2Z2E47A4T6OWUCDP4KFEXANCNFSM4HSA3N3Q> .

jywarren · 2019-09-27T22:32:33Z

Closing this now!

jywarren added bug the issue is regarding one of our programs which faces problems when a certain task is executed help wanted requires help by anyone willing to contribute labels Jun 1, 2019

jywarren added the high-priority label Jun 1, 2019

icarito mentioned this issue Jun 5, 2019

Prevent restart loop of mailman container #5841

Merged

5 tasks

This was referenced Sep 12, 2019

comment posting error due to email service downtime #6273

Closed

Hotfix: Retry and set timeout to 60 #6275

Closed

jywarren closed this as completed Sep 27, 2019

Memory issues (leak?) investigation #5817

Memory issues (leak?) investigation #5817

Comments

jywarren commented Jun 1, 2019

jywarren commented Jun 1, 2019

jywarren commented Jun 1, 2019

icarito commented Jun 2, 2019

icarito commented Jun 2, 2019

icarito commented Jun 2, 2019

jywarren commented Jun 3, 2019

jywarren commented Jun 3, 2019

jywarren commented Jun 3, 2019

jywarren commented Jun 3, 2019

jywarren commented Jun 3, 2019 • edited Loading

jywarren commented Jun 3, 2019

jywarren commented Jun 4, 2019

jywarren commented Jun 4, 2019

jywarren commented Jun 4, 2019

jywarren commented Jun 4, 2019

jywarren commented Jun 4, 2019

jywarren commented Jun 4, 2019

jywarren commented Jun 4, 2019 • edited Loading

jywarren commented Jun 5, 2019

jywarren commented Jun 5, 2019

jywarren commented Jun 5, 2019

jywarren commented Jun 5, 2019

jywarren commented Jun 5, 2019

jywarren commented Jun 5, 2019 • edited Loading

jywarren commented Jun 5, 2019

jywarren commented Jun 5, 2019

icarito commented Jun 18, 2019 • edited Loading

icarito commented Jun 18, 2019

icarito commented Jun 18, 2019

icarito commented Jun 18, 2019

jywarren commented Jun 18, 2019

skilfullycurled commented Jun 19, 2019

icarito commented Jun 19, 2019

skilfullycurled commented Jun 19, 2019 • edited Loading

cesswairimu commented Jun 19, 2019

grvsachdeva commented Jun 19, 2019

icarito commented Jun 19, 2019

icarito commented Jun 19, 2019 via email

jywarren commented Jun 19, 2019 via email

skilfullycurled commented Jun 19, 2019

jywarren commented Jun 19, 2019 via email

skilfullycurled commented Jun 19, 2019

icarito commented Jun 20, 2019

icarito commented Jun 20, 2019

jywarren commented Jun 20, 2019 via email

jywarren commented Jun 20, 2019 via email

skilfullycurled commented Jun 20, 2019

jywarren commented Jun 20, 2019 via email

jywarren commented Jun 21, 2019

grvsachdeva commented Jun 25, 2019

skilfullycurled commented Jun 25, 2019

jywarren commented Jun 25, 2019 via email

jywarren commented Sep 27, 2019

jywarren commented Jun 3, 2019 •

edited

Loading

jywarren commented Jun 4, 2019 •

edited

Loading

jywarren commented Jun 5, 2019 •

edited

Loading

icarito commented Jun 18, 2019 •

edited

Loading

skilfullycurled commented Jun 19, 2019 •

edited

Loading