-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Memory issues (leak?) investigation #5817
Comments
Noting @icarito's comment:
And this graph: |
We're seeing a lot of SMTP test errors too:
|
Yes load is very high too. From the
|
The log is filled with cycles of these, no error:
|
Looks like mailman is crashing and being immediately respawn!
I've decided to stop this container for tonight in order to monitor the effect on performance. |
I think we may also look at what gems updatea were merged in the days leading up to this code publication. Thanks! |
That's so weird about mailman, I will look at the config but I don't remember any changes to the rate. |
Oh you know what? We set it to retry 3 times. Maybe these are overlapping now? It could at least have increased the rate of attempts since it retries 3 times for every scheduled run. Line 32 in faf66c0
|
Ok modified it for 20 seconds, which should mean max a retry every 5 seconds -- That'll be the same rate as before when we added retries. |
OK, now working on analysis after a few hours: https://oss.skylight.io/app/applications/GZDPChmcfm1Q/1559574420/6h/endpoints Overall looks good. But, on closer look, it's ramping up in load time: Comparing the latter portion where it's starting to go back up: to the earlier just after the reboot: And then to this from a couple weeks ago before all our trouble: Then finally just after we started seeing issues on the 22-23rd of May: Overall it's not conclusive. Resources:
One of the tough things about this is that it's right around where these two commits happened:
I'd like to think it relates to the addition of the This could mean that a) something else is driving it, or b) the "rescue/retry" cycle itself could be causing memory leak buildup? |
shall i comment out the rescue/retry code entirely? maybe the hanging waiting for mysql to pick up is actually taking up threads? I'll try this. Site is almost unresponsive. I removed the Deploying... it'll take a while. |
Hmm it really doesn't seem solved... https://oss.skylight.io/app/applications/GZDPChmcfm1Q/1559577660/8h13m/endpoints |
Ok I wonder if the container setup affected the mailman container at all? Because at this point we've reverted all the likely stuff from the mailman script. |
The stats range calls are taking up to 40+ seconds! |
@icarito could there be like an issue on the read/write io or something on cache generation? I'm just not sure why it would take this long to pack all the data into the cache. |
Leaky gems -- check off if we're OK
Non-leaky but memory issues in any case:
|
I'm still seeing this massive cache generation time for https://guides.rubyonrails.org/v3.2/caching_with_rails.html#activesupport-cache-memorystore |
Also looking at https://www.skylight.io/support/performance-tips |
I'll look at the email configuration now but if it doesn't yield anything I'll merge this, turning off the |
OK our next step for #5841 is to develop a monitoring strategy for if mailman goes down. |
Deploying with the new email credentials, AND the |
Latest error:
That's here: Line 265 in e62bb49
|
Fixed that! c02b560 |
ug, finally relaly publishing the comment.rb fix.... |
Doing operation on live production database:
Tested https://publiclab.org - session was retained! |
mitigation done! Hopefully this will free us! |
i'll leave it for tonight, site looks speedy to me... 😝 hopefully this is it! |
|
Nooooooooooooo! Well, there's only one other explanation and that's ghosts. I'll open up another issue and look into finding an exorcist or ghostbusters gem. |
I think actually there's been improvement on I/O use because using a 30GB table is heavy - if you look closely the peaks seem related to Statscontroller... maybe we could do the stats work on staging? I can make it copy production database regularly say weekly? |
Hey @icarito, I was wondering if you could answer some "educational" questions for me :
Why would this be? Due to the caching? I can only think of three people who would be using it and I'm one of them and I haven't been.
I've been hearing...er...seeing you use the word "staging" a lot lately. What is that and how does it play into the site/workflow? If it's a part of the docs, let me know which one and I'll take a crack at understanding it first.
I think that'd be good. It's not so much that the freshest data are important, but between the Q&A system being changed and the recent tags migration, I suppose weekly is a good idea since it will catch any structural changes as they come in. @cesswairimu, what do you think? |
This was a really awesome thread to read. Yeah its a great idea having the stats in stage and copying weekly is fine too 👍 |
Hey @icarito, can we increase the RAM of the server? Maybe that'll help in speeding up the website until we improve our query response rate? Thanks! |
Thanks for your replies! I am thankful for the work that you are doing and for replying to this issue and reading thru our efforts! I don't want to sound accusing or anything! I'm just looking at the data and trying to improve our site's reliability. Regarding staging and production, currently we have three instances: You are right that documentation wise we should do a better job describing this process. Currently i found some docs here https://github.com/publiclab/plots2/blob/master/doc/TESTING.md#testing-branches but it's not clear at all that these branches build when we push to those branches. The database is currently updated manually every so often but it should be simple to automate it now that we have daily database dumps. I will set it up and ping you! This doesn't mean we shouldn't implement more solutions, next I think a threaded webserver (Puma) could help! |
That is a good question! We are in the process of moving our hosting to
new provider and were hoping to deploy as a container cluster in the new
hosting provider.
Since running in containers isn't immediately trivial (because our app
container isn't immutable) - an alternative to start is that we could
move the database first to make room.
I don't think we should increase our hosting usage in our current host
as we are barely within our allowed quota, but @jywarren can confirm?
Thanks for your work!
…On 19/06/19 11:23, Gaurav Sachdeva wrote:
Hey @icarito <https://github.com/icarito>, can we increase the RAM of
the server? Maybe that'll help in speeding up the website until we
improve our query response rate?
Thanks!
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#5817?email_source=notifications&email_token=AABQYS3R6ENGBU4FYJXVNXTP3JMPBA5CNFSM4HSA3N32YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODYCM7NI#issuecomment-503631797>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AABQYS7LYPEKQ4QEANK5PRLP3JMPBANCNFSM4HSA3N3Q>.
|
Actually I wonder if we could temporarily boost our ram in that container
until we do the move and if it would help short term. I think we'd be ok
with that cost increasing!
On Wed, Jun 19, 2019, 12:59 PM Sebastian Silva <notifications@github.com>
wrote:
… That is a good question! We are in the process of moving our hosting to
new provider and were hoping to deploy as a container cluster in the new
hosting provider.
Since running in containers isn't immediately trivial (because our app
container isn't immutable) - an alternative to start is that we could
move the database first to make room.
I don't think we should increase our hosting usage in our current host
as we are barely within our allowed quota, but @jywarren can confirm?
Thanks for your work!
On 19/06/19 11:23, Gaurav Sachdeva wrote:
>
> Hey @icarito <https://github.com/icarito>, can we increase the RAM of
> the server? Maybe that'll help in speeding up the website until we
> improve our query response rate?
>
> Thanks!
>
> —
> You are receiving this because you were mentioned.
> Reply to this email directly, view it on GitHub
> <
#5817?email_source=notifications&email_token=AABQYS3R6ENGBU4FYJXVNXTP3JMPBA5CNFSM4HSA3N32YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODYCM7NI#issuecomment-503631797
>,
> or mute the thread
> <
https://github.com/notifications/unsubscribe-auth/AABQYS7LYPEKQ4QEANK5PRLP3JMPBANCNFSM4HSA3N3Q
>.
>
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#5817?email_source=notifications&email_token=AAAF6J4GPT5S2JYJCMGJWP3P3JQVRA5CNFSM4HSA3N32YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODYCQFCY#issuecomment-503644811>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAAF6J4ERAAUV6JD3HUDZKDP3JQVRANCNFSM4HSA3N3Q>
.
|
Oh, @icarito, no, no, I didn't sense any accusation, not at all. I read, "this is what's happening" and I was just saying "that's odd, why would it be doing that if no one was on it...?" Along the same lines, I didn't mean to imply the documentation was poor. Only that you didn't have to explain it if there was any. And hey, it's not an entirely unfounded accusation : ) although I am having a bit of fun pretending that I've been framed and I've gone underground and have to prove my innocence but that's a whole other screenplay that I'm working on. Thankfully these lurid and baseless accusations ; ) on both are parts have been cleared up and we can get back to the business at hand. Related question: Why would the stats controller be active if no one was using it or is that the mystery? Regarding the staging, thanks for the explanation. To make sure I've got, is saying...
...interchangeable with saying, "I'll try this on stable.publiclab.org"? |
To the stable.publiclab.org Q -- yes! And that's built off of any push to
the `master` branch - hope that helps!
…On Wed, Jun 19, 2019 at 3:19 PM Benjamin Sugar ***@***.***> wrote:
Oh, @icarito <https://github.com/icarito>, no, no, I didn't sense any
accusation, not at all. I read, "this is what's happening" and I was just
saying "that's odd, why would it be doing that if no one was on it...?"
Along the same lines, I didn't mean to imply the documentation was poor.
Only that you didn't have to explain it if there was any.
And hey, it's not an entirely unfounded accusation : ) although I am
having a bit of fun pretending that I've been framed and I've gone
underground and have to prove my innocence but that's a whole other
screenplay that I'm working on.
Thankfully these lurid and baseless accusations ; ) on both are parts have
been cleared up and we can get back to the business at hand.
Related question: Why would the stats controller be active if no one was
using it or is that the mystery?
Regarding the staging, thanks for the explanation. To make sure I've got,
is saying...
I'll try this in stable staging instance.
...interchangeable with saying, "I'll try this on stable.publiclab.org"?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#5817?email_source=notifications&email_token=AAAF6J23U74QTJEVCLT6FLDP3KBBFA5CNFSM4HSA3N32YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODYDAD5Y#issuecomment-503710199>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAAF6J2RLJGI3ESQKZARV6DP3KBBFANCNFSM4HSA3N3Q>
.
|
@jywarren, yup! Got now. Thank you! |
Thanks for the clarification @skilfullycurled ! Brief moments ago we had another peak that knocked us down for few minutes: The trigger in this case was actually the Full Text Search. I think this may be significantly affecting our baseline performance. If this usage is not known, then perhaps crawlers are hitting these endpoints? Maybe a robots.txt or some access control would fix it? @jywarren thanks for the clarification, I'll look into doing it asap then. |
Shall we robots.txt all stats routes? So /stats* basically?
…On Thu, Jun 20, 2019 at 12:21 AM Sebastian Silva ***@***.***> wrote:
Actually here's Statscontroller details for previous timeslice:
[image: imagen]
<https://user-images.githubusercontent.com/199755/59818278-d4b1c980-92e8-11e9-9b9e-46900a253bd8.png>
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#5817?email_source=notifications&email_token=AAAF6J7GBBZKJQY6TCZMQE3P3MARXA5CNFSM4HSA3N32YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODYD6ZTY#issuecomment-503835855>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAAF6J7PGJ5YZZHPLWPIJ73P3MARXANCNFSM4HSA3N3Q>
.
|
OK, i did, and also exempted /api/* - we had already blocked /stats/range*
but now it's all /stats*
aa93dc3
…On Thu, Jun 20, 2019 at 2:45 PM Jeffrey Warren ***@***.***> wrote:
Shall we robots.txt all stats routes? So /stats* basically?
On Thu, Jun 20, 2019 at 12:21 AM Sebastian Silva ***@***.***>
wrote:
> Actually here's Statscontroller details for previous timeslice:
> [image: imagen]
> <https://user-images.githubusercontent.com/199755/59818278-d4b1c980-92e8-11e9-9b9e-46900a253bd8.png>
>
> —
> You are receiving this because you were mentioned.
> Reply to this email directly, view it on GitHub
> <#5817?email_source=notifications&email_token=AAAF6J7GBBZKJQY6TCZMQE3P3MARXA5CNFSM4HSA3N32YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODYD6ZTY#issuecomment-503835855>,
> or mute the thread
> <https://github.com/notifications/unsubscribe-auth/AAAF6J7PGJ5YZZHPLWPIJ73P3MARXANCNFSM4HSA3N3Q>
> .
>
|
So you don't think it's the caching? |
The cache is use-generated, that is, it generates when a) it's expired, AND
b) a new request comes in. So something has to be requesting it for the
cache to generate... if I can resolve a couple unrelated issues and merge
their PRs, i'll start a new publication to production tonight (otherwise
tomorrow) and we can see if the robots.txt helps at all?
…On Thu, Jun 20, 2019 at 4:53 PM Benjamin Sugar ***@***.***> wrote:
So you don't think it's the caching?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#5817?email_source=notifications&email_token=AAAF6JZ5WFKAP5ZCICW67VLP3PUZBA5CNFSM4HSA3N32YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODYGSO4I#issuecomment-504178545>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAAF6J4MBMWM6WIOH6VJCY3P3PUZBANCNFSM4HSA3N3Q>
.
|
via @icarito - so on tonight's update we can see if robots.txt changes help this. |
Hey @jywarren, I saw that robot.txt update commit was pushed to stable some days ago. Any improvement you noticed? |
Yes, would love an update! Not sure I grabbed the correct data, but here's some images from skylight of before the commit, after the commit, and the last ~24 hours. The red line indicates when the commit was made. Looks on the surface like the answer is yes, but it may not be significant, or I might be interpreting the data incorrectly. |
Yes i think a full analysis would be great. But the short answer is that
we've almost halved our average problem response time for all site requests
from 5.5+ to 3 or less. It's really a huge improvement. It was a
combination of a) almost doubling RAM from 8-15GB, b) blocking a marketing
bot in robots.txt, and c) blocking it in nginx configs as well (i think by
IP address range). The tough part is how much the bot/stats_controller was
part of it, because we didn't want to hold back the overall site upgrade.
The timing was:
1. robots.txt at about 5-6pm ET, i think
2. nginx block hours later after we weren't sure how quickly robots.txt was
read or respected
3. ~7am ET site memory expansion on Saturday.
In any case we're doing really well now. Load average is <4 instead of ~8,
and we have 6 instead of 4 CPUs.
…On Tue, Jun 25, 2019 at 5:32 PM Benjamin Sugar ***@***.***> wrote:
Yes, would love an update! Not sure I grabbed the correct data, but here's
some images from skylight of before the commit, after the commit, and the
last ~24 hours. The red line indicates when the commit was made. Looks on
the surface like the answer is yes, but it may not be significant, or I
might be interpreting the data incorrectly.
[image: robots_txt]
<https://user-images.githubusercontent.com/950291/60135129-05718300-976f-11e9-8fe7-3ca1c081abe3.png>
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#5817?email_source=notifications&email_token=AAAF6J6ALZMY2QMSC7TZQHDP4KFEXA5CNFSM4HSA3N32YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODYRVAIQ#issuecomment-505630754>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAAF6J4E2Z2E47A4T6OWUCDP4KFEXANCNFSM4HSA3N3Q>
.
|
Closing this now! |
We're seeing a persistent memory issue since one week ago on Saturday, and I'm compiling information about it here to investigate.
Wondering if it's related to this controller method for the dashboard.
https://www.skylight.io/app/applications/GZDPChmcfm1Q/1559320320/1d/endpoints/HomeController%23dashboard?responseType=html
The text was updated successfully, but these errors were encountered: