-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
InfluxDB 1.7 uses way more memory and disk i/o than 1.6.4 #10468
Comments
I've spent a lot more time trying to understand what's happening here. I've also downgraded back to 1.6.4 and the same thing happens - so I'm suspecting that after the upgrade, maybe one of the shards got corrupted in some way. What seems to be happening is, when InfluxDB starts up, everything is fine. At some time after this, the read I/O load suddenly gets maxed out, this doesn't ever seem to stop until InfluxDB is restarted. I'm attaching here a profile I took while it was in this state (this is 1.6.4). |
@scotloach @cxhuawei @nktl thanks for your reports. We will prioritise an investigation into this issue today. In the mean time, if you have any more information on your setups that would be very helpful. For example:
@scotloach thanks for your 1.6.4 profiles. When your downgraded back to 1.6.4 would you describe the system behaviour as being exactly the same as 1.7.0 (same heap usage, CPU and IO) or would you say there were differences in the performance characteristics? I will look at your profiles this morning. |
All the following comment was wrong, analyzed the profile using the wrong binary, corrected version here Just for context, I analyzed a bit the heap profile that @scotloach provided, here's the top heap consuming calls.
Attached the profile image too.
|
Could you please also provide some data from the output of the following:
You don't need to provide all the output (there could be a lot), we're just interested in seeing if the min time and max time are correct on the TSM index data. |
@fntlnz I'm not sure you're using the right binary. When I look at the heap profile I see all the heap being used by the decoders for a compaction:
This issue looks very similar to the one we recently fixed, which is somewhat confusing. |
Thanks! Regarding your questions:
Around 10000
There are roughly 100 databases, the 'worst' one has carnality of around 40000
Getting 404 on this influx endpoint for some reason |
@nktl thanks. Do you have |
Yes, that was it! |
Sorry for the confusion, uploading the profile output with the right binary from the downloads page. and here's the top calls.
|
@nktl ok thanks. In the meantime, if you can identify a shard that you think is being continually compacted do you think you could send it to us? We have a secure SFTP site for uploading it. Feel free to email me |
For my system, I'm using tsi, about 6K shards, 70 databases. Largest DB has about 8M series. I use 7 day shards for the most part so they all rolled over last evening, and it's been stable and looking good on the 1.64 release since then. I'm happy to help you out with your investigation if I can, but I'm not going to put 1.7 back on it anytime soon. |
@e-dard, the following line using
|
Another thing I noticed, the current _internal.monitor shard was growing very fast when this was happening, it churned through about 250GB of disk before I noticed and deleted that shard. I'm not sure what it was and I needed to clean the space up. |
That is a really great lead, @scotloach. Perhaps the behavior of the |
My instinct is more that something was happening (compaction I guess?) and got into some internal loop, this process was continually reading from disk and as a side effect was updating some state in the monitor db. |
@stuartcarnie The profiles are taken from a |
My current theory based on this ticket and #10465 are that there is a compaction bug somewhere, but it's not caused by a problem with encoding bad TSM data. The symptoms seem to be when upgrading to 1.7.0 a full compaction seems to go on indefinitely. Profiles from this ticket, and #10465 indicate a significant amount of time decoding blocks and checking for overlaps, suggesting the routines are being called many time.
I'm awaiting logs from #10465 to see if there are any clues there. |
To clarify, downgrading to 1.6.4 and restarting did not make the issue go away. The issue seems to have gone away now that I'm writing to new shards created by 1.6.4. It's only been about 12 hours since I restarted it post shard-rollover, so it's possible it will happen again, I'm keeping a close eye on it. |
Ahh right profile is 1.6.4, my mistake @e-dard |
@scotloach sorry yeah, I think the shard rolling over has something to do with too. |
@scotloach You should be able to tell by checking if the most recent cold shard (the shard that was problematic) is now completely cold. An |
I'm not actually sure what the problematic shard was. |
@e-dard Hi , Regarding your questions: index-version = "tsi1" and profiles I use influxdb as promethues's storage. The write iops is normal but the read iops periodic surge. When I query 30 days of data, Influxdb has a high probability of falling into high read load so that I can only restart influxdb. thanks. : ) |
I'm still running into problems with influxdb 1.6.4. I think I can identify problematic files now, if you still need some I can probably provide them. |
@e-dard Thanks you. When will the docker image 1.7.1 available on dockerhub? |
@brava-vinh we have submitted a PR for the docker hub image to be updated |
Running 1.7.1 since ~Nov 16 02:00:00 UTC 2018 (from the official docker image) but still having .tsm.tmp files that seem stuck:
Exact same result at The shard sizes are The container was manually restarted at 2018-11-16T09:36:13.006788691Z to set max-concurrent-compactions to 4. It has been running since then. I don't really know if the files behave as expected or not. However InfluxDB is heavily using all 16 CPU cores. Could it be that this is a corrupt shard from 1.7.0 that is not being fixed by 1.7.1? My history has been:
I did a and got the results:
Is there anything else I can provide here or on e-mail to help sort this issue? |
I see same thing with @hallvar . I ended up just remove all db since it's new DB and I was doing double write from 1.6.4 amd it good since then. |
Indeed from the log of InfluxDB I can see an entry for starting the compaction process of ./asd/full_resolution/1001/, but never an entry for it completing. |
@hallvar few questions:
|
@hallvar also, do you see a log entry for a compaction for |
Some more details about full_resolution/1001
The tsm.tmp file has indeed stopped growing in size:
I can upload this shard somewhere for you to reach it. I'll send the link to your gmail address once it is up. Unable to say if the influxdb log had a completion entry for |
|
Are you sure that you're running 1.7.1? Docker hub shows updates "an hour ago". |
@conet have you had any issues since upgrading to 1.7.1? |
@stuartcarnie no, but I just read the backlog and another difference besides packaging is that I haven't switched to |
@conet glad to hear you are still stable. The index type was unrelated to this issue. |
@hallvar Are you sure that the 4 stuck compaction processes amount to the used CPU? What I noticed that a single compaction process would only consume 1 core. Could you run this query on the internal database?
How about queries hitting the DB, aren't they causing the load?
@stuartcarnie I was trying to spot differences and that was the only thing I noticed, anyway I don't want to create more noise around this issue 😄 |
No worries @conet, appreciate your input! |
@stuartcarnie Very interesting, I am running docker image influxdb:1.7.1 but the startup log does indeed say something else:
After following the docker image PR very closely, I very eagerly pulled it about 15 minutes after the tag first appeared on Docker Hub. Go figure, something must have been off with the initially uploaded docker image.. When pulling the image tag again just now I get an updated newer image. After restarting influx using the new image I finally see the full_resolution/1001 shard compaction complete successfully. Sorry for bringing in the extra confusion, got completely side tracked by the initially incorrectly tagged docker image. @stuartcarnie Excellent work and nicely fixed. Thanks for making me double check the version! @conet You are right, that was probably query load I was seeing. |
Ah, yes, been some long days.. thanks @conet! Finally everything is back to normal :-) |
@v9n yes, that makes me certain the initially uploaded docker image tagged 1.7.1 was in fact 1.7.0. The pull request to docker official images seems correct: docker-library/official-images/pull/5076. Anyway that is a separate issue that can be opened there I guess. |
Linking influxdata/influxdata-docker#279 for visibility. The official docker images likely need to be rebuilt for both normal and alpine versions. |
Nevermind, I didn't notice that the image was just updated. |
I upgraded to v1.7.1 and deleted all the data, so far everything is normal. Thank you. @e-dard |
I'm currently on 1.7.7, and I facing a huge memory issue. (Was facing a similar issue with 1.7.6 also)
Number of series cardinality = 1 I'm not sure what is causing the huge memory increase. TIA |
@prabhumarappan please open a new issue. |
you may have a look at #13318 for 1.7.7 |
Thanks @mqu! |
System info:
InfluxDB 1.7, upgraded from 1.6.4. Running on the standard Docker image.
Steps to reproduce:
I upgraded a large InfluxDB server to InfluxDB 1.7. Nothing else changed. We are running two InfluxDB servers of a similar size, the other one was left at 1.6.4.
This ran fine for about a day, then it started running into our memory limit and continually OOMing.
We upped the memory and restarted. It ran fine for about 4 hours then started using very high disk i/o (read) which caused our stats writing processes to back off.
Please see the timeline on the heap metrics below:
The text was updated successfully, but these errors were encountered: