Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

InfluxDB 1.7 uses way more memory and disk i/o than 1.6.4 #10468

Closed
scotloach opened this issue Nov 11, 2018 · 63 comments
Closed

InfluxDB 1.7 uses way more memory and disk i/o than 1.6.4 #10468

scotloach opened this issue Nov 11, 2018 · 63 comments

Comments

@scotloach
Copy link

System info:
InfluxDB 1.7, upgraded from 1.6.4. Running on the standard Docker image.

Steps to reproduce:

I upgraded a large InfluxDB server to InfluxDB 1.7. Nothing else changed. We are running two InfluxDB servers of a similar size, the other one was left at 1.6.4.

This ran fine for about a day, then it started running into our memory limit and continually OOMing.
We upped the memory and restarted. It ran fine for about 4 hours then started using very high disk i/o (read) which caused our stats writing processes to back off.

Please see the timeline on the heap metrics below:

  • you can see relatively stable heap usage before we upgraded at 6am on Nov 9
  • at around 4pm there is a step-up in heap usage
  • around 11:30pm there is another step-up and it starts hitting the limit (OOM, causes a restart)
  • at 12:45pm on the 10th we restart with more RAM
  • around 4 hours later it starts using high i/o and you can see spikes in heap usage

image

@scotloach
Copy link
Author

I've spent a lot more time trying to understand what's happening here. I've also downgraded back to 1.6.4 and the same thing happens - so I'm suspecting that after the upgrade, maybe one of the shards got corrupted in some way. What seems to be happening is, when InfluxDB starts up, everything is fine. At some time after this, the read I/O load suddenly gets maxed out, this doesn't ever seem to stop until InfluxDB is restarted. I'm attaching here a profile I took while it was in this state (this is 1.6.4).
profiles.tar.gz

@chenxu1990
Copy link

I have encountered the same problem.
The read load continues to be high.

image

@nktl
Copy link

nktl commented Nov 12, 2018

Same problem here, visible increase of memory usage after upgrade and performance of the queries dropped dramatically - the platform is borderline unusable at the moment... No other changes than upgrading version to 1.7

image

@e-dard
Copy link
Contributor

e-dard commented Nov 12, 2018

@scotloach @cxhuawei @nktl thanks for your reports. We will prioritise an investigation into this issue today.

In the mean time, if you have any more information on your setups that would be very helpful. For example:

  • What index are you using (TSI / inmem)?
  • How many shards do you have?
  • What's your database cardinality? SHOW SERIES CARDINALITY.
  • If you could provide profiles whilst your 1.7 system is using lots of heap (the higher the heap use the better) that would really help us. You can do curl -o profiles.tar.gz http://localhost:8086/debug/pprof/all?cpu=true It will take 30s for the request to complete.

@scotloach thanks for your 1.6.4 profiles. When your downgraded back to 1.6.4 would you describe the system behaviour as being exactly the same as 1.7.0 (same heap usage, CPU and IO) or would you say there were differences in the performance characteristics?

I will look at your profiles this morning.

@fntlnz
Copy link
Contributor

fntlnz commented Nov 12, 2018

All the following comment was wrong, analyzed the profile using the wrong binary, corrected version here


Just for context, I analyzed a bit the heap profile that @scotloach provided, here's the top heap consuming calls.

Total: 88992.1 MB
 33940.0  38.1%  38.1%  33991.0  38.2% github.com/influxdata/influxdb/prometheus/remote..Unmarshal
 14590.3  16.4%  54.5%  14590.3  16.4% runtime.getitab
  7334.0   8.2%  62.8%   9060.1  10.2% runtime..sysAlloc
  4771.5   5.4%  68.1%   5255.3   5.9% runtime..Error
  2462.0   2.8%  70.9%   2462.0   2.8% github.com/influxdata/influxdb/query..Close
  2262.0   2.5%  73.4%   3940.5   4.4% text/tabwriter..append
  1790.8   2.0%  75.5%   4249.8   4.8% github.com/influxdata/influxdb/query..Next
  1662.8   1.9%  77.3%   1665.8   1.9% github.com/influxdata/influxdb/vendor/collectd.org/network.parseValues
  1508.5   1.7%  79.0%   1567.5   1.8% bytes..String
  1369.5   1.5%  80.6%  13209.1  14.8% type..eq.github.com/influxdata/influxdb/vendor/github.com/BurntSushi/toml.item

Attached the profile image too.

profile001

@e-dard
Copy link
Contributor

e-dard commented Nov 12, 2018

Could you please also provide some data from the output of the following:

influx_inspect dumptsm -index path/to/tsm/file/in/shard/being/compacted/0000000xy-00000000z.tsm

You don't need to provide all the output (there could be a lot), we're just interested in seeing if the min time and max time are correct on the TSM index data.

@e-dard
Copy link
Contributor

e-dard commented Nov 12, 2018

@fntlnz I'm not sure you're using the right binary. When I look at the heap profile I see all the heap being used by the decoders for a compaction:

Showing top 10 nodes out of 130
      flat  flat%   sum%        cum   cum%
   33.57GB 37.40% 37.40%    33.57GB 37.40%  github.com/influxdata/influxdb/tsdb/engine/tsm1.DecodeFloatBlock
   19.43GB 21.65% 59.05%    27.74GB 30.91%  github.com/influxdata/influxdb/tsdb/index/tsi1.(*LogFile).execSeriesEntry
    7.07GB  7.88% 66.92%     7.79GB  8.68%  github.com/influxdata/influxdb/tsdb/index/tsi1.(*logTagValue).addSeriesID
    2.66GB  2.96% 69.89%     2.66GB  2.96%  github.com/influxdata/influxdb/pkg/rhh.(*HashMap).insert
    2.50GB  2.79% 72.68%     6.14GB  6.84%  github.com/influxdata/influxdb/tsdb/index/inmem.(*Index).CreateSeriesListIfNotExists
    2.24GB  2.49% 75.17%     2.24GB  2.49%  github.com/influxdata/influxdb/models.Tags.Clone
    1.92GB  2.14% 77.31%     6.01GB  6.69%  github.com/influxdata/influxdb/tsdb/engine/tsm1.(*Engine).WritePoints
    1.63GB  1.82% 79.13%     1.63GB  1.82%  github.com/influxdata/influxdb/tsdb/engine/tsm1.(*entry).add
    1.57GB  1.75% 80.88%     3.24GB  3.61%  github.com/influxdata/influxdb/pkg/rhh.(*HashMap).grow
    1.30GB  1.44% 82.32%     5.52GB  6.15%  github.com/influxdata/influxdb/tsdb.(*SeriesIndex).execEntry

This issue looks very similar to the one we recently fixed, which is somewhat confusing.

@nktl
Copy link

nktl commented Nov 12, 2018

Thanks!

Regarding your questions:

  • What index are you using (TSI / inmem)?
    index-version = "tsi1"

  • How many shards do you have?

Around 10000

  • What's your database cardinality? SHOW SERIES CARDINALITY.

There are roughly 100 databases, the 'worst' one has carnality of around 40000

  • If you could provide profiles whilst your 1.7 system is using lots of heap (the higher the heap use the better) that would really help us. You can do curl -o profiles.tar.gz http://localhost:8086/debug/pprof/all?cpu=true It will take 30s for the request to complete.

Getting 404 on this influx endpoint for some reason

@e-dard
Copy link
Contributor

e-dard commented Nov 12, 2018

@nktl thanks. Do you have pprof-enabled = true in the http section of your config? It defaults to true, but maybe you disabled it?

@nktl
Copy link

nktl commented Nov 12, 2018

@nktl thanks. Do you have pprof-enabled = true in the http section of your config? It defaults to true, but maybe you disabled it?

Yes, that was it!
Unfortunately cannot restart influx at the moment to apply this, would have to be a bit later today or tomorrow.

@fntlnz
Copy link
Contributor

fntlnz commented Nov 12, 2018

Sorry for the confusion, uploading the profile output with the right binary from the downloads page.

profile002

and here's the top calls.

Dropped 549 nodes (cum <= 0.45GB)
Showing top 10 nodes out of 130
      flat  flat%   sum%        cum   cum%
   33.57GB 37.40% 37.40%    33.57GB 37.40%  github.com/influxdata/influxdb/tsdb/engine/tsm1.DecodeFloatBlock
   19.43GB 21.65% 59.05%    27.74GB 30.91%  github.com/influxdata/influxdb/tsdb/index/tsi1.(*LogFile).execSeriesEntry
    7.07GB  7.88% 66.92%     7.79GB  8.68%  github.com/influxdata/influxdb/tsdb/index/tsi1.(*logTagValue).addSeriesID
    2.66GB  2.96% 69.89%     2.66GB  2.96%  github.com/influxdata/influxdb/pkg/rhh.(*HashMap).insert
    2.50GB  2.79% 72.68%     6.14GB  6.84%  github.com/influxdata/influxdb/tsdb/index/inmem.(*Index).CreateSeriesListIfNotExists
    2.24GB  2.49% 75.17%     2.24GB  2.49%  github.com/influxdata/influxdb/models.Tags.Clone
    1.92GB  2.14% 77.31%     6.01GB  6.69%  github.com/influxdata/influxdb/tsdb/engine/tsm1.(*Engine).WritePoints
    1.63GB  1.82% 79.13%     1.63GB  1.82%  github.com/influxdata/influxdb/tsdb/engine/tsm1.(*entry).add
    1.57GB  1.75% 80.88%     3.24GB  3.61%  github.com/influxdata/influxdb/pkg/rhh.(*HashMap).grow
    1.30GB  1.44% 82.32%     5.52GB  6.15%  github.com/influxdata/influxdb/tsdb.(*SeriesIndex).execEntry

@e-dard
Copy link
Contributor

e-dard commented Nov 12, 2018

@nktl ok thanks. In the meantime, if you can identify a shard that you think is being continually compacted do you think you could send it to us? We have a secure SFTP site for uploading it. Feel free to email me edd at the product name dot com. It may help us get to the bottom of the issue more quickly.

@scotloach
Copy link
Author

For my system, I'm using tsi, about 6K shards, 70 databases. Largest DB has about 8M series.

I use 7 day shards for the most part so they all rolled over last evening, and it's been stable and looking good on the 1.64 release since then. I'm happy to help you out with your investigation if I can, but I'm not going to put 1.7 back on it anytime soon.

@stuartcarnie
Copy link
Contributor

@e-dard, the following line using DecodeFloatBlock I would expect to be for query load, as compaction should be using the new DecodeFloatArrayBlock API

33.57GB 37.40% 37.40%    33.57GB 37.40%  github.com/influxdata/influxdb/tsdb/engine/tsm1.DecodeFloatBlock

@scotloach
Copy link
Author

Another thing I noticed, the current _internal.monitor shard was growing very fast when this was happening, it churned through about 250GB of disk before I noticed and deleted that shard. I'm not sure what it was and I needed to clean the space up.

@stuartcarnie
Copy link
Contributor

That is a really great lead, @scotloach. Perhaps the behavior of the _internal monitoring database has changed.

@scotloach
Copy link
Author

My instinct is more that something was happening (compaction I guess?) and got into some internal loop, this process was continually reading from disk and as a side effect was updating some state in the monitor db.

@e-dard
Copy link
Contributor

e-dard commented Nov 12, 2018

@stuartcarnie The profiles are taken from a 1.6.4 release. @scotloach downgraded when they inititially saw problems.

@e-dard
Copy link
Contributor

e-dard commented Nov 12, 2018

My current theory based on this ticket and #10465 are that there is a compaction bug somewhere, but it's not caused by a problem with encoding bad TSM data.

The symptoms seem to be when upgrading to 1.7.0 a full compaction seems to go on indefinitely. Profiles from this ticket, and #10465 indicate a significant amount of time decoding blocks and checking for overlaps, suggesting the routines are being called many time.

I'm awaiting logs from #10465 to see if there are any clues there.

@scotloach
Copy link
Author

To clarify, downgrading to 1.6.4 and restarting did not make the issue go away. The issue seems to have gone away now that I'm writing to new shards created by 1.6.4. It's only been about 12 hours since I restarted it post shard-rollover, so it's possible it will happen again, I'm keeping a close eye on it.

@stuartcarnie
Copy link
Contributor

Ahh right profile is 1.6.4, my mistake @e-dard

@e-dard
Copy link
Contributor

e-dard commented Nov 12, 2018

@scotloach sorry yeah, I think the shard rolling over has something to do with too.

@e-dard
Copy link
Contributor

e-dard commented Nov 12, 2018

@scotloach You should be able to tell by checking if the most recent cold shard (the shard that was problematic) is now completely cold. An ls -lah on that shard directory should provide information to help determine that.

@scotloach
Copy link
Author

I'm not actually sure what the problematic shard was.

@chenxu1990
Copy link

@e-dard Hi , Regarding your questions:

index-version = "tsi1"

and profiles

profiles.tar.gz

I use influxdb as promethues's storage. The write iops is normal but the read iops periodic surge.

image

When I query 30 days of data, Influxdb has a high probability of falling into high read load so that I can only restart influxdb.

image

thanks. : )

@scotloach
Copy link
Author

I'm still running into problems with influxdb 1.6.4. I think I can identify problematic files now, if you still need some I can probably provide them.

@brava-vinh
Copy link

brava-vinh commented Nov 15, 2018

@e-dard Thanks you. When will the docker image 1.7.1 available on dockerhub?

@e-dard
Copy link
Contributor

e-dard commented Nov 15, 2018

@brava-vinh we have submitted a PR for the docker hub image to be updated

@hallvar
Copy link

hallvar commented Nov 16, 2018

Running 1.7.1 since ~Nov 16 02:00:00 UTC 2018 (from the official docker image) but still having .tsm.tmp files that seem stuck:

$ date; sudo find . -name "*.tsm.tmp"|xargs sudo ls -lh 
Fri Nov 16 18:48:01 UTC 2018
-rw-r--r-- 1 root root 523M Nov 16 09:41 ./asd/downsampled/1002/000013067-000000004.tsm.tmp
-rw-r--r-- 1 root root 312M Nov 16 09:41 ./asd/full_resolution/1001/000010126-000000002.tsm.tmp
-rw-r--r-- 1 root root    0 Nov 16 09:41 ./asd/full_resolution_calculations/1005/000008481-000000003.tsm.tmp
-rw-r--r-- 1 root root    0 Nov 16 09:41 ./asd/full_resolution_calculations/1030/000000033-000000003.tsm.tmp

Exact same result at Fri Nov 16 19:18:38 UTC 2018

The shard sizes are ./asd/full_resolution/1001/: 41 GB. ./asd/downsampled/1002/: 24 GB. ./asd/full_resolution_calculations/1005/: 1.6 GB. ./asd/full_resolution_calculations/1030/: 929 MB

The container was manually restarted at 2018-11-16T09:36:13.006788691Z to set max-concurrent-compactions to 4. It has been running since then.

I don't really know if the files behave as expected or not. However InfluxDB is heavily using all 16 CPU cores.

screenshot 2018-11-16 at 20 04 02

Could it be that this is a corrupt shard from 1.7.0 that is not being fixed by 1.7.1?

My history has been:

  1. Running 1.6.4.
  2. Stop 1.6.4
  3. Run buildtsi using 1.7.0 client
  4. Run on 1.7.0
  5. Major issues with CPU/mem/io
  6. Downgrade to 1.6.4 (not rebuilding indexes)
  7. Set max_concurrent_compactions to 1
  8. Run for ~1-2 days
  9. Stop 1.6.4
  10. Run buildtsi using 1.7.1 client
  11. Run on 1.7.1 with max_concurrent_compactions=0
  12. Running for a few hours, noticing reduced performance/high CPU/high IO
  13. Restart 1.7.1 with max_concurrent_compactions=4
  14. Still noticing reduced performance
  15. File this report :-)

I did a
curl -o profile6.pprof http://localhost:8086/debug/pprof/profile

and got the results:

go tool pprof -top influxd-1.7.1 profile6.pprof


File: influxd-1.7.1
Type: cpu
Time: Nov 16, 2018 at 8:27pm (CET)
Duration: 30.17s, Total samples = 96.34s (319.31%)
Showing nodes accounting for 87.61s, 90.94% of 96.34s total
Dropped 240 nodes (cum <= 0.48s)
      flat  flat%   sum%        cum   cum%
    11.01s 11.43% 11.43%     11.01s 11.43%  strconv.ParseUint
     5.65s  5.86% 17.29%     18.84s 19.56%  github.com/influxdata/influxdb/tsdb/engine/tsm1.FloatArrayDecodeAll
     5.42s  5.63% 22.92%      5.42s  5.63%  runtime.memclrNoHeapPointers
     4.96s  5.15% 28.07%      6.18s  6.41%  path/filepath.Base
     4.59s  4.76% 32.83%      9.75s 10.12%  runtime.scanobject
     3.61s  3.75% 36.58%     26.44s 27.44%  github.com/influxdata/influxdb/tsdb/engine/tsm1.DefaultParseFileName
     3.58s  3.72% 40.29%      4.05s  4.20%  runtime.(*mheap).freeSpanLocked
     3.10s  3.22% 43.51%     13.11s 13.61%  runtime.sweepone
     2.85s  2.96% 46.47%      2.85s  2.96%  indexbytebody
     2.84s  2.95% 49.42%      3.18s  3.30%  runtime.findObject
     2.83s  2.94% 52.36%      2.83s  2.94%  runtime.futex
     2.36s  2.45% 54.81%      5.64s  5.85%  strings.Index
     2.12s  2.20% 57.01%     13.34s 13.85%  github.com/influxdata/influxdb/tsdb/engine/tsm1.timeBatchDecodeAllSimple
     1.90s  1.97% 58.98%      5.53s  5.74%  runtime.lock
     1.89s  1.96% 60.94%        33s 34.25%  github.com/influxdata/influxdb/tsdb/engine/tsm1.(*tsmGeneration).level
     1.79s  1.86% 62.80%      1.79s  1.86%  runtime.procyield
     1.71s  1.77% 64.57%     28.15s 29.22%  github.com/influxdata/influxdb/tsdb/engine/tsm1.(*FileStore).ParseFileName
     1.61s  1.67% 66.24%     29.76s 30.89%  github.com/influxdata/influxdb/tsdb/engine/tsm1.(*DefaultPlanner).ParseFileName
     1.45s  1.51% 67.75%      3.32s  3.45%  runtime.unlock
     1.35s  1.40% 69.15%     31.11s 32.29%  github.com/influxdata/influxdb/tsdb/engine/tsm1.(*DefaultPlanner).ParseFileName-fm
     1.32s  1.37% 70.52%     34.27s 35.57%  github.com/influxdata/influxdb/tsdb/engine/tsm1.tsmGenerations.level
     1.22s  1.27% 71.79%      1.22s  1.27%  os.IsPathSeparator (inline)
     1.20s  1.25% 73.03%      1.20s  1.25%  runtime.(*mSpanList).remove
     1.10s  1.14% 74.17%      1.10s  1.14%  github.com/influxdata/influxdb/tsdb/engine/tsm1.(*block).overlapsTimeRange (inline)
     1.07s  1.11% 75.29%      9.96s 10.34%  runtime.(*mspan).sweep
     1.02s  1.06% 76.34%     19.04s 19.76%  runtime.(*mcentral).cacheSpan
     0.98s  1.02% 77.36%      3.51s  3.64%  github.com/influxdata/influxdb/pkg/encoding/simple8b.DecodeBytesBigEndian
     0.95s  0.99% 78.35%      0.95s  0.99%  runtime.(*gcBitsArena).tryAlloc (inline)
     0.93s  0.97% 79.31%      0.93s  0.97%  runtime.osyield
     0.90s  0.93% 80.25%      0.90s  0.93%  runtime.memmove
     0.88s  0.91% 81.16%      1.84s  1.91%  runtime.greyobject
     0.76s  0.79% 81.95%      0.76s  0.79%  runtime.markBits.isMarked (inline)
     0.75s  0.78% 82.73%      0.75s  0.78%  github.com/influxdata/influxdb/tsdb/engine/tsm1.ZigZagDecode (inline)
     0.71s  0.74% 83.46%      0.85s  0.88%  runtime.(*gcSweepBuf).pop (inline)
     0.70s  0.73% 84.19%      6.34s  6.58%  github.com/influxdata/influxdb/tsdb/engine/tsm1.integerBatchDecodeAllSimple
     0.70s  0.73% 84.92%      0.70s  0.73%  runtime.(*fixalloc).alloc
     0.70s  0.73% 85.64%      7.46s  7.74%  runtime.(*mcentral).freeSpan
     0.56s  0.58% 86.23%      0.56s  0.58%  github.com/influxdata/influxdb/pkg/encoding/simple8b.unpack240
     0.56s  0.58% 86.81%     29.40s 30.52%  github.com/influxdata/influxdb/tsdb/engine/tsm1.(*tsmBatchKeyIterator).combineFloat
     0.52s  0.54% 87.35%     23.16s 24.04%  runtime.mallocgc
     0.51s  0.53% 87.88%     13.20s 13.70%  runtime.growslice
     0.40s  0.42% 88.29%      5.54s  5.75%  runtime.(*mheap).alloc_m
     0.39s   0.4% 88.70%         2s  2.08%  runtime.(*mheap).allocSpanLocked
     0.37s  0.38% 89.08%      0.53s  0.55%  runtime.markrootSpans
     0.21s  0.22% 89.30%      0.51s  0.53%  runtime.gcmarknewobject
     0.17s  0.18% 89.47%      1.27s  1.32%  runtime.newMarkBits
     0.15s  0.16% 89.63%      0.49s  0.51%  runtime.gentraceback
     0.14s  0.15% 89.78%      4.90s  5.09%  runtime.deductSweepCredit
     0.13s  0.13% 89.91%      0.82s  0.85%  runtime.heapBits.initSpan
     0.12s  0.12% 90.04%      9.45s  9.81%  runtime.(*mcentral).grow
     0.11s  0.11% 90.15%      5.98s  6.21%  runtime.(*mheap).freeSpan.func1
     0.07s 0.073% 90.22%     27.55s 28.60%  github.com/influxdata/influxdb/tsdb/engine/tsm1.DecodeFloatArrayBlock
     0.07s 0.073% 90.29%     19.26s 19.99%  runtime.(*mcache).nextFree
     0.07s 0.073% 90.37%     10.10s 10.48%  runtime.gcDrain
     0.06s 0.062% 90.43%     11.63s 12.07%  github.com/influxdata/influxdb/tsdb/engine/tsm1.(*tsmBatchKeyIterator).combineInteger
     0.06s 0.062% 90.49%      0.55s  0.57%  runtime.schedule
     0.05s 0.052% 90.54%     13.39s 13.90%  github.com/influxdata/influxdb/tsdb/engine/tsm1.TimeArrayDecodeAll
     0.05s 0.052% 90.60%     19.09s 19.82%  runtime.(*mcache).refill
     0.05s 0.052% 90.65%     12.29s 12.76%  runtime.makeslice
     0.05s 0.052% 90.70%     37.77s 39.20%  runtime.systemstack
     0.04s 0.042% 90.74%     19.13s 19.86%  runtime.(*mcache).nextFree.func1
     0.03s 0.031% 90.77%      6.37s  6.61%  github.com/influxdata/influxdb/tsdb/engine/tsm1.IntegerArrayDecodeAll
     0.03s 0.031% 90.80%      0.98s  1.02%  runtime.futexsleep
     0.03s 0.031% 90.83%     12.19s 12.65%  runtime.gosweepone
     0.02s 0.021% 90.86%      1.90s  1.97%  runtime.futexwakeup
     0.02s 0.021% 90.88%     12.16s 12.62%  runtime.gosweepone.func1
     0.01s  0.01% 90.89%     25.43s 26.40%  github.com/influxdata/influxdb/tsdb/engine/tsm1.(*DefaultPlanner).PlanLevel
     0.01s  0.01% 90.90%     11.46s 11.90%  github.com/influxdata/influxdb/tsdb/engine/tsm1.DecodeIntegerArrayBlock
     0.01s  0.01% 90.91%      8.49s  8.81%  runtime.(*mheap).alloc
     0.01s  0.01% 90.92%      5.55s  5.76%  runtime.(*mheap).alloc.func1
     0.01s  0.01% 90.93%      5.99s  6.22%  runtime.(*mheap).freeSpan
     0.01s  0.01% 90.94%      0.53s  0.55%  runtime.mProf_Malloc
         0     0% 90.94%     41.03s 42.59%  github.com/influxdata/influxdb/tsdb/engine/tsm1.(*Compactor).CompactFull
         0     0% 90.94%     41.03s 42.59%  github.com/influxdata/influxdb/tsdb/engine/tsm1.(*Compactor).compact
         0     0% 90.94%     41.03s 42.59%  github.com/influxdata/influxdb/tsdb/engine/tsm1.(*Compactor).write
         0     0% 90.94%     41.03s 42.59%  github.com/influxdata/influxdb/tsdb/engine/tsm1.(*Compactor).writeNewFiles
         0     0% 90.94%      9.07s  9.41%  github.com/influxdata/influxdb/tsdb/engine/tsm1.(*DefaultPlanner).PlanOptimize
         0     0% 90.94%     34.57s 35.88%  github.com/influxdata/influxdb/tsdb/engine/tsm1.(*Engine).compact
         0     0% 90.94%     21.16s 21.96%  github.com/influxdata/influxdb/tsdb/engine/tsm1.(*Engine).compactFull.func1
         0     0% 90.94%     19.87s 20.62%  github.com/influxdata/influxdb/tsdb/engine/tsm1.(*Engine).compactHiPriorityLevel.func1
         0     0% 90.94%     34.57s 35.88%  github.com/influxdata/influxdb/tsdb/engine/tsm1.(*Engine).enableLevelCompactions.func1
         0     0% 90.94%     41.03s 42.59%  github.com/influxdata/influxdb/tsdb/engine/tsm1.(*compactionStrategy).Apply
         0     0% 90.94%     41.03s 42.59%  github.com/influxdata/influxdb/tsdb/engine/tsm1.(*compactionStrategy).compactGroup
         0     0% 90.94%     41.03s 42.59%  github.com/influxdata/influxdb/tsdb/engine/tsm1.(*tsmBatchKeyIterator).Next
         0     0% 90.94%     41.03s 42.59%  github.com/influxdata/influxdb/tsdb/engine/tsm1.(*tsmBatchKeyIterator).merge
         0     0% 90.94%     29.40s 30.52%  github.com/influxdata/influxdb/tsdb/engine/tsm1.(*tsmBatchKeyIterator).mergeFloat
         0     0% 90.94%     11.63s 12.07%  github.com/influxdata/influxdb/tsdb/engine/tsm1.(*tsmBatchKeyIterator).mergeInteger
         0     0% 90.94%      1.07s  1.11%  runtime.(*mheap).reclaim
         0     0% 90.94%      7.44s  7.72%  runtime.bgsweep
         0     0% 90.94%     10.12s 10.50%  runtime.gcBgMarkWorker
         0     0% 90.94%     10.09s 10.47%  runtime.gcBgMarkWorker.func2
         0     0% 90.94%      0.70s  0.73%  runtime.goschedImpl
         0     0% 90.94%      0.69s  0.72%  runtime.gosched_m
         0     0% 90.94%      0.62s  0.64%  runtime.markroot
         0     0% 90.94%      0.75s  0.78%  runtime.mcall
         0     0% 90.94%      0.55s  0.57%  runtime.profilealloc

go tool pprof -gif influxd-1.7.1 profile6.pprof
gives me

profile002

Is there anything else I can provide here or on e-mail to help sort this issue?

@v9n
Copy link

v9n commented Nov 16, 2018

I see same thing with @hallvar . I ended up just remove all db since it's new DB and I was doing double write from 1.6.4 amd it good since then.

@hallvar
Copy link

hallvar commented Nov 16, 2018

Indeed from the log of InfluxDB I can see an entry for starting the compaction process of ./asd/full_resolution/1001/, but never an entry for it completing.

@stuartcarnie
Copy link
Contributor

stuartcarnie commented Nov 16, 2018

@hallvar few questions:

  • is this an older shard that is stuck?
  • have the .tsm.tmp files stopped growing in size, suggesting the compactor is no longer making progress?
  • can you share one of the smaller shards for us to examine?

@stuartcarnie
Copy link
Contributor

@hallvar also, do you see a log entry for a compaction for full_resolution_calculations/1005 that starts but does not stop?

@hallvar
Copy link

hallvar commented Nov 16, 2018

Some more details about full_resolution/1001

$ ls -l
total 42339144
-rw-r--r--  1 root root 2147486635 Nov  6 00:47 000001280-000000005.tsm
-rw-r--r--  1 root root 2147488291 Nov  6 00:49 000001280-000000006.tsm
-rw-r--r--  1 root root 1364751604 Nov  6 00:50 000001280-000000007.tsm
-rw-r--r--  1 root root 2147488268 Nov  6 18:39 000002176-000000005.tsm
-rw-r--r--  1 root root 2147488779 Nov  6 18:42 000002176-000000006.tsm
-rw-r--r--  1 root root  537029393 Nov  6 18:43 000002176-000000007.tsm
-rw-r--r--  1 root root 2147486794 Nov  7 17:26 000003456-000000005.tsm
-rw-r--r--  1 root root 2147487880 Nov  7 17:27 000003456-000000006.tsm
-rw-r--r--  1 root root 1487047140 Nov  7 17:28 000003456-000000007.tsm
-rw-r--r--  1 root root 2147486948 Nov  8 15:41 000004736-000000005.tsm
-rw-r--r--  1 root root 2147487245 Nov  8 15:42 000004736-000000006.tsm
-rw-r--r--  1 root root 1514402097 Nov  8 15:43 000004736-000000007.tsm
-rw-r--r--  1 root root 2147485943 Nov  9 14:25 000006016-000000005.tsm
-rw-r--r--  1 root root 2147486486 Nov  9 14:26 000006016-000000006.tsm
-rw-r--r--  1 root root 1630170600 Nov  9 14:28 000006016-000000007.tsm
-rw-r--r--  1 root root 2147486552 Nov 10 13:48 000007296-000000005.tsm
-rw-r--r--  1 root root 2147486659 Nov 10 13:49 000007296-000000006.tsm
-rw-r--r--  1 root root 1552881210 Nov 10 13:51 000007296-000000007.tsm
-rw-r--r--  1 root root 2147488256 Nov 11 10:34 000008192-000000005.tsm
-rw-r--r--  1 root root 2147489347 Nov 11 10:37 000008192-000000006.tsm
-rw-r--r--  1 root root   74587847 Nov 11 10:37 000008192-000000007.tsm
-rw-r--r--  1 root root 2147489288 Nov 11 23:28 000008704-000000005.tsm
-rw-r--r--  1 root root  463276511 Nov 11 23:29 000008704-000000006.tsm
-rw-r--r--  1 root root  633100657 Nov 12 02:15 000008833-000000004.tsm
-rw-r--r--  1 root root  470640594 Nov 12 06:50 000008962-000000004.tsm
-rw-r--r--  1 root root  139974092 Nov 12 12:36 000009099-000000004.tsm
-rw-r--r--  1 root root  156140475 Nov 13 17:45 000009233-000000004.tsm
-rw-r--r--  1 root root  110167094 Nov 13 17:45 000009361-000000004.tsm
-rw-r--r--  1 root root  111120955 Nov 14 02:01 000009490-000000004.tsm
-rw-r--r--  1 root root  108211921 Nov 14 02:01 000009618-000000004.tsm
-rw-r--r--  1 root root  118685885 Nov 16 01:22 000009748-000000004.tsm
-rw-r--r--  1 root root  134496994 Nov 16 01:23 000009876-000000004.tsm
-rw-r--r--  1 root root  109944892 Nov 16 01:25 000010004-000000004.tsm
-rw-r--r--  1 root root   25862631 Nov 16 01:22 000010036-000000003.tsm
-rw-r--r--  1 root root   26852524 Nov 16 01:22 000010068-000000003.tsm
-rw-r--r--  1 root root   26196516 Nov 16 01:23 000010100-000000003.tsm
-rw-r--r--  1 root root    5690656 Nov 16 01:21 000010108-000000002.tsm
-rw-r--r--  1 root root    6081352 Nov 16 01:22 000010116-000000002.tsm
-rw-r--r--  1 root root    6690825 Nov 16 01:23 000010124-000000002.tsm
-rw-r--r--  1 root root     775892 Nov 14 16:36 000010125-000000001.tsm
-rw-r--r--  1 root root     780584 Nov 14 16:39 000010126-000000001.tsm
-rw-r--r--  1 root root  327155712 Nov 16 20:21 000010126-000000002.tsm.tmp
-rw-r--r--  1 root root      25684 Nov  8 12:16 fields.idx
drwxr-xr-x 10 root root       4096 Nov 16 00:06 index
$ influx_inspect dumptsm 000001280-000000005.tsm
Summary:
  File: 000001280-000000005.tsm
  Time Range: 2018-11-05T00:00:00Z - 2018-11-05T22:29:59.9Z
  Duration: 22h29m59.9s   Series: 1996   File Size: 2147486635


Statistics
  Blocks:
    Total: 1021620 Size: 2118552906 Min: 39 Max: 5866 Avg: 2073
  Index:
    Total: 1021620 Size: 28933716
  Points:
    Total: 1020426448
  Encoding:
    Timestamp: 	none: 0 (0%) 	s8b: 975933 (95%) 	rle: 45687 (4%) 
    Float: 	none: 0 (0%) 	gor: 1021620 (100%) 
  Compression:
    Per block: 2.08 bytes/point
    Total: 2.10 bytes/point
$ influx_inspect dumptsm 000010126-000000001.tsm
Summary:
  File: 000010126-000000001.tsm
  Time Range: 2018-11-11T19:00:00Z - 2018-11-11T19:29:59.9Z
  Duration: 29m59.9s   Series: 86   File Size: 780584

Statistics
  Blocks:
    Total: 1634 Size: 717326 Min: 30 Max: 1847 Avg: 439
  Index:
    Total: 1634 Size: 63245
  Points:
    Total: 1548086
  Encoding:
    Timestamp: 	none: 0 (0%) 	s8b: 516 (31%) 	rle: 1118 (68%) 
    Float: 	none: 0 (0%) 	gor: 1634 (100%) 
  Compression:
    Per block: 0.46 bytes/point
    Total: 0.50 bytes/point
$ influx_inspect dumptsm 000010126-000000002.tsm.tmp
dumptsm: Error opening TSM files: indirectIndex: not enough data for max time

The tsm.tmp file has indeed stopped growing in size:

$ date; ls -l 000010126-000000002.tsm.tmp
Fri Nov 16 20:51:50 UTC 2018
-rw-r--r-- 1 root root 327155712 Nov 16 20:21 000010126-000000002.tsm.tmp

$ date; ls -l 000010126-000000002.tsm.tmp
Fri Nov 16 20:52:25 UTC 2018
-rw-r--r-- 1 root root 327155712 Nov 16 20:21 000010126-000000002.tsm.tmp

I can upload this shard somewhere for you to reach it. I'll send the link to your gmail address once it is up.

Unable to say if the influxdb log had a completion entry for full_resolution_calculations/1005. I've been restarting back and forth, and at some point I thought the shards were fixed except 1005, so I unfortunately sacrificed and deleted that shard already.

@hallvar
Copy link

hallvar commented Nov 16, 2018

$ influx_inspect report 1001
DB      RP              Shard   File                    Series  New (est) Min Time                 Max Time                 Load Time
*redacted*     full_resolution 1001    000001280-000000005.tsm 1996    1997      2018-11-05T00:00:00Z     2018-11-05T22:29:59.9Z   2.127496ms
*redacted*     full_resolution 1001    000001280-000000006.tsm 3500    3494      2018-11-05T00:00:00Z     2018-11-06T00:29:59.975Z 114.600155ms
*redacted*     full_resolution 1001    000001280-000000007.tsm 2350    2369      2018-11-05T00:00:00Z     2018-11-06T00:29:59.95Z  104.123384ms
*redacted*     full_resolution 1001    000002176-000000005.tsm 2193    0         2018-11-05T22:00:00Z     2018-11-06T17:59:59.975Z 106.93131ms
*redacted*     full_resolution 1001    000002176-000000006.tsm 4771    1         2018-11-05T22:00:00.05Z  2018-11-06T18:29:59.975Z 111.297607ms
*redacted*     full_resolution 1001    000002176-000000007.tsm 937     60        2018-11-05T22:30:00.05Z  2018-11-06T18:29:59.975Z 81.33326ms
*redacted*     full_resolution 1001    000003456-000000005.tsm 1921    0         2018-11-06T18:00:00.025Z 2018-11-07T16:59:59.975Z 101.851983ms
*redacted*     full_resolution 1001    000003456-000000006.tsm 3111    2         2018-11-06T17:30:00.075Z 2018-11-07T16:59:59.975Z 110.449569ms
*redacted*     full_resolution 1001    000003456-000000007.tsm 2810    0         2018-11-06T18:00:00Z     2018-11-07T16:59:59.975Z 94.000429ms
*redacted*     full_resolution 1001    000004736-000000005.tsm 2543    547       2018-11-06T11:48:00.475Z 2018-11-08T14:59:59.95Z  104.866018ms
*redacted*     full_resolution 1001    000004736-000000006.tsm 3019    0         2018-11-07T13:00:00.025Z 2018-11-08T15:29:59.975Z 97.934706ms
*redacted*     full_resolution 1001    000004736-000000007.tsm 2913    88        2018-11-06T11:48:00.525Z 2018-11-08T15:29:59.975Z 126.143675ms
*redacted*     full_resolution 1001    000006016-000000005.tsm 1936    0         2018-11-08T15:00:00.05Z  2018-11-09T13:59:59.975Z 89.927246ms
*redacted*     full_resolution 1001    000006016-000000006.tsm 2810    0         2018-11-08T12:00:00Z     2018-11-09T13:59:59.975Z 91.292107ms
*redacted*     full_resolution 1001    000006016-000000007.tsm 3092    0         2018-11-08T15:30:00Z     2018-11-09T13:59:59.975Z 61.211254ms
*redacted*     full_resolution 1001    000007296-000000005.tsm 1912    0         2018-11-09T13:30:00.15Z  2018-11-10T13:29:59.95Z  43.394452ms
*redacted*     full_resolution 1001    000007296-000000006.tsm 2843    0         2018-11-09T13:30:00Z     2018-11-10T13:29:59.975Z 95.845912ms
*redacted*     full_resolution 1001    000007296-000000007.tsm 3082    0         2018-11-09T14:00:00Z     2018-11-10T13:29:59.95Z  45.762259ms
*redacted*     full_resolution 1001    000008192-000000005.tsm 2175    0         2018-11-10T13:00:00Z     2018-11-11T09:59:59.8Z   36.166727ms
*redacted*     full_resolution 1001    000008192-000000006.tsm 4163    0         2018-11-10T09:30:00.05Z  2018-11-11T10:29:59.975Z 94.093133ms
*redacted*     full_resolution 1001    000008192-000000007.tsm 466     0         2018-11-10T10:00:00.05Z  2018-11-10T13:59:59.925Z 9.267417ms
*redacted*     full_resolution 1001    000008704-000000005.tsm 4535    0         2018-11-10T12:30:00Z     2018-11-11T22:59:59.95Z  69.444225ms
*redacted*     full_resolution 1001    000008704-000000006.tsm 2268    0         2018-11-10T14:00:00Z     2018-11-11T22:59:59.95Z  33.501507ms
*redacted*     full_resolution 1001    000008833-000000004.tsm 6660    0         2018-11-10T14:00:00.025Z 2018-11-11T23:59:59.975Z 15.833312ms
*redacted*     full_resolution 1001    000008962-000000004.tsm 857     0         2018-11-10T22:00:00.025Z 2018-11-11T17:59:59.925Z 27.914255ms
*redacted*     full_resolution 1001    000009099-000000004.tsm 999     0         2018-11-11T05:30:00.05Z  2018-11-11T20:59:59.95Z  12.381657ms
*redacted*     full_resolution 1001    000009233-000000004.tsm 2206    174       2018-11-10T10:00:00Z     2018-11-11T23:59:59.95Z  21.198654ms
*redacted*     full_resolution 1001    000009361-000000004.tsm 1426    0         2018-11-10T12:30:00Z     2018-11-11T13:29:59.95Z  27.671145ms
*redacted*     full_resolution 1001    000009490-000000004.tsm 1205    0         2018-11-10T17:00:00Z     2018-11-10T21:59:59.975Z 23.743869ms
*redacted*     full_resolution 1001    000009618-000000004.tsm 1206    0         2018-11-10T21:30:00Z     2018-11-11T02:29:59.975Z 23.800471ms
*redacted*     full_resolution 1001    000009748-000000004.tsm 1499    0         2018-11-11T02:00:00Z     2018-11-11T15:59:59.975Z 24.383997ms
*redacted*     full_resolution 1001    000009876-000000004.tsm 1498    0         2018-11-11T06:30:00Z     2018-11-11T22:59:59.975Z 19.937697ms
*redacted*     full_resolution 1001    000010004-000000004.tsm 1412    0         2018-11-11T10:00:00Z     2018-11-11T23:59:59.975Z 24.115485ms
*redacted*     full_resolution 1001    000010036-000000003.tsm 1118    0         2018-11-11T14:30:00.025Z 2018-11-11T15:59:59.975Z 6.45649ms
*redacted*     full_resolution 1001    000010068-000000003.tsm 1118    0         2018-11-11T16:00:00Z     2018-11-11T17:29:59.975Z 12.507463ms
*redacted*     full_resolution 1001    000010100-000000003.tsm 1118    0         2018-11-11T17:00:00Z     2018-11-11T18:29:59.975Z 14.894271ms
*redacted*     full_resolution 1001    000010108-000000002.tsm 516     0         2018-11-11T18:00:00.075Z 2018-11-11T18:59:59.975Z 13.749419ms
*redacted*     full_resolution 1001    000010116-000000002.tsm 688     0         2018-11-11T18:30:00Z     2018-11-11T19:29:59.975Z 7.964258ms
*redacted*     full_resolution 1001    000010124-000000002.tsm 688     0         2018-11-11T19:00:00Z     2018-11-11T19:29:59.975Z 6.293983ms
*redacted*     full_resolution 1001    000010125-000000001.tsm 86      0         2018-11-11T19:00:00.025Z 2018-11-11T19:29:59.925Z 2.907131ms
*redacted*     full_resolution 1001    000010126-000000001.tsm 86      0         2018-11-11T19:00:00Z     2018-11-11T19:29:59.9Z   166.908µs

Summary:
  Files: 41
  Time Range: 2018-11-05T00:00:00Z - 2018-11-11T23:59:59.975Z
  Duration: 167h59m59.975s 

Statistics
  Series:
     - *redacted* (est): 8732 (100%)
  Total (est): 8732
Completed in 2.263066813s

@conet
Copy link

conet commented Nov 16, 2018

Are you sure that you're running 1.7.1? Docker hub shows updates "an hour ago".
dockerhub updates
It is either that or maybe you are running into a different kind of problem. I'm running on bare metal and installed from rpm https://dl.influxdata.com/influxdb/releases/influxdb-1.7.1.x86_64.rpm

@stuartcarnie
Copy link
Contributor

@conet have you had any issues since upgrading to 1.7.1?

@conet
Copy link

conet commented Nov 16, 2018

@stuartcarnie no, but I just read the backlog and another difference besides packaging is that I haven't switched to tsi yet, I'm using in memory indexes. The cpu has been idling ever since the upgrade to 1.7.1.

@stuartcarnie
Copy link
Contributor

@conet glad to hear you are still stable. The index type was unrelated to this issue.

@conet
Copy link

conet commented Nov 16, 2018

@hallvar Are you sure that the 4 stuck compaction processes amount to the used CPU? What I noticed that a single compaction process would only consume 1 core. Could you run this query on the internal database?

select /.*CompactionsActive/ from tsm1_engine where time > now() - 1m AND id = '1001' group by *

How about queries hitting the DB, aren't they causing the load?

The index type was unrelated to this issue.

@stuartcarnie I was trying to spot differences and that was the only thing I noticed, anyway I don't want to create more noise around this issue 😄

@stuartcarnie
Copy link
Contributor

No worries @conet, appreciate your input!

@hallvar
Copy link

hallvar commented Nov 16, 2018

@stuartcarnie Very interesting, I am running docker image influxdb:1.7.1 but the startup log does indeed say something else:

InfluxDB starting {"log_id": "0BohEL_l000", "version": "1.7.0", "branch": "1.7", "commit": "dac4c6f571662c63dc0d73346787b8c7f113222a"}

After following the docker image PR very closely, I very eagerly pulled it about 15 minutes after the tag first appeared on Docker Hub. Go figure, something must have been off with the initially uploaded docker image..

When pulling the image tag again just now I get an updated newer image.

After restarting influx using the new image I finally see the full_resolution/1001 shard compaction complete successfully.

Sorry for bringing in the extra confusion, got completely side tracked by the initially incorrectly tagged docker image.

@stuartcarnie Excellent work and nicely fixed. Thanks for making me double check the version!

@conet You are right, that was probably query load I was seeing.

@stuartcarnie
Copy link
Contributor

@hallvar great news. Credit for checking the version goes to @conet!

@v9n
Copy link

v9n commented Nov 16, 2018

Sadly I was indeed running 1.7.0 despite off using docker image 1.7.1

screen shot 2018-11-16 at 3 49 18 pm

but when I connect over the shell, it said 1.7.0 :(...

Not sure why but force pull 1.7.1 again and it seems look right now :D thanks @hallvar @conet

@hallvar
Copy link

hallvar commented Nov 16, 2018

Ah, yes, been some long days.. thanks @conet! Finally everything is back to normal :-)

@hallvar
Copy link

hallvar commented Nov 17, 2018

@v9n yes, that makes me certain the initially uploaded docker image tagged 1.7.1 was in fact 1.7.0. The pull request to docker official images seems correct: docker-library/official-images/pull/5076. Anyway that is a separate issue that can be opened there I guess.

@randombk
Copy link

Linking influxdata/influxdata-docker#279 for visibility. The official docker images likely need to be rebuilt for both normal and alpine versions.

@randombk
Copy link

Nevermind, I didn't notice that the image was just updated.

@chenxu1990
Copy link

I upgraded to v1.7.1 and deleted all the data, so far everything is normal. Thank you. @e-dard

@prabhumarappan
Copy link

prabhumarappan commented Jul 8, 2019

I'm currently on 1.7.7, and I facing a huge memory issue. (Was facing a similar issue with 1.7.6 also)
These are the things I have tried:

  1. Disabled pprof in the config file
  2. Change to tsi1 index-version and check (previously it was inmem, the default)

Number of series cardinality = 1

I'm not sure what is causing the huge memory increase.

TIA

@e-dard
Copy link
Contributor

e-dard commented Jul 8, 2019

@prabhumarappan please open a new issue.

@mqu
Copy link

mqu commented Aug 26, 2019

you may have a look at #13318 for 1.7.7

@prabhumarappan
Copy link

Thanks @mqu!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests