[0.9.4.1] CQ's run into a "deadlock" - especially by using count() or count(distinct(field_key)) #3158

markuspatrick · 2015-06-26T16:30:28Z

I tried inserting my data (in packages 500 data tuples per minutes - realtime) and using the new influxdb 0.9.1-rc1 (ubuntu) . Additionally, one CQ Queries downsampl the data on precision level "5m". No other retention policies as default is used or created. A python script insert the data using influxdb-python lib (using new version for influxdb 0.9). The influxdb is installed on ubuntu, no cluster.

After a couple of minutes or up to half an hour (each time different) the insert operation breaks up with a timeout error (see error logs below).

Using 0.9 (in that case under OS X):

The timeout happens after 3 until 10 inserts

Two oberservations with 0.9.1-rc1:
1.) The CQ query use 4 sum-function (on float) and one count function (string) - count(distinct(id)). Without the count function it works better. That means it crashes after 3 or 4 minutes (3 or 4 packages). With count function, it crash immediately (after the 2nd or 3rd tuple).

2.) After the timeout, influx must be restarted. If not, every other post-request (doesn't matter what kind, e.g. insert, show-statements ...) runs into a timeout again. The log file contains no error messages or stack traces. Only the post-entry(ies) with timeout(s).

3.) The insert (with CQs in background) is quite slowly. It took aprox. 40 sec. to insert 500 data tuples (6 float fields and 4 tags). Without CQs in background, its done in one/two second.

4.) Just to be sure: it is not a network problem or load problem of the server....

You can find the history of this issue here:
https://groups.google.com/d/msg/influxdb/H11ivieFPG4/3YPM2yYi93wJ

INFLUX-LOG file (0.9.1-rc1)

[http] 2015/06/26 17:03:06 192.168.2.57 - - [26/Jun/2015:17:03:06 +0200] POST /write?p=admin&u=admin HTTP/1.1 204 0 - python-requests/2.7.0 CPython/2.7.6 Darwin/14.3.0 731380ea-1c14-11e5-87fd-000000000000 66.764963ms

[http] 2015/06/26 17:03:06 192.168.2.57 - - [26/Jun/2015:17:03:06 +0200] POST /write?p=admin&u=admin HTTP/1.1 204 0 - python-requests/2.7.0 CPython/2.7.6 Darwin/14.3.0 731e362b-1c14-11e5-87fe-000000000000 116.77357ms

[http] 2015/06/26 17:03:11 192.168.2.57 - - [26/Jun/2015:17:03:06 +0200] POST /write?p=admin&u=admin HTTP/1.1 500 44 - python-requests/2.7.0 CPython/2.7.6 Darwin/14.3.0 73308ab3-1c14-11e5-87ff-000000000000 5.004348282s

[continuous_querier] 2015/06/26 17:03:11 timeout

[continuous_querier] 2015/06/26 17:03:11 error during recompute previous: timeout. running: SELECT sum(turnover) AS turnover, sum(sale) AS sale, sum(view) AS view INTO "intraV3"."default"."transactions.product.5m" FROM "intraV3"."default".transactions WHERE time >= '2015-06-26 14:55:00' AND time < '2015-06-26 15:00$

[continuous_querier] 2015/06/26 17:03:11 error executing query: CREATE CONTINUOUS QUERY CQ1 ON intraV3 BEGIN SELECT sum(turnover) AS turnover, sum(sale) AS sale, sum(view) AS view INTO "intraV3"."default"."transactions.product.5m" FROM "intraV3"."default".transactions GROUP BY time(5m), partner_id, product_id, type$

markuspatrick · 2015-06-29T13:32:55Z

I think if found the problem.

My CQs queries using as aggregate functions count(distinct(value)) on a string and on integer values.
After creating the CQ, the next insert of data fails with the timeout error 500.
If you:

Restart the InfluxDb
Drop the corresponding CQ
Restart Insert-Job (with same data)
Everything works again perfectly - without timeout error

Issue (#3171) also describes an inconsistent behavior of nested statement with count and distinct.

markuspatrick · 2015-09-18T14:36:48Z

Using
InfluxDB starting, version 0.9.4.1, branch 0.9.4, commit c4f85f8
@beckettsean :
it seems, that count() functions and also the combination count(distinct(field_key)) works better in CQs, now.
But influxdb crashes silently after a minute, i.e. it looks like an endless loop.
No response on read-Queries (see below):

[wal] 2015/09/18 16:24:33 Flush due to idle. Flushing 10 series with 10 points and 656 bytes from partition 1
[wal] 2015/09/18 16:24:33 write to index of partition 1 took 2.216323ms
[wal] 2015/09/18 16:24:43 Flush due to idle. Flushing 10 series with 10 points and 656 bytes from partition 1
[wal] 2015/09/18 16:24:43 write to index of partition 1 took 1.39374ms
[http] 2015/09/18 16:24:52 ::1 - - [18/Sep/2015:16:24:52 +0200] GET /ping HTTP/1.1 204 0 - InfluxDBShell/0.9.4.1 06797b55-5e11-11e5-8012-000000000000 42.037µs
[wal] 2015/09/18 16:24:53 Flush due to idle. Flushing 10 series with 10 points and 656 bytes from partition 1
[wal] 2015/09/18 16:24:53 write to index of partition 1 took 1.296851ms
[http] 2015/09/18 16:24:56 ::1 - - [18/Sep/2015:16:24:56 +0200] GET /query?db=&q=select+sum%28view%29+from+%2F.%2A%2F HTTP/1.1 200 72 - InfluxDBShell/0.9.4.1 09188567-5e11-11e5-8013-000000000000 410.69µs
[http] 2015/09/18 16:25:00 ::1 - admin [18/Sep/2015:16:24:30 +0200] POST /write?db=intra10 HTTP/1.1 500 32 - python-requests/2.7.0 CPython/2.7.10 Darwin/14.5.0 f9cad412-5e10-11e5-8011-000000000000 30.010060316s

[query] 2015/09/18 16:25:01 SELECT sum(view) FROM "intra10"."default"./.*/

[wal] 2015/09/18 16:25:03 Flush due to idle. Flushing 10 series with 10 points and 656 bytes from partition 1
[wal] 2015/09/18 16:25:03 write to index of partition 1 took 2.330035ms
[wal] 2015/09/18 16:25:13 Flush due to idle. Flushing 10 series with 10 points and 656 bytes from partition 1
[wal] 2015/09/18 16:25:13 write to index of partition 1 took 1.428896ms

Write Queries run into timeouts.

Restarting the db helps, but after a minute it happens again.

beckettsean · 2015-09-18T23:03:28Z

@markuspatrick I'm not sure that is a valid query. Can you try running SELECT sum(view) FROM "intra10"."default"./.*/ (leaving out the trailing asterisk).

Not sure what you mean by

But influxdb crashes silently after a minute, i.e. it looks like an endless loop.

When CQs are running with COUNT(DISTINCT()) then eventually InfluxDB has problems where writes timeout and queries just hang? What about the /ping endpoint?

If you disable CQs does the problem still happen after a minute?

Is it exactly a minute or is it variable in time?

markuspatrick · 2015-09-19T10:35:10Z

@beckettsean:

Sorry the trailing asterisk was a copy and paste error from my side (I correct it in the log above). The query was SELECT sum(view) FROM "intra10"."default"./.*/

About my config:
The database gets every minute approximately 1200 points per minute to write.
The cq aggregates every 5 minutes with an count(distinct(view)) (and other fields with sum()). The used measurement in cq (its the raw data) contains aprox. only 300K points (just for testing).
The cq is simply used to count unique session_ids of a webpage (datatype string).

Using the CQ without count(distinct(field_key)) - Everything works fine.
Using it only with count(field_key) seems to be more stable - at least it happens not every minute, but after a while same problem.
Using it with count(distinct(field_key)) is not stable - If new data is written, the read query above don't work anymore. Same for other queries on the raw data measurement, or measurements created by the cq. The corresponding write job gets timeouts for the next points to write. Interestingly queries like show measurements still work.

It always happen after the insert of the new data (and thats why the problems occurs every minute).
Restarting helps and the read queries work fine until the next data points came in (after a minute).

"What about the /ping endpoint?"
I can ping the endpoint and queries like show measurementsworks fine (see above).

markuspatrick · 2015-09-22T14:44:22Z

@beckettsean:
Little update on the problem.
It seems that the cq running into kind of lock - also without count() or count(distinct())function after a while (20-40 minutes). Lock means, I can write successfully but I can't read any measurement (no response on queries after minutes,which normally have under 1s response time).

In contrast to cq with distinct and/or count (see above) the write jobs still works without timeouts. After restarting the db, the raw data measurement (written by the external write job every minute) has no lagged data. But the downsampled measurement have it.

I can provoke the behavior with a lot of request on raw data and down sampled measurements with grafana dashboard and auto refresh every 5s.

The monitor is active and I can give same information from _internal database. What would be interesting for you? As I sad above, I can't see any error, warning or something else in the log.

Here are some Information about my configuration.

All cqs looks like the example below. The cqs only differ in there group by clause (5m, 1h,1d), In total, I have 6 cqs and 2 raw data measurements (3 cq on each measurement). The first raw data ist filled every minute with aprox. 1200 points. The second one every minute with 160 points.

CREATE CONTINUOUS QUERY CQ1_5m ON db1
BEGIN 
select 
    sum(integer-field1) as A, 
       sum(integer-field2) AS B,
       sum(float-field) AS C,
       sum(float-field) AS D,
       sum(integer-field1) AS E,
       sum(integer-field1) AS F, 
       sum(integer-field1) AS G,
       sum(integer-field1) AS H,
       sum(integer-field1) as I
INTO "downsampled.5m" 
FROM "raw_data"
GROUP BY time(5m),tag1,tag2,tag3,tag4,tag5,tag6,tag7
END;

System-Config:
Ubuntu 14.04.2 LTS
Release: 14.04
Codename: trusty

Influxdb 0.9.4.1. (stable) for ubuntu
Singel Node Instance

markuspatrick · 2015-09-23T07:25:23Z

Next Update - After a 24 h testing

What looks like a deadlock seems only be "hanging" cqs on small grouping level (5m).
After 24h, the down sampled measurements with 1h and 1d grouping contains correct data without lags.
In contrast, the downsampled measurements with 5m grouping contains a lot of lags. The lagging starts 2h after starting the test. The lags are 20 minutes until 40 minutes long. Between the lags there a 20 minutes correct data.
During the "lag time" read queries have a quite long long response times.

markuspatrick · 2015-09-25T08:10:58Z

I found a solution, that works in my case and avoid hanging cqs
1.) Drop cqs smaller than 1h grouping level
2.) Every measurement or rather time level got an own retention policy (for 1h and 1d)
3.) Disable internal monitoring (it writes every 10s in the db...)

With this restriction, influx runs without hanging cq or lagged data. Unfortunately, cqs with count()or count(distinct())are still a "deadlock maker"

Maybe its getting better (without the restriction above) with new storage engine (#4086)

toddboom · 2016-03-23T01:16:43Z

Closing this out since I'm pretty sure it's been solved in the current releases. Please reopen if you still see this issue in v0.11.0 or higher.

nanicpc · 2017-11-02T10:18:09Z

I've been having the same problem with influx 1.2. have you found a solution?

markuspatrick changed the title ~~timeout problems using CQ in 0.9 an 0.9.1-rc1~~ [0.9.1-rc1 and 0.9.0] count(distinct(field_key)) in CQ's results in write-timeouts Jul 1, 2015

markuspatrick mentioned this issue Jul 3, 2015

[0.9.1] aggregate functions different from sum and count do not work in CQs #3225

Closed

beckettsean self-assigned this Jul 15, 2015

beckettsean added this to the 0.9.3 milestone Jul 15, 2015

beckettsean modified the milestones: 0.9.4, 0.9.3 Aug 6, 2015

beckettsean added the area/continuous queries label Sep 17, 2015

markuspatrick changed the title ~~[0.9.1-rc1 and 0.9.0] count(distinct(field_key)) in CQ's results in write-timeouts~~ [0.9.4.1] count(distinct(field_key)) in CQ's results in write-timeouts Sep 18, 2015

markuspatrick changed the title ~~[0.9.4.1] count(distinct(field_key)) in CQ's results in write-timeouts~~ [0.9.4.1] count() or count(distinct(field_key)) in CQ's results in endless loop/silent crash Sep 18, 2015

beckettsean removed their assignment Sep 18, 2015

markuspatrick changed the title ~~[0.9.4.1] count() or count(distinct(field_key)) in CQ's results in endless loop/silent crash~~ [0.9.4.1] CQ's run into a "deadlock" - especially by using count() or count(distinct(field_key)) Sep 22, 2015

markuspatrick closed this as completed Sep 23, 2015

markuspatrick reopened this Sep 23, 2015

markuspatrick mentioned this issue Sep 23, 2015

[0.9.5-nightly] query of CQ data hangs CLI #4203

Closed

brettdh mentioned this issue Nov 17, 2015

[0.9.4 & 0.9.5-nightly-6682752] Continuous Queries stop running #4646

Closed

toddboom closed this as completed Mar 23, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[0.9.4.1] CQ's run into a "deadlock" - especially by using count() or count(distinct(field_key)) #3158

[0.9.4.1] CQ's run into a "deadlock" - especially by using count() or count(distinct(field_key)) #3158

markuspatrick commented Jun 26, 2015

markuspatrick commented Jun 29, 2015

markuspatrick commented Sep 18, 2015

beckettsean commented Sep 18, 2015

markuspatrick commented Sep 19, 2015

markuspatrick commented Sep 22, 2015

markuspatrick commented Sep 23, 2015

markuspatrick commented Sep 25, 2015

toddboom commented Mar 23, 2016

nanicpc commented Nov 2, 2017

[0.9.4.1] CQ's run into a "deadlock" - especially by using count() or count(distinct(field_key)) #3158

[0.9.4.1] CQ's run into a "deadlock" - especially by using count() or count(distinct(field_key)) #3158

Comments

markuspatrick commented Jun 26, 2015

markuspatrick commented Jun 29, 2015

markuspatrick commented Sep 18, 2015

beckettsean commented Sep 18, 2015

markuspatrick commented Sep 19, 2015

markuspatrick commented Sep 22, 2015

markuspatrick commented Sep 23, 2015

markuspatrick commented Sep 25, 2015

toddboom commented Mar 23, 2016

nanicpc commented Nov 2, 2017