Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[0.9.4.1] CQ's run into a "deadlock" - especially by using count() or count(distinct(field_key)) #3158

Closed
markuspatrick opened this issue Jun 26, 2015 · 9 comments

Comments

@markuspatrick
Copy link

I tried inserting my data (in packages 500 data tuples per minutes - realtime) and using the new influxdb 0.9.1-rc1 (ubuntu) . Additionally, one CQ Queries downsampl the data on precision level "5m". No other retention policies as default is used or created. A python script insert the data using influxdb-python lib (using new version for influxdb 0.9). The influxdb is installed on ubuntu, no cluster.

After a couple of minutes or up to half an hour (each time different) the insert operation breaks up with a timeout error (see error logs below).

Using 0.9 (in that case under OS X):

  • The timeout happens after 3 until 10 inserts

Two oberservations with 0.9.1-rc1:
1.) The CQ query use 4 sum-function (on float) and one count function (string) - count(distinct(id)). Without the count function it works better. That means it crashes after 3 or 4 minutes (3 or 4 packages). With count function, it crash immediately (after the 2nd or 3rd tuple).

2.) After the timeout, influx must be restarted. If not, every other post-request (doesn't matter what kind, e.g. insert, show-statements ...) runs into a timeout again. The log file contains no error messages or stack traces. Only the post-entry(ies) with timeout(s).

3.) The insert (with CQs in background) is quite slowly. It took aprox. 40 sec. to insert 500 data tuples (6 float fields and 4 tags). Without CQs in background, its done in one/two second.

4.) Just to be sure: it is not a network problem or load problem of the server....

You can find the history of this issue here:
https://groups.google.com/d/msg/influxdb/H11ivieFPG4/3YPM2yYi93wJ

INFLUX-LOG file (0.9.1-rc1)

[http] 2015/06/26 17:03:06 192.168.2.57 - - [26/Jun/2015:17:03:06 +0200] POST /write?p=admin&u=admin HTTP/1.1 204 0 - python-requests/2.7.0 CPython/2.7.6 Darwin/14.3.0 731380ea-1c14-11e5-87fd-000000000000 66.764963ms

[http] 2015/06/26 17:03:06 192.168.2.57 - - [26/Jun/2015:17:03:06 +0200] POST /write?p=admin&u=admin HTTP/1.1 204 0 - python-requests/2.7.0 CPython/2.7.6 Darwin/14.3.0 731e362b-1c14-11e5-87fe-000000000000 116.77357ms

[http] 2015/06/26 17:03:11 192.168.2.57 - - [26/Jun/2015:17:03:06 +0200] POST /write?p=admin&u=admin HTTP/1.1 500 44 - python-requests/2.7.0 CPython/2.7.6 Darwin/14.3.0 73308ab3-1c14-11e5-87ff-000000000000 5.004348282s

[continuous_querier] 2015/06/26 17:03:11 timeout

[continuous_querier] 2015/06/26 17:03:11 error during recompute previous: timeout. running: SELECT sum(turnover) AS turnover, sum(sale) AS sale, sum(view) AS view INTO "intraV3"."default"."transactions.product.5m" FROM "intraV3"."default".transactions WHERE time >= '2015-06-26 14:55:00' AND time < '2015-06-26 15:00$

[continuous_querier] 2015/06/26 17:03:11 error executing query: CREATE CONTINUOUS QUERY CQ1 ON intraV3 BEGIN SELECT sum(turnover) AS turnover, sum(sale) AS sale, sum(view) AS view INTO "intraV3"."default"."transactions.product.5m" FROM "intraV3"."default".transactions GROUP BY time(5m), partner_id, product_id, type$

@markuspatrick
Copy link
Author

I think if found the problem.

My CQs queries using as aggregate functions count(distinct(value)) on a string and on integer values.
After creating the CQ, the next insert of data fails with the timeout error 500.
If you:

  • Restart the InfluxDb
  • Drop the corresponding CQ
  • Restart Insert-Job (with same data)
    Everything works again perfectly - without timeout error

Issue (#3171) also describes an inconsistent behavior of nested statement with count and distinct.

@markuspatrick markuspatrick changed the title timeout problems using CQ in 0.9 an 0.9.1-rc1 [0.9.1-rc1 and 0.9.0] count(distinct(field_key)) in CQ's results in write-timeouts Jul 1, 2015
@beckettsean beckettsean self-assigned this Jul 15, 2015
@beckettsean beckettsean added this to the 0.9.3 milestone Jul 15, 2015
@beckettsean beckettsean modified the milestones: 0.9.4, 0.9.3 Aug 6, 2015
@markuspatrick
Copy link
Author

Using
InfluxDB starting, version 0.9.4.1, branch 0.9.4, commit c4f85f8
@beckettsean :
it seems, that count() functions and also the combination count(distinct(field_key)) works better in CQs, now.
But influxdb crashes silently after a minute, i.e. it looks like an endless loop.
No response on read-Queries (see below):

[wal] 2015/09/18 16:24:33 Flush due to idle. Flushing 10 series with 10 points and 656 bytes from partition 1
[wal] 2015/09/18 16:24:33 write to index of partition 1 took 2.216323ms
[wal] 2015/09/18 16:24:43 Flush due to idle. Flushing 10 series with 10 points and 656 bytes from partition 1
[wal] 2015/09/18 16:24:43 write to index of partition 1 took 1.39374ms
[http] 2015/09/18 16:24:52 ::1 - - [18/Sep/2015:16:24:52 +0200] GET /ping HTTP/1.1 204 0 - InfluxDBShell/0.9.4.1 06797b55-5e11-11e5-8012-000000000000 42.037µs
[wal] 2015/09/18 16:24:53 Flush due to idle. Flushing 10 series with 10 points and 656 bytes from partition 1
[wal] 2015/09/18 16:24:53 write to index of partition 1 took 1.296851ms
[http] 2015/09/18 16:24:56 ::1 - - [18/Sep/2015:16:24:56 +0200] GET /query?db=&q=select+sum%28view%29+from+%2F.%2A%2F HTTP/1.1 200 72 - InfluxDBShell/0.9.4.1 09188567-5e11-11e5-8013-000000000000 410.69µs
[http] 2015/09/18 16:25:00 ::1 - admin [18/Sep/2015:16:24:30 +0200] POST /write?db=intra10 HTTP/1.1 500 32 - python-requests/2.7.0 CPython/2.7.10 Darwin/14.5.0 f9cad412-5e10-11e5-8011-000000000000 30.010060316s

[query] 2015/09/18 16:25:01 SELECT sum(view) FROM "intra10"."default"./.*/

[wal] 2015/09/18 16:25:03 Flush due to idle. Flushing 10 series with 10 points and 656 bytes from partition 1
[wal] 2015/09/18 16:25:03 write to index of partition 1 took 2.330035ms
[wal] 2015/09/18 16:25:13 Flush due to idle. Flushing 10 series with 10 points and 656 bytes from partition 1
[wal] 2015/09/18 16:25:13 write to index of partition 1 took 1.428896ms

Write Queries run into timeouts.

Restarting the db helps, but after a minute it happens again.

@markuspatrick markuspatrick changed the title [0.9.1-rc1 and 0.9.0] count(distinct(field_key)) in CQ's results in write-timeouts [0.9.4.1] count(distinct(field_key)) in CQ's results in write-timeouts Sep 18, 2015
@markuspatrick markuspatrick changed the title [0.9.4.1] count(distinct(field_key)) in CQ's results in write-timeouts [0.9.4.1] count() or count(distinct(field_key)) in CQ's results in endless loop/silent crash Sep 18, 2015
@beckettsean
Copy link
Contributor

@markuspatrick I'm not sure that is a valid query. Can you try running SELECT sum(view) FROM "intra10"."default"./.*/ (leaving out the trailing asterisk).

Not sure what you mean by

But influxdb crashes silently after a minute, i.e. it looks like an endless loop.

When CQs are running with COUNT(DISTINCT()) then eventually InfluxDB has problems where writes timeout and queries just hang? What about the /ping endpoint?

If you disable CQs does the problem still happen after a minute?

Is it exactly a minute or is it variable in time?

@beckettsean beckettsean removed their assignment Sep 18, 2015
@markuspatrick
Copy link
Author

@beckettsean:

Sorry the trailing asterisk was a copy and paste error from my side (I correct it in the log above). The query was SELECT sum(view) FROM "intra10"."default"./.*/

About my config:
The database gets every minute approximately 1200 points per minute to write.
The cq aggregates every 5 minutes with an count(distinct(view)) (and other fields with sum()). The used measurement in cq (its the raw data) contains aprox. only 300K points (just for testing).
The cq is simply used to count unique session_ids of a webpage (datatype string).

Using the CQ without count(distinct(field_key)) - Everything works fine.
Using it only with count(field_key) seems to be more stable - at least it happens not every minute, but after a while same problem.
Using it with count(distinct(field_key)) is not stable - If new data is written, the read query above don't work anymore. Same for other queries on the raw data measurement, or measurements created by the cq. The corresponding write job gets timeouts for the next points to write. Interestingly queries like show measurements still work.

It always happen after the insert of the new data (and thats why the problems occurs every minute).
Restarting helps and the read queries work fine until the next data points came in (after a minute).

"What about the /ping endpoint?"
I can ping the endpoint and queries like show measurementsworks fine (see above).

@markuspatrick
Copy link
Author

@beckettsean:
Little update on the problem.
It seems that the cq running into kind of lock - also without count() or count(distinct())function after a while (20-40 minutes). Lock means, I can write successfully but I can't read any measurement (no response on queries after minutes,which normally have under 1s response time).

In contrast to cq with distinct and/or count (see above) the write jobs still works without timeouts. After restarting the db, the raw data measurement (written by the external write job every minute) has no lagged data. But the downsampled measurement have it.

I can provoke the behavior with a lot of request on raw data and down sampled measurements with grafana dashboard and auto refresh every 5s.

The monitor is active and I can give same information from _internal database. What would be interesting for you? As I sad above, I can't see any error, warning or something else in the log.

Here are some Information about my configuration.


All cqs looks like the example below. The cqs only differ in there group by clause (5m, 1h,1d), In total, I have 6 cqs and 2 raw data measurements (3 cq on each measurement). The first raw data ist filled every minute with aprox. 1200 points. The second one every minute with 160 points.

CREATE CONTINUOUS QUERY CQ1_5m ON db1
BEGIN 
select 
    sum(integer-field1) as A, 
       sum(integer-field2) AS B,
       sum(float-field) AS C,
       sum(float-field) AS D,
       sum(integer-field1) AS E,
       sum(integer-field1) AS F, 
       sum(integer-field1) AS G,
       sum(integer-field1) AS H,
       sum(integer-field1) as I
INTO "downsampled.5m" 
FROM "raw_data"
GROUP BY time(5m),tag1,tag2,tag3,tag4,tag5,tag6,tag7
END;

System-Config:
Ubuntu 14.04.2 LTS
Release: 14.04
Codename: trusty

Influxdb 0.9.4.1. (stable) for ubuntu
Singel Node Instance

@markuspatrick markuspatrick changed the title [0.9.4.1] count() or count(distinct(field_key)) in CQ's results in endless loop/silent crash [0.9.4.1] CQ's run into a "deadlock" - especially by using count() or count(distinct(field_key)) Sep 22, 2015
@markuspatrick
Copy link
Author

Next Update - After a 24 h testing

What looks like a deadlock seems only be "hanging" cqs on small grouping level (5m).
After 24h, the down sampled measurements with 1h and 1d grouping contains correct data without lags.
In contrast, the downsampled measurements with 5m grouping contains a lot of lags. The lagging starts 2h after starting the test. The lags are 20 minutes until 40 minutes long. Between the lags there a 20 minutes correct data.
During the "lag time" read queries have a quite long long response times.

@markuspatrick
Copy link
Author

I found a solution, that works in my case and avoid hanging cqs
1.) Drop cqs smaller than 1h grouping level
2.) Every measurement or rather time level got an own retention policy (for 1h and 1d)
3.) Disable internal monitoring (it writes every 10s in the db...)

With this restriction, influx runs without hanging cq or lagged data. Unfortunately, cqs with count()or count(distinct())are still a "deadlock maker"

Maybe its getting better (without the restriction above) with new storage engine (#4086)

@toddboom
Copy link
Contributor

Closing this out since I'm pretty sure it's been solved in the current releases. Please reopen if you still see this issue in v0.11.0 or higher.

@nanicpc
Copy link

nanicpc commented Nov 2, 2017

I've been having the same problem with influx 1.2. have you found a solution?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants