Search failing when cluster busy #99

clintongormley · 2010-03-31T16:26:07Z

Hiya Shay

It turns out that the issue I was having earlier with NFS was a red herring. What seems to be happening is:

My process:

i'm reindexing old_index to new_index
i read 5000 docs from the old index, then create each one in the new index
if there is an error, then i delete new_index

So:

the cluster gets busy, and a search for the next 5,000 docs results in this error: select failed: No child processes.
This was triggering the cleanup in my script which deleted the index.
It appears the index has been deleted by one node, while another node is still trying to write snapshot info for the (now deleted) index, which results in these errors:

[14:48:09,948][WARN ][index.gateway ] [Nameless One][ia_object_1270046679][0] Failed to snapshot on close
org.elasticsearch.index.gateway.IndexShardGatewaySnapshotFailedException: [ia_object_1270046679][0] Failed to append snapshot translog into [/opt/elasticsearch/data/iAnnounce/ia_object_1270046679/0/translog/translog-3]
at org.elasticsearch.index.gateway.fs.FsIndexShardGateway.snapshot(FsIndexShardGateway.java:199)
at org.elasticsearch.index.gateway.IndexShardGatewayService$1.snapshot(IndexShardGatewayService.java:154)
at org.elasticsearch.index.engine.robin.RobinEngine.snapshot(RobinEngine.java:350)
at org.elasticsearch.index.shard.service.InternalIndexShard.snapshot(InternalIndexShard.java:369)
at org.elasticsearch.index.gateway.IndexShardGatewayService.snapshot(IndexShardGatewayService.java:150)
at org.elasticsearch.index.gateway.IndexShardGatewayService.close(IndexShardGatewayService.java:176)
at org.elasticsearch.index.service.InternalIndexService.deleteShard(InternalIndexService.java:244)
at org.elasticsearch.index.service.InternalIndexService.close(InternalIndexService.java:159)
at org.elasticsearch.indices.InternalIndicesService.deleteIndex(InternalIndicesService.java:208)
at org.elasticsearch.indices.InternalIndicesService.deleteIndex(InternalIndicesService.java:185)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at com.google.inject.internal.ConstructionContext$DelegatingInvocationHandler.invoke(ConstructionContext.java:108)
at $Proxy19.deleteIndex(Unknown Source)
at org.elasticsearch.indices.cluster.IndicesClusterStateService.clusterChanged(IndicesClusterStateService.java:178)
at org.elasticsearch.cluster.service.InternalClusterService$2.run(InternalClusterService.java:193)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:619)
Caused by: java.io.FileNotFoundException: /opt/elasticsearch/data/iAnnounce/ia_object_1270046679/0/translog/translog-3 (Stale NFS file handle)
at java.io.RandomAccessFile.open(Native Method)
at java.io.RandomAccessFile.(RandomAccessFile.java:212)
at org.elasticsearch.index.gateway.fs.FsIndexShardGateway.snapshot(FsIndexShardGateway.java:184)
... 20 more

Now, I'm catching the select failed: No child processes errors, sleeping for a few seconds, then trying again, and everything is working well.

ta

clint

The text was updated successfully, but these errors were encountered:

kimchy · 2010-03-31T18:18:23Z

What is this select failed: No child process error? Very strange, there are no child processes in elasticsearch... . Is it an exception that you get as part of the search request that originates from elasticsearch? Do you see a stack trace? If not, can you set action: DEBUG on the logging.yml file and simulate it again, I would love to see the stack trace...

kimchy · 2010-03-31T19:47:06Z

Just committed a fix (I hope) for the snapshot/delete index thingy.

clintongormley · 2010-04-01T15:52:40Z

´Fraid not :)

It logs that the index was deleted, but I see the directory growing for quite a long time after the delete happened.

clintongormley · 2010-04-01T15:59:53Z

... then eventually I get this in the logs, and the dir size drops to 44kB (but continues to exist):

[WARN ][index.gateway            ] [Blade][ia_object_1270136239][4] Failed to snapshot on close
org.elasticsearch.index.gateway.IndexShardGatewaySnapshotFailedException: [ia_object_1270136239][4] Failed to finalize index snapshot into [/opt/elasticsearch/data/iAnnounce/ia_object_1270136239/4/index/segments_4]
    at org.elasticsearch.index.gateway.fs.FsIndexShardGateway.snapshot(FsIndexShardGateway.java:220)
    at org.elasticsearch.index.gateway.IndexShardGatewayService$1.snapshot(IndexShardGatewayService.java:154)
    at org.elasticsearch.index.engine.robin.RobinEngine.snapshot(RobinEngine.java:350)
    at org.elasticsearch.index.shard.service.InternalIndexShard.snapshot(InternalIndexShard.java:370)
    at org.elasticsearch.index.gateway.IndexShardGatewayService.snapshot(IndexShardGatewayService.java:150)
    at org.elasticsearch.index.gateway.IndexShardGatewayService$SnapshotRunnable.run(IndexShardGatewayService.java:214)
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)
    at java.util.concurrent.FutureTask$Sync.innerRunAndReset(FutureTask.java:317)
    at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:150)
    at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$101(ScheduledThreadPoolExecutor.java:98)
    at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.runPeriodic(ScheduledThreadPoolExecutor.java:181)
    at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:205)
    at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
    at java.lang.Thread.run(Thread.java:619)
Caused by: java.io.IOException: No such file or directory
    at java.io.UnixFileSystem.createFileExclusively(Native Method)
    at java.io.File.createNewFile(File.java:883)
    at org.elasticsearch.util.lucene.Directories.copyFromDirectory(Directories.java:86)
    at org.elasticsearch.index.gateway.fs.FsIndexShardGateway.snapshot(FsIndexShardGateway.java:209)
    ... 14 more

kimchy · 2010-04-01T19:57:44Z

Can you try it again, I just pushed a further validation that this will not happen. Just to make sure, I am trying to fix the delete index problem, that seems to collide with the scheduled snapshotting done for a shard.

clintongormley · 2010-04-01T20:14:14Z

Success+++ The index is deleted correctly, and no errors in the logs!

By the way, tt turns out that the select failed: no child processes was a spurious error caused by a fault in my code. However, it was being thrown because of a real error: it looks like, when the cluster is busy, sometimes it doesn't return anything.

There is nothing in the logs, I just get this from the module doing the HTTP request:
Server closed connection without sending any data back

It takes quite a while (like 500,000 reindexed records) before I get this error, so not sure if it is doing it in the current build yet, but I'll rerun and let you know

kimchy · 2010-04-01T20:23:41Z

Do you know on which operation you get this exception? There is an http keep alive mechanism that closes connections after 30s (which is configurable, see here: http://www.elasticsearch.com/docs/elasticsearch/modules/http/netty/, the httpKeepAlive option). I wonder maybe its too small a value, but it might be that the request is simply stuck on the server... . If so, I would love to understand first which request it is....

clintongormley · 2010-04-01T20:49:10Z

It's not that the keep-alive is timing out - the HTTP module I'm using handles that gracefully. So it successfully makes the request, but then the ES server closes the connection.

The operation in question is a search:

{ query: { matchAll: {}}, sort: ["id"], from: 500000, size: 5000 }

The 500,000 is an example. If I wait a second or two, then the cluster responds correctly to the next request, although often there are a few such error messages close together.

There is nothing in the ES log

kimchy · 2010-04-01T20:58:01Z

The keepAlive thingy will close a connection even if a response was not sent (I need to double verify it). Can you time your requests, and the one that fails, can you print how long it took? If it's 30s, then it might be it...

clintongormley · 2010-04-02T13:42:31Z

Your thought was correct - it is timing out after 30 seconds. The request is as mentioned above, and as the nodes get busier, it takes 15 seconds plus to execute.

I'm comfortable with a timeout of 30 seconds - seems a reasonable setting to me.

However, I'd consider changing the HTTP response code from 500 to 503 (Service unavailable - http://www.w3.org/Protocols/rfc2616/rfc2616-sec10.html#sec10.5.4 )

This kind of error is easy to catch and recover from, especially when you know to expect it (eg while hammering the cluster!)

kimchy · 2010-04-02T13:56:22Z

Actually, I need to think better on how to handle this properly. There might be requests that take more than 30s (for example, optimization) and I would not want to close the connection on the client calling it...

Can you verify that the invocation actually returns at the end by changing the timeout to a higher value, for example, set http.netty.httpKeepAlive: 5m.

clintongormley · 2010-04-02T15:07:52Z

Yeah, I can verify that it does eventually return, with results.

keep-alive is really meant for idle connections, rather than active requests. you could have a request timeout set to much higher than the keep-alive

kimchy · 2010-04-02T15:09:47Z

Exactly. I will see what I can do to fix this..., cheers!

It could be sometime useful to have a stand alone runner to see how exactly Tika extracts content from a given file. You can run `StandaloneRunner` class using: * `-u file://URL/TO/YOUR/DOC` * `--size` set extracted size (default to mapper attachment size) * `BASE64` encoded binary Example: ```sh StandaloneRunner BASE64Text StandaloneRunner -u /tmp/mydoc.pdf StandaloneRunner -u /tmp/mydoc.pdf --size 1000000 ``` It produces something like: ``` ## Extracted text --------------------- BEGIN ----------------------- This is the extracted text ---------------------- END ------------------------ ## Metadata - author: null - content_length: null - content_type: application/pdf - date: null - keywords: null - language: null - name: null - title: null ``` Closes elastic#99. (cherry picked from commit 720b3bf) (cherry picked from commit 990fa15)

Relates elastic#40754 Relates elastic#99

clintongormley closed this as completed Jul 27, 2011

bluelu mentioned this issue Dec 11, 2014

Node is not responsive after the end of a big merge for close to 10 minutes #8905

Closed

makeyang mentioned this issue Sep 7, 2015

es 0.90.2 plus jdk6.0_25-b06 crashed on production #13368

Closed

drekbour mentioned this issue Jan 7, 2016

Correct documentation of discovery.zen.ping.unicast.hosts regarding support for port ranges #15816

Closed

pandrese mentioned this issue May 29, 2017

geo_point value parsing seems inconsistent and documentation could be improved #24944

Closed

lruslan mentioned this issue Sep 12, 2017

Possible memory issue on the master nodes in 5.5.1 with the large amount of idexes/shards #26610

Closed

YuvalPruss mentioned this issue Dec 5, 2018

Elasticsearch store.size and pri.store.size does not updated after creating an index #36274

Closed

cbuescher pushed a commit to cbuescher/elasticsearch that referenced this issue Oct 2, 2023

Bump runtime Java version to Java 11

a76dea8

Relates elastic#40754 Relates elastic#99

sh3bang mentioned this issue Jan 23, 2024

elasticsearch hung up / closing connection during collapse search query with size parameter #104647

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Search failing when cluster busy #99

Search failing when cluster busy #99

clintongormley commented Mar 31, 2010

kimchy commented Mar 31, 2010

kimchy commented Mar 31, 2010

clintongormley commented Apr 1, 2010

clintongormley commented Apr 1, 2010

kimchy commented Apr 1, 2010

clintongormley commented Apr 1, 2010

kimchy commented Apr 1, 2010

clintongormley commented Apr 1, 2010

kimchy commented Apr 1, 2010

clintongormley commented Apr 2, 2010

kimchy commented Apr 2, 2010

clintongormley commented Apr 2, 2010

kimchy commented Apr 2, 2010

Search failing when cluster busy #99

Search failing when cluster busy #99

Comments

clintongormley commented Mar 31, 2010

kimchy commented Mar 31, 2010

kimchy commented Mar 31, 2010

clintongormley commented Apr 1, 2010

clintongormley commented Apr 1, 2010

kimchy commented Apr 1, 2010

clintongormley commented Apr 1, 2010

kimchy commented Apr 1, 2010

clintongormley commented Apr 1, 2010

kimchy commented Apr 1, 2010

clintongormley commented Apr 2, 2010

kimchy commented Apr 2, 2010

clintongormley commented Apr 2, 2010

kimchy commented Apr 2, 2010