-
Notifications
You must be signed in to change notification settings - Fork 25k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Search failing when cluster busy #99
Comments
What is this select failed: No child process error? Very strange, there are no child processes in elasticsearch... . Is it an exception that you get as part of the search request that originates from elasticsearch? Do you see a stack trace? If not, can you set |
Just committed a fix (I hope) for the snapshot/delete index thingy. |
´Fraid not :) It logs that the index was deleted, but I see the directory growing for quite a long time after the delete happened. |
... then eventually I get this in the logs, and the dir size drops to 44kB (but continues to exist):
|
Can you try it again, I just pushed a further validation that this will not happen. Just to make sure, I am trying to fix the delete index problem, that seems to collide with the scheduled snapshotting done for a shard. |
Success+++ The index is deleted correctly, and no errors in the logs! By the way, tt turns out that the There is nothing in the logs, I just get this from the module doing the HTTP request: It takes quite a while (like 500,000 reindexed records) before I get this error, so not sure if it is doing it in the current build yet, but I'll rerun and let you know |
Do you know on which operation you get this exception? There is an http keep alive mechanism that closes connections after |
It's not that the keep-alive is timing out - the HTTP module I'm using handles that gracefully. So it successfully makes the request, but then the ES server closes the connection. The operation in question is a search:
The 500,000 is an example. If I wait a second or two, then the cluster responds correctly to the next request, although often there are a few such error messages close together. There is nothing in the ES log |
The keepAlive thingy will close a connection even if a response was not sent (I need to double verify it). Can you time your requests, and the one that fails, can you print how long it took? If it's |
Your thought was correct - it is timing out after 30 seconds. The request is as mentioned above, and as the nodes get busier, it takes 15 seconds plus to execute. I'm comfortable with a timeout of 30 seconds - seems a reasonable setting to me. However, I'd consider changing the HTTP response code from 500 to 503 (Service unavailable - http://www.w3.org/Protocols/rfc2616/rfc2616-sec10.html#sec10.5.4 ) This kind of error is easy to catch and recover from, especially when you know to expect it (eg while hammering the cluster!) |
Actually, I need to think better on how to handle this properly. There might be requests that take more than 30s (for example, optimization) and I would not want to close the connection on the client calling it... Can you verify that the invocation actually returns at the end by changing the timeout to a higher value, for example, set |
Yeah, I can verify that it does eventually return, with results. keep-alive is really meant for idle connections, rather than active requests. you could have a request timeout set to much higher than the keep-alive |
Exactly. I will see what I can do to fix this..., cheers! |
It could be sometime useful to have a stand alone runner to see how exactly Tika extracts content from a given file. You can run `StandaloneRunner` class using: * `-u file://URL/TO/YOUR/DOC` * `--size` set extracted size (default to mapper attachment size) * `BASE64` encoded binary Example: ```sh StandaloneRunner BASE64Text StandaloneRunner -u /tmp/mydoc.pdf StandaloneRunner -u /tmp/mydoc.pdf --size 1000000 ``` It produces something like: ``` ## Extracted text --------------------- BEGIN ----------------------- This is the extracted text ---------------------- END ------------------------ ## Metadata - author: null - content_length: null - content_type: application/pdf - date: null - keywords: null - language: null - name: null - title: null ``` Closes elastic#99. (cherry picked from commit 720b3bf) (cherry picked from commit 990fa15)
Relates elastic#40754 Relates elastic#99
Hiya Shay
It turns out that the issue I was having earlier with NFS was a red herring. What seems to be happening is:
My process:
So:
the cluster gets busy, and a search for the next 5,000 docs results in this error:
select failed: No child processes
.This was triggering the cleanup in my script which deleted the index.
It appears the index has been deleted by one node, while another node is still trying to write snapshot info for the (now deleted) index, which results in these errors:
[14:48:09,948][WARN ][index.gateway ] [Nameless One][ia_object_1270046679][0] Failed to snapshot on close
org.elasticsearch.index.gateway.IndexShardGatewaySnapshotFailedException: [ia_object_1270046679][0] Failed to append snapshot translog into [/opt/elasticsearch/data/iAnnounce/ia_object_1270046679/0/translog/translog-3]
at org.elasticsearch.index.gateway.fs.FsIndexShardGateway.snapshot(FsIndexShardGateway.java:199)
at org.elasticsearch.index.gateway.IndexShardGatewayService$1.snapshot(IndexShardGatewayService.java:154)
at org.elasticsearch.index.engine.robin.RobinEngine.snapshot(RobinEngine.java:350)
at org.elasticsearch.index.shard.service.InternalIndexShard.snapshot(InternalIndexShard.java:369)
at org.elasticsearch.index.gateway.IndexShardGatewayService.snapshot(IndexShardGatewayService.java:150)
at org.elasticsearch.index.gateway.IndexShardGatewayService.close(IndexShardGatewayService.java:176)
at org.elasticsearch.index.service.InternalIndexService.deleteShard(InternalIndexService.java:244)
at org.elasticsearch.index.service.InternalIndexService.close(InternalIndexService.java:159)
at org.elasticsearch.indices.InternalIndicesService.deleteIndex(InternalIndicesService.java:208)
at org.elasticsearch.indices.InternalIndicesService.deleteIndex(InternalIndicesService.java:185)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at com.google.inject.internal.ConstructionContext$DelegatingInvocationHandler.invoke(ConstructionContext.java:108)
at $Proxy19.deleteIndex(Unknown Source)
at org.elasticsearch.indices.cluster.IndicesClusterStateService.clusterChanged(IndicesClusterStateService.java:178)
at org.elasticsearch.cluster.service.InternalClusterService$2.run(InternalClusterService.java:193)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:619)
Caused by: java.io.FileNotFoundException: /opt/elasticsearch/data/iAnnounce/ia_object_1270046679/0/translog/translog-3 (Stale NFS file handle)
at java.io.RandomAccessFile.open(Native Method)
at java.io.RandomAccessFile.(RandomAccessFile.java:212)
at org.elasticsearch.index.gateway.fs.FsIndexShardGateway.snapshot(FsIndexShardGateway.java:184)
... 20 more
Now, I'm catching the
select failed: No child processes
errors, sleeping for a few seconds, then trying again, and everything is working well.ta
clint
The text was updated successfully, but these errors were encountered: