Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Search failing when cluster busy #99

Closed
clintongormley opened this issue Mar 31, 2010 · 13 comments
Closed

Search failing when cluster busy #99

clintongormley opened this issue Mar 31, 2010 · 13 comments

Comments

@clintongormley
Copy link
Contributor

Hiya Shay

It turns out that the issue I was having earlier with NFS was a red herring. What seems to be happening is:

My process:

  • i'm reindexing old_index to new_index
  • i read 5000 docs from the old index, then create each one in the new index
  • if there is an error, then i delete new_index

So:

  • the cluster gets busy, and a search for the next 5,000 docs results in this error: select failed: No child processes.

  • This was triggering the cleanup in my script which deleted the index.

  • It appears the index has been deleted by one node, while another node is still trying to write snapshot info for the (now deleted) index, which results in these errors:

    [14:48:09,948][WARN ][index.gateway ] [Nameless One][ia_object_1270046679][0] Failed to snapshot on close
    org.elasticsearch.index.gateway.IndexShardGatewaySnapshotFailedException: [ia_object_1270046679][0] Failed to append snapshot translog into [/opt/elasticsearch/data/iAnnounce/ia_object_1270046679/0/translog/translog-3]
    at org.elasticsearch.index.gateway.fs.FsIndexShardGateway.snapshot(FsIndexShardGateway.java:199)
    at org.elasticsearch.index.gateway.IndexShardGatewayService$1.snapshot(IndexShardGatewayService.java:154)
    at org.elasticsearch.index.engine.robin.RobinEngine.snapshot(RobinEngine.java:350)
    at org.elasticsearch.index.shard.service.InternalIndexShard.snapshot(InternalIndexShard.java:369)
    at org.elasticsearch.index.gateway.IndexShardGatewayService.snapshot(IndexShardGatewayService.java:150)
    at org.elasticsearch.index.gateway.IndexShardGatewayService.close(IndexShardGatewayService.java:176)
    at org.elasticsearch.index.service.InternalIndexService.deleteShard(InternalIndexService.java:244)
    at org.elasticsearch.index.service.InternalIndexService.close(InternalIndexService.java:159)
    at org.elasticsearch.indices.InternalIndicesService.deleteIndex(InternalIndicesService.java:208)
    at org.elasticsearch.indices.InternalIndicesService.deleteIndex(InternalIndicesService.java:185)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
    at java.lang.reflect.Method.invoke(Method.java:597)
    at com.google.inject.internal.ConstructionContext$DelegatingInvocationHandler.invoke(ConstructionContext.java:108)
    at $Proxy19.deleteIndex(Unknown Source)
    at org.elasticsearch.indices.cluster.IndicesClusterStateService.clusterChanged(IndicesClusterStateService.java:178)
    at org.elasticsearch.cluster.service.InternalClusterService$2.run(InternalClusterService.java:193)
    at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
    at java.lang.Thread.run(Thread.java:619)
    Caused by: java.io.FileNotFoundException: /opt/elasticsearch/data/iAnnounce/ia_object_1270046679/0/translog/translog-3 (Stale NFS file handle)
    at java.io.RandomAccessFile.open(Native Method)
    at java.io.RandomAccessFile.(RandomAccessFile.java:212)
    at org.elasticsearch.index.gateway.fs.FsIndexShardGateway.snapshot(FsIndexShardGateway.java:184)
    ... 20 more

Now, I'm catching the select failed: No child processes errors, sleeping for a few seconds, then trying again, and everything is working well.

ta

clint

@kimchy
Copy link
Member

kimchy commented Mar 31, 2010

What is this select failed: No child process error? Very strange, there are no child processes in elasticsearch... . Is it an exception that you get as part of the search request that originates from elasticsearch? Do you see a stack trace? If not, can you set action: DEBUG on the logging.yml file and simulate it again, I would love to see the stack trace...

@kimchy
Copy link
Member

kimchy commented Mar 31, 2010

Just committed a fix (I hope) for the snapshot/delete index thingy.

@clintongormley
Copy link
Contributor Author

´Fraid not :)

It logs that the index was deleted, but I see the directory growing for quite a long time after the delete happened.

@clintongormley
Copy link
Contributor Author

... then eventually I get this in the logs, and the dir size drops to 44kB (but continues to exist):

[WARN ][index.gateway            ] [Blade][ia_object_1270136239][4] Failed to snapshot on close
org.elasticsearch.index.gateway.IndexShardGatewaySnapshotFailedException: [ia_object_1270136239][4] Failed to finalize index snapshot into [/opt/elasticsearch/data/iAnnounce/ia_object_1270136239/4/index/segments_4]
    at org.elasticsearch.index.gateway.fs.FsIndexShardGateway.snapshot(FsIndexShardGateway.java:220)
    at org.elasticsearch.index.gateway.IndexShardGatewayService$1.snapshot(IndexShardGatewayService.java:154)
    at org.elasticsearch.index.engine.robin.RobinEngine.snapshot(RobinEngine.java:350)
    at org.elasticsearch.index.shard.service.InternalIndexShard.snapshot(InternalIndexShard.java:370)
    at org.elasticsearch.index.gateway.IndexShardGatewayService.snapshot(IndexShardGatewayService.java:150)
    at org.elasticsearch.index.gateway.IndexShardGatewayService$SnapshotRunnable.run(IndexShardGatewayService.java:214)
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)
    at java.util.concurrent.FutureTask$Sync.innerRunAndReset(FutureTask.java:317)
    at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:150)
    at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$101(ScheduledThreadPoolExecutor.java:98)
    at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.runPeriodic(ScheduledThreadPoolExecutor.java:181)
    at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:205)
    at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
    at java.lang.Thread.run(Thread.java:619)
Caused by: java.io.IOException: No such file or directory
    at java.io.UnixFileSystem.createFileExclusively(Native Method)
    at java.io.File.createNewFile(File.java:883)
    at org.elasticsearch.util.lucene.Directories.copyFromDirectory(Directories.java:86)
    at org.elasticsearch.index.gateway.fs.FsIndexShardGateway.snapshot(FsIndexShardGateway.java:209)
    ... 14 more

@kimchy
Copy link
Member

kimchy commented Apr 1, 2010

Can you try it again, I just pushed a further validation that this will not happen. Just to make sure, I am trying to fix the delete index problem, that seems to collide with the scheduled snapshotting done for a shard.

@clintongormley
Copy link
Contributor Author

Success+++ The index is deleted correctly, and no errors in the logs!

By the way, tt turns out that the select failed: no child processes was a spurious error caused by a fault in my code. However, it was being thrown because of a real error: it looks like, when the cluster is busy, sometimes it doesn't return anything.

There is nothing in the logs, I just get this from the module doing the HTTP request:
Server closed connection without sending any data back

It takes quite a while (like 500,000 reindexed records) before I get this error, so not sure if it is doing it in the current build yet, but I'll rerun and let you know

@kimchy
Copy link
Member

kimchy commented Apr 1, 2010

Do you know on which operation you get this exception? There is an http keep alive mechanism that closes connections after 30s (which is configurable, see here: http://www.elasticsearch.com/docs/elasticsearch/modules/http/netty/, the httpKeepAlive option). I wonder maybe its too small a value, but it might be that the request is simply stuck on the server... . If so, I would love to understand first which request it is....

@clintongormley
Copy link
Contributor Author

It's not that the keep-alive is timing out - the HTTP module I'm using handles that gracefully. So it successfully makes the request, but then the ES server closes the connection.

The operation in question is a search:

{ query: { matchAll: {}}, sort: ["id"], from: 500000, size: 5000 }

The 500,000 is an example. If I wait a second or two, then the cluster responds correctly to the next request, although often there are a few such error messages close together.

There is nothing in the ES log

@kimchy
Copy link
Member

kimchy commented Apr 1, 2010

The keepAlive thingy will close a connection even if a response was not sent (I need to double verify it). Can you time your requests, and the one that fails, can you print how long it took? If it's 30s, then it might be it...

@clintongormley
Copy link
Contributor Author

Your thought was correct - it is timing out after 30 seconds. The request is as mentioned above, and as the nodes get busier, it takes 15 seconds plus to execute.

I'm comfortable with a timeout of 30 seconds - seems a reasonable setting to me.

However, I'd consider changing the HTTP response code from 500 to 503 (Service unavailable - http://www.w3.org/Protocols/rfc2616/rfc2616-sec10.html#sec10.5.4 )

This kind of error is easy to catch and recover from, especially when you know to expect it (eg while hammering the cluster!)

@kimchy
Copy link
Member

kimchy commented Apr 2, 2010

Actually, I need to think better on how to handle this properly. There might be requests that take more than 30s (for example, optimization) and I would not want to close the connection on the client calling it...

Can you verify that the invocation actually returns at the end by changing the timeout to a higher value, for example, set http.netty.httpKeepAlive: 5m.

@clintongormley
Copy link
Contributor Author

Yeah, I can verify that it does eventually return, with results.

keep-alive is really meant for idle connections, rather than active requests. you could have a request timeout set to much higher than the keep-alive

@kimchy
Copy link
Member

kimchy commented Apr 2, 2010

Exactly. I will see what I can do to fix this..., cheers!

rmuir pushed a commit to rmuir/elasticsearch that referenced this issue Nov 8, 2015
It could be sometime useful to have a stand alone runner to see how exactly Tika extracts content from a given file.

You can run `StandaloneRunner` class using:

*  `-u file://URL/TO/YOUR/DOC`
*  `--size` set extracted size (default to mapper attachment size)
*  `BASE64` encoded binary

Example:

```sh
StandaloneRunner BASE64Text
StandaloneRunner -u /tmp/mydoc.pdf
StandaloneRunner -u /tmp/mydoc.pdf --size 1000000
```

It produces something like:

```
## Extracted text
--------------------- BEGIN -----------------------
This is the extracted text
---------------------- END ------------------------
## Metadata
- author: null
- content_length: null
- content_type: application/pdf
- date: null
- keywords: null
- language: null
- name: null
- title: null
```

Closes elastic#99.
(cherry picked from commit 720b3bf)
(cherry picked from commit 990fa15)
cbuescher pushed a commit to cbuescher/elasticsearch that referenced this issue Oct 2, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants