Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

arakoon --remote-collapse failing on arakoon version 1.6.10 #428

Open
michael-arion opened this issue May 23, 2014 · 11 comments
Open

arakoon --remote-collapse failing on arakoon version 1.6.10 #428

michael-arion opened this issue May 23, 2014 · 11 comments

Comments

@michael-arion
Copy link

Hi,

We have a 3 node arakoon cluster that was recently upgraded to version 1.6.10.
Remote collapsing is not working on it.

Output of dump-store of the head.db:

root@CCAUMAN1:~# arakoon --dump-store /opt/tlogs/Marketplace/head.db
i: Some ("99279767")
master: Some(Marketplace_2,0)
routing : --
interval: {(_,_),(_,_)}

I have all tlogs starting from 992.tlf:

root@CCAUMAN1:~# ls -ltrh /opt/tlogs/Marketplace/
total 3.7G
-rw-r--r-- 1 root root 7.4M 2014-05-13 03:20 992.tlf
-rw-r--r-- 1 root root 7.4M 2014-05-13 12:31 993.tlf
-rw-r--r-- 1 root root 7.4M 2014-05-13 21:37 994.tlf
-rw-r--r-- 1 root root 7.4M 2014-05-14 06:48 995.tlf
-rw-r--r-- 1 root root 7.4M 2014-05-14 15:54 996.tlf
-rw-r--r-- 1 root root 7.4M 2014-05-15 01:04 997.tlf
-rw-r--r-- 1 root root 7.4M 2014-05-15 10:11 998.tlf
-rw-r--r-- 1 root root 7.4M 2014-05-15 19:17 999.tlf
-rw-r--r-- 1 root root 7.4M 2014-05-16 04:27 1000.tlf
-rw-r--r-- 1 root root 7.4M 2014-05-16 13:34 1001.tlf
-rw-r--r-- 1 root root 7.4M 2014-05-16 22:40 1002.tlf
-rw-r--r-- 1 root root 7.4M 2014-05-17 07:51 1003.tlf
-rw-r--r-- 1 root root 7.4M 2014-05-17 16:56 1004.tlf
-rw-r--r-- 1 root root 7.4M 2014-05-18 02:07 1005.tlf
-rw-r--r-- 1 root root 7.4M 2014-05-18 11:17 1006.tlf
-rw-r--r-- 1 root root 7.4M 2014-05-18 20:23 1007.tlf
-rw-r--r-- 1 root root 7.4M 2014-05-19 05:31 1008.tlf
-rw-r--r-- 1 root root 7.4M 2014-05-19 14:36 1009.tlf
-rw-r--r-- 1 root root 7.4M 2014-05-19 23:46 1010.tlf
-rw-r--r-- 1 root root 7.4M 2014-05-20 08:56 1011.tlf
-rw-r--r-- 1 root root 7.4M 2014-05-20 18:02 1012.tlf
-rw-r--r-- 1 root root 7.4M 2014-05-21 03:10 1013.tlf
-rw-r--r-- 1 root root 7.4M 2014-05-21 12:20 1014.tlf
-rw-r--r-- 1 root root 7.4M 2014-05-21 21:26 1015.tlf
-rw-r--r-- 1 root root 7.4M 2014-05-22 06:36 1016.tlf
-rw-r--r-- 1 root root 1.8G 2014-05-22 12:33 head.db
-rw-r--r-- 1 root root  37M 2014-05-22 14:00 1017.tlog

Still, I get an error when trying a collapse-remote call:

root@CCAUMAN1:~# arakoon --collapse-remote marketplace 10.74.8.130 7081 3
arakoon: main: remote_collapsing_failed: Failure("unknown failure (EC: 255)")

This is the relevant info in the arakoon server log during the collapse-remote client call:

May 22 09:53:39 8335: (client_protocol|info): connection=10.74.8.130:client_service_23036 COLLAPSE_TLOGS: n=3
May 22 09:53:39 8335: (client_protocol|info): ... Start collapsing ... (n=3)
May 22 09:53:39 8336: (main|info): Starting collapse
May 22 09:53:39 8345: (main|info): Creating local store at /opt/tlogs/Marketplace/head.db
May 22 09:53:39 8345: (main|info): returning assuming no I "/opt/tlogs/Marketplace/head.db": Failure("success")
May 22 09:53:39 8346: (main|info): Going to collapse 1014 tlogs
May 22 09:53:39 8356: (main|info): copy_file /opt/tlogs/Marketplace/head.db /opt/tlogs/Marketplace/head.db.clone
May 22 09:53:43 0370: (main|info): 10.74.8.130:client_service:session=2 connection=10.74.8.130:client_service_23037 socket_address=ADDR_INET 10.74.8.132,42279 file_descriptor_inode=1159833830
May 22 09:53:43 0396: (main|info): exiting session (2) connection=10.74.8.130:client_service_23037: End_of_file
May 22 09:53:43 4503: (main|info): Marketplace_2 is master
May 22 09:53:50 0873: (main|info): done: copy_file
May 22 09:53:50 0875: (main|info): rename /opt/tlogs/Marketplace/head.db.clone.tmp -> /opt/tlogs/Marketplace/head.db.clone
May 22 09:53:57 4566: (main|info): Creating local store at /opt/tlogs/Marketplace/head.db.clone
May 22 09:53:57 4568: (client_protocol|error): Exception during client request (Failure("success")) => rc:ff msg:unknown failure
May 22 09:53:57 4570: (main|info): exiting session (1) connection=10.74.8.130:client_service_23036: Lwt_io.Channel_closed("output")
May 22 09:53:57 4570: (main|info): Exception on closing of socket (connection=10.74.8.130:client_service_23036): Unix.Unix_error(Unix.EBADF, "check_descriptor", "")

When checking the files in my tlog dir, I see that the copy of head.db to head.db.clone is done, but that the head.db is untouched.

Thx for looking into this.
Michael Van Wesenbeeck

@NicolasT
Copy link
Contributor

Could you

  • Run the node in debug mode, then try to collapse, and provide that log output
  • Run arakoon --dump-store on head.db.clone after it was created
  • Calculate the MD5 checksum of head.db and head.db.clone and check whether these are the same

@michael-arion
Copy link
Author

root@CCAUMAN3:# vi /opt/qbase5/cfg/qconfig/arakoon/marketplace/marketplace.cfg
root@CCAUMAN3:
# grep -C3 github /opt/qbase5/cfg/qconfig/arakoon/marketplace/marketplace.cfg
ip = 10.74.8.132
client_port = 7081
messaging_port = 10001

#428

log_level = debug
log_dir = /opt/qbase5/var/log/arakoon/Marketplace
home = /opt/ssd/arakoon/Marketplace
root@CCAUMAN3:# ps -ef | grep daemonize
root 18780 5105 0 15:31 pts/14 00:00:00 grep --color=auto daemonize
root 21738 1 0 May07 ? 00:16:45 arakoon --node arakoon_3 -config /opt/qbase5/cfg/qconfig/arakoon/arakoon/arakoon.cfg -daemonize
root 23138 1 0 May07 ? 00:49:33 arakoon --node Marketplace_3 -config /opt/qbase5/cfg/qconfig/arakoon/marketplace/marketplace.cfg -daemonize
root 23999 1 0 May07 ? 00:11:35 arakoon --node Marketplacefe_3 -config /opt/qbase5/cfg/qconfig/arakoon/marketplacefe/marketplacefe.cfg -daemonize
root@CCAUMAN3:
# kill -USR1 23138
=> now in debug mode.
Dump-store from before collapse:
root@CCAUMAN3:# arakoon --dump-store /opt/tlogs/Marketplace/head.db
i: Some ("99279767")
master: Some(Marketplace_2,0)
routing : --
interval: {(,),(,)}
root@CCAUMAN3:
# ls -ltrh /opt/tlogs/Marketplace/
total 3.7G
-rw-r--r-- 1 root root 7.4M 2014-05-13 03:20 992.tlf
-rw-r--r-- 1 root root 7.4M 2014-05-13 12:31 993.tlf
-rw-r--r-- 1 root root 7.4M 2014-05-13 21:37 994.tlf
-rw-r--r-- 1 root root 7.4M 2014-05-14 06:48 995.tlf
-rw-r--r-- 1 root root 7.4M 2014-05-14 15:54 996.tlf
-rw-r--r-- 1 root root 7.4M 2014-05-15 01:04 997.tlf
-rw-r--r-- 1 root root 7.4M 2014-05-15 10:11 998.tlf
-rw-r--r-- 1 root root 7.4M 2014-05-15 19:17 999.tlf
-rw-r--r-- 1 root root 7.4M 2014-05-16 04:27 1000.tlf
-rw-r--r-- 1 root root 7.4M 2014-05-16 13:34 1001.tlf
-rw-r--r-- 1 root root 7.4M 2014-05-16 22:40 1002.tlf
-rw-r--r-- 1 root root 7.4M 2014-05-17 07:51 1003.tlf
-rw-r--r-- 1 root root 7.4M 2014-05-17 16:56 1004.tlf
-rw-r--r-- 1 root root 7.4M 2014-05-18 02:07 1005.tlf
-rw-r--r-- 1 root root 7.4M 2014-05-18 11:17 1006.tlf
-rw-r--r-- 1 root root 7.4M 2014-05-18 20:23 1007.tlf
-rw-r--r-- 1 root root 7.4M 2014-05-19 05:31 1008.tlf
-rw-r--r-- 1 root root 7.4M 2014-05-19 14:36 1009.tlf
-rw-r--r-- 1 root root 7.4M 2014-05-19 23:46 1010.tlf
-rw-r--r-- 1 root root 7.4M 2014-05-20 08:56 1011.tlf
-rw-r--r-- 1 root root 7.4M 2014-05-20 18:02 1012.tlf
-rw-r--r-- 1 root root 7.4M 2014-05-21 03:10 1013.tlf
-rw-r--r-- 1 root root 7.4M 2014-05-21 12:20 1014.tlf
-rw-r--r-- 1 root root 7.4M 2014-05-21 21:26 1015.tlf
-rw-r--r-- 1 root root 7.4M 2014-05-22 06:36 1016.tlf
-rw-r--r-- 1 root root 7.4M 2014-05-22 15:42 1017.tlf
-rw-r--r-- 1 root root 7.4M 2014-05-23 00:58 1018.tlf
-rw-r--r-- 1 root root 1.8G 2014-05-23 03:03 head.db.clone
-rw-r--r-- 1 root root 7.4M 2014-05-23 10:09 1019.tlf
-rw-r--r-- 1 root root 1.8G 2014-05-23 15:33 head.db
-rw-r--r-- 1 root root 27M 2014-05-23 15:34 1020.tlog
root@CCAUMAN3:~#

=> I run the remote-collapse on the node itself (ip 10.74.8.132)
root@CCAUMAN3:# arakoon --collapse-remote marketplace 10.74.8.132 7081 3
arakoon: main: remote_collapsing_failed: Failure("unknown failure (EC: 255)")
root@CCAUMAN3:
# ls -ltrh /opt/tlogs/Marketplace/
total 3.7G
-rw-r--r-- 1 root root 7.4M 2014-05-13 03:20 992.tlf
-rw-r--r-- 1 root root 7.4M 2014-05-13 12:31 993.tlf
-rw-r--r-- 1 root root 7.4M 2014-05-13 21:37 994.tlf
-rw-r--r-- 1 root root 7.4M 2014-05-14 06:48 995.tlf
-rw-r--r-- 1 root root 7.4M 2014-05-14 15:54 996.tlf
-rw-r--r-- 1 root root 7.4M 2014-05-15 01:04 997.tlf
-rw-r--r-- 1 root root 7.4M 2014-05-15 10:11 998.tlf
-rw-r--r-- 1 root root 7.4M 2014-05-15 19:17 999.tlf
-rw-r--r-- 1 root root 7.4M 2014-05-16 04:27 1000.tlf
-rw-r--r-- 1 root root 7.4M 2014-05-16 13:34 1001.tlf
-rw-r--r-- 1 root root 7.4M 2014-05-16 22:40 1002.tlf
-rw-r--r-- 1 root root 7.4M 2014-05-17 07:51 1003.tlf
-rw-r--r-- 1 root root 7.4M 2014-05-17 16:56 1004.tlf
-rw-r--r-- 1 root root 7.4M 2014-05-18 02:07 1005.tlf
-rw-r--r-- 1 root root 7.4M 2014-05-18 11:17 1006.tlf
-rw-r--r-- 1 root root 7.4M 2014-05-18 20:23 1007.tlf
-rw-r--r-- 1 root root 7.4M 2014-05-19 05:31 1008.tlf
-rw-r--r-- 1 root root 7.4M 2014-05-19 14:36 1009.tlf
-rw-r--r-- 1 root root 7.4M 2014-05-19 23:46 1010.tlf
-rw-r--r-- 1 root root 7.4M 2014-05-20 08:56 1011.tlf
-rw-r--r-- 1 root root 7.4M 2014-05-20 18:02 1012.tlf
-rw-r--r-- 1 root root 7.4M 2014-05-21 03:10 1013.tlf
-rw-r--r-- 1 root root 7.4M 2014-05-21 12:20 1014.tlf
-rw-r--r-- 1 root root 7.4M 2014-05-21 21:26 1015.tlf
-rw-r--r-- 1 root root 7.4M 2014-05-22 06:36 1016.tlf
-rw-r--r-- 1 root root 7.4M 2014-05-22 15:42 1017.tlf
-rw-r--r-- 1 root root 7.4M 2014-05-23 00:58 1018.tlf
-rw-r--r-- 1 root root 7.4M 2014-05-23 10:09 1019.tlf
-rw-r--r-- 1 root root 1.8G 2014-05-23 15:33 head.db
-rw-r--r-- 1 root root 1.8G 2014-05-23 15:37 head.db.clone
-rw-r--r-- 1 root root 27M 2014-05-23 15:38 1020.tlog

=> now turning off debug mode again:
root@CCAUMAN3:# vi /opt/qbase5/cfg/qconfig/arakoon/marketplace/marketplace.cfg
root@CCAUMAN3:
# ps -ef | grep daemonize
root 21738 1 0 May07 ? 00:16:45 arakoon --node arakoon_3 -config /opt/qbase5/cfg/qconfig/arakoon/arakoon/arakoon.cfg -daemonize
root 23138 1 0 May07 ? 00:49:42 arakoon --node Marketplace_3 -config /opt/qbase5/cfg/qconfig/arakoon/marketplace/marketplace.cfg -daemonize
root 23835 5105 0 15:38 pts/14 00:00:00 grep --color=auto daemonize
root 23999 1 0 May07 ? 00:11:35 arakoon --node Marketplacefe_3 -config /opt/qbase5/cfg/qconfig/arakoon/marketplacefe/marketplacefe.cfg -daemonize
root@CCAUMAN3:# kill -USR1 23138
root@CCAUMAN3:
# grep "May 23 15" /opt/qbase5/var/log/arakoon/Marketplace/Marketplace_3.log > arakoon_Marketplace_debug.log
root@CCAUMAN3:# less arakoon_Marketplace_debug.log
root@CCAUMAN3:
# arakoon --dump-store /opt/tlogs/Marketplace/head.db
i: Some ("99279767")
master: Some(Marketplace_2,0)
routing : --
interval: {(,),(,)}
root@CCAUMAN3:# arakoon --dump-store /opt/tlogs/Marketplace/head.db.clone
i: Some ("99279767")
master: Some(Marketplace_2,0)
routing : --
interval: {(,),(,)}
root@CCAUMAN3:
# md5sum /opt/tlogs/Marketplace/head.db
d125bdf8ebc5d651e97d47b7ad753fba /opt/tlogs/Marketplace/head.db
root@CCAUMAN3:# md5sum /opt/tlogs/Marketplace/head.db.clone
d125bdf8ebc5d651e97d47b7ad753fba /opt/tlogs/Marketplace/head.db.clone
root@CCAUMAN3:
#

@michael-arion
Copy link
Author

Hoi Nicolas,

In bijlage de debug logfile van tijdens de collapse-remote.

Grtjs,
Michael Van Wesenbeeck
Belgacom Cloud Services Operations

On 23 May 2014, at 12:10, Nicolas Trangez notifications@github.com wrote:

Could you

Run the node in debug mode, then try to collapse, and provide that log output
Run arakoon --dump-store on head.db.clone after it was created
Calculate the MD5 checksum of head.db and head.db.clone and check whether these are the same

Reply to this email directly or view it on GitHub.

@NicolasT
Copy link
Contributor

Github 'Issues' doesn't keep e-mail attachments. Could you post it as a Gist?

@michael-arion
Copy link
Author

@NicolasT
Copy link
Contributor

Could you provide the output of arakoon --version and tcbmgr inform /path/to/head.db?

@michael-arion
Copy link
Author

The version we had before the upgrade:
root@CCAUMAN3:~# arakoon --version
version: 1.6.5
git_revision: "1.6.5-0-ga879267-dirty"
compiled: "01/10/2013 15:47:27 UTC"
machine: "ubuntu 3.0.0-32-generic x86_64 x86_64 x86_64 GNU/Linux"
tlogEntriesPerFile: 100000
dependencies:
camltc 0.7.0 OCaml bindings for tokyo cabinet
lwt 2.4.3 A cooperative threads library for OCaml
ounit 2.0.0 Unit testing framework loosely based on HUnit. It is similar to JUnit, and other XUnit testing frameworks

The current version:
root@CCAUMAN3:~# arakoon --version
version: 1.6.10
git_revision: "tags/1.6.10-0-ged8263b"
compiled: "04/04/2014 11:46:38 UTC"
machine: "jenkins-arakoon-git-build-and-package-18 3.5.0-42-generic x86_64 x86_64 x86_64 GNU/Linux"
tlogEntriesPerFile: 100000
dependencies:
camltc 0.9.0 Tokyo Cabinet bindings for OCaml.
lwt 2.4.4 A cooperative threads library for OCaml
ounit 2.0.0 Unit testing framework loosely based on HUnit. It is similar to JUnit, and other XUnit testing frameworks

and the tcbmgr inform:
root@CCAUMAN3:~# /opt/qbase5/apps/arakoon/bin/tcbmgr inform /opt/tlogs/Marketplace/head.db
path: /opt/tlogs/Marketplace/head.db
database type: btree
additional flags:
comparison function: lexical
max leaf member: 128
max node member: 256
leaf number: 151647
node number: 954
bucket number: 229373
alignment: 256
free block pool: 1024
inode number: 917749
modified time: 2014-05-23T14:46:00+01:00
options: large
record number: 218798
file size: 1827966976

@NicolasT
Copy link
Contributor

For reference: should look into toolslive/camltc@a28c703 & check whether tcbdbsetmutex could return non-zero without setting an error code on the bdb reference for some reason.

@NicolasT
Copy link
Contributor

Since camltc 0.8.1, tcbdbsetmutex is called whenever a database is created (toolslive/camltc@a28c703). This call, as implemented in TokyoCabinet, can fail without setting any error code, see https://github.com/Incubaid/tokyocabinet/blob/2eed757b82bd2854c729ad5f4c8cc82169f4bc51/tcbdb.c#L212: when pthread_rwlock_init or pthread_mutex_init fails, the procedure returns false without calling tcbdbsetecode.

Whilst this might explain what you're experiencing, it's hard to know why these call would fail on your system. According to the man pages, EAGAIN, ENOMEM or EPERM are most likely (EBUSY and EINVAL are not applicable).

@michael-arion
Copy link
Author

What do you suggest to debug this further? Run a node under strace?
Thx.

@NicolasT
Copy link
Contributor

It's not necessarily a system call going wrong, so strace might not help. I'd rather run the node under gdb, break in tcbdbsetmutex, then step through it to check the value of err after the 2 pthread_*_init calls. Note you might enter tcbdbsetmutex a couple of times (at startup and after sending the collapse command).

Something like

$ gdb ./arakoon.native
...
Reading symbols from ./arakoon.native...done.
(gdb) break tcbdbsetmutex 
Breakpoint 1 at 0x71d950: file tcbdb.c, line 212.
(gdb) run --node arakoon_0
Starting program: /home/nicolas/Projects/arakoon/arakoon.native --node arakoon_0
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
[New Thread 0x7ffff1451700 (LWP 2789)]
[New Thread 0x7ffff0c50700 (LWP 2790)]

Breakpoint 1, tcbdbsetmutex (bdb=0xb1fe20) at tcbdb.c:212
212 bool tcbdbsetmutex(TCBDB *bdb){
...
(gdb) next
215   if(bdb->mmtx || bdb->open){
(gdb) 
219   TCMALLOC(bdb->mmtx, sizeof(pthread_rwlock_t));
(gdb) 
220   TCMALLOC(bdb->cmtx, sizeof(pthread_mutex_t));
(gdb) 
222   if(pthread_rwlock_init(bdb->mmtx, NULL) != 0) err = true;
(gdb) 
223   if(pthread_mutex_init(bdb->cmtx, NULL) != 0) err = true;
(gdb) p err
$1 = false
(gdb) next
222   if(pthread_rwlock_init(bdb->mmtx, NULL) != 0) err = true;
(gdb) 
223   if(pthread_mutex_init(bdb->cmtx, NULL) != 0) err = true;
(gdb) 
224   if(err){
(gdb) next
231   return tchdbsetmutex(bdb->hdb);
(gdb) next
232 }
(gdb) continue
Continuing.

Note you should adjust the arguments passed to run (but don't pass -daemonize) according to the way you launch Arakoon. You might get a slightly different output than above, better print err more often.

NicolasT added a commit to Incubaid/tokyocabinet that referenced this issue May 27, 2014
Failures in `pthread_rwlock_init` or `pthread_mutex_init` called from
`tcbdbsetmutex` could result in the call returning failure (as desired),
but not setting an error code on the database object. This results in
applications reporting failures with error message 'success'.

It's unclear why `pthread_rwlock_init` or `pthread_mutex_init` would fail
though.

See: Incubaid/arakoon#428
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants