Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

smartos server crashes every hour #554

Open
aortmannm opened this issue Feb 5, 2016 · 16 comments
Open

smartos server crashes every hour #554

aortmannm opened this issue Feb 5, 2016 · 16 comments

Comments

@aortmannm
Copy link

Hi,

got a smartos server that crashes every hour since Jan 15 21:08. Picture attached. IPMI has no downtime so it's not a power connection problem.

core-dump

The system was stable for a year, no problems and no mysterious.
Here is the part of the core dump, can upload the whole one if needed.

Loading modules: [ unix genunix specfs dtrace mac cpu.generic uppc apix scsi_vhci ufs ip hook neti sockfs arp usba stmf_sbd stmf zfs sd lofs idm sata mpt_sas crypto random cpc logindmux ptm sppp nfs ipc ]
> ::status
debugging crash dump vmcore.498 (64-bit) from 0c-c4-7a-18-3a-c0
operating system: 5.11 joyent_20150219T102159Z (i86pc)
image uuid: (not set)
panic message:
assertion failed: 0 == dmu_object_info(bpo->bpo_os, subsubobjs, &doi) (0x0 == 0x2), file: ../../common/fs
/zfs/bpobj.c, line: 422
dump content: kernel pages only

SmartOS Version: 20150709T171818Z
Mainboard: X9DRD-7LN4F(-JBOD)/X9DRD-EF
CPU: Intel(R) Xeon(R) CPU E5-1650 v2 @ 3.50GHz
RAM: Samsung 16GB DDR3-1866 dual-rank ECC registered DIMM
HDDs: Some HGST Disks of these HUS726060AL5210

Hope to find a fix for this problem.

@rmustacc
Copy link
Contributor

rmustacc commented Feb 5, 2016

We'll probably want to start with getting the full stack trace (the $C command in mdb) and/or uploading a dump. I'd also highly recommend you upgrade from Feb 2015 to a more recent release to help rule out various issues that have already been fixed.

@matthiasg
Copy link

It seems to be related to deleting snapshots. We are using a service called zsnapper which creates snapshots at regular intervals and deletes older ones automatically. That explains why it happens so regularily. Now we will try a newer release of SmartOS to find out whether thats been fixed, since deleting a snapshot obviously shouldnt cause a reboot.

[root@0c-c4-7a-18-3a-c0 /var/crash/volatile]# mdb -k unix.498 vmcore.498
Loading modules: [ unix genunix specfs dtrace mac cpu.generic uppc apix scsi_vhc             i ufs ip hook neti sockfs arp usba stmf_sbd stmf zfs sd lofs idm sata mpt_sas cr             ypto random cpc logindmux ptm sppp nfs ipc ]
> $C
ffffff00f78cf650 vpanic()
ffffff00f78cf6a0 0xfffffffffba6be7d()
ffffff00f78cf7b0 bpobj_enqueue_subobj+0x35a(ffffff21f1abb150, 28377,
ffffff21d32f25c0)
ffffff00f78cf8b0 dsl_deadlist_move_bpobj+0x14e(ffffff2200c7a9c0,
ffffff21f1abb150, 1, ffffff21d32f25c0)
ffffff00f78cf970 dsl_destroy_snapshot_sync_impl+0x23a(ffffff21f9bb3d00, 0,
ffffff21d32f25c0)
ffffff00f78cf9d0 dsl_destroy_snapshot_sync+0x67(ffffff00f55caaf0,
ffffff21d32f25c0)
ffffff00f78cfa10 dsl_sync_task_sync+0x10a(ffffff00f55caa10, ffffff21d32f25c0)
ffffff00f78cfaa0 dsl_pool_sync+0x28b(ffffff21f1abb080, 14127b6)
ffffff00f78cfb70 spa_sync+0x27e(ffffff21f210b000, 14127b6)
ffffff00f78cfc20 txg_sync_thread+0x227(ffffff21f1abb080)
ffffff00f78cfc30 thread_start+8()

@matthiasg
Copy link

we tried it with the newest release and it is still happening. we can reproduce it by deleting all snapshots on the zfs file system which will cause the system to crash after reaching a certain point.

anything else we can try ?

@rmustacc
Copy link
Contributor

rmustacc commented Feb 8, 2016

Would it be possible to get a crash dump uploaded? We can also help reach out to the ZFS folks to see if they've seen anything like this.

@aortmannm
Copy link
Author

Here the dump files. Before the server crashes start we made from another server a "zfs send | ssh zfs recv" that we had to stop, don't know if this created this problem.

http://downloads.curasystems.de/vmdump.520.tgz
http://downloads.curasystems.de/unix.520.tgz

@matthiasg
Copy link

@rmustacc did anybody look at this ? is there anything we can do ourselves ?

@rmustacc
Copy link
Contributor

Apologies, I've been swamped and haven't been able to personally, sorry I've dropped the ball on this issue. I know that it must be quite frustrating. I'd suggest mailing details to the ZFS developer lists where they'll have some folks who are a bit more knowledgeable about these kinds of issues in ZFS and hopefully will be able to provide some guidance a bit faster than we will.

@aortmannm
Copy link
Author

Who is this zfs group where we need to create this issue? We have now a second machine were the problem exists. Restarts every 5 minutes until we deactivated zsnapper.

@jclulow
Copy link
Contributor

jclulow commented Feb 22, 2016

You could try the OpenZFS mailing list: http://open-zfs.org/wiki/Mailing_list

@stevenburgess
Copy link

your quote "which will cause the system to crash after reaching a certain point" makes me wonder if its this issue basically trying to delete too many snapshots at once holds open the next TXG, so depending on write workload, can fill some buffers and in our case, lead to crashes.

You can check on this by figuring out what znapper would be deleting, if any of the datasets or snapshot ranges passed to zfs destroy contains too many snapshots, try breaking the destroy down into smaller pieces.

@rmustacc
Copy link
Contributor

From the dump we're not holding open a single txg for a long time. Instead the issue is that ZFS is expecting some bit of data to be present that's not. @ahrens have you ever seen issues like the ones in the dump above?

@matthiasg
Copy link

btw we now have a second server exhibit the same issue. this time a live customer machine. it was rebooting every minute due to making/deleting snapshots. disabling the znapper service fixed the issue, but we are now stuck with another system where we cannot delete certain snapshots anymore. we will post into the zfs groups too

@ahrens
Copy link

ahrens commented Feb 23, 2016

It's possible that this is the fallout from hitting bug 3603, which was fixed a few years ago. Have you run this pool without the following fix? A workaround for this problem would be to add code to bpobj_enqueue_subobj to ignore the subsubobjs if it does not exist (i.e. dmu_object_info() returns ENOENT, as it has here). This would leak some space (perhaps a very small amount of space) but allow you to recover all the blocks that can be found.

commit d04756377ddd1cf28ebcf652541094e17b03c889
Author: Matthew Ahrens mahrens@delphix.com
Date: Mon Mar 4 12:27:52 2013 -0800

3603 panic from bpobj_enqueue_subobj()

@aortmannm
Copy link
Author

We never run such an old smartos version on these two servers.

@aortmannm
Copy link
Author

Made a new update today to the newest Version 20160317T000621Z.

The problem is still existing on two servers. Did really no one know this problem or can help to investigate it?

The new dump from the newest version.
http://downloads.curasystems.de/vmdump.538.tgz

@Adel-Magebinary
Copy link

This problem is more related to your raid controller. Make sure you support real HBA like (DELL HBA330) or logical raid controller

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants