smartos server crashes every hour #554

aortmannm · 2016-02-05T15:31:00Z

Hi,

got a smartos server that crashes every hour since Jan 15 21:08. Picture attached. IPMI has no downtime so it's not a power connection problem.

The system was stable for a year, no problems and no mysterious.
Here is the part of the core dump, can upload the whole one if needed.

Loading modules: [ unix genunix specfs dtrace mac cpu.generic uppc apix scsi_vhci ufs ip hook neti sockfs arp usba stmf_sbd stmf zfs sd lofs idm sata mpt_sas crypto random cpc logindmux ptm sppp nfs ipc ]
> ::status
debugging crash dump vmcore.498 (64-bit) from 0c-c4-7a-18-3a-c0
operating system: 5.11 joyent_20150219T102159Z (i86pc)
image uuid: (not set)
panic message:
assertion failed: 0 == dmu_object_info(bpo->bpo_os, subsubobjs, &doi) (0x0 == 0x2), file: ../../common/fs
/zfs/bpobj.c, line: 422
dump content: kernel pages only

SmartOS Version: 20150709T171818Z
Mainboard: X9DRD-7LN4F(-JBOD)/X9DRD-EF
CPU: Intel(R) Xeon(R) CPU E5-1650 v2 @ 3.50GHz
RAM: Samsung 16GB DDR3-1866 dual-rank ECC registered DIMM
HDDs: Some HGST Disks of these HUS726060AL5210

Hope to find a fix for this problem.

The text was updated successfully, but these errors were encountered:

rmustacc · 2016-02-05T16:59:13Z

We'll probably want to start with getting the full stack trace (the $C command in mdb) and/or uploading a dump. I'd also highly recommend you upgrade from Feb 2015 to a more recent release to help rule out various issues that have already been fixed.

matthiasg · 2016-02-07T08:03:11Z

It seems to be related to deleting snapshots. We are using a service called zsnapper which creates snapshots at regular intervals and deletes older ones automatically. That explains why it happens so regularily. Now we will try a newer release of SmartOS to find out whether thats been fixed, since deleting a snapshot obviously shouldnt cause a reboot.

[root@0c-c4-7a-18-3a-c0 /var/crash/volatile]# mdb -k unix.498 vmcore.498
Loading modules: [ unix genunix specfs dtrace mac cpu.generic uppc apix scsi_vhc             i ufs ip hook neti sockfs arp usba stmf_sbd stmf zfs sd lofs idm sata mpt_sas cr             ypto random cpc logindmux ptm sppp nfs ipc ]
> $C
ffffff00f78cf650 vpanic()
ffffff00f78cf6a0 0xfffffffffba6be7d()
ffffff00f78cf7b0 bpobj_enqueue_subobj+0x35a(ffffff21f1abb150, 28377,
ffffff21d32f25c0)
ffffff00f78cf8b0 dsl_deadlist_move_bpobj+0x14e(ffffff2200c7a9c0,
ffffff21f1abb150, 1, ffffff21d32f25c0)
ffffff00f78cf970 dsl_destroy_snapshot_sync_impl+0x23a(ffffff21f9bb3d00, 0,
ffffff21d32f25c0)
ffffff00f78cf9d0 dsl_destroy_snapshot_sync+0x67(ffffff00f55caaf0,
ffffff21d32f25c0)
ffffff00f78cfa10 dsl_sync_task_sync+0x10a(ffffff00f55caa10, ffffff21d32f25c0)
ffffff00f78cfaa0 dsl_pool_sync+0x28b(ffffff21f1abb080, 14127b6)
ffffff00f78cfb70 spa_sync+0x27e(ffffff21f210b000, 14127b6)
ffffff00f78cfc20 txg_sync_thread+0x227(ffffff21f1abb080)
ffffff00f78cfc30 thread_start+8()

matthiasg · 2016-02-08T09:13:57Z

we tried it with the newest release and it is still happening. we can reproduce it by deleting all snapshots on the zfs file system which will cause the system to crash after reaching a certain point.

anything else we can try ?

rmustacc · 2016-02-08T16:39:50Z

Would it be possible to get a crash dump uploaded? We can also help reach out to the ZFS folks to see if they've seen anything like this.

aortmannm · 2016-02-09T09:59:48Z

Here the dump files. Before the server crashes start we made from another server a "zfs send | ssh zfs recv" that we had to stop, don't know if this created this problem.

http://downloads.curasystems.de/vmdump.520.tgz
http://downloads.curasystems.de/unix.520.tgz

matthiasg · 2016-02-19T07:48:36Z

@rmustacc did anybody look at this ? is there anything we can do ourselves ?

rmustacc · 2016-02-19T15:58:37Z

Apologies, I've been swamped and haven't been able to personally, sorry I've dropped the ball on this issue. I know that it must be quite frustrating. I'd suggest mailing details to the ZFS developer lists where they'll have some folks who are a bit more knowledgeable about these kinds of issues in ZFS and hopefully will be able to provide some guidance a bit faster than we will.

aortmannm · 2016-02-22T08:16:36Z

Who is this zfs group where we need to create this issue? We have now a second machine were the problem exists. Restarts every 5 minutes until we deactivated zsnapper.

jclulow · 2016-02-22T10:40:46Z

You could try the OpenZFS mailing list: http://open-zfs.org/wiki/Mailing_list

stevenburgess · 2016-02-22T12:52:12Z

your quote "which will cause the system to crash after reaching a certain point" makes me wonder if its this issue basically trying to delete too many snapshots at once holds open the next TXG, so depending on write workload, can fill some buffers and in our case, lead to crashes.

You can check on this by figuring out what znapper would be deleting, if any of the datasets or snapshot ranges passed to zfs destroy contains too many snapshots, try breaking the destroy down into smaller pieces.

rmustacc · 2016-02-22T15:36:29Z

From the dump we're not holding open a single txg for a long time. Instead the issue is that ZFS is expecting some bit of data to be present that's not. @ahrens have you ever seen issues like the ones in the dump above?

matthiasg · 2016-02-23T15:12:50Z

btw we now have a second server exhibit the same issue. this time a live customer machine. it was rebooting every minute due to making/deleting snapshots. disabling the znapper service fixed the issue, but we are now stuck with another system where we cannot delete certain snapshots anymore. we will post into the zfs groups too

ahrens · 2016-02-23T16:14:01Z

It's possible that this is the fallout from hitting bug 3603, which was fixed a few years ago. Have you run this pool without the following fix? A workaround for this problem would be to add code to bpobj_enqueue_subobj to ignore the subsubobjs if it does not exist (i.e. dmu_object_info() returns ENOENT, as it has here). This would leak some space (perhaps a very small amount of space) but allow you to recover all the blocks that can be found.

commit d04756377ddd1cf28ebcf652541094e17b03c889
Author: Matthew Ahrens mahrens@delphix.com
Date: Mon Mar 4 12:27:52 2013 -0800

3603 panic from bpobj_enqueue_subobj()

aortmannm · 2016-02-23T16:44:02Z

We never run such an old smartos version on these two servers.

aortmannm · 2016-03-22T15:29:56Z

Made a new update today to the newest Version 20160317T000621Z.

The problem is still existing on two servers. Did really no one know this problem or can help to investigate it?

The new dump from the newest version.
http://downloads.curasystems.de/vmdump.538.tgz

Adel-Magebinary · 2022-05-17T12:05:26Z

This problem is more related to your raid controller. Make sure you support real HBA like (DELL HBA330) or logical raid controller

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

smartos server crashes every hour #554

smartos server crashes every hour #554

aortmannm commented Feb 5, 2016

rmustacc commented Feb 5, 2016

matthiasg commented Feb 7, 2016

matthiasg commented Feb 8, 2016

rmustacc commented Feb 8, 2016

aortmannm commented Feb 9, 2016

matthiasg commented Feb 19, 2016

rmustacc commented Feb 19, 2016

aortmannm commented Feb 22, 2016

jclulow commented Feb 22, 2016

stevenburgess commented Feb 22, 2016

rmustacc commented Feb 22, 2016

matthiasg commented Feb 23, 2016

ahrens commented Feb 23, 2016

aortmannm commented Feb 23, 2016

aortmannm commented Mar 22, 2016

Adel-Magebinary commented May 17, 2022

smartos server crashes every hour #554

smartos server crashes every hour #554

Comments

aortmannm commented Feb 5, 2016

rmustacc commented Feb 5, 2016

matthiasg commented Feb 7, 2016

matthiasg commented Feb 8, 2016

rmustacc commented Feb 8, 2016

aortmannm commented Feb 9, 2016

matthiasg commented Feb 19, 2016

rmustacc commented Feb 19, 2016

aortmannm commented Feb 22, 2016

jclulow commented Feb 22, 2016

stevenburgess commented Feb 22, 2016

rmustacc commented Feb 22, 2016

matthiasg commented Feb 23, 2016

ahrens commented Feb 23, 2016

aortmannm commented Feb 23, 2016

aortmannm commented Mar 22, 2016

Adel-Magebinary commented May 17, 2022