Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Very slow metadata reads during writing #1179

Closed
dominikh opened this issue Jan 5, 2013 · 19 comments
Closed

Very slow metadata reads during writing #1179

dominikh opened this issue Jan 5, 2013 · 19 comments
Labels
Type: Performance Performance improvement or performance problem

Comments

@dominikh
Copy link
Contributor

dominikh commented Jan 5, 2013

I'm experiencing the following problem when doing any big sequential writes, such as copying (from a different file system)/generating a big file:

Directory listings (e.g. with ls) and computing disk usage with du get very slow, to the point that using tree prints a couple lines per second only.

Normal reads, such as cating another big file, have perfect performance though (they slow down writes to almost a halt, actually), so the slow reads are limited to metadata.

I have tested this both with a pool with a single SATA disk and a pool with three SATA disks without redundancy, on different controllers as well. I have tested setting zfs_vdev_max_pending to 1 as well as a higher value than the default, but neither made a noticeable difference.

Under idle load, using tree causes constant ~2MB/s reads (according to zpool iostat). During a write, tree causes between 6 and 600kB/s reads, but not constant, but rather in "impulses". It reads for 1-2 seconds, then pauses for another 1-2 seconds, and so on.

I'd be more than happy to provide any information you need.

@ryao
Copy link
Contributor

ryao commented Jan 9, 2013

What is the output of zdb?

@dominikh
Copy link
Contributor Author

dominikh commented Jan 9, 2013

home:
    version: 28
    name: 'home'
    state: 0
    txg: 14462
    pool_guid: 11329702098934403967
    hostname: 'dominikh-pc'
    vdev_children: 1
    vdev_tree:
        type: 'root'
        id: 0
        guid: 11329702098934403967
        create_txg: 4
        children[0]:
            type: 'disk'
            id: 0
            guid: 12057560379909090469
            path: '/dev/sdb8'
            whole_disk: 0
            metaslab_array: 30
            metaslab_shift: 29
            ashift: 9
            asize: 64904232960
            is_log: 0
            DTL: 92
            create_txg: 4

@cousins
Copy link

cousins commented Jul 15, 2013

Was this ever addressed? I find the same thing.

Here is output from zdb:

pool1:
version: 5000
name: 'pool1'
state: 0
txg: 724667
pool_guid: 11849325496751575525
hostname: 'nfs1.localdomain'
vdev_children: 1
vdev_tree:
type: 'root'
id: 0
guid: 11849325496751575525
create_txg: 4
children[0]:
type: 'raidz'
id: 0
guid: 16232289015635245001
nparity: 1
metaslab_array: 34
metaslab_shift: 37
ashift: 12
asize: 16003128885248
is_log: 0
create_txg: 4
children[0]:
type: 'disk'
id: 0
guid: 15323940241176049209
path: '/dev/disk/by-vdev/0-18'
whole_disk: 0
DTL: 42
create_txg: 4
children[1]:
type: 'disk'
id: 1
guid: 16107610421697630772
path: '/dev/disk/by-vdev/0-19'
whole_disk: 0
DTL: 41
create_txg: 4
children[2]:
type: 'disk'
id: 2
guid: 4592313039804463070
path: '/dev/disk/by-vdev/0-20'
whole_disk: 0
DTL: 40
create_txg: 4
children[3]:
type: 'disk'
id: 3
guid: 4286523150378293281
path: '/dev/disk/by-vdev/0-21'
whole_disk: 0
DTL: 39
create_txg: 4
features_for_read:
pool2:
version: 5000
name: 'pool2'
state: 0
txg: 853752
pool_guid: 1449893506355619091
hostname: 'nfs1.localdomain'
vdev_children: 6
vdev_tree:
type: 'root'
id: 0
guid: 1449893506355619091
create_txg: 4
children[0]:
type: 'raidz'
id: 0
guid: 13535368675214857022
nparity: 2
metaslab_array: 41
metaslab_shift: 38
ashift: 12
asize: 40007822213120
is_log: 0
create_txg: 4
children[0]:
type: 'disk'
id: 0
guid: 15367600075020876362
path: '/dev/mapper/mpathay'
whole_disk: 0
DTL: 423
create_txg: 4
children[1]:
type: 'disk'
id: 1
guid: 2883425663032000826
path: '/dev/mapper/mpathaw'
whole_disk: 0
DTL: 422
create_txg: 4
children[2]:
type: 'disk'
id: 2
guid: 15523232341984025474
path: '/dev/mapper/mpathba'
whole_disk: 0
DTL: 421
create_txg: 4
children[3]:
type: 'disk'
id: 3
guid: 8494900005291235235
path: '/dev/mapper/mpathbr'
whole_disk: 0
DTL: 420
create_txg: 4
children[4]:
type: 'disk'
id: 4
guid: 10004260158543572020
path: '/dev/mapper/mpathbs'
whole_disk: 0
DTL: 419
create_txg: 4
children[5]:
type: 'disk'
id: 5
guid: 15472209448453876362
path: '/dev/mapper/mpathc'
whole_disk: 0
DTL: 418
create_txg: 4
children[6]:
type: 'disk'
id: 6
guid: 13459517471440984652
path: '/dev/mapper/mpathe'
whole_disk: 0
DTL: 417
create_txg: 4
children[7]:
type: 'disk'
id: 7
guid: 9432346345837365482
path: '/dev/mapper/mpathd'
whole_disk: 0
DTL: 416
create_txg: 4
children[8]:
type: 'disk'
id: 8
guid: 12680336969170452170
path: '/dev/mapper/mpathaa'
whole_disk: 0
DTL: 415
create_txg: 4
children[9]:
type: 'disk'
id: 9
guid: 4723109204817423965
path: '/dev/mapper/mpathag'
whole_disk: 0
DTL: 414
create_txg: 4
children[1]:
type: 'raidz'
id: 1
guid: 12807610639044923521
nparity: 2
metaslab_array: 39
metaslab_shift: 38
ashift: 12
asize: 40007822213120
is_log: 0
create_txg: 4
children[0]:
type: 'disk'
id: 0
guid: 16525629931697231711
path: '/dev/mapper/mpathav'
whole_disk: 0
DTL: 443
create_txg: 4
children[1]:
type: 'disk'
id: 1
guid: 214752360207951468
path: '/dev/mapper/mpathax'
whole_disk: 0
DTL: 442
create_txg: 4
children[2]:
type: 'disk'
id: 2
guid: 15667663214024081754
path: '/dev/mapper/mpathaz'
whole_disk: 0
DTL: 441
create_txg: 4
children[3]:
type: 'disk'
id: 3
guid: 16389023265712432347
path: '/dev/mapper/mpathbt'
whole_disk: 0
DTL: 440
create_txg: 4
children[4]:
type: 'disk'
id: 4
guid: 7239552607776229924
path: '/dev/mapper/mpathbv'
whole_disk: 0
DTL: 439
create_txg: 4
children[5]:
type: 'disk'
id: 5
guid: 11133414635665401074
path: '/dev/mapper/mpathf'
whole_disk: 0
DTL: 438
create_txg: 4
children[6]:
type: 'disk'
id: 6
guid: 5962597296578748074
path: '/dev/mapper/mpathg'
whole_disk: 0
DTL: 437
create_txg: 4
children[7]:
type: 'disk'
id: 7
guid: 7503372555309017085
path: '/dev/mapper/mpathi'
whole_disk: 0
DTL: 436
create_txg: 4
children[8]:
type: 'disk'
id: 8
guid: 12832984865691365552
path: '/dev/mapper/mpathad'
whole_disk: 0
DTL: 435
create_txg: 4
children[9]:
type: 'disk'
id: 9
guid: 9450736022982777539
path: '/dev/mapper/mpathac'
whole_disk: 0
DTL: 434
create_txg: 4
children[2]:
type: 'raidz'
id: 2
guid: 10982097010528358280
nparity: 2
metaslab_array: 38
metaslab_shift: 38
ashift: 12
asize: 40007822213120
is_log: 0
create_txg: 4
children[0]:
type: 'disk'
id: 0
guid: 2754247701997483318
path: '/dev/mapper/mpathbb'
whole_disk: 0
DTL: 403
create_txg: 4
children[1]:
type: 'disk'
id: 1
guid: 7508610659816953514
path: '/dev/mapper/mpathbe'
whole_disk: 0
DTL: 402
create_txg: 4
children[2]:
type: 'disk'
id: 2
guid: 7532169281697963699
path: '/dev/mapper/mpathbc'
whole_disk: 0
DTL: 401
create_txg: 4
children[3]:
type: 'disk'
id: 3
guid: 10774162423908615811
path: '/dev/mapper/mpathbu'
whole_disk: 0
DTL: 400
create_txg: 4
children[4]:
type: 'disk'
id: 4
guid: 9948663893352823412
path: '/dev/mapper/mpathbw'
whole_disk: 0
DTL: 399
create_txg: 4
children[5]:
type: 'disk'
id: 5
guid: 7416277044794958713
path: '/dev/mapper/mpathh'
whole_disk: 0
DTL: 398
create_txg: 4
children[6]:
type: 'disk'
id: 6
guid: 16785171220053622657
path: '/dev/mapper/mpathj'
whole_disk: 0
DTL: 397
create_txg: 4
children[7]:
type: 'disk'
id: 7
guid: 6450617676614101201
path: '/dev/mapper/mpathk'
whole_disk: 0
DTL: 396
create_txg: 4
children[8]:
type: 'disk'
id: 8
guid: 3042762451379655665
path: '/dev/mapper/mpathab'
whole_disk: 0
DTL: 395
create_txg: 4
children[9]:
type: 'disk'
id: 9
guid: 7469536874470028810
path: '/dev/mapper/mpathae'
whole_disk: 0
DTL: 394
create_txg: 4
children[3]:
type: 'raidz'
id: 3
guid: 2057058372264219532
nparity: 2
metaslab_array: 37
metaslab_shift: 38
ashift: 12
asize: 40007822213120
is_log: 0
create_txg: 4
children[0]:
type: 'disk'
id: 0
guid: 14627859175975193481
path: '/dev/mapper/mpathbd'
whole_disk: 0
DTL: 393
create_txg: 4
children[1]:
type: 'disk'
id: 1
guid: 8579806227451864558
path: '/dev/mapper/mpathbf'
whole_disk: 0
DTL: 392
create_txg: 4
children[2]:
type: 'disk'
id: 2
guid: 16786422642316301974
path: '/dev/mapper/mpathbg'
whole_disk: 0
DTL: 391
create_txg: 4
children[3]:
type: 'disk'
id: 3
guid: 7558932231101274593
path: '/dev/mapper/mpathbx'
whole_disk: 0
DTL: 390
create_txg: 4
children[4]:
type: 'disk'
id: 4
guid: 17308681924675649441
path: '/dev/mapper/mpathby'
whole_disk: 0
DTL: 389
create_txg: 4
children[5]:
type: 'disk'
id: 5
guid: 11315208517481449901
path: '/dev/mapper/mpathl'
whole_disk: 0
DTL: 388
create_txg: 4
children[6]:
type: 'disk'
id: 6
guid: 9116019710600322230
path: '/dev/mapper/mpathm'
whole_disk: 0
DTL: 387
create_txg: 4
children[7]:
type: 'disk'
id: 7
guid: 13792086835240651436
path: '/dev/mapper/mpathn'
whole_disk: 0
DTL: 386
create_txg: 4
children[8]:
type: 'disk'
id: 8
guid: 18275221268381600893
path: '/dev/mapper/mpathas'
whole_disk: 0
DTL: 385
create_txg: 4
children[9]:
type: 'disk'
id: 9
guid: 4649142271934045153
path: '/dev/mapper/mpathai'
whole_disk: 0
DTL: 384
create_txg: 4
children[4]:
type: 'raidz'
id: 4
guid: 11403491051027443587
nparity: 2
metaslab_array: 36
metaslab_shift: 38
ashift: 12
asize: 40007822213120
is_log: 0
create_txg: 4
children[0]:
type: 'disk'
id: 0
guid: 12543208754326323010
path: '/dev/mapper/mpathbh'
whole_disk: 0
DTL: 413
create_txg: 4
children[1]:
type: 'disk'
id: 1
guid: 8031831489193426255
path: '/dev/mapper/mpathbi'
whole_disk: 0
DTL: 412
create_txg: 4
children[2]:
type: 'disk'
id: 2
guid: 16763372292077154269
path: '/dev/mapper/mpathbk'
whole_disk: 0
DTL: 411
create_txg: 4
children[3]:
type: 'disk'
id: 3
guid: 2132699546901437428
path: '/dev/mapper/mpathcb'
whole_disk: 0
DTL: 410
create_txg: 4
children[4]:
type: 'disk'
id: 4
guid: 6204024202877216302
path: '/dev/mapper/mpathbz'
whole_disk: 0
DTL: 409
create_txg: 4
children[5]:
type: 'disk'
id: 5
guid: 10271336551288709155
path: '/dev/mapper/mpatho'
whole_disk: 0
DTL: 408
create_txg: 4
children[6]:
type: 'disk'
id: 6
guid: 2948958354676228940
path: '/dev/mapper/mpathp'
whole_disk: 0
DTL: 407
create_txg: 4
children[7]:
type: 'disk'
id: 7
guid: 17630532148077612341
path: '/dev/mapper/mpathq'
whole_disk: 0
DTL: 406
create_txg: 4
children[8]:
type: 'disk'
id: 8
guid: 17122390723182134782
path: '/dev/mapper/mpathah'
whole_disk: 0
DTL: 405
create_txg: 4
children[9]:
type: 'disk'
id: 9
guid: 1831818142236287837
path: '/dev/mapper/mpathaf'
whole_disk: 0
DTL: 404
create_txg: 4
children[5]:
type: 'raidz'
id: 5
guid: 4555316044482278270
nparity: 2
metaslab_array: 34
metaslab_shift: 38
ashift: 12
asize: 40007822213120
is_log: 0
create_txg: 4
children[0]:
type: 'disk'
id: 0
guid: 9167167153092562059
path: '/dev/mapper/mpathbj'
whole_disk: 0
DTL: 433
create_txg: 4
children[1]:
type: 'disk'
id: 1
guid: 16862282001395020981
path: '/dev/mapper/mpathbo'
whole_disk: 0
DTL: 432
create_txg: 4
children[2]:
type: 'disk'
id: 2
guid: 4474773783325290484
path: '/dev/mapper/mpathbl'
whole_disk: 0
DTL: 431
create_txg: 4
children[3]:
type: 'disk'
id: 3
guid: 5757464137864938800
path: '/dev/mapper/mpathca'
whole_disk: 0
DTL: 430
create_txg: 4
children[4]:
type: 'disk'
id: 4
guid: 9876005222154795437
path: '/dev/mapper/mpathcc'
whole_disk: 0
DTL: 429
create_txg: 4
children[5]:
type: 'disk'
id: 5
guid: 1564324532825522320
path: '/dev/mapper/mpathr'
whole_disk: 0
DTL: 428
create_txg: 4
children[6]:
type: 'disk'
id: 6
guid: 7312149929362521741
path: '/dev/mapper/mpaths'
whole_disk: 0
DTL: 427
create_txg: 4
children[7]:
type: 'disk'
id: 7
guid: 5233402612129867781
path: '/dev/mapper/mpatht'
whole_disk: 0
DTL: 426
create_txg: 4
children[8]:
type: 'disk'
id: 8
guid: 14019631778607735571
path: '/dev/mapper/mpathat'
whole_disk: 0
DTL: 425
create_txg: 4
children[9]:
type: 'disk'
id: 9
guid: 934589199848764476
path: '/dev/mapper/mpathaj'
whole_disk: 0
DTL: 424
create_txg: 4
features_for_read:

Thanks.

Steve

@ryao
Copy link
Contributor

ryao commented Jul 15, 2013

@dominikh I am rather late in responding to this, but your issue could be caused by having an ashift=9 pool on an advanced format disk. That would incur read-copy-write overhead inside the disk itself.

@cousins I am not sure about you. I do not have much information about what you are doing to go on.

@dominikh
Copy link
Contributor Author

It's not an advanced format disk, it's a SAMSUNG HD103SJ, which still uses real 512 byte sectors.

The performance penalties I am seeing are way higher than that read-copy-write would incur, too.

@cousins
Copy link

cousins commented Jul 16, 2013

Hi Richard,

The current case is that I have two rsync's going (about 31 TB between the two so they are running for days) and meanwhile I wanted to check to see how much data had been copied so I was running:

du -cbsh /pool2/omg/lab//

The load on the system is between 1.5 and 2.0 with the majority of it being from ssh (rsync over ssh) and the two rsync processes

top - 16:37:35 up 24 days, 1:14, 7 users, load average: 1.70, 1.69, 2.02
Tasks: 1002 total, 2 running, 1000 sleeping, 0 stopped, 0 zombie
Cpu(s): 3.0%us, 4.7%sy, 0.0%ni, 86.8%id, 5.0%wa, 0.0%hi, 0.4%si, 0.0%st
Mem: 132126220k total, 127384060k used, 4742160k free, 258236k buffers
Swap: 4192184k total, 3476k used, 4188708k free, 953300k cached

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
26186 root 20 0 64104 7876 2792 S 55.0 0.0 979:16.03 ssh
26187 root 20 0 371m 132m 744 S 36.3 0.1 673:27.05 rsync
26199 root 20 0 384m 233m 516 R 34.0 0.2 1835:19 rsync
26198 root 20 0 64200 7972 2792 S 12.4 0.0 698:40.63 ssh
4587 root RT 0 5543m 15m 3376 S 3.9 0.0 539:13.52 multipathd
21469 root 20 0 99.4m 1640 672 D 2.9 0.0 1:02.46 du
10338 root 39 19 0 0 0 S 2.0 0.0 179:13.06 z_wr_int/5
10341 root 39 19 0 0 0 S 2.0 0.0 192:01.52 z_wr_int/8
10342 root 39 19 0 0 0 S 2.0 0.0 193:33.35 z_wr_int/9
10324 root 39 19 0 0 0 S 1.6 0.0 150:41.96 z_wr_iss/21
...

The last time I ran it it took about 52 minutesto run after 10 TB had been copied. On the source volumes (xfs on a different machine) it takes about 35 minutes to do the whole 31 TB which is longer than I expected but still quite a bit faster than the ZFS system right now. The 31 TB volumes contain a total of about 16 million files and so far about 6.6 million files have been copied.

So:
ZFS system: 10 TB, 6.6 million files: du takes 52 minutes
XFS system: 31 TB, 16 million filesl: du takes 35 minutes

FWIW, The XFS system is on an old Supermicro Opteron file server (circa 2005) with two Opteron 248 CPU's with 32 GB of RAM and it is connected to two separate RAID systems via an LSI Ultra320 SCSI HBA. The ZFS system is a new Supermicro system with two Xeon E5-2620 processors with 128 GB of RAM. It is connected to the drives with LSI SAS controllers.

You mentioned ashift. I used -o ashift=12 and these are Seagate ST4000NM0023 4 TB drives. Please let me know if I should have used something else.

Thanks very much for your help. Let me know if you need any other information.

Steve

@behlendorf
Copy link
Contributor

@cousins Can you try setting the following module option to see if that helps. If your filesystem is relatively full or fragmented it may help considerably.

echo 1 >/sys/module/zfs/parameters/metaslab_debug

@chrisrd
Copy link
Contributor

chrisrd commented Jul 17, 2013

Intriguing... what's the theory as to how enabling this option may possibly improve performance? Better caching? Perhaps it should become something other than a debug option?

@cousins
Copy link

cousins commented Jul 17, 2013

I gave it a try but I don't think it helped. This is a fairly new system so I can't imagine it is fragmented but this pool is getting filled up. Right now it is at 78% full. Here is what I got:

[root@nfs1 lab]# echo " ";echo " Total size: "; time du -cbsh *;echo -n " Number of files: "; time find . -type f | wc -l; echo " "

Total size:
5.7T tr1
7.4T tr3
14T total

real 203m41.660s
user 0m28.227s
sys 11m16.876s
Number of files: 11364238

real 94m49.238s
user 0m30.117s
sys 8m1.627s

So 3 hours and 23 minutes to du 14 TB with 11 million files. The find command itself took 95 minutes.

I'm willing to try other things. This system is a beta system. I'm trying to make sure that it is stable and I'm comfortable with zfsonlinux before we put it into production. That said, I don't really want to lose the data on it but I'm willing to try just about anything if it makes sense.

@behlendorf
Copy link
Contributor

@chrisrd One of the known soft spots in ZFS can be the block allocator. For a highly fragmented pool it's possible to have to unload+load multiple metaslabs before finding a good place to write. Setting metaslab_debug=1 prevents ZFS from unloading the old metaslabs which can improve things in this case by preventing wasted I/O. This comes at the expense of some memory. There's work planned to improve the allocator if it falls in to this case when the pool is full.

@cousins However, since it didn't help clearly that's not the issue here. One other thing you might try is to increase the zfs_arc_meta_limit. This is also a module option and it defaults to 1/4 of the ARC size (in bytes). It may be helpful to incrase it to 1/2 the ARC size. You can check the value in the /proc/spl/kstat/zfs/arcstats file and if your at the limit. It would actually be helpful if you could post the contents of that file in to this issue.

@cousins
Copy link

cousins commented Jul 17, 2013

Hi Brian,

Here you go. It looks like the max is being reached. The system has 128 GB so is ARC size 64 GB? So, set zfs_arc_meta_limit to 32 GB? Can this be done on a running system or do I set it and restart? Just put:

options zfs zfs_arc_meta_limit=34359738368

in /etc/modprobe.d/zfs.conf and reboot?

Thanks,

Steve

[root@nfs1 lab]# cat /proc/spl/kstat/zfs/arcstats
4 1 0x01 80 3840 48111157957 2245317612953595
name type data
hits 4 2162393439
misses 4 178676011
demand_data_hits 4 1558808364
demand_data_misses 4 385517
demand_metadata_hits 4 453856992
demand_metadata_misses 4 16382066
prefetch_data_hits 4 5866936
prefetch_data_misses 4 144060739
prefetch_metadata_hits 4 143861147
prefetch_metadata_misses 4 17847689
mru_hits 4 345033964
mru_ghost_hits 4 8171881
mfu_hits 4 1668547983
mfu_ghost_hits 4 13487455
deleted 4 718637732
recycle_miss 4 19642479
mutex_miss 4 4607923
evict_skip 4 2841287524
evict_l2_cached 4 0
evict_l2_eligible 4 90259312766976
evict_l2_ineligible 4 241109578240
hash_elements 4 2641521
hash_elements_max 4 4364158
hash_collisions 4 286813905
hash_chains 4 552827
hash_chain_max 4 10
p 4 28400452096
c 4 69793218560
c_min 4 8724152320
c_max 4 69793218560
size 4 69561361528
hdr_size 4 892601632
data_size 4 56497236992
other_size 4 12171522904
anon_size 4 18432
anon_evict_data 4 0
anon_evict_metadata 4 0
mru_size 4 54007403520
mru_evict_data 4 52112571392
mru_evict_metadata 4 0
mru_ghost_size 4 15548960768
mru_ghost_evict_data 4 58982400
mru_ghost_evict_metadata 4 15489978368
mfu_size 4 2489815040
mfu_evict_data 4 0
mfu_evict_metadata 4 3751936
mfu_ghost_size 4 8761552384
mfu_ghost_evict_data 4 3584
mfu_ghost_evict_metadata 4 8761548800
l2_hits 4 0
l2_misses 4 0
l2_feeds 4 0
l2_rw_clash 4 0
l2_read_bytes 4 0
l2_write_bytes 4 0
l2_writes_sent 4 0
l2_writes_done 4 0
l2_writes_error 4 0
l2_writes_hdr_miss 4 0
l2_evict_lock_retry 4 0
l2_evict_reading 4 0
l2_free_on_write 4 0
l2_abort_lowmem 4 0
l2_cksum_bad 4 0
l2_io_error 4 0
l2_size 4 0
l2_hdr_size 4 0
memory_throttle_count 4 0
duplicate_buffers 4 0
duplicate_buffers_size 4 0
duplicate_reads 4 0
memory_direct_count 4 40827
memory_indirect_count 4 87846541
arc_no_grow 4 0
arc_tempreserve 4 0
arc_loaned_bytes 4 0
arc_prune 4 61043
arc_meta_used 4 17448790136
arc_meta_limit 4 17448304640
arc_meta_max 4 17531662576

@chrisrd
Copy link
Contributor

chrisrd commented Jul 22, 2013

(Apologies for the ticket topic hijack, but I thought it best to keep the information together for anyone following.)

@behlendorf Would metaslab_debug=1 also mean the deferred free list doesn't get compacted until the next mount, and the compact on mount could then take excessive amounts of time? I had metaslab_debug=1 on whilst doing a lot of removes, then the machine was power cycled (possibly due to zfs memory issues), and after the reboot the mount took over 7 hours(!). Would this be an explanation for that behaviour, and why metaslab_debug=1 is a debug option rather than a recommended option for some scenarios? (For anyone interested, my reference for suspecting this behaviour is the 'deferred frees' section of: https://blogs.oracle.com/bonwick/entry/space_maps)

@dweeezil
Copy link
Contributor

@chrisrd Any chance most/all of the files you were removing had xattrs? If so, this is a known issue (see #1548).

@chrisrd
Copy link
Contributor

chrisrd commented Jul 22, 2013

@dweeezil Yep, I have (lots of) xattrs (w/ xattr=dir), but I'm removing the xattrs before unlinking the files (ref: #457 (comment))

@chrisrd
Copy link
Contributor

chrisrd commented Jul 22, 2013

...oh yes, I've also previously used stap to confirm that, if I unlink the xattrs first, I don't get into the problematical code in zfs_unlinked_drain():

  /*
   * If this is an attribute directory, purge its contents.
   */
  if (S_ISDIR(ZTOI(zp)->i_mode) && (zp->z_pflags & ZFS_XATTR)) {
    /*
     * We don't need to check the return value of
     * zfs_purgedir here, because zfs_rmnode will just
     * return this xattr directory to the unlinked set
     * until all of its xattrs are gone.
     */
     (void) zfs_purgedir(zp);
  }

@dweeezil
Copy link
Contributor

@chrisrd With code from today's master (and, I suspect 0.6.1), the hidden directory hangs around in the unlinked set until a remount even if you manually remove all the xattrs. If you actually perform an operation that fetches an xattr during the current mount session, both the file's object and the hidden directory object get stuck in the unlinked set until the next remount. This problem is a side-effect of the various attempts at fixing xattr-deleting deadlocks.

I realize now that this has nothing to do with the original poster's issue but, as I mentioned, any postings involving the unlinked set, xattrs, etc. get my attention.

@FransUrbo
Copy link
Contributor

@dominikh Is this still a problem with latest GIT, or can this be closed?

@dominikh
Copy link
Contributor Author

dominikh commented Jun 9, 2014

@FransUrbo Can't tell, I'm not currently using ZoL. Feel free to close and someone will refile the issue if it happens again.

@behlendorf
Copy link
Contributor

This issue has gotten pretty clutter. Let's close it out and we can reopen a new issue if this is reported again.

@behlendorf behlendorf modified the milestone: 0.6.6 Nov 8, 2014
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Type: Performance Performance improvement or performance problem
Projects
None yet
Development

No branches or pull requests

7 participants