Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Disk activity widget inactive - Tumbleweed #2844

Closed
phillxnet opened this issue May 29, 2024 · 14 comments
Closed

Disk activity widget inactive - Tumbleweed #2844

phillxnet opened this issue May 29, 2024 · 14 comments
Assignees

Comments

@phillxnet
Copy link
Member

On our pending 5.0.9-0 installers we have no activity shown in the Dashboard disk activity widget. This is only for a Tumblweed base and affects all our installer machine/arch targets: i.e. TW on x86_64, ARM64EFI, Pi4.
Example command output from a failing TW.x86_64 installer instance:

cat /proc/diskstats
 254       0 vda 7749 14 789428 5441 13465 27 537993 11456 0 13727 19696 7539 1 42603224 245 920 2552                                                                          
 254       1 vda1 902 0 7464 76 0 0 0 0 0 274 76 0 0 0 0 0 0                                                                                                                   
 254       2 vda2 119 0 2008 49 1 0 1 0 0 90 49 2 0 59808 0 0 0                                                                                                                
 254       3 vda3 6631 14 776236 5298 13460 27 537992 11455 0 13394 17000 7537 1 42543416 245 0 0                                                                              
   8       0 sda 731 0 30104 128 7 1 176 6 0 274 138 0 0 0 0 2 3                                                                                                               
   7       0 loop0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0                                                                                                                           
   7       1 loop1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0                                                                                                                           
   7       2 loop2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0                                                                                                                           
   7       3 loop3 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0                                                                                                                           
   7       4 loop4 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0                                                                                                                           
   7       5 loop5 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0                                                                                                                           
   7       6 loop6 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0                                                                                                                           
   7       7 loop7 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0                                                                                                                           

Example command output from a working 15.6.x86_64 installer instance:

cat /proc/diskstats
   8       0 sda 9383 38 822820 3620 1583 459 56489 1244 0 7352 5480 0 0 0 0 274 615                                                                                           
   8       1 sda1 238 0 1904 45 0 0 0 0 0 120 45 0 0 0 0 0 0                                                                                                                   
   8       2 sda2 106 12 1752 69 1 0 1 0 0 108 69 0 0 0 0 0 0                                                                                                                  
   8       3 sda3 8941 26 815700 3485 1578 459 56488 1243 0 7216 4729 0 0 0 0 0 0                                                                                              
   8      16 sdb 320 0 13048 50 0 0 0 0 0 152 50 0 0 0 0 0 0                                                                                                                   

It is assumed currently that we have a parsing issue afoot, and that there is something in the TW output that is throwing our disk activity data retrieval in all TW targets.

@Hooverdan96
Copy link
Member

When checking here:

https://www.kernel.org/doc/Documentation/ABI/testing/procfs-diskstats

it seems that additional fields have been added since Kernel 5.5 and depending on the way the parsing occurs that might throw it for a loop (pun intended).

Kernel 5.5+ appends two more fields for flush requests:

==  =====================================
19  flush requests completed successfully
20  time spent flushing
==  =====================================

@FroggyFlox
Copy link
Member

Nice find, @Hooverdan96 !

Looks like we should update our unit tests there and then our parsing.

The following should most likely be moved to a dedicated issue but before that, I wanted to briefly get your feedback on considering it further: should we consider and investigate whether the python psutil library would help us better than having to do our own parsing and try to have it keep up-to-date with kernel versions? I would presume that library would be better at that than we can spend time on, but it remains to be looked at. We would also need to check whether it provides all the metrics we currently use. Thoughts?

@Hooverdan96
Copy link
Member

Hooverdan96 commented May 31, 2024

using psutil might be the no-brainer (a bit more effort, but removing the kernel/OpenSUSE flavor dependency).

I think this is where the parsing occurs (only showing the upper code snippet)

class DisksWidgetNamespace(RockstorIO):
switch = False
byid_disk_map = {}
def on_connect(self, sid, environ):
self.byid_disk_map = get_byid_name_map()
self.switch = True
self.spawn(self.send_top_disks, sid)
def on_disconnect(self, sid):
self.cleanup(sid)
self.switch = False
def send_top_disks(self):
def disk_stats(prev_stats):
disks_stats = []
# invoke body of disk_stats with empty cur_stats
stats_file_path = "/proc/diskstats"
cur_stats = {}
interval = 1
# TODO: Consider refactoring the following to use Disk.temp_name or
# TODO: building byid_disk_map from the same. Ideally we would have
# TODO: performance testing in place prior to this move.
# Build a list of our db's disk names, now in by-id type format.
disks = [d.name for d in Disk.objects.all()]
# /proc/diskstats has lines of the following form:
# 8 64 sde 1034 0 9136 702 0 0 0 0 0 548 702
# 8 65 sde1 336 0 2688 223 0 0 0 0 0 223 223
with open(stats_file_path) as stats_file:
for line in stats_file.readlines():
fields = line.split()
# As the /proc/diskstats lines contain transient type names
# we need to convert those to our by-id db names.

@Hooverdan96
Copy link
Member

Quick Research: here is the disk readout from 'psutil'

https://psutil.readthedocs.io/en/latest/#psutil.disk_io_counters

giving these fields:

Return system-wide disk I/O statistics as a named tuple including the following fields:
read_count: number of reads
write_count: number of writes
read_bytes: number of bytes read
write_bytes: number of bytes written

Platform-specific fields:
read_time: (all except NetBSD and OpenBSD) time spent reading from disk (in milliseconds)
write_time: (all except NetBSD and OpenBSD) time spent writing to disk (in milliseconds)
busy_time: (Linux, FreeBSD) time spent doing actual I/Os (in milliseconds)
read_merged_count (Linux): number of merged reads (see [iostats doc](https://www.kernel.org/doc/Documentation/iostats.txt))
write_merged_count (Linux): number of merged writes (see [iostats doc](https://www.kernel.org/doc/Documentation/iostats.txt))

in the Rockstor parsing it seems to map these fields:

disks_stats.append(
{
"name": disk,
"reads_completed": data[0],
"reads_merged": data[1],
"sectors_read": data[2],
"ms_reading": data[3],
"writes_completed": data[4],
"writes_merged": data[5],
"sectors_written": data[6],
"ms_writing": data[7],
"ios_progress": data[8],
"ms_ios": data[9],
"weighted_ios": data[10],
"ts": str(
datetime.utcnow().replace(tzinfo=utc).isoformat()
), # noqa E501

sooo, unless we restructure the widget, I think psutil might not cover the requirement ...

@Hooverdan96
Copy link
Member

Hooverdan96 commented May 31, 2024

ok, some more analysis. I attempted a mapping, and there are a few I couldn't really map (or find that they're derived/calculated). So, any refactoring might not be so bad ...

<style> </style>
In psutil Maps to in Rockstor diskstats Not mapped in psutil Comments
       
read_count: number of reads "reads_completed": data[0],    
write_count: number of writes "writes_completed": data[4],    
read_bytes: number of bytes read -    
write_bytes: number of bytes written -    
read_time: (all except NetBSD and OpenBSD) time spent reading from disk (in milliseconds) "ms_reading": data[3],    
write_time: (all except NetBSD and OpenBSD) time spent writing to disk (in milliseconds) "ms_writing": data[7],    
busy_time: (Linux, FreeBSD) time spent doing actual I/Os (in milliseconds) "ms_ios": data[9],    
read_merged_count (Linux): number of merged reads (see iostats doc) "reads_merged": data[1],    
write_merged_count (Linux): number of merged writes (see iostats doc) "writes_merged": data[5],    
    "name": disk, derived
    "sectors_read": data[2],  
    "sectors_written": data[6],  
    "ios_progress": data[8],  
    "weighted_ios": data[10],
    "ts": str(datetime.utcnow().replace(tzinfo=utc).isoformat() derived

@Hooverdan96
Copy link
Member

and finally, it would have been too easy, but there is this python package:

https://pypi.org/project/proc/

but it doesn't seem to have been updated since 2020 so probably not an option, unless the reason is that it only is updated, when the Linux proc is materially changed ...

@phillxnet
Copy link
Member Author

@Hooverdan96
Re:

sooo, unless we restructure the widget, I think psutil might not cover the requirement ...

We can presumably drop some of the dimensions monitored. It may end up making the widget let cpu heavy as it goes: especially give we may well incur more overhead in using psutil.

@FroggyFlox
Copy link
Member

@Hooverdan96 , @phillxnet ,

If we indeed "just" need to update our current parsing for TW, would you agree on implementing this fix for the PR targeting this issue given we are in feature freeze and only bug fix for now?
We can then dedicate an issue to consider alternatives for our next testing phase.

@Hooverdan96
Copy link
Member

I'm ok with that approach, too. It seems to me that somewhere the "number of columns available" needs to be evaluated before mapping.
Question is whether anybody can take this (or even the other approach) on before the stable release...

@phillxnet
Copy link
Member Author

@Hooverdan96
Re:

Question is whether anybody can take this (or even the other approach) on before the stable release...

Although highly visible, this is not a show-stopper - but a nice to have. Plus only presenting in Tumbleweed, which we have on the downloads page as "Development/Advanced-user/Rescue use only" lowers the significance of this issue. But on the other hand it is likely only a matter of time before this may present in our current Leap 15.6 target. So I'm currently not thinking of delaying our next stable rpm version designation as a result of this issue (it remains not on that Milestone). And if we take care to limit any related PR to allow a back-port from future testing branch to the then stable master we can release a fix for master also.

To fully assess the issue we need to know exactly the root cause: it may not be the additional number of parameters per device; but the devices presented. Linking to a now very old issue that may, or may not, have the same cause but from a 2019 kernel:

"Disk activity plot showing nothing, all drives" #2049

I.e. it may be those loop devices showing in TW that are not shown in Leap (at least for now).

@phillxnet
Copy link
Member Author

Linking to the last major change in this area:
"regression in dashboard disk activity widget. Fixes #1978" #1979

Given we seem to have additional device showing in TW than in 15.6 we may just have an issue with get_byid_name_map():

Looking into extending the existing test/tests we have for get_byid_name_map() (added in #1979) re:

  • Add unit tests for both prior and current behaviour of get_byid_name_map() to ensure nature of change is understood.

to try and rule that out as a potential cause for this TW failure.

@phillxnet
Copy link
Member Author

OK, so hopefully some progress:

Adding via @Hooverdan96 pointer to rockstor-core/src/rockstor/smart_manager/data_collector.py the following filter:
N.B. last 3 lines

            with open(stats_file_path) as stats_file:
                for line in stats_file.readlines():
                    fields = line.split()
                    # Sanity check:
                    if len(fields) < 14 or fields[2].startswith("loop"):
                        continue

We have a functional disk activity widget in latest TW. Suggesting the loop* devices are throwing us and we can avoid them entirely by this filter. Having more of a look but reporting here by way of findings to date.

@phillxnet phillxnet added this to the 5.1.X-X Stable release milestone Jul 22, 2024
@phillxnet phillxnet self-assigned this Jul 23, 2024
@phillxnet
Copy link
Member Author

Moving to establishing a device blacklist via major device numbers (first column in /proc/diskstats): https://www.kernel.org/doc/Documentation/admin-guide/devices.txt

   7 block	Loopback devices
		  0 = /dev/loop0	First loop device
		  1 = /dev/loop1	Second loop device
		    ...

		The loop devices are used to mount filesystems not
		associated with block devices.	The binding to the
		loop devices is handled by mount(8) or losetup(8).

As cleaner/faster and more easily extensible. I.e. we also have potential noise re sr0 (scsi rom) devices:

8 16 sdb 100 0 4192 88 0 0 0 0 0 72 88 0 0 0 0 0 0
11 0 sr0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

  11 block	SCSI CD-ROM devices
		  0 = /dev/scd0		First SCSI CD-ROM
		  1 = /dev/scd1		Second SCSI CD-ROM
		    ...

		The prefix /dev/sr (instead of /dev/scd) has been deprecated.

We can then move, in time, to using block major numbers as a canonical reference to replace a number of device name matches elsewhere that are less robust: i.e. note the /dev/sr deprecation that we still see in Leap 15.6 that is to be replaced by /dev/scd as per the above kernel doc reference.

phillxnet added a commit to phillxnet/rockstor-core that referenced this issue Jul 23, 2024
Add filter to `/proc/diskstats` excluding block loopback major
device 7. This avoids a failed byid_name lookup via
byid_disk_map() as loop* block devices do not have byid names,
and are not devices of interest.

Includes
- Additional sanity check on diskstats re ignoring < 14 columns.
- Filter out SCSI CD-ROM and Floppy disks re /proc/diskstats.
- Associated comment & TODO updates/additions.
- Minor type hint improvement.
- Updated get_byid_name_map() test data re newer kernels.
phillxnet added a commit that referenced this issue Jul 26, 2024
…ctive---Tumbleweed

Disk activity widget inactive - Tumbleweed #2844
@phillxnet
Copy link
Member Author

Closing as:
Fixed by #2880

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants