Releases: UCBerkeleySETI/turbo_seti
New plotSETI parameter: --h5dat_lists for pre-generated/edited lists of h5 files and dat files
Internally, plotSETI uses one text-file-resident list of h5 files and another for the corresponding dat files. The list of h5 files is formatted as a text file like this:
/home/giraffe/BASIS/seti_data/voyager_2020/h5_dir/single_coarse_guppi_59046_80036_DIAG_VOYAGER-1_0011.rawspec.0000.h5
/home/giraffe/BASIS/seti_data/voyager_2020/h5_dir/single_coarse_guppi_59046_80354_DIAG_VOYAGER-1_0012.rawspec.0000.h5
/home/giraffe/BASIS/seti_data/voyager_2020/h5_dir/single_coarse_guppi_59046_80672_DIAG_VOYAGER-1_0013.rawspec.0000.h5
/home/giraffe/BASIS/seti_data/voyager_2020/h5_dir/single_coarse_guppi_59046_80989_DIAG_VOYAGER-1_0014.rawspec.0000.h5
/home/giraffe/BASIS/seti_data/voyager_2020/h5_dir/single_coarse_guppi_59046_81310_DIAG_VOYAGER-1_0015.rawspec.0000.h5
/home/giraffe/BASIS/seti_data/voyager_2020/h5_dir/single_coarse_guppi_59046_81628_DIAG_VOYAGER-1_0016.rawspec.0000.h5
The list of corresponding dat files is formatted in the same manner.
Normally, both lists are generated internally by plotSETI and are never seen by the user. However, it has been proposed that in some circumstances, the lists should be prepared by the user. So, if parameter --h5dat_lists (NEW!) is set to 2 file paths (one text file for h5s, one text file for dats), then those list files should be used instead of autogeneration. User-supplied lists will be:
- Checked for existence and consistency.
- Used for internal list processing.
E.g. plotSETI --h5dat_lists /dir_a/list_h5_files.txt /b/list_h5_files.txt --out_dir .....
tells plotSETI that there exists a list of h5 files in /dir_a/list_h5_files.txt and a list of dat files in /dir_b/list_dat_files.txt.
If --h5dat_lists is absent (default i.e. most common usage), plotSETI will internally generate the 2 list files as it has been doing in the past.
Add diagnostics when plotSETI detects mismatch of h5 and dat files
The following circumstances are anomalous:
- No h5 files are found.
- No dat files are found.
- The number of h5 files != the number of dat files.
They are now explicitly diagnosed to prevent confusion.
New utility for showing the differences between 2 dat files
The new utility (dat_diff) executes the overall comparison as 2 independent processes in succession:
- For each entry in dat file dat1, look for a match in dat file dat2.
- For each entry in dat file dat2, look for a match in dat file dat1.
Given 2 dat file entries, the comparison is performed using the following data elements:
- Coarse channel number (exact match)
- Frequency (within rtol) where rtol is the {math,numpy}.isclose() value (E.g. 0.0001 which signifies 0.01%)
- Drift rate (within rtol)
Correct duplicate hits design
The tophitsearch code, considers each candidate hit, and checks over a window of frequencies nearby that hit. If there is another larger hit, it doesn't report this one. This is to prevent reporting a single signal as multiple hits.
There was a small bug in this logic. Previously, the index of the edge of the window was calculated as
i - obs_length*max_drift/2
If you just check the units, obs_length is measured in seconds, max_drift is measured in Hz/s, so obs_length * max_drift has units of Hz, and we were subtracting it from i which is a unitless index. So, this is basically just a meaningless calculation.
Also it should be multiplying by 2 instead of adding by 2, because the two signals could be moving toward each other. These two bugs were roughly canceling each other out, so for Green Bank data for example, we were deduplicating over a window of radius 58 when it should have been a window of radius 80. Not too big a difference, and this fix won't change very much in practice, but it's better to be using the right calculation here.
Correct drift rate calculation in find_doppler.py
There was an off-by-one error when calculating the resolution of the drift rate. You don't want to divide by number of timesteps. Instead, you want to divide by (number of timesteps - 1). Think of the line as being between the centroid of a bin in the first and last row, rather than the very start of the first row and the end of the last row.
@lacker discussed this in the iseti meeting of 5/4/2022 and also with @stevecroft on the previous day. There is a general astronomer-consensus that this fix is an improvement.
GPU Performance Improvement
This release replaces the flt function with a new implementation when turbo_seti is running in GPU mode. Thanks to Franklin Antonio (@fantonio2 on github) for his code at https://github.com/UCBerkeleySETI/dedopplerperf/blob/main/CudaTaylor5demo.cu; these turbo_seti changes are based on that. Kevin Lacker (@lacker on github) used a C++ template to handle multiple float types and other miscellaneous amendments.
Note that some of the surrounding code is refactored because the CPU implementation of flt stores rows of the output by using a bit reversal technique,. The GPU implementation doesn't so the format is slightly different.
This speeds up the flt function by a factor of 5x or so and that was previously around 30% of the time spent by turbo_seti in the search_coarse_channel function. Overall this change seems to provide a ~15% performance improvement.
Note that the output of the search_coarse_channel function is unchanged. This is purely a performance change when running in GPU mode.
Profiling before this change:
https://bldata.berkeley.edu/pipeline/tmp/turboseti_profile.svg
Profiling after this change:
https://bldata.berkeley.edu/pipeline/tmp/new_turboseti_profile.svg
Plot Event Improvements
Some of the turbo_seti plot_event.py code was fixed in a couple of ways:
- When using a plot offset (red barbell) with the red guideline, it was not being placed correctly.
- Some of the code was extraneous which could be quite confusing.
Using Filter Parameters after turboSETI Completes
Sometimes, when running turboSETI (or the FindDoppler Python class object), one or more of the 3 filtering parameter values (minimum drift rate, maximum drift rate, and minimum SNR) are guessed, misspecified, or omitted. It is desirable to have a second chance at filtering out dedoppler top hits that are not interesting for analysis (E.g. RFI). Also, this will reduce the number of plots (PNG files) produced which then need to be pruned manually.
This release of turbo_seti adds 2 courses of action that can be taken after the turboSETI execution:
- With a new utility (dat_filter), apply one or more of the 3 above filtering parameters to permanently update the DAT file produced by turboSETI.
- In the plotSETI program or through the use of the find_event_pipeline API, specify values for one or more of the 3 filtering parameters. Note that in this case, the DAT file is not updated.
For example, suppose turboSETI has produced xx.dat from xx.h5 with drift rates varying from -0.5 to 0.5. All of the SNR values are acceptable but we'd like to avoid signals with drift rate absolute values below 0.1 and above 0.4. Then, the following dat_filter execution will permanently purge the signals near 0 drift rates:
dat_filter -m 0.1 -M 0.4 xx.dat
Alternatively, without modifying xx.dat, we could use plotSETI with new parameters instead. Assume that both xx.h5 and xx.dat are in the same directory abc. Then, the following execution will do event analysis and produce plots without permanently purging the signals near 0 drift rates from xx.dat:
plotSETI -m 0.1 -M 0.4 abc
The latter might be a useful tool for experimentation.
Correct drift rate resolution calculation
When the number of time integrations is not a power of 2, the next power of 2 was used for the number of time integrations in the drift rate resolution calculation (data_handler.py DATAH instantiation). As a result, this threw off the drift rate calculations in the doppler search main loop (find_doppler.py).
Visible symptom: the red-colored fit lines of the plots were not falling on the signal in the waterfall plots of the signal candidates.
Enhance logging information
-
Show versions of hdf5plugin and the HDF5 library.
-
Enable the display of HDF5 library error messages which are inhibited by default.