Initial PR, full dataset and notebooks #1

sunandascript · 2025-01-27T07:57:34Z

Overview

This PR adds the raw datasets collected on the DIY Raman system used in the calibration, analysis, and figure-making notebooks. It also adds the generate_calibration and generate_figures notebooks and apply_calibration script, and the spectral_library csv of the names and peaks for the samples from the pub. All the data is organized by date and is in CSV format. I tested the notebooks and script using the dataset from 2024-08-27 (which can be found in the data > raw folder), then ran it with all calibration data collected for this pub, and I'm unaware of any bugs.

The notebook is fairly long and has a lot of outputs since the intent is for the user to thoroughly check the performance of the instrument and the quality of the acquired data by looking at the spectra and peaks of standard materials. It uses modules and packages that are in the dev.yml file.

PR checklist

Describe the changes you've made.
Describe the tests you have conducted to confirm that your changes behave as expected. If you haven't conducted any tests, please explain why.
If you've added new functionality, make sure that the documentation is updated accordingly.

This is one of three notebooks (generate_calibration, apply_calibration, and figure_generation). It was tested with data from 2024-08-27, solid configuration.

review-notebook-app · 2025-01-29T05:50:43Z

Check out this pull request on

See visual diffs & provide feedback on Jupyter Notebooks.

Powered by ReviewNB

Ran formatting

Ran formatting and fixed some markdown sections

Ran through all calibration data collected from 8/27 to 10/18; added notes to markdown sections on suggested parameters

…beling problem in generate_calibration Major update is adding apply_calibration.py, which applies the calibration from neon and acetonitrile to all data in the raw folder and also additional processing, plus generates plots

Used apc.save_figure instead of plt.savefig; added notes on unbaselined data vs baselined

Made csvs consistent across generate_calibration and apply_calibration, reprocessed all acetonitrile and neon cal files

FIgures are the base versions for those in the pub

keithchev

This looks very thorough! I think most of the important elements for reproducibility are in good shape, and the notebooks are generally well-documented. And I was able to run the generate_calibration.ipynb notebook without any trouble.

General comments

I would add a more detailed description of the contents of the repo to README, along with an explanation of the intended usage. My understanding is that there are two steps: first the notebook generate_calibration.ipynb is used to generate a calibration file from neon and acetonitrile standards, and then later the apply_calibration.py script can be used to calibrate raw data from the calibration files. Assuming I've got this right, I think it'd be good to clarify this explicitly in the README as it may not be obvious to external readers.
I would clarify the naming convention of the raw data files. This is explained in the notebook, but it's kind of hard to find, and it's very brief. The format is given as:
```
"YYYY-MM-DD_samplename_blanked?_baselined?_filtered?_exposure(ms)_gain(dB)_#averagedspectra.csv"
```
It was hard for me to figure out what "blanked", "baselined", or "filtered" meant, or what "samplename" would look like. I was also confused when choosing an acetonitrile file in the generate_calibration notebook, because those filenames are of the form "2024-08-27_acetonitrile_n_n_y_liquid_1000_0_5.csv" which at first glance is hard to map to the provided format. (I eventually figured it out, though I'm still not sure where "liquid" fits in.) To address this I would use a clearer way to describe the format, like this:
```
YYYY-MM-DD_<sample-name>_<blanked>_<baselined>_<filtered>_<exposure-time-ms>_<gain-dB>_<num-averaged-spectra>.csv
```
And I'd explicitly explain what each of the parameters means; e.g. explain what the parameters "blanked", "baselined", and "filtered" mean and that they can be either either "y" or "n". Also, giving some real example filenames would help to further orient the reader.
In the generate_calibration notebook, I think it would be preferable to specify the input files by typing out their filenames in the notebook, rather than using a file browser window to pick the files to process, because typing the names is more explicit and more reproducible. This is particularly important because the notebooks in the repo currently include the output figures, but these are not really meaningful without an explicit record of the input files that were used to generate them. If you do stick with using the file browser windows to pick files, I would recommend removing the outputs from the notebook in the repo.

Comments from running the `generate_calibration` notebook

In the initial step, I wasn't sure which neon file to select, so I just picked a random one. Same story with the acetonitrile file; I just wasn't sure how to choose a single file.
In the next step, to find and fit peaks, the cell outputs a large block of text and a ton of figures, and I wasn't sure what all of this was or how to interpret it. I guess they're zoomed in plots of individual peaks? But I found this confusing, and I wasn't sure what to look for in these plots. It might be better to explain what to look for here, or else change to cell to output something more succinct by default.
Also, in the cells that find and fit peaks, it looks like the while loops can potentially run forever if not enough peaks are found. I'd recommend adding some maximum iteration counter to prevent this.
On the steps that save the data, it wasn't immediately obvious to me that the notebook would overwrite existing files in the repo. It might be good to make this clear, and/or explicitly specify an output folder here, so the user has the option to change it if they want.

Housekeeping things

I would recommend to avoid adding so many files directly to the repo (this PR has over 1,000 files added), as it makes PRs hard to review because the "files changed" page becomes slow to render and sometimes crashes. Basically, GitHub is not designed to support large numbers of files. I think the best alternative is to compress the directories that contain the CSV files, and then commit the single zipfile instead.
Don't forget to update the project name in pyproject.toml and update the README (it still has some placeholder sections).
I'd recommend not including a "last update" timestamp in the notebooks and scripts; these are hard to keep up to date, and GitHub should be the source of truth for when files were last changed. Also, without more context about what was updated, a last-updated time is not really meaningful to an outside reader.

Added max iteration counter in generate_calibration, added descriptions in the markdown sections, re ordered figures in the generate_figures notebook

Added file names and potassium phosphate

sunandascript · 2025-02-21T09:36:58Z

Thanks @keithchev ! Your comments were very helpful.
I went through and made changes in the repo and the notebooks to address your comments. I added more detail to the README for how to use this, updated the placeholder texts, compressed the data directories, added more detail in the markdown sections of the generate_calibration notebook, removed the outputs, and added counters for the while loops.

keithchev

Looks good!

initial dataset from openraman

3e08848

sunandascript assigned sunandascript and dgmets Jan 27, 2025

Adding generate_calibration notebook

50cfda6

This is one of three notebooks (generate_calibration, apply_calibration, and figure_generation). It was tested with data from 2024-08-27, solid configuration.

sunandascript changed the title ~~initial dataset from DIY Raman~~ full dataset from DIY Raman and generate_calibration notebook Jan 29, 2025

sunandascript changed the title ~~full dataset from DIY Raman and generate_calibration notebook~~ full dataset and generate_calibration notebook Jan 29, 2025

sunandascript added 4 commits January 28, 2025 22:00

Update generate_calibration.ipynb formatting

7587806

Ran formatting

Update generate_calibration.ipynb

dbf4749

Ran formatting and fixed some markdown sections

Updated generate_calibration notebook after checking all datafiles

ca778c1

Ran through all calibration data collected from 8/27 to 10/18; added notes to markdown sections on suggested parameters

sunandascript changed the title ~~full dataset and generate_calibration notebook~~ Initial PR, full dataset and notebooks Feb 1, 2025

sunandascript added 3 commits February 6, 2025 14:56

Updated generate_calibration notebook to fix figure generation

a0a1221

Used apc.save_figure instead of plt.savefig; added notes on unbaselined data vs baselined

Rewrote the to_csv parts of generate_calibration

afa36f2

Made csvs consistent across generate_calibration and apply_calibration, reprocessed all acetonitrile and neon cal files

Added generate_figures complete notebook and processed figures

c3f6224

FIgures are the base versions for those in the pub

sunandascript assigned keithchev Feb 10, 2025

sunandascript requested review from keithchev and dgmets February 10, 2025 19:50

sunandascript unassigned keithchev and dgmets Feb 10, 2025

keithchev reviewed Feb 18, 2025

View reviewed changes

sunandascript added 3 commits February 21, 2025 00:37

Changes in response to review

975ae93

Added max iteration counter in generate_calibration, added descriptions in the markdown sections, re ordered figures in the generate_figures notebook

Updated spectral library peaks

5083bd4

Added file names and potassium phosphate

Compressing data folders for Github

73f845d

sunandascript removed the request for review from dgmets February 21, 2025 09:31

sunandascript closed this Feb 21, 2025

sunandascript reopened this Feb 21, 2025

keithchev self-requested a review February 21, 2025 19:03

keithchev approved these changes Feb 21, 2025

View reviewed changes

sunandascript merged commit fae176b into main Feb 21, 2025
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Initial PR, full dataset and notebooks #1

Initial PR, full dataset and notebooks #1

sunandascript commented Jan 27, 2025 •

edited

Loading

review-notebook-app bot commented Jan 29, 2025

keithchev left a comment •

edited

Loading

sunandascript commented Feb 21, 2025

keithchev left a comment

Initial PR, full dataset and notebooks #1

Initial PR, full dataset and notebooks #1

Conversation

sunandascript commented Jan 27, 2025 • edited Loading

Overview

PR checklist

review-notebook-app bot commented Jan 29, 2025

keithchev left a comment • edited Loading

Choose a reason for hiding this comment

General comments

Comments from running the generate_calibration notebook

Housekeeping things

sunandascript commented Feb 21, 2025

keithchev left a comment

Choose a reason for hiding this comment

sunandascript commented Jan 27, 2025 •

edited

Loading

keithchev left a comment •

edited

Loading

Comments from running the `generate_calibration` notebook