Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Initial PR, full dataset and notebooks #1

Merged
merged 12 commits into from
Feb 21, 2025
Merged

Initial PR, full dataset and notebooks #1

merged 12 commits into from
Feb 21, 2025

Conversation

sunandascript
Copy link
Collaborator

@sunandascript sunandascript commented Jan 27, 2025

Overview

This PR adds the raw datasets collected on the DIY Raman system used in the calibration, analysis, and figure-making notebooks. It also adds the generate_calibration and generate_figures notebooks and apply_calibration script, and the spectral_library csv of the names and peaks for the samples from the pub. All the data is organized by date and is in CSV format. I tested the notebooks and script using the dataset from 2024-08-27 (which can be found in the data > raw folder), then ran it with all calibration data collected for this pub, and I'm unaware of any bugs.

The notebook is fairly long and has a lot of outputs since the intent is for the user to thoroughly check the performance of the instrument and the quality of the acquired data by looking at the spectra and peaks of standard materials. It uses modules and packages that are in the dev.yml file.

PR checklist

  • Describe the changes you've made.
  • Describe the tests you have conducted to confirm that your changes behave as expected. If you haven't conducted any tests, please explain why.
  • If you've added new functionality, make sure that the documentation is updated accordingly.

This is one of three notebooks (generate_calibration, apply_calibration, and figure_generation). It was tested with data from 2024-08-27, solid configuration.
Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@sunandascript sunandascript changed the title initial dataset from DIY Raman full dataset from DIY Raman and generate_calibration notebook Jan 29, 2025
@sunandascript sunandascript changed the title full dataset from DIY Raman and generate_calibration notebook full dataset and generate_calibration notebook Jan 29, 2025
Ran formatting and fixed some markdown sections
Ran through all calibration data collected from 8/27 to 10/18; added notes to markdown sections on suggested parameters
…beling problem in generate_calibration

Major update is adding apply_calibration.py, which applies the calibration from neon and acetonitrile to all data in the raw folder and also additional processing, plus generates plots
@sunandascript sunandascript changed the title full dataset and generate_calibration notebook Initial PR, full dataset and notebooks Feb 1, 2025
Used apc.save_figure instead of plt.savefig; added notes on unbaselined data vs baselined
Made csvs consistent across generate_calibration and apply_calibration, reprocessed all acetonitrile and neon cal files
FIgures are the base versions for those in the pub
Copy link
Member

@keithchev keithchev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks very thorough! I think most of the important elements for reproducibility are in good shape, and the notebooks are generally well-documented. And I was able to run the generate_calibration.ipynb notebook without any trouble.

General comments

  • I would add a more detailed description of the contents of the repo to README, along with an explanation of the intended usage. My understanding is that there are two steps: first the notebook generate_calibration.ipynb is used to generate a calibration file from neon and acetonitrile standards, and then later the apply_calibration.py script can be used to calibrate raw data from the calibration files. Assuming I've got this right, I think it'd be good to clarify this explicitly in the README as it may not be obvious to external readers.

  • I would clarify the naming convention of the raw data files. This is explained in the notebook, but it's kind of hard to find, and it's very brief. The format is given as:

    "YYYY-MM-DD_samplename_blanked?_baselined?_filtered?_exposure(ms)_gain(dB)_#averagedspectra.csv"
    

    It was hard for me to figure out what "blanked", "baselined", or "filtered" meant, or what "samplename" would look like. I was also confused when choosing an acetonitrile file in the generate_calibration notebook, because those filenames are of the form "2024-08-27_acetonitrile_n_n_y_liquid_1000_0_5.csv" which at first glance is hard to map to the provided format. (I eventually figured it out, though I'm still not sure where "liquid" fits in.) To address this I would use a clearer way to describe the format, like this:

    YYYY-MM-DD_<sample-name>_<blanked>_<baselined>_<filtered>_<exposure-time-ms>_<gain-dB>_<num-averaged-spectra>.csv
    

    And I'd explicitly explain what each of the parameters means; e.g. explain what the parameters "blanked", "baselined", and "filtered" mean and that they can be either either "y" or "n". Also, giving some real example filenames would help to further orient the reader.

  • In the generate_calibration notebook, I think it would be preferable to specify the input files by typing out their filenames in the notebook, rather than using a file browser window to pick the files to process, because typing the names is more explicit and more reproducible. This is particularly important because the notebooks in the repo currently include the output figures, but these are not really meaningful without an explicit record of the input files that were used to generate them. If you do stick with using the file browser windows to pick files, I would recommend removing the outputs from the notebook in the repo.

Comments from running the generate_calibration notebook

  • In the initial step, I wasn't sure which neon file to select, so I just picked a random one. Same story with the acetonitrile file; I just wasn't sure how to choose a single file.
  • In the next step, to find and fit peaks, the cell outputs a large block of text and a ton of figures, and I wasn't sure what all of this was or how to interpret it. I guess they're zoomed in plots of individual peaks? But I found this confusing, and I wasn't sure what to look for in these plots. It might be better to explain what to look for here, or else change to cell to output something more succinct by default.
  • Also, in the cells that find and fit peaks, it looks like the while loops can potentially run forever if not enough peaks are found. I'd recommend adding some maximum iteration counter to prevent this.
  • On the steps that save the data, it wasn't immediately obvious to me that the notebook would overwrite existing files in the repo. It might be good to make this clear, and/or explicitly specify an output folder here, so the user has the option to change it if they want.

Housekeeping things

  • I would recommend to avoid adding so many files directly to the repo (this PR has over 1,000 files added), as it makes PRs hard to review because the "files changed" page becomes slow to render and sometimes crashes. Basically, GitHub is not designed to support large numbers of files. I think the best alternative is to compress the directories that contain the CSV files, and then commit the single zipfile instead.
  • Don't forget to update the project name in pyproject.toml and update the README (it still has some placeholder sections).
  • I'd recommend not including a "last update" timestamp in the notebooks and scripts; these are hard to keep up to date, and GitHub should be the source of truth for when files were last changed. Also, without more context about what was updated, a last-updated time is not really meaningful to an outside reader.

Added max iteration counter in generate_calibration, added descriptions in the markdown sections, re ordered figures in the generate_figures notebook
Added file names and potassium phosphate
@sunandascript sunandascript removed the request for review from dgmets February 21, 2025 09:31
@sunandascript sunandascript reopened this Feb 21, 2025
@sunandascript
Copy link
Collaborator Author

Thanks @keithchev ! Your comments were very helpful.
I went through and made changes in the repo and the notebooks to address your comments. I added more detail to the README for how to use this, updated the placeholder texts, compressed the data directories, added more detail in the markdown sections of the generate_calibration notebook, removed the outputs, and added counters for the while loops.

@keithchev keithchev self-requested a review February 21, 2025 19:03
Copy link
Member

@keithchev keithchev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good!

@sunandascript sunandascript merged commit fae176b into main Feb 21, 2025
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants