Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A unified way of unit conversion #125

Closed
xiki-tempula opened this issue Apr 28, 2021 · 11 comments · Fixed by #129
Closed

A unified way of unit conversion #125

xiki-tempula opened this issue Apr 28, 2021 · 11 comments · Fixed by #129

Comments

@xiki-tempula
Copy link
Collaborator

Currently, the internal unit is always kt. However, when presenting the data, it might be desirable to present the data in kcal/mol or kj/mol. Thus, it would be a good idea to have a discussion on how to do the unit conversion.

Currently, the only place that requires unit conversion is the plotting functions and later on the workflow #114.

My proposal is that the plotting function will be taking a conversion factor that converts kt to the desired unit.

@orbeckst
Copy link
Member

One thing I would like to keep constant is that the internal units are always kT — that's one of the fundamental assumptions. The parsers make sure that the data are in this common format. This means that unit conversion should only happen from kT into something else for downstream processing.

The plotting functions take DataFrames as input so a natural way inside alchemlyb would be to have something like a postprocessor function (say postprocessors.units.to_kcalmol()). I'd say a postprocessor is similar to a preprocessor but creates data transformations for downstream analysis or plotting. @xiki-tempula brought up using DataFrame attributes (DataFrame.attrs – flagged as experimental in their API) to store metadata. We could store the unit as metadata, to check that we don't make a mistake on conversion. We could also store the temperature there, if known, which would make conversion from kT to absolute units less error prone. (However, using attr would require us to carefully check if and how to propagate attr through various functions and estimators. It might be a bigger project if we wanted to keep tracking meta data.)

We could also look into any number of python unit packages although I'd favor a light weight solution, especially as we are really only concerned with energy conversion from a fixed unit system. But if anyone has any ideas please make suggestions.

Our plotting function could call the appropriate unit conversion depending on the value of units — at the moment, that just sets the axes labels but we could use it in a more substantial role.

@dotsdl @davidlmobley , any suggestions?

What do you think @xiki-tempula ?

@davidlmobley
Copy link

I like the idea of carrying around units to avoid problems, so this sounds good.

@xiki-tempula
Copy link
Collaborator Author

So I think the current general workflow would be

Input file (Energy: kT; kcal/mol; kJ/mol, Temperature K)
↓
Preprocessing (Energy: kT)
↓
Estimator (Energy: kT)
↓
Graphic/text output (Energy: kT; kcal/mol; kJ/mol) require the user to rescale the unit.

So the new plan would be

Input file (Energy: kT; kcal/mol; kJ/mol, Temperature K)
↓
Preprocessing (Energy: kT)
↓
Estimator (Energy: kT)
↓
Unit conversion (require Temperature K)
↓
Graphic/text output (Energy: kT; kcal/mol; kJ/mol).

So the Unit conversion could happen through postprocessors.units.to_kcalmol().
The question would then be how to pass the temperature to this function.

The temperature could be added to DataFrame.attrs, which could have problem of propagation from dhdl data frame to d_delta_f_ data frame.

I think one way would be adding temperature as an input postprocessors.units.to_kcalmol(temperature).
Another way would be setting the temperature global variable alchemlyb.envs.temperature.

I'm not quite sure as to how to pass units around.

@orbeckst
Copy link
Member

I had in mind what you sketched out in your new plan.

Originally I was thinking that the Graphic/text output step can do a unit conversion internally (using the same tools that one can use explicitly). However, if we find a way to carry units and temperature around with the df then this can be done automatically.

I do not like to keep global state with global variables. All of alchemlyb is written in a way to avoid keeping global state as much as possible.

@xiki-tempula
Copy link
Collaborator Author

So I would imagine using attrs to pass the temperature around (if the temperature is not set as a global variable).

Input file (Energy: kT; kcal/mol; kJ/mol, Temperature K): extract_u_nk(xvg, T=300) > u_nk.attrs['temp'] = 300
↓
Preprocessing (Energy: kT): statistical_inefficiency(u_nk) > u_nk_sample.attrs['temp'] = 300
↓
Estimator (Energy: kT) > MBAR().fit(u_nk_sample) > mbar.delta_f_.attrs['temp'] = 300; mbar. d_delta_f_.attrs['temp'] = 300 (one might need to make sure that the attribute temp is being passed through)
↓
Graphic/text output (Energy: kT; kcal/mol; kJ/mol). which uses the internal postprocessors.units.to_kcalmol(temperature) to do the unit conversion.

@orbeckst
Copy link
Member

orbeckst commented May 4, 2021

Makes sense to me. In addition to df.attrs['temperature'] (let's write out the name instead of "temp" which could also mena "temporary") I'd then also include df.attrs['unit'] = "kT" for completeness.

The main challenge is going to be what happens when someone manipulates the df externally and the attrs get lost. They do not automatically get copied on slicing a df. I have two ideas:

  1. The simplest solution is to have all alchemlyb functions that input df check the df and raise an error if the attrs are missing. (We might be able to write a decorator for this check.) Then we just need to document which attrs are required.
  2. Alternatively, we can subclass pandas.DataFrame as our own df and then add code to copy attrs.

However, I'd like to keep things simple so I am in favor of 1.

@orbeckst
Copy link
Member

orbeckst commented May 4, 2021

Btw, maybe attrs['unit'] is not very clear, given that dataframes can have different columns. Maybe attrs['energy_unit'] and attrs['time_unit'] (if necessary) would be clearer?

@xiki-tempula
Copy link
Collaborator Author

xiki-tempula commented May 4, 2021

In the Input file layer, the output data frame would have the attributes.

I'm thinking of having a decorator pass the attributes from the input data frame to the output data frame during the preprocessing and estimator layer.
Something similar to

def pass_attrs(func):
    def wrapper(input_dataframe, *args,**kwargs):
        dataframe = func(input_dataframe, *args,**kwargs)
        dataframe.attrs = input_dataframe.attrs
        return dataframe
    return wrapper

The Graphic/text output layer would then just use these attributes to do the work.

I suppose the checking procedure could be implemented to the Graphic/text output layer while preprocessing and estimator layer would not check them as it is not being used. I think it might be better for the functions in the Graphic/text output layer to check the attributes in a case to case manner as the input are quite heterogenous.

@orbeckst
Copy link
Member

I think your plan is good.

@xiki-tempula
Copy link
Collaborator Author

xiki-tempula commented May 15, 2021

I tried to implement this plan in #129

It seems that attr is only supported from pandas 1.0 which doesn't support py2 and py35.
The attr only works as intended from 1.2, which doesn't support py36.

@orbeckst
Copy link
Member

orbeckst commented May 15, 2021 via email

@orbeckst orbeckst added this to the 0.5.0 milestone Jun 10, 2021
orbeckst added a commit that referenced this issue Aug 1, 2021
- state explicitly supported Python on OS
- move Changes to top position and reordered changes by importance
- added explicit item for unit-awareness (#125)
- add date
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants