JTS - Method for describing units for a field #35

rufuspollock · 2013-04-07T09:13:35Z

Say I have real GDP in 2009 £m (i.e. in millions of £ in the year 2009) I have no way to specify this.

Propose two new fields:

"scaler" attribute. Value is a number. Default value is 1 but for £m would be 1m i.e. 1000000
"unit" attribute whose value is a hash with following properties:

{
   type: "currency",
   value: "GBP",
   # base date in iso 8601 format
   date: "2009"
}

Concerns

This has the potential for massively increased complexity
Would this not be better part of a proper dimension description approach (keep JTS simple)
Further research into existing work e.g. sdmx

Floppy · 2013-04-30T18:13:01Z

+1 for some sort of units support. It would need to handle scientific/physical units as well. To be compatible with @rgrp's proposal above, the unit hash for an acceleration could be something like this:

{
    type: 'physical',
    value: 'm/s^2'
}

This is based on the unit strings generated by https://github.com/spatchcock/quantify.

This would be very handy for automatic inspection, presentation and analysis of data.

Floppy · 2013-04-30T18:20:29Z

Perhaps id might be a better field name than value, above. Suggests something unique and precise, which these should certainly be.

rufuspollock · 2013-05-04T14:08:36Z

@Floppy sounds good. Would we have a good enumeration of values - i guess @spatchcock quantify work is best.

spatchcock · 2013-05-04T20:02:47Z

A work in progress of mine (with @Floppy) is https://github.com/spatchcock/calcJSON where we extend JTS and add a "unit" attribute. I am not sure it needs to be any more complex than a single string in the vast majority of cases, although currency units are perhaps exceptions to the rule given that their values vary through time (as opposed to, say, the 'metre') and therefore something like a time-stamp is necessary.

Here are my thoughts regardless:

There is no need to specify the dimension or another descriptor such as "physical" or "currency".
There needs to be an agreed list of known units each with a unique identifier.
There needs to be a standard syntax for describing compound units.

Dimensions

Dimensions are part of any unit definition so it is entirely sufficient to specify the unit and the dimension are implied. I would argue that it is the role of the user (probably via client libraries) to understand how to handle units and any role for dimensions (e.g. in operations, producing compound units, etc,) and the protocol simply needs to describe the unit. "Currency" is not a formal (physical) dimension but it is analogous to a dimension. Still, the specification of, for example, "GBP", would suffice to completely describe the intended meaning.

Valid units and identifiers

I have established a tentative list of units with unique identifiers here https://github.com/spatchcock/quantify/blob/master/lib/quantify/config.rb . In many cases the unique identifier ("label") for each unit is simply the unit symbol (e.g. m, kg, K, J, Pa, etc.). In other cases, where exotic chars are used in the symbols (e.g. "°"), or multiple unit variants exist (e.g. US and UK versions), the label is something different. Many follow JScience and therefore there is already some precedent to many of them, although I added many others myself. I tried to stay close to the unit symbol where possible, although I think enforcing basic characters and using underscores for whitespace would be good practice. Standard unit symbols are case sensitive so if they are used as unique identifiers case sensitivity has to be okay.

Compound unit syntax

Compound units need to describe base unit multiplication, division (denominators) and raising to powers. There are a few ways to do this, e.g.:

Multiplication
kW·h (middot)
kW h (whitespace)
kW x h (multiply symbol)

Power
m2 (crap)
m^2

Denominator
m/s
m s^-1

I think JScience uses the "·", "/" and "^" characters. The Quantify Rubygem supports "·" and white- space for multiplication, "/" or negative powers for denoting denominators and either "^" or superscript characters (2,3 only) for powers. I would suggest not supporting superscript characters (these are only for presentation in Quantify - NOT unique unit labels). I would also suggest using white-space for multiplication rather than "·" which would be okay as long as unit descriptors use underscores so that there is no ambiguity. The "/" for delimiting denominators must only occur once, otherwise parentheses would be needed and opens the door to very messy unit descriptions.

Examples

"m"	metre
"kW h"	kilowatt hour
"kg m^2 s^-2"	implied joule
"kg m^2/s^2"	implied joule (alternative denominator delimiter)
"J"	joule
"btu_39f/lb"	British thermal units per pound
"btu_39f lb^-1"	British thermal units per pound (alternative denominator delimiter)
"t km"	metric tonne kilometre
"ton_uk km"	imperial ton kilometre
"ton_us km"	imperial ton kilometre
"deg_c/h"	degrees celsius per hour
"deg_c h^-1"	degrees celsius per hour (alternative denominator delimiter)
"GBP/USD"	exchange rate

Currency

In my view, these strings are all that is needed to uniquely describe a unit (as long as the identifiers and compound syntax are standardised). Since currency requires more metadata (a time-stamp) then perhaps that is reason enough to require a more complex structure like those descried above. In the vast majority of cases, though, this would reduce to:

{
    value: "m/s^2"
}

so it could be argued that a simple string should be allowed since (I imagine) that will cover most cases (maybe not?). Either way, "date" (or perhaps a metadata object) should probably be optional.

Hope that helps!

rufuspollock · 2013-05-04T20:44:26Z

@spatchcock this is awesome and would like to get this officially in or at least an official extension (do we need a way to have JTS "extensions").

Re the currency I should explain that the year part is when people say stuff like: "prices in 2000 dollars" meaning they have also deflated to have everything in the same year prices.

Re the type field: I agree on redundancy but it sure might make rendering easier for the average client (e.g. how do i work out that GBP is a currency?)

spatchcock · 2013-05-04T23:07:50Z

@rgrp

My understanding was that JTS requires a few core attributes then allows any additional ones. I guess we can define a "unit protocol" in its own right which could simply be adopted by/used in conjunction with JTS without any changes. Is that correct?

My take on the "type" field was that clients would need to understand how to use units. But then, I am not sure what use cases you have in mind. Rendering data could be one use case, as could using it in calculations/conversions.

For rendering, a user might want to be able to render a humanised unit name or standard symbol (e.g. "Great British Pounds", "£"). For this case, simply labelling as the unit "currency" would not be enough as the additional information is not included (everyone could handle pounds but not necessarily other currencies). These humanized names/symbols could be specified as part of the protocol, but my question would be whether they therefore need to be described in the data every time or whether they are simply part of the standard, supported by client libraries, etc.

A "currency" type would be useful for rendering rules such as limiting a quantity value to 2 decimal places. But then I wonder what other "types" we would support. My instinct is to think in terms of physical dimensions. There are 7 "base" physical dimensions, from which all other physical dimensions are derived. Would we want to support these (e.g. length, mass, temperature, time...)? And, it is conventional to name some "compound" dimensions (e.g. "energy" = mass x length^2 x time^-2; "acceleration" = mass x time^-2). Do we support these? Or do we support unnamed compound "types" ("currency per time")? And again, does this "type" actually need to be included in the data or simply inferred from the protocol?

My answer would be that we define the units supported by the protocol and include in the definitions their base dimensions. This way, anyone wanting to comply with the protocol can do, but the information does not have to be contained within the data description itself - the unit description suffices and everything else is implied and gets delegated to the protocol.

There are at least 3 types of quantity that I can think of that are not covered in the standard set of physical base dimensions but would be useful regardless: "currency" (e.g. £, $), "information" (e.g. bit, byte) and "item" (e.g. GDP in £ per capita, methane emissions in kg per head of livestock). We could choose to define these if we are going down the route of using dimensions.

The other use case I can think of for data with machine-readable units is using the data in operations (e.g. addition, multiplication,... of quantities) or converting quantities into different units. This, to me, is the real important use case - being able to read in data with possibly a variety of units and perform the same operations on each with units dynamically identified and accounted for. This would certainly require knowledge of dimensions as well as conversion factors amongst other things. Again, I don't think the dimensions (the "type"?) need to be communicated in the data, but are simply implied by the units and compliance with the protocol.

On the "prices in 2000 dollars" use case: I am not sure what someone might want to do with this data. If they want to operate on it using other non-currency units (e.g. $10 / 1 hr = $10/hr), then that would be okay. If they want to convert it to a different currency or the same currency at a different time, then that would be impossible without more information on the exchange rates. In the latter case, the "date" attribute would certainly convey useful information, as it would if the intention is simply to be able to render the data with the context described accurately. This seems to me to be quite an exceptional requirement in comparison with all other types of quantities (albeit perhaps quite common). Currency is odd in that the "conversion factors" float. This means that, in a sense, a yr 2000 $ is a different unit to a yr 2012 $ - they require different conversion factors; they mean different things. These types of units are similar to the variants of physical units that exist (e.g. UK/US gallons, UK/US tons) in that they have the same name but a different meaning. However, the other units have a small number of standard, fixed variants whereas currencies have an infinite number of different "meanings" depending on the time period and resolution you choose. The British Thermal Unit is perhaps the closest to this in that it has several definitions based on different experimental temperatures. I can imagine defining the BTU with a metadata field which describes the reference temperature in the same way as a $ might have a year or timestamp. However, the BTU's do have a set of standard variants so the analogy is not too close. I am not sure whether a separate "currency protocol" makes more sense...

spatchcock · 2013-05-05T21:35:05Z

I have made a start here: https://github.com/spatchcock/dataprotocols/blob/master/source/unit-protocol.markdown

Perfectly happy to extend into a more complex data structure if we think that is required. It would be good to try to think of any other cases (beyond the currency one) where more metadata would be required in association with units.

(I've used markdown for now - not familiar with .rst).

spatchcock · 2013-05-18T13:47:45Z

@rgrp

Any thoughts on the above?

rufuspollock · 2013-05-20T02:35:01Z

@spatchcock I like this :-)

Suggestions:

Let's rst-ify it (markdown is based on RST so they are pretty similar - see here for a primer: http://sphinx-doc.org/rest.html#rst-primer)
Seems a good idea to have this as a separate doc. Think simple units.rst would suffice a file name (no need for protocol here ..)
We can then reference this from JSON table schema :-)

spatchcock · 2013-05-20T07:30:58Z

@rgrp Great. I'll make those changes and raise a pull request.

rufuspollock · 2013-06-23T14:40:23Z

FIXED. See http://www.dataprotocols.org/en/latest/units.html

@roll

* Update governance.md First draft of governance page. Might need to tweak it a little bit later on. @roll have a look and see what you think. Mainly: do we want to mention wg members are part of the Frictionless community? It's nice to keep the link ATM, but we might want to skip that if the wg changes composition in the future. * Update governance.md Updating wording taking into account @peterdesmet's comments. * Update governance.md Improving a sentence.

spatchcock mentioned this issue May 22, 2013

Add units protocol #47

Merged

rufuspollock closed this as completed Jun 23, 2013

rufuspollock mentioned this issue Apr 10, 2020

Units and scales (and currency) in Table Schema #216

Closed

roll added this to Open Knowledge Jun 27, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

JTS - Method for describing units for a field #35

JTS - Method for describing units for a field #35

rufuspollock commented Apr 7, 2013

Floppy commented Apr 30, 2013

Floppy commented Apr 30, 2013

rufuspollock commented May 4, 2013

spatchcock commented May 4, 2013

rufuspollock commented May 4, 2013

spatchcock commented May 4, 2013

spatchcock commented May 5, 2013

spatchcock commented May 18, 2013

rufuspollock commented May 20, 2013

spatchcock commented May 20, 2013

rufuspollock commented Jun 23, 2013

JTS - Method for describing units for a field #35

JTS - Method for describing units for a field #35

Comments

rufuspollock commented Apr 7, 2013

Concerns

Floppy commented Apr 30, 2013

Floppy commented Apr 30, 2013

rufuspollock commented May 4, 2013

spatchcock commented May 4, 2013

Dimensions

Valid units and identifiers

Compound unit syntax

Examples

Currency

rufuspollock commented May 4, 2013

spatchcock commented May 4, 2013

spatchcock commented May 5, 2013

spatchcock commented May 18, 2013

rufuspollock commented May 20, 2013

spatchcock commented May 20, 2013

rufuspollock commented Jun 23, 2013