Skip to content

Latest commit

 

History

History
237 lines (168 loc) · 20.5 KB

04_Fix-Path.md

File metadata and controls

237 lines (168 loc) · 20.5 KB

Lesson 4: Fix Path and more complex transfromations in Fix

Last sessions we learned the how to construct a metafacture workflow, how to use the Playground and how and how Metafacture Fix can be used to parse structured information. Today we will go deeper into Metafacture Fix and describe how to pluck data out of structured information.

Today will we fetch a new weather report with the Metafacture Playground:

"https://fcc-weather-api.glitch.me/api/current?lat=50.93414&lon=6.93147"
| open-http
| as-lines
| decode-json
| encode-yaml
| print
;

We also saw in the previous post how you can use Metafacture to transform the JSON format into the YAML format which is easier to read and contains the same information.

We also learned some fixes to retrieve information out of the JSON file like retain("name","main.temp").

In this post we delve a bit deeper into ways how to point to fields in a JSON or a YAML file:

---
coord:
  lon: "6.9315"
  lat: "50.9341"
weather:
- id: "800"
  main: "Clear"
  description: "clear sky"
  icon: "https://cdn.glitch.com/6e8889e5-7a72-48f0-a061-863548450de5%2F01d.png?1499366022009"
base: "stations"
main:
  temp: "12.56"
  feels_like: "11.93"
  temp_min: "11.62"
  temp_max: "14.68"
  pressure: "1022"
  humidity: "79"
visibility: "10000"
wind:
  speed: "1.03"
  deg: "0"
clouds:
  all: "0"
dt: "1654153727"
sys:
  type: "2"
  id: "43069"
  country: "DE"
  sunrise: "1654140191"
  sunset: "1654198658"
timezone: "7200"
id: "2886242"
name: "Cologne"
cod: "200"

main.temp is called a Path that is JSON Path-like and points to a part of the data set - here our Yaml record - you are interested in. The data, as shown above, is structured like a tree.

There are top level simple fields like: base, cod, dt, id which contain only text values or numbers. Depending on the context simple fields can also be named: elemente, properties, attribute or key.

There are also fields like coord that contain a deeper structure like lat and lon. Nested elements that contain one or more subfields or subelements are also called objects or hash.

Metafacture Fix is using Fix Path, a path-syntax that is JSON Path like but not identical. It also uses the dot notation but there are some differences with the path structure of arrays and repeated fields. Especially when working with JSON or YAML.

Using a JSON path you can point to every part of the JSON file using a dot-notation. For simple top level fields the path is just the name of the field:

  • base
  • cod
  • dt
  • id
  • name

For the nested objects with deeper structure you add a dot . to point to the subfields:

  • clouds.all
  • coord.lat
  • coord.lon
  • main.temp
  • `etc…

So for example. If you would have a deeply nested structure like this object:

x:
  y:
    z:
      a:
        b:
          c: Hello :-)

Then you would point to the c field with the path to reference the element woulf be x.y.z.a.b.c.

So lets do some simple excercises:

Try and complete the fix functions. Transform the element a into title and combine the subfields of b and c to the element author.

Answer [See here](https://metafacture.org/playground/?flux=inputFile%0A%7Copen-file%0A%7Cas-records%0A%7Cdecode-yaml%0A%7Cfix%28transformationFile%29%0A%7Cencode-json%28prettyPrinting%3D%22true%22%29%0A%7Cprint%0A%3B&transformation=move_field%28%22a%22%22%2C+%22title%22%29%0Apaste%28%22author%22%2C+%22...%22%2C+...%2C+%22~from%22%2C+...%29%0Aretain%28%22title%22%2C+%22author%22%29&data=---%0Aa%3A+Faust%0Ab+%3A%0A++ln%3A+Goethe%0A++fn%3A+JW%0Ac%3A+Weimar%0A%0A---%0Aa%3A+R%C3%A4uber%0Ab+%3A%0A++++ln%3A+Schiller%0A++++fn%3A+F%0Ac%3A+Weimar)

There are two extra path structures that need to be explained:

  • repeatable fields
  • arrays

In an data set an element sometimes can have multiple instances. Different data models solve this possibility differently. XML-Records can have all elements multiple times, element repition is possible and in many schemas it is (partly) allowed. E.g. the subject element exists three times:

<subject>Metadata</subject>
<subject>Datatransformation</subject>
<subject>ETL</subject>

Repeatable elements also exist e.g. in JSON and YAML but are unusual:

creator: Justus
creator: Peter
creator: Bob

In our two examples the subject- and creator-element exists three times. To point to one of the elements you need to use an index. The index is one-based: The first index has value 1, the second the value 2, the third the value 3. So, the path of the creator Bob would be creator.3. (This is a main difference between Catmandu and Metafacture because Catmandu has an zero based index.)

If you want to refer to all creators then you can use the array wildcard * which can replace the concrete index number: creator.* refers to all creator elements. You can also select the the first instance with the array wildcard $first and the last $last. This is espacially handy if you do not know how often an element is repeated. When adding an additional repeated element you usually use the $append wildcard.

Prepend the correct last name to the three investigators: Justus Jonas, Peter Shaw and Bob Andrews. Also append Investigator to all of them.

Answer [See here](https://metafacture.org/playground/?flux=inputFile%0A%7Copen-file%0A%7Cas-records%0A%7Cdecode-yaml%0A%7Cfix%28transformationFile%29%0A%7Cencode-json%28prettyPrinting%3D%22true%22%29%0A%7Cprint%0A%3B&transformation=append%28%22creator.1%22%2C%22+Jonas%22%29%0Aappend%28%22creator.2%22%2C%22+Shaw%22%29%0Aappend%28%22creator.3%22%2C%22+Andrews%22%29%0Aprepend%28%22creator.%2A%22%2C%22Investigator+%22%29&data=---%0Acreator%3A+Justus%0Acreator%3A+Peter%0Acreator%3A+Bob%0A)

In JSON or YAML element repetion is possible but unusual. Instead of repeating elements repetition is constructed as list so that an element can have more than one value. This is called an array and looks like this in YAML:

Our example from above would look like this if creator was a list instead of an repeated field:

creator:
	- Justus
	- Peter
	- Bob

or:

my:
  colors:
    - black
    - red
    - yellow

Also lists can be deeply nested, if they are not just lists of strings (array of strings) but of objects (array of objects).

characters:
  - name: Justus
    role: Investigator
  - name: Peter
    role: Investigator
  - name: Bob
    role: Research & Archive

In the example above you see a field my which contains a deeper field colors which has 3 values. To point to one of the colors you need to use an index but also genuin arrays have a marker in Metafacture: []. Also here the first index in a array has value 1, the second the value 2, the third the value 3. The array markers are generated by the JSON-Decoder and the YAML-Decoder. Also if you want to generate an array in the target format, then you need to add [] at the end of an list-element like newArray[]. (While sofare the path handling of Catmandu and Metafacture are similar, they differ at this point.)

So, the path of the red would be: my.color[].2 And the path for Peter would be characters[].2.name

There is one array type in our JSON report from our example at the beginning above and that is the weather field. To point to the description of the weather you need the path weather[].1.description.

elements objects array/repeated field
need path need dots to mark nested structure need index/array-wildcards to refer to specific position
id title.subtitle author.*.firstName
name very.nested.element my.color.2

Excercise:

Only retain the elements of title, the element of the series and the role of Bob Andrews. You have to identify the paths for said elements.

TODO: Solution

Again append the last names to the specific character Justus Jonas, Peter Shaw and Bob Andrews. Also add a field to each character "type":"Person"`

Answer [See here](https://metafacture.org/playground/?flux=inputFile%0A%7Copen-file%0A%7Cas-records%0A%7Cdecode-yaml%0A%7Cfix%28transformationFile%29%0A%7Cencode-json%28prettyPrinting%3D%22true%22%29%0A%7Cprint%0A%3B&transformation=append%28%22characters%5B%5D.1.name%22%2C%22+Jonas%22%29%0Aappend%28%22characters%5B%5D.2.name%22%2C%22+Shaw%22%29%0Aappend%28%22characters%5B%5D.3.name%22%2C%22+Andrews%22%29%0Aadd_field%28%22characters%5B%5D.%2A.type%22%2C%22+Andrews%22%29&data=---%0Acharacters%3A+%0A++-+name%3A+Justus%0A++++role%3A+Investigator%0A++-+name%3A+Peter%0A++++role%3A+Investigator%0A++-+name%3A+Bob%0A++++role%3A+Research+%26+Archive%0A)

In this post we learned the JSON Path syntax and how it can be used to point to parts of a JSON data set want to manipulate. We explained the Fix path using a YAML transformation as example, because this is easier to read.

Especially when working with complex bibliographic data one has to get to know the paths so that you do not have to guess what a path to a certain element is:

There exists multiple ways to find out the path-names of records:

e.g.: Here a way to show pathways in combination with values.

Here is a way to collect and count all paths in all records by using the list-fix-paths-command.

Other ways are also possible too.

Bonus: XML in MF and their paths

<title>This is the title</title>

The path for the value This is the title is not title but title.value

XMLs are not just simple elements with key-pair values or objects with subfields but each elemnt can have additional attributs. In Metafacture the xml decoder (decode-xml with handle-generic-xml) groups the attributes and values as subfields of an object.

<title type="mainTitle" lang="eng">This is the title</title>

The path for the different attributs and elements are the following:

title.value
title.type
title.lang

If you want to create xml with attributes then you need to map to this structure too. We will come back to lection working with xml in lesson 10.

Next lessons: 05 More Fix Concepts