Skip to content

Commit

Permalink
wip
Browse files Browse the repository at this point in the history
  • Loading branch information
dagardner-nv committed Sep 19, 2022
1 parent b45832d commit 9e2da5e
Showing 1 changed file with 44 additions and 6 deletions.
50 changes: 44 additions & 6 deletions docs/source/developer_guide/guides/5_digital_fingerprinting.md
Original file line number Diff line number Diff line change
Expand Up @@ -239,12 +239,9 @@ from dfp.utils.column_info import DataFrameInputSchema
from dfp.utils.column_info import DateTimeColumn
```
```python
schema = DataFrameInputSchema(json_columns=['event'],
column_info=[
DateTimeColumn(name=config.ae.timestamp_column_name,
dtype=datetime,
input_name='event.timestamp')
])
schema = DataFrameInputSchema(
json_columns=['event'],
column_info=[DateTimeColumn(name=config.ae.timestamp_column_name, dtype=datetime, input_name='event.timestamp')])
```

In the above examples three opperations were performed:
Expand All @@ -261,6 +258,47 @@ The `DFPFileToDataFrameStage` is executed first and is responsible for flattenin
| `preserve_columns` | `List[str]` or `str` | Optional regular expression string or list of regular expression strings that define columns in the input data which should be preserved in the output `DataFrame`. By default this is an empty `list`. |
| `row_filter` | `function` or `None` | Optional function to be called after all other processing has been performed. This function receives the `DataFrame` as it's only argument returning a `DataFrame`. |

#### ColumnInfo Classes
The `ColumnInfo` class and subclasses define a single column.

##### ColumnInfo
Defines a single column and type-cast.
| Argument | Type | Descirption |
| -------- | ---- | ----------- |
| `name` | `str` | Name of the column |
| `dtype` | `str` or Python type | Any type string or Python class recognized by [Pandas](https://pandas.pydata.org/docs/user_guide/basics.html#dtypes) |

##### CustomColumn
Defines a column to be computed by a user-defined function `process_column_fn`.

| Argument | Type | Descirption |
| -------- | ---- | ----------- |
| `name` | `str` | Name of the column |
| `dtype` | `str` or Python type | Any type string or Python class recognized by [Pandas](https://pandas.pydata.org/docs/user_guide/basics.html#dtypes) |
| `process_column_fn` | `function` | Function which receives the entire `DataFrame` as it's only input, returning a new [`pandas.Series`](https://pandas.pydata.org/docs/reference/api/pandas.Series.html) object to be stored in column `name`. |

##### RenameColumn
Similar to `ColumnInfo` but adds the ability to also perform a rename.
| Argument | Type | Descirption |
| -------- | ---- | ----------- |
| `name` | `str` | Name of the destination column |
| `dtype` | `str` or Python type | Any type string or Python class recognized by [Pandas](https://pandas.pydata.org/docs/user_guide/basics.html#dtypes) |
| `input_name` | `str` | Original column name |

##### BoolColumn
Subclass of `RenameColumn` adds the ability to map a set custom values as boolean values. For example say we had a string input field containing one of 5 possible enum values: `OK`, `SUCCESS`, `DENIED`, `CANCELED` and `EXPIRED` we could map these values into a single boolean field as:
```python
from dfp.utils.column_info import BoolColumn
```
```python
field = BoolColumn(name="result",
dtype=bool,
input_name="result",
true_values=["OK", "SUCCESS"],
false_values=["DENIED", "CANCELED", "EXPIRED"])
```

We used strings in this example, however we also could have just as easily mapped integer status codes.

### Output Stages
![Output Stages](img/dfp_output_config.png)
Expand Down

0 comments on commit 9e2da5e

Please sign in to comment.