Skip to content

Commit

Permalink
Rename variables and update comments to be generalized
Browse files Browse the repository at this point in the history
  • Loading branch information
dagardner-nv committed Sep 19, 2022
1 parent 9e2da5e commit bb82168
Show file tree
Hide file tree
Showing 2 changed files with 54 additions and 4 deletions.
52 changes: 51 additions & 1 deletion docs/source/developer_guide/guides/5_digital_fingerprinting.md
Original file line number Diff line number Diff line change
Expand Up @@ -298,7 +298,57 @@ field = BoolColumn(name="result",
false_values=["DENIED", "CANCELED", "EXPIRED"])
```

We used strings in this example, however we also could have just as easily mapped integer status codes.
We used strings in this example, however we also could have just as easily mapped integer status codes. We also have the ability to map on to types other than boolean by providing custom values for true and false (eg. `1`/`0`, `yes`/`no`) .

| Argument | Type | Descirption |
| -------- | ---- | ----------- |
| `name` | `str` | Name of the destination column |
| `dtype` | `str` or Python type | Typically this should be `bool` however it could potentially be another type if `true_value` and `false_value` are specified. |
| `input_name` | `str` | Original column name |
| `true_value` | Any | Optional value to store for true values, should be of a type `dtype`. Defaults to `True`. |
| `false_value` | Any | Optional value to store for false values, should be of a type `dtype`. Defaults to `False`. |
| `true_values` | `List[str]` | List of string values to be interpreted as true. |
| `false_values` | `List[str]` | List of string values to be interpreted as false. |

##### DateTimeColumn
Subclass of `RenameColumn` specific to casting UTC localized datetime values. When incoming values contain a time-zone offset string the values are converted to UTC, while values without a time-zone are assumed to be UTC.

| Argument | Type | Descirption |
| -------- | ---- | ----------- |
| `name` | `str` | Name of the destination column |
| `dtype` | `str` or Python type | Any type string or Python class recognized by [Pandas](https://pandas.pydata.org/docs/user_guide/basics.html#dtypes) |
| `input_name` | `str` | Original column name |

##### StringJoinColumn
Subclass of `RenameColumn`, converts incoming `list` values to string by joining by `sep`.

| Argument | Type | Descirption |
| -------- | ---- | ----------- |
| `name` | `str` | Name of the destination column |
| `dtype` | `str` or Python type | Any type string or Python class recognized by [Pandas](https://pandas.pydata.org/docs/user_guide/basics.html#dtypes) |
| `input_name` | `str` | Original column name |
| `sep` | `str` | Separator string to use for the join |

##### StringCatColumn
Concatinate values from multiple columns into a new string column separated by `sep`.

| Argument | Type | Descirption |
| -------- | ---- | ----------- |
| `name` | `str` | Name of the destination column |
| `dtype` | `str` or Python type | Any type string or Python class recognized by [Pandas](https://pandas.pydata.org/docs/user_guide/basics.html#dtypes) |
| `input_columns` | `List[str]` | List of columns to concatinate |
| `sep` | `str` | Separator string |

##### IncrementColumn
Subclass of `DateTimeColumn`, counts the unique occurrences of a value in `groupby_column` over a specific time window `period` based on dates in the `input_name` field.

| Argument | Type | Descirption |
| -------- | ---- | ----------- |
| `name` | `str` | Name of the destination column |
| `dtype` | `str` or Python type | Should be `int` or other integer class |
| `input_name` | `str` | Original column name containing timestamp values |
| `groupby_column` | `str` | Column name to group by |
| `period` | `str` | Optional time period to peform the calculation over, value must be [one of pandas' offset strings](https://pandas.pydata.org/docs/user_guide/timeseries.html#timeseries-offset-aliases). Defaults to `D` one day |

### Output Stages
![Output Stages](img/dfp_output_config.png)
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -159,10 +159,10 @@ class IncrementColumn(DateTimeColumn):
period: str = "D"

def process_column(self, df: pd.DataFrame) -> pd.Series:
per_day = super().process_column(df).dt.to_period(self.period)
period = super().process_column(df).dt.to_period(self.period)

# Create the per-user, per-day log count
return df.groupby([self.groupby_column, per_day]).cumcount()
# Create the `groupby_column`, per-period log count
return df.groupby([self.groupby_column, period]).cumcount()


@dataclasses.dataclass
Expand Down

0 comments on commit bb82168

Please sign in to comment.