Skip to content

Commit

Permalink
PPL geoip function (#871)
Browse files Browse the repository at this point in the history
* geoip function implementation

Signed-off-by: Kenrick Yap <14yapkc1@gmail.com>

* Fixed integration tests

Signed-off-by: Kenrick Yap <kenrick.yap@improving.com>

* linting

Signed-off-by: Kenrick Yap <kenrick.yap@improving.com>

* addressing PR comments (added addtional integ tests, doc changes)

Signed-off-by: Kenrick Yap <14yapkc1@gmail.com>

* fixed new integ tests

Signed-off-by: Kenrick Yap <kenrick.yap@improving.com>

* addressing pr comments

Signed-off-by: Kenrick Yap <kenrick.yap@improving.com>

* address review comments

Signed-off-by: Kenrick Yap <kenrick.yap@improving.com>

* moved validateGeoIpProperty to relevant class

Signed-off-by: Kenrick Yap <kenrick.yap@improving.com>

* updated scalaudf function descriptions

Signed-off-by: Kenrick Yap <kenrick.yap@improving.com>

---------

Signed-off-by: Kenrick Yap <14yapkc1@gmail.com>
Signed-off-by: Kenrick Yap <kenrick.yap@improving.com>
Signed-off-by: kenrickyap <121634635+kenrickyap@users.noreply.github.com>
Co-authored-by: Kenrick Yap <14yapkc1@gmail.com>
  • Loading branch information
kenrickyap and 14yapkc1 authored Dec 19, 2024
1 parent 957de4e commit 20ef890
Show file tree
Hide file tree
Showing 15 changed files with 1,314 additions and 34 deletions.
65 changes: 64 additions & 1 deletion docs/ppl-lang/functions/ppl-ip.md
Original file line number Diff line number Diff line change
Expand Up @@ -32,4 +32,67 @@ Note:
- `ip` can be an IPv4 or an IPv6 address
- `cidr` can be an IPv4 or an IPv6 block
- `ip` and `cidr` must be either both IPv4 or both IPv6
- `ip` and `cidr` must both be valid and non-empty/non-null
- `ip` and `cidr` must both be valid and non-empty/non-null

### `GEOIP`

**Description**

`GEOIP(ip[, property]...)` retrieves geospatial data corresponding to the provided `ip`.

**Argument type:**
- `ip` is string be **STRING** representing an IPv4 or an IPv6 address.
- `property` is **STRING** and must be one of the following:
- `COUNTRY_ISO_CODE`
- `COUNTRY_NAME`
- `CONTINENT_NAME`
- `REGION_ISO_CODE`
- `REGION_NAME`
- `CITY_NAME`
- `TIME_ZONE`
- `LOCATION`
- Return type:
- **STRING** if one property given
- **STRUCT_TYPE** if more than one or no property is given

Example:

_Without properties:_

os> source=ips | eval a = geoip(ip) | fields ip, a
fetched rows / total rows = 2/2
+---------------------+-------------------------------------------------------------------------------------------------------+
|ip |lol |
+---------------------+-------------------------------------------------------------------------------------------------------+
|66.249.157.90 |{JM, Jamaica, North America, 14, Saint Catherine Parish, Portmore, America/Jamaica, 17.9686,-76.8827} |
|2a09:bac2:19f8:2ac3::|{CA, Canada, North America, PE, Prince Edward Island, Charlottetown, America/Halifax, 46.2396,-63.1355}|
+---------------------+-------+------+-------------------------------------------------------------------------------------------------------+

_With one property:_

os> source=users | eval a = geoip(ip, COUNTRY_NAME) | fields ip, a
fetched rows / total rows = 2/2
+---------------------+-------+
|ip |a |
+---------------------+-------+
|66.249.157.90 |Jamaica|
|2a09:bac2:19f8:2ac3::|Canada |
+---------------------+-------+

_With multiple properties:_

os> source=users | eval a = geoip(ip, COUNTRY_NAME, REGION_NAME, CITY_NAME) | fields ip, a
fetched rows / total rows = 2/2
+---------------------+---------------------------------------------+
|ip |a |
+---------------------+---------------------------------------------+
|66.249.157.90 |{Jamaica, Saint Catherine Parish, Portmore} |
|2a09:bac2:19f8:2ac3::|{Canada, Prince Edward Island, Charlottetown}|
+---------------------+---------------------------------------------+

Note:
- To use `geoip` user must create spark table containing geo ip location data. Instructions to create table can be found [here](../../opensearch-geoip.md).
- `geoip` command by default expects the created table to be called `geoip_ip_data`.
- if a different table name is desired, can set `spark.geoip.tablename` spark config to new table name.
- `ip` can be an IPv4 or an IPv6 address.
- `geoip` commands will always calculated first if used with other eval functions.
59 changes: 59 additions & 0 deletions docs/ppl-lang/planning/ppl-geoip-command.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,59 @@
## geoip syntax proposal

geoip function to add information about the geographical location of an IPv4 or IPv6 address

**Implementation syntax**
- `... | eval geoinfo = geoip(ipAddress *[,properties])`
- generic syntax
- `... | eval geoinfo = geoip(ipAddress)`
- retrieves all geo data
- `... | eval geoinfo = geoip(ipAddress, city, location)`
- retrieve only city, and location

**Implementation details**
- Current implementation requires user to have created a geoip table. Geoip table has the following schema:

```SQL
CREATE TABLE geoip (
cidr STRING,
country_iso_code STRING,
country_name STRING,
continent_name STRING,
region_iso_code STRING,
region_name STRING,
city_name STRING,
time_zone STRING,
location STRING,
ip_range_start BIGINT,
ip_range_end BIGINT,
ipv4 BOOLEAN
)
```

- `geoip` is resolved by performing a join on said table and projecting the resulting geoip data as a struct.
- an example of using `geoip` is equivalent to running the following SQL query:

```SQL
SELECT source.*, struct(geoip.country_name, geoip.city_name) AS a
FROM source, geoip
WHERE geoip.ip_range_start <= ip_to_int(source.ip)
AND geoip.ip_range_end > ip_to_int(source.ip)
AND geoip.ip_type = is_ipv4(source.ip);
```
- in the case that only one property is provided in function call, `geoip` returns string of specified property instead:

```SQL
SELECT source.*, geoip.country_name AS a
FROM source, geoip
WHERE geoip.ip_range_start <= ip_to_int(source.ip)
AND geoip.ip_range_end > ip_to_int(source.ip)
AND geoip.ip_type = is_ipv4(source.ip);
```

**Future plan for additional data-sources**

- Currently only using pre-existing geoip table defined within spark is possible.
- There is future plans to allow users to specify data sources:
- API data sources - if users have their own geoip provided will create ability for users to configure and call said endpoints
- OpenSearch geospatial client - once geospatial client is published we can leverage client to utilize OpenSearch geo2ip functionality.
- Additional datasource connection params will be provided through spark config options.
Original file line number Diff line number Diff line change
Expand Up @@ -771,6 +771,79 @@ trait FlintSparkSuite extends QueryTest with FlintSuite with OpenSearchSuite wit
| """.stripMargin)
}

protected def createGeoIpTestTable(testTable: String): Unit = {
sql(s"""
| CREATE TABLE $testTable
| (
| ip STRING,
| ipv4 STRING,
| isValid BOOLEAN
| )
| USING $tableType $tableOptions
|""".stripMargin)

sql(s"""
| INSERT INTO $testTable
| VALUES ('66.249.157.90', '66.249.157.90', true),
| ('2a09:bac2:19f8:2ac3::', 'Given IPv6 is not mapped to IPv4', true),
| ('192.168.2.', '192.168.2.', false),
| ('2001:db8::ff00:12:', 'Given IPv6 is not mapped to IPv4', false)
| """.stripMargin)
}

protected def createGeoIpTable(): Unit = {
sql(s"""
| CREATE TABLE geoip
| (
| cidr STRING,
| country_iso_code STRING,
| country_name STRING,
| continent_name STRING,
| region_iso_code STRING,
| region_name STRING,
| city_name STRING,
| time_zone STRING,
| location STRING,
| ip_range_start DECIMAL(38,0),
| ip_range_end DECIMAL(38,0),
| ipv4 BOOLEAN
| )
| USING $tableType $tableOptions
|""".stripMargin)

sql(s"""
| INSERT INTO geoip
| VALUES (
| '66.249.157.0/24',
| 'JM',
| 'Jamaica',
| 'North America',
| '14',
| 'Saint Catherine Parish',
| 'Portmore',
| 'America/Jamaica',
| '17.9686,-76.8827',
| 1123654912,
| 1123655167,
| true
| ),
| (
| '2a09:bac2:19f8::/45',
| 'CA',
| 'Canada',
| 'North America',
| 'PE',
| 'Prince Edward Island',
| 'Charlottetown',
| 'America/Halifax',
| '46.2396,-63.1355',
| 55878094401180025937395073088449675264,
| 55878094401189697343951990121847324671,
| false
| )
| """.stripMargin)
}

protected def createNestedJsonContentTable(tempFile: Path, testTable: String): Unit = {
val json =
"""
Expand Down
Loading

0 comments on commit 20ef890

Please sign in to comment.