Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PPL geoip function #871

Merged
merged 10 commits into from
Dec 19, 2024
Merged
Show file tree
Hide file tree
Changes from 7 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
65 changes: 64 additions & 1 deletion docs/ppl-lang/functions/ppl-ip.md
Original file line number Diff line number Diff line change
Expand Up @@ -32,4 +32,67 @@ Note:
- `ip` can be an IPv4 or an IPv6 address
- `cidr` can be an IPv4 or an IPv6 block
- `ip` and `cidr` must be either both IPv4 or both IPv6
- `ip` and `cidr` must both be valid and non-empty/non-null
- `ip` and `cidr` must both be valid and non-empty/non-null

### `GEOIP`

**Description**

`GEOIP(ip[, property]...)` retrieves geospatial data corresponding to the provided `ip`.

**Argument type:**
- `ip` is string be **STRING** representing an IPv4 or an IPv6 address.
- `property` is **STRING** and must be one of the following:
- `COUNTRY_ISO_CODE`
- `COUNTRY_NAME`
- `CONTINENT_NAME`
- `REGION_ISO_CODE`
- `REGION_NAME`
- `CITY_NAME`
- `TIME_ZONE`
- `LOCATION`
- Return type:
- **STRING** if one property given
- **STRUCT_TYPE** if more than one or no property is given

Example:

_Without properties:_

os> source=ips | eval a = geoip(ip) | fields ip, a
fetched rows / total rows = 2/2
+---------------------+-------------------------------------------------------------------------------------------------------+
|ip |lol |
+---------------------+-------------------------------------------------------------------------------------------------------+
|66.249.157.90 |{JM, Jamaica, North America, 14, Saint Catherine Parish, Portmore, America/Jamaica, 17.9686,-76.8827} |
|2a09:bac2:19f8:2ac3::|{CA, Canada, North America, PE, Prince Edward Island, Charlottetown, America/Halifax, 46.2396,-63.1355}|
+---------------------+-------+------+-------------------------------------------------------------------------------------------------------+

_With one property:_

os> source=users | eval a = geoip(ip, COUNTRY_NAME) | fields ip, a
fetched rows / total rows = 2/2
+---------------------+-------+
|ip |a |
+---------------------+-------+
|66.249.157.90 |Jamaica|
|2a09:bac2:19f8:2ac3::|Canada |
+---------------------+-------+

_With multiple properties:_

os> source=users | eval a = geoip(ip, COUNTRY_NAME, REGION_NAME, CITY_NAME) | fields ip, a
fetched rows / total rows = 2/2
+---------------------+---------------------------------------------+
|ip |a |
+---------------------+---------------------------------------------+
|66.249.157.90 |{Jamaica, Saint Catherine Parish, Portmore} |
|2a09:bac2:19f8:2ac3::|{Canada, Prince Edward Island, Charlottetown}|
+---------------------+---------------------------------------------+

Note:
- To use `geoip` user must create spark table containing geo ip location data. Instructions to create table can be found [here](../../opensearch-geoip.md).
- `geoip` command by default expects the created table to be called `geoip_ip_data`.
- if a different table name is desired, can set `spark.geoip.tablename` spark config to new table name.
- `ip` can be an IPv4 or an IPv6 address.
- `geoip` commands will always calculated first if used with other eval functions.
74 changes: 74 additions & 0 deletions docs/ppl-lang/planning/ppl-geoip-command.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,74 @@
## geoip syntax proposal

geoip function to add information about the geographical location of an IPv4 or IPv6 address

**Implementation syntax**
- `... | eval geoinfo = geoip(ipAddress *[,properties])`
- generic syntax
- `... | eval geoinfo = geoip(ipAddress)`
- retrieves all geo data
- `... | eval geoinfo = geoip(ipAddress, city, location)`
- retrieve only city, and location

**Implementation details**
- Current implementation requires user to have created a geoip table. Geoip table has the following schema:

```SQL
CREATE TABLE geoip (
cidr STRING,
country_iso_code STRING,
country_name STRING,
continent_name STRING,
region_iso_code STRING,
region_name STRING,
city_name STRING,
time_zone STRING,
location STRING,
ip_range_start BIGINT,
ip_range_end BIGINT,
ipv4 BOOLEAN
)
```

- `geoip` is resolved by performing a join on said table and projecting the resulting geoip data as a struct.
- an example of using `geoip` is equivalent to running the following SQL query:

```SQL
SELECT source.*, struct(geoip.country_name, geoip.city_name) AS a
kenrickyap marked this conversation as resolved.
Show resolved Hide resolved
FROM source, geoip
WHERE geoip.ip_range_start <= ip_to_int(source.ip)
AND geoip.ip_range_end > ip_to_int(source.ip)
AND geoip.ip_type = is_ipv4(source.ip);
```
- in the case that only one property is provided in function call, `geoip` returns string of specified property instead:

```SQL
SELECT source.*, geoip.country_name AS a
FROM source, geoip
WHERE geoip.ip_range_start <= ip_to_int(source.ip)
AND geoip.ip_range_end > ip_to_int(source.ip)
AND geoip.ip_type = is_ipv4(source.ip);
```

**Future plan for additional data-sources**

- Currently only using pre-existing geoip table defined within spark is possible.
- There is future plans to allow users to specify data sources:
- API data sources - if users have their own geoip provided will create ability for users to configure and call said endpoints
- OpenSearch geospatial client - once geospatial client is published we can leverage client to utilize OpenSearch geo2ip functionality.
- Additional datasource connection params will be provided through spark config options.

### New syntax definition in ANTLR
kenrickyap marked this conversation as resolved.
Show resolved Hide resolved

```ANTLR

// functions
evalFunctionCall
: evalFunctionName LT_PRTHS functionArgs RT_PRTHS
| geoipFunction
;

geoipFunction
: GEOIP LT_PRTHS (datasource = functionArg COMMA)? ipAddress = functionArg (COMMA properties = stringLiteral)? RT_PRTHS
;
```
Original file line number Diff line number Diff line change
Expand Up @@ -771,6 +771,79 @@ trait FlintSparkSuite extends QueryTest with FlintSuite with OpenSearchSuite wit
| """.stripMargin)
}

protected def createGeoIpTestTable(testTable: String): Unit = {
sql(s"""
| CREATE TABLE $testTable
| (
| ip STRING,
| ipv4 STRING,
| isValid BOOLEAN
| )
| USING $tableType $tableOptions
|""".stripMargin)

sql(s"""
| INSERT INTO $testTable
| VALUES ('66.249.157.90', '66.249.157.90', true),
| ('2a09:bac2:19f8:2ac3::', 'Given IPv6 is not mapped to IPv4', true),
| ('192.168.2.', '192.168.2.', false),
| ('2001:db8::ff00:12:', 'Given IPv6 is not mapped to IPv4', false)
| """.stripMargin)
}

protected def createGeoIpTable(): Unit = {
sql(s"""
| CREATE TABLE geoip
| (
| cidr STRING,
| country_iso_code STRING,
| country_name STRING,
| continent_name STRING,
| region_iso_code STRING,
| region_name STRING,
| city_name STRING,
| time_zone STRING,
| location STRING,
| ip_range_start DECIMAL(38,0),
| ip_range_end DECIMAL(38,0),
| ipv4 BOOLEAN
| )
| USING $tableType $tableOptions
|""".stripMargin)

sql(s"""
| INSERT INTO geoip
| VALUES (
| '66.249.157.0/24',
| 'JM',
| 'Jamaica',
| 'North America',
| '14',
| 'Saint Catherine Parish',
| 'Portmore',
| 'America/Jamaica',
| '17.9686,-76.8827',
| 1123654912,
| 1123655167,
| true
| ),
| (
| '2a09:bac2:19f8::/45',
| 'CA',
| 'Canada',
| 'North America',
| 'PE',
| 'Prince Edward Island',
| 'Charlottetown',
| 'America/Halifax',
| '46.2396,-63.1355',
| 55878094401180025937395073088449675264,
| 55878094401189697343951990121847324671,
| false
| )
| """.stripMargin)
}

protected def createNestedJsonContentTable(tempFile: Path, testTable: String): Unit = {
val json =
"""
Expand Down
Loading