Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OpenSearch DataType map to Spark DataType #505

Open
Tracked by #185
penghuo opened this issue Aug 1, 2024 · 2 comments
Open
Tracked by #185

OpenSearch DataType map to Spark DataType #505

penghuo opened this issue Aug 1, 2024 · 2 comments

Comments

@penghuo
Copy link
Collaborator

penghuo commented Aug 1, 2024

By mapping the OpenSearch DataType to a SQL DataType and embedding all other OpenSearch mapping parameters into metadata, we achieve a clean separation between the logical data type used in SQL and the physical storage details. For instance, OpenSearch Type: keyword map toSpark SQL Type: StringType
All additional mapping parameters from OpenSearch (such as doc_values, index, store, etc.) will be stored in the metadata of the Spark SQL schema. For an OpenSearch keyword field named name, the corresponding Spark SQL schema might be defined as follows:

import org.apache.spark.sql.types.{StringType, StructField}

val nameField = StructField(
  "name",
  StringType,
  nullable = true,
  metadata = new org.apache.spark.sql.types.MetadataBuilder()
    .putBoolean("doc_values", true)
    .putBoolean("index", true)
    // Add any other mapping parameters as needed
    .build()
)
  • Data Retrieval:
    The query engine will use the metadata to determine the appropriate retrieval mechanism. For instance, if _source is disabled and doc_values is enabled, the engine will know to extract data from doc_values instead.
  • Query Optimization:
    The metadata provides hints on how the underlying data is stored, which can help optimize sorting, filtering, and aggregations. For example, if doc_values are disabled, the engine might handle the query differently due to potential performance impacts when extracting data from _source.
  • Extensibility:
    Storing the mapping parameters as metadata allows future extensions without altering the logical SQL type. As OpenSearch evolves or as additional parameters are needed, they can be incorporated into the metadata without impacting SQL query logic.
@dblock dblock added enhancement New feature or request and removed untriaged labels Aug 19, 2024
@dblock
Copy link
Member

dblock commented Aug 19, 2024

Catch All Triage - 1, 2, 3

@penghuo Is there a regular triage meeting for this repo? Please work with @krisfreedain to set one up.

@penghuo
Copy link
Collaborator Author

penghuo commented Feb 3, 2025

The following table explains how to infer a schema from OpenSearch index mapping to SQL data types.

OpenSearch Data Types SQL Data Type Extended IndexType Descriptoin
binary BinaryType No binary
boolean BooleanType No boolean
keyword StringType No keyword
text StringType No text
long LongType No long
integer IntegerType No integer
short ShortType No short
byte ByteType No byte
double DoubleType No double
float FloatType No float
half_float FloatType No half_float
scaled_float DoubleType No scaled_float
unsigend_long TBD n/a TBD There is no SQL DataType support unsigend_long,
date DateType No date
data_nanos DateType No data_nanos
alias TBD n/a TBD
object MapType No object
flat_object UDT flat_object
nested UDT nested
join UDT join
integer_range IntegerRangeType Yes integer_range follow https://www.postgresql.org/docs/current/rangetypes.html
long_range LongRangeType Yes long_range follow https://www.postgresql.org/docs/current/rangetypes.html
double_range DoubleRangeType Yes double_range follow https://www.postgresql.org/docs/current/rangetypes.html
float_range FloatRangeType Yes float_range follow https://www.postgresql.org/docs/current/rangetypes.html
date_range TimestampRangeType Yes date_range based on format, date_range could map to TimestampRangeType / TimestampRangeNTZType
ip_range IPRangeType Yes ip_range follow https://www.postgresql.org/docs/current/rangetypes.html
ip IPType Yes ip follow https://www.postgresql.org/docs/current/datatype-net-types.html
autocomplete UDT autocomplete
geo_point GeoPointType Yes geo_point follow https://www.postgresql.org/docs/current/datatype-geometric.html
geo_shape GeoShapeType Yes geo_shape follow https://www.postgresql.org/docs/current/datatype-geometric.html
point PointType Yes point follow https://www.postgresql.org/docs/current/datatype-geometric.html
shape ShapeType Yes shape follow https://www.postgresql.org/docs/current/datatype-geometric.html
percolotor UDT percolotor
k-NN vector KNNType Yes k-NN vector

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants