Update dependency io.openlineage:openlineage-java to v1.13.1 #2789
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR contains the following updates:
1.9.1
->1.13.1
Release Notes
OpenLineage/OpenLineage (io.openlineage:openlineage-java)
v1.13.1
Compare Source
Added
#2609
@pawel-big-lebowskiExtends the circuit breaker mechanism to contain a global timeout that stops running OpenLineage integration code when a specified amount of time has elapsed.
DataSetEvent
andJobEvent
inTransport.emit
#2611
@dolfinusAdds overloads
Transport.emit(OpenLineage.DatasetEvent)
andTransport.emit(OpenLineage.JobEvent)
, reusing the implementation ofTransport.emit(OpenLineage.RunEvent)
. Please note:Transport.emit(String)
is now deprecated and will be removed in 1.16.0.GZIP
compression toHttpTransport
#2603
#2604
@dolfinusAdds a
compression
option toHttpTransport
config in the Java and Python clients, withgzip
implementation.#2571
#2597
#2598
@dolfinusAdds a new
messageKey
option toKafkaTransport
config in the Python and Java clients, as well as the Proxy. This option replaces thelocalServerId
option, which is now deprecated. Default value is generated using the run id (forRunEvent
), job name (forJobEvent
) or dataset name (forDatasetEvent
). This value is used by the Kafka producer to distribute messages along topic partitions, instead of sending all the events to the same partition. This allows for full utilization of Kafka performance advantages.#2633
@mobuchowskiAdds a mechanism for forwarding metrics to any Micrometer-compatible implementation for Flink as has been implemented for Spark. Included:
MeterRegistry
,CompositeMeterRegistry
,SimpleMeterRegistry
, andMicrometerProvider
.#2520
@JDarDagranObjects specified with JSON Schema needed to be manually developed and checked in Python, leading to many discrepancies, including wrong schema URLs. This adds a
datamodel-code-generator
for parsing JSON Schema and generating Pydantic or dataclasses classes, etc. In order to useattrs
(a more modern version of dataclasses) and overcome some limitations of the tool, a number of steps have been added in order to customize code to meet OpenLineage requirements. Included: updated references to the latest base JSON Schema spec for all child facets. Please note: newly generated code creates a v2 interface that will be implemented in existing integrations in a future release. The v2 interface introduces some breaking changes: facets are put into separate modules per JSON Schema spec file, some names are changed, and several classes are nowkw_only
.#2583
@pawel-big-lebowskiCreates a
SparkOpenlineageConfig
andFlinkOpenlineageConfig
for a more uniform configuration experience for the user. RenamesOpenLineageYaml
toOpenLineageConfig
and modifies the code to use onlyOpenLineageConfig
classes. Includes a doc update to mention that both ways can be used interchangeably and final documentation will merge all values provided.#2613
@tnazarewAdds support for
FQCN
asspark.openlineage.transport.auth.type
.#2609
@pawel-big-lebowskiImplements within the circuit breaker an optional timeout that switches off the OpenLineage integration code.
#2613
@tnazarewAdds a
TokenProviderTypeIdResolver
to handle bothFQCN
and (for backward compatibility)api_key
types inspark.openlineage.transport.auth.type
.SparkConf
&FlinkConf
approaches#2583
@pawel-big-lebowskiAdds support for config entries being provided by both YAML file and integration-specific configuration (
SparkConf
/FlinkConf
). Allows each integration to have its own config entries.#2533
@pawel-big-lebowskiEnables configuration entries specifying ownership of the job that will result in an
OwnershipJobFacet
being attached to job facets.Changed
partitionKey
format with Kafka implementation#2620
@dolfinusChanges the format of Kinesis
partitionKey
from{jobNamespace}:{jobName}
torun:{jobNamespace}/{jobName}
to match the Kafka transport implementation.Fixed
load_config
return an empty dict instead ofNone
when file empty#2596
@kacpermudautils.load_config()
now returns an empty dict instead ofNone
in the case of an empty file to prevent anOpenLineageClient
crash.#2614
@dolfinusFixes rendering of javadoc for methods generated by
lombok
annotations by adding adelombok
step.#2599
@mobuchowskiFixes NPE when using query option when reading from Snowflake.
v1.12.0
Compare Source
Added
lineage_job_namespace
andlineage_job_name
macros#2582
@dolfinusAdds new Airflow macros
lineage_job_namespace()
,lineage_job_name(task)
that return an Airflow namespace and Airflow job name, respectively.SchemaDatasetFacet
#2548
@dolfinusAllows nested fields support to
SchemaDatasetFacet
.Fixed
airflow.macros.lineage_parent_id
#2578
@blacklightFixes the run format returned by the
lineage_parent_id
Airflow macro and simplifies the format of thelineage_parent_id
andlineage_run_id
macros.#2591
@blacklightdbt-ol
now propagates the exit code of the underlying dbt process even if no lineage events are emitted.#2579
@JDarDagranAdds an upper limit on supported versions of Dagster as the integration is no longer actively maintained and recent releases introduce breaking changes.
#2585
@harelsString lookup was not accounting for empty strings and causing a
java.lang.StringIndexOutOfBoundsException
.#2624
@pawel-big-lebowskiImproves developer experience by fixing issues resulting in warnings on build.
pkg_resources
module on Python 3.12#2572
@dolfinusRemoves
pkg_resources
dependency and replaces it with the packaging lib.HashSet
in column-level lineage instead of iterating throughLinkedList
#2584
@mobuchowskiTakes advantage of performance gains available from using
HashSet
for collection.v1.11.3
Compare Source
Added
SCRIPT
-type jobs in BigQuery#2564
@kacpermudaIn the case of
SCRIPT
-type jobs in BigQuery, no lineage was being extracted because theSCRIPT
job had no lineage information - it only spawned child jobs that had that information. With this change, the integration extracts lineage information from child jobs when dealing withSCRIPT
-type jobs.#2272
@pawel-big-lebowskiThis PR adds a
spark-interfaces-scala
package that allows lineage extraction to be implemented within Spark extensions (Iceberg, Delta, GCS, etc.). The Openlineage integration, when traversing the query plan, verifies if nodes implement defined interfaces. If so, interface methods are used to extract lineage. Refer to the README for more details.#2496
@mobuchowskiAdds a mechanism for forwarding metrics to any Micrometer-compatible implementation. Included:
MeterRegistryFactory
,MicrometerProvider
,StatsDMetricsBuilder
, metrics config in OpenLineage config, and a Java client implementation.#2528
@mobuchowskiAdds timers, counters and additional instrumentation in order to implement Micrometer metrics collection.
#2556
@mobuchowskiAdds support for the Spark-BigQuery connector's query input option, which executes a query directly on BigQuery, storing the result in an intermediate dataset, bypassing Spark's computation layer. Due to this, the lineage is retrieved using the SQL parser, similarly to
JDBCRelation
.SparkPropertyFacetBuilder
to support recording Spark runtime#2523
@Ruihua98Modifies
SparkPropertyFacetBuilder
to capture theRuntimeConfig
of the Spark session because the existingSparkPropertyFacet
can only capture the static config of the Spark context. This facet will be added in both RDD-related and SQL-related runs.fileCount
to dataset stat facets#2562
@dolfinusAdds a
fileCount
field toDataQualityMetricsInputDatasetFacet
andOutputStatisticsOutputDatasetFacet
specification.Fixed
dbt-ol
should transparently exit with the same exit code as the childdbt
process#2560
@blacklightMakes
dbt-ol
transparently exit with the same exit code as the childdbt
process.#2531
@HuangZhenQiuDisables the module metadata generation for Flink to fix the problem of having gradle dependencies to submodules within
openlineage-flink.jar
.#2507
@pawel-big-lebowskiFixes the class not found issue when checking for Cassandra classes. Also fixes the Maven pom dependency on subprojects.
.emit()
method logging & annotations#2539
@dolfinusUpdates OpenLineage.emit debug messages and annotations.
#2547
@dolfinusWhen the
OpenLineageSql
class could not load a native library, if returnedNone
for all operations. But because the error message was suppressed, the user could not determine the reason.#2510
@mobuchowskiIncludes tests and cosmetic improvements.
#2535
@pawel-big-lebowskiChanges behavior so
IllegalStateException
is always caught when accessingSparkSession
.#2537
@pawel-big-lebowskiFixes the
ClassNotFoundError
occurring on Databricks runtime and extends the integration test to verifyDatabricksEnvironmentFacet
.#2565
@d-m-hThe
JobMetricsHolder#cleanUp(int)
method now correctly purges unneeded state from both maps.UnknownEntryFacetListener
#2557
@pawel-big-lebowskiPrevents storing the state when a facet is disabled, purging the state after populating run facets.
JDBCOptions(table=...)
containing subquery#2546
@dolfinusPrevents
openlineage-spark
from producing datasets with names likedatabase.(select * from table)
for JDBC sources.#2563
@mobuchowskiWhen a Snowflake job is bypassing Spark's computation layer, now the SQL parser will be used to get the lineage.
IllegalStateException
when accessingSparkSession
#2535
@pawel-big-lebowskiIllegalStateException
was not being caught.v1.10.2
Compare Source
Added
#2518
@JDarDagranAdds the new provider required by the latest version of Dagster.
#2491
@HuangZhenQiuAdds support for hybrid source lineage for users of Kafka and Iceberg sources in backfill usecases.
#2479
@HuangZhenQiuCassandra cluster info to be used as the dataset namespace, and the keyspace to be combined with the table name as the dataset name.
#2472
@HuangZhenQiuBumps the Flink JDBC connector version to 3.1.2-1.18 for Flink 1.18.
OpenLineageClientUtils#loadOpenLineageJson(InputStream)
and changeOpenLineageClientUtils#loadOpenLineageYaml(InputStream)
methods#2490
@d-m-hThis improves the explicitness of the methods. Previously,
loadOpenLineageYaml(InputStream)
wanted theInputStream
to contain bytes that represented JSON.#2486
@davidjgossAdds the status code and body as properties on the thrown exception when a non-success response is encountered in the HTTP transport.
#2478
@mattiabertorelloEases publication of events to MSK with IAM authentication.
Removed
#2524
@kacpermudaRefines the operator's attribute inclusion logic in facets to include only those known to be important or compact, ensuring that custom operator attributes with substantial data do not inflate the event size.
Fixed
task_instance
copy fails#2492
@kacpermudaAirflow will now proceed without rendering templates if
task_instance
copy fails inlistener.on_task_instance_running
.HttpTransport
timeout#2475
@pawel-big-lebowskiThe existing
timeout
config parameter is ambiguous: implementation treats the value as double in seconds, although the documentation claims it's milliseconds. A new config paramtimeoutInMillis
has been added. the Existingtimeout
has been removed from docs and will be deprecated in 1.13.#2515
@pawel-big-lebowskiAdds a check for a null context before executing
end(jobEnd)
.#2507
@pawel-big-lebowskiFixes the class not found issue when checking for Cassandra classes. Also fixes the Maven POM dependency on subprojects.
#2512
@HuangZhenQiuEnables the JDBC table name with a schema prefix.
#2508
@pawel-big-lebowskiFor JDBC, the Flink integration is not adjusted to the Openlineage naming convention. There is code that extracts the dataset namespace/name from the JDBC connection url, but it's in the Spark integration. As a solution, this code has to be extracted into the Java client and reused by the Spark and Flink integrations.
#2507
@pawel-big-lebowskiFlink is failing when no Cassandra classes are present on the class path. This is happening because of
CassandraUtils
class which has a statichasClasses
method, but it imports Cassandra-related classes in the header. Also, the Flink subproject contains an unnecessarymaven-publish
plugin.#2504
@HuangZhenQiuThe shadow jar of Flink is not minimized, so some internal jars are listed as runtime dependences. This removes them from the final pom.xml file in the Flink module.
#2479
@HuangZhenQiuFollowing the namespace definition, we should use
cassandra://host:port
.Configuration
📅 Schedule: Branch creation - "every 3 months on the first day of the month" (UTC), Automerge - At any time (no schedule defined).
🚦 Automerge: Disabled by config. Please merge this manually once you are satisfied.
♻ Rebasing: Whenever PR becomes conflicted, or you tick the rebase/retry checkbox.
🔕 Ignore: Close this PR and you won't be reminded about this update again.
This PR has been generated by Mend Renovate. View repository job log here.