The IcebergWriter is for writing data-frames to iceberg tables in batch mode with DataFrame API.

The location of the writing can be either an existing iceberg table or a directory path.
The supported write-options are as follows
- write-format - specify the file format to use for the write operation; It should be one of parquet, avro, or orc etc. Default: write.format.default - table property when the table was being created.
- target-file-size-bytes - override the target table’s write.target-file-size-bytes. Default: as per table property.
- check-nullability - set the nullable check on fields. Default: true.
- snapshot-property.custom-key - add an entry with custom-key and corresponding value in the snapshot summary. Default: null.
- fanout-enabled - override the target table’s write.spark.fanout.enabled. Default: false. Fanout writer opens the files per partition value and doesn’t close these files till write task is finished. This functionality is discouraged for batch query, as explicit sort against output rows isn’t expensive for batch workload.
- check-ordering - check if input schema and table schema are same. Default: true
The tablePartitionedBy are the column(s) (separated by ,[comma] if multiple columns are presented) that are used in the "partitioned by" clause of the target table creation statement. If not specified, the data-frame being written to the target iceberg table must be explicitly sorted by these columns before calling this writer. The property only applies when the writing location is a partitioned iceberg table. Please note that the column(s) is/are not used for partitioning the data-frame being written out.
The write mode can only be overwrite or append

Actor Class: com.qwshen.etl.sink.IcebergWriter

The definition of the IcebergWriter:

In YAML format

  actor:
    type: iceberg-writer
    properties:
      options:
        check-ordering: "true"
      tablePartitionedBy: "city, timestamp"
      mode: overwrite
      table: "events.db.features"
      view: users

In JSON format

  {
    "actor": {
      "type": "iceberg-writer",
      "properties": {
        "options": {
          "check-orering": "true"
        },
        "tablePartitionedBy": "city, timestamp",
        "mode": "overwrite",
        "table": "events.db.users",
        "view": "users"
      }
    }
  }

In XML format

  <actor type="iceberg-writer">
    <properties>
      <options>
        <check-ordering>true</check-ordering>
      </options>
      <tablePartitionedBy>city, timestamp</tablePartitionedBy>  
      <mode>overwrite</mode>
      <table>hdfs:///event-warehouse/users</table>
      <view>users</view>
    </properties>
  </actor>

Note:

When submitting a job for writing to an icerberg table, dynamic overwrite mode is recommended by setting spark.sql.sources.partitionOverwriteMode=dynamic.

The IcebergWriter provides only fundamental write operations. It is recommended to use SQL commands with SqlWriter for more complicated writes.

INSERT INTO (for Spark 3.0 or later)

INSERT INTO prod.db.table VALUES (1, 'a'), (2, 'b')
-- or
INSERT INTO prod.db.table SELECT ...

MERGE INTO (for Spark 3.0 or later)

MERGE INTO prod.db.target t
USING (SELECT * from source) s
ON t.id = s.id
WHEN MATCHED AND s.op = 'delete' THEN DELETE
WHEN MATCHED AND t.count IS NULL AND s.op = 'increment' THEN UPDATE SET t.count = 0
WHEN MATCHED AND s.op = 'increment' THEN UPDATE SET t.count = t.count + 1
WHEN NOT MATCHED AND s.event_time > still_valid_threshold THEN INSERT (id, count) VALUES (s.id, 1)
WHEN NOT MATCHED THEN INSERT *

INSERT OVERWRITE (make sure Spark is 3.0.1 or later)

INSERT OVERWRITE prod.my_app.logs
SELECT uuid, first(level), first(ts), first(message)
FROM prod.my_app.logs
WHERE cast(ts as date) = '2020-07-01'
GROUP BY uuid

-- overwrite by partition
INSERT OVERWRITE prod.my_app.logs
PARTITION (level = 'INFO')
SELECT uuid, first(level), first(ts), first(message)
FROM prod.my_app.logs
WHERE level = 'INFO'
GROUP BY uuid

DELETE FROM (for Spark 3.0 or later)

DELETE FROM prod.db.table
WHERE ts >= '2020-05-01 00:00:00' and ts < '2020-06-01 00:00:00'

DELETE FROM prod.db.all_events
WHERE session_time < (SELECT min(session_time) FROM prod.db.good_events)

DELETE FROM prod.db.orders AS t1
WHERE EXISTS (SELECT oid FROM prod.db.returned_orders WHERE t1.oid = oid)

UPDATE (for Spark 3.1 or later)

UPDATE prod.db.table
SET c1 = 'update_c1', c2 = 'update_c2'
WHERE ts >= '2020-05-01 00:00:00' and ts < '2020-06-01 00:00:00'

UPDATE prod.db.all_events
SET session_time = 0, ignored = true
WHERE session_time < (SELECT min(session_time) FROM prod.db.good_events)

UPDATE prod.db.orders AS t1
SET order_status = 'returned'
WHERE EXISTS (SELECT oid FROM prod.db.returned_orders WHERE t1.oid = oid)

Please use SqlActor to create/alter/drop iceberg tables, including calling stored-procedures:

CREATE TABLE prod.db.sample (
    id bigint,
    data string,
    category string,
    ts timestamp)
USING iceberg
PARTITIONED BY (bucket(16, id), days(ts), category)

ALTER TABLE prod.db.sample ADD COLUMN point.z double

ALTER TABLE prod.db.sample WRITE DISTRIBUTED BY PARTITION LOCALLY ORDERED BY category, id

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

iceberg-writer.md

iceberg-writer.md

Files

iceberg-writer.md

Latest commit

History

iceberg-writer.md

File metadata and controls