Hudi键更新

适用于版本0.10.1。

每条记录通过主键唯一识别,即记录键和分区路径。

主键可用于分区级别的唯一约束,以及高效更新或删除记录。

分区方式直接决定了记录消费和查询延迟。

当前 Hudi支持分区和全局索引。分区索引通过记录键和分区路径唯一确定每条记录,而全局索引仅通过记录键保证记录唯一。

Hudi提供了多种开箱可用的键更新器,并且提供了自定义实现接口。

以下为使用键更新器所需的配置:

Config Meaning/purpose
hoodie.datasource.write.recordkey.field Refers to record key field. This is a mandatory(必须的) field.
hoodie.datasource.write.partitionpath.field Refers to partition path field. This is a mandatory field.
hoodie.datasource.write.keygenerator.class Refers to Key generator class(including full path). Could refer to any of the available ones or user defined one. This is a mandatory field.
hoodie.datasource.write.partitionpath.urlencode When set to true, partition path will be url encoded. Default value is false.
hoodie.datasource.write.hive_style_partitioning When set to true, uses hive style partitioning. Partition field name will be prefixed to the value. Format: “=”. Default value is false.

注意: 使用基于时间戳的键更新器还需要额外的配置。

1 SimpleKeyGenerator

记录键和分区路径均由数据列按名称指定。值将转换为字符串。

2 ComplexKeyGenerator

记录键和分区路径由逗号分隔的列组合制定,如"Hoodie.datasource.write.recordkey.field" : “col1,col4”

3 GlobalDeleteKeyGenerator

不使用分区路径

4 NonPartitionedKeyGenerator

适用于非分区表,返回空白分区

5 CustomKeyGenerator

使用多种键更新器组合的泛型实现。

适用于复杂分区路径场景。在配置hoodie.datasource.write.partitionpath.field中按照field1:PartitionKeyType1,field2:PartitionKeyType2…形式指定分区路径\/\ 。PartitionType是枚举值,当前支持SIMPLE和TIMESTAMP。

6 自定义实现

API

7 TimestampBasedKeyGenerator

将分区路径值解析为时间戳,而不是同记录键一样转换为字符串。

需要设置以下参数:

Config Meaning/purpose
hoodie.deltastreamer.keygen.timebased.timestamp.type One of the timestamp types supported(UNIX_TIMESTAMP, DATE_STRING, MIXED, EPOCHMILLISECONDS, SCALAR)
hoodie.deltastreamer.keygen.timebased.output.dateformat Output date format
hoodie.deltastreamer.keygen.timebased.timezone Timezone of the data format
hoodie.deltastreamer.keygen.timebased.input.dateformat Input date format

示例如下:

GMT

Config field Value
hoodie.deltastreamer.keygen.timebased.timestamp.type “EPOCHMILLISECONDS”
hoodie.deltastreamer.keygen.timebased.output.dateformat “yyyy-MM-dd hh”
hoodie.deltastreamer.keygen.timebased.timezone “GMT+8:00”

输入:“1578283932000L”, 输出:“2020-01-06 12”

注意:输入null对应输出“1970-01-01 08”。

DATE_STRING

Config field Value
hoodie.deltastreamer.keygen.timebased.timestamp.type “DATE_STRING”
hoodie.deltastreamer.keygen.timebased.output.dateformat “yyyy-MM-dd hh”
hoodie.deltastreamer.keygen.timebased.timezone “GMT+8:00”
hoodie.deltastreamer.keygen.timebased.input.dateformat “yyyy-MM-dd hh:mm:ss”

输入:“2020-01-06 12:12:12”, 输出:“2020-01-06 12”

注意:输入null对应输出“1970-01-01 12:00:00”。

SCALAR

hoodie.deltastreamer.keygen.timebased.timestamp.type “SCALAR”
hoodie.deltastreamer.keygen.timebased.output.dateformat “yyyy-MM-dd hh”
hoodie.deltastreamer.keygen.timebased.timezone “GMT”
hoodie.deltastreamer.keygen.timebased.timestamp.scalar.time.unit “days”

输入:“20000L”, 输出:“2024-10-04 12”

注意:输入null对应输出“1970-01-02 12”。

ISO8601WithMsZ with Single Input format

Config field Value
hoodie.deltastreamer.keygen.timebased.timestamp.type “DATE_STRING”
hoodie.deltastreamer.keygen.timebased.input.dateformat “yyyy-MM-dd’T’HH:mm:ss.SSSZ”
hoodie.deltastreamer.keygen.timebased.input.dateformat.list.delimiter.regex “”
hoodie.deltastreamer.keygen.timebased.input.timezone “”
hoodie.deltastreamer.keygen.timebased.output.dateformat “yyyyMMddHH”
hoodie.deltastreamer.keygen.timebased.output.timezone “GMT”

输入:”2020-04-01T13:01:33.428Z”, 输出:”2020040113”

ISO8601WithMsZ with Multiple Input formats

Config field Value
hoodie.deltastreamer.keygen.timebased.timestamp.type “DATE_STRING”
hoodie.deltastreamer.keygen.timebased.input.dateformat “yyyy-MM-dd’T’HH:mm:ssZ,yyyy-MM-dd’T’HH:mm:ss.SSSZ”
hoodie.deltastreamer.keygen.timebased.input.dateformat.list.delimiter.regex “”
hoodie.deltastreamer.keygen.timebased.input.timezone “”
hoodie.deltastreamer.keygen.timebased.output.dateformat “yyyyMMddHH”
hoodie.deltastreamer.keygen.timebased.output.timezone “UTC”

输入:”2020-04-01T13:01:33.428Z”, 输出:”2020040113”

ISO8601NoMs with offset using multiple input formats

Config field Value
hoodie.deltastreamer.keygen.timebased.timestamp.type “DATE_STRING”
hoodie.deltastreamer.keygen.timebased.input.dateformat “yyyy-MM-dd’T’HH:mm:ssZ,yyyy-MM-dd’T’HH:mm:ss.SSSZ”
hoodie.deltastreamer.keygen.timebased.input.dateformat.list.delimiter.regex “”
hoodie.deltastreamer.keygen.timebased.input.timezone “”
hoodie.deltastreamer.keygen.timebased.output.dateformat “yyyyMMddHH”
hoodie.deltastreamer.keygen.timebased.output.timezone “UTC”

输入:”2020-04-01T13:01:33-05:00“, 输出:”2020040118”

Input as short date string and expect date in date format

Config field Value
hoodie.deltastreamer.keygen.timebased.timestamp.type “DATE_STRING”
hoodie.deltastreamer.keygen.timebased.input.dateformat “yyyy-MM-dd’T’HH:mm:ssZ,yyyy-MM-dd’T’HH:mm:ss.SSSZ,yyyyMMdd”
hoodie.deltastreamer.keygen.timebased.input.dateformat.list.delimiter.regex “”
hoodie.deltastreamer.keygen.timebased.input.timezone “UTC”
hoodie.deltastreamer.keygen.timebased.output.dateformat “MM/dd/yyyy”
hoodie.deltastreamer.keygen.timebased.output.timezone “UTC”

输入:”220200401”, 输出:”04/01/2020”

参考资料