适用于版本0.10.1。
每条记录通过主键唯一识别,即记录键和分区路径。
主键可用于分区级别的唯一约束,以及高效更新或删除记录。
分区方式直接决定了记录消费和查询延迟。
当前 Hudi支持分区和全局索引。分区索引通过记录键和分区路径唯一确定每条记录,而全局索引仅通过记录键保证记录唯一。
Hudi提供了多种开箱可用的键更新器,并且提供了自定义实现接口。
以下为使用键更新器所需的配置:
Config | Meaning/purpose |
---|---|
hoodie.datasource.write.recordkey.field |
Refers to record key field. This is a mandatory(必须的) field. |
hoodie.datasource.write.partitionpath.field |
Refers to partition path field. This is a mandatory field. |
hoodie.datasource.write.keygenerator.class |
Refers to Key generator class(including full path). Could refer to any of the available ones or user defined one. This is a mandatory field. |
hoodie.datasource.write.partitionpath.urlencode |
When set to true, partition path will be url encoded. Default value is false. |
hoodie.datasource.write.hive_style_partitioning |
When set to true, uses hive style partitioning. Partition field name will be prefixed to the value. Format: “ |
注意: 使用基于时间戳的键更新器还需要额外的配置。
1 SimpleKeyGenerator
记录键和分区路径均由数据列按名称指定。值将转换为字符串。
2 ComplexKeyGenerator
记录键和分区路径由逗号分隔的列组合制定,如"Hoodie.datasource.write.recordkey.field" : “col1,col4”
3 GlobalDeleteKeyGenerator
不使用分区路径
4 NonPartitionedKeyGenerator
适用于非分区表,返回空白分区
5 CustomKeyGenerator
使用多种键更新器组合的泛型实现。
适用于复杂分区路径场景。在配置hoodie.datasource.write.partitionpath.field中按照field1:PartitionKeyType1,field2:PartitionKeyType2…形式指定分区路径\
6 自定义实现
7 TimestampBasedKeyGenerator
将分区路径值解析为时间戳,而不是同记录键一样转换为字符串。
需要设置以下参数:
Config | Meaning/purpose |
---|---|
hoodie.deltastreamer.keygen.timebased.timestamp.type |
One of the timestamp types supported(UNIX_TIMESTAMP, DATE_STRING, MIXED, EPOCHMILLISECONDS, SCALAR) |
hoodie.deltastreamer.keygen.timebased.output.dateformat |
Output date format |
hoodie.deltastreamer.keygen.timebased.timezone |
Timezone of the data format |
hoodie.deltastreamer.keygen.timebased.input.dateformat |
Input date format |
示例如下:
GMT
Config field | Value |
---|---|
hoodie.deltastreamer.keygen.timebased.timestamp.type |
“EPOCHMILLISECONDS” |
hoodie.deltastreamer.keygen.timebased.output.dateformat |
“yyyy-MM-dd hh” |
hoodie.deltastreamer.keygen.timebased.timezone |
“GMT+8:00” |
输入:“1578283932000L”, 输出:“2020-01-06 12”
注意:输入null对应输出“1970-01-01 08”。
DATE_STRING
Config field | Value |
---|---|
hoodie.deltastreamer.keygen.timebased.timestamp.type |
“DATE_STRING” |
hoodie.deltastreamer.keygen.timebased.output.dateformat |
“yyyy-MM-dd hh” |
hoodie.deltastreamer.keygen.timebased.timezone |
“GMT+8:00” |
hoodie.deltastreamer.keygen.timebased.input.dateformat |
“yyyy-MM-dd hh:mm:ss” |
输入:“2020-01-06 12:12:12”, 输出:“2020-01-06 12”
注意:输入null对应输出“1970-01-01 12:00:00”。
SCALAR
hoodie.deltastreamer.keygen.timebased.timestamp.type |
“SCALAR” |
---|---|
hoodie.deltastreamer.keygen.timebased.output.dateformat |
“yyyy-MM-dd hh” |
hoodie.deltastreamer.keygen.timebased.timezone |
“GMT” |
hoodie.deltastreamer.keygen.timebased.timestamp.scalar.time.unit |
“days” |
输入:“20000L”, 输出:“2024-10-04 12”
注意:输入null对应输出“1970-01-02 12”。
ISO8601WithMsZ with Single Input format
Config field | Value |
---|---|
hoodie.deltastreamer.keygen.timebased.timestamp.type |
“DATE_STRING” |
hoodie.deltastreamer.keygen.timebased.input.dateformat |
“yyyy-MM-dd’T’HH:mm:ss.SSSZ” |
hoodie.deltastreamer.keygen.timebased.input.dateformat.list.delimiter.regex |
“” |
hoodie.deltastreamer.keygen.timebased.input.timezone |
“” |
hoodie.deltastreamer.keygen.timebased.output.dateformat |
“yyyyMMddHH” |
hoodie.deltastreamer.keygen.timebased.output.timezone |
“GMT” |
输入:”2020-04-01T13:01:33.428Z”, 输出:”2020040113”
ISO8601WithMsZ with Multiple Input formats
Config field | Value |
---|---|
hoodie.deltastreamer.keygen.timebased.timestamp.type |
“DATE_STRING” |
hoodie.deltastreamer.keygen.timebased.input.dateformat |
“yyyy-MM-dd’T’HH:mm:ssZ,yyyy-MM-dd’T’HH:mm:ss.SSSZ” |
hoodie.deltastreamer.keygen.timebased.input.dateformat.list.delimiter.regex |
“” |
hoodie.deltastreamer.keygen.timebased.input.timezone |
“” |
hoodie.deltastreamer.keygen.timebased.output.dateformat |
“yyyyMMddHH” |
hoodie.deltastreamer.keygen.timebased.output.timezone |
“UTC” |
输入:”2020-04-01T13:01:33.428Z”, 输出:”2020040113”
ISO8601NoMs with offset using multiple input formats
Config field | Value |
---|---|
hoodie.deltastreamer.keygen.timebased.timestamp.type |
“DATE_STRING” |
hoodie.deltastreamer.keygen.timebased.input.dateformat |
“yyyy-MM-dd’T’HH:mm:ssZ,yyyy-MM-dd’T’HH:mm:ss.SSSZ” |
hoodie.deltastreamer.keygen.timebased.input.dateformat.list.delimiter.regex |
“” |
hoodie.deltastreamer.keygen.timebased.input.timezone |
“” |
hoodie.deltastreamer.keygen.timebased.output.dateformat |
“yyyyMMddHH” |
hoodie.deltastreamer.keygen.timebased.output.timezone |
“UTC” |
输入:”2020-04-01T13:01:33-05:00“, 输出:”2020040118”
Input as short date string and expect date in date format
Config field | Value |
---|---|
hoodie.deltastreamer.keygen.timebased.timestamp.type |
“DATE_STRING” |
hoodie.deltastreamer.keygen.timebased.input.dateformat |
“yyyy-MM-dd’T’HH:mm:ssZ,yyyy-MM-dd’T’HH:mm:ss.SSSZ,yyyyMMdd” |
hoodie.deltastreamer.keygen.timebased.input.dateformat.list.delimiter.regex |
“” |
hoodie.deltastreamer.keygen.timebased.input.timezone |
“UTC” |
hoodie.deltastreamer.keygen.timebased.output.dateformat |
“MM/dd/yyyy” |
hoodie.deltastreamer.keygen.timebased.output.timezone |
“UTC” |
输入:”220200401”, 输出:”04/01/2020”