适用于版本0.10.1。
1 异步压缩
Hudi默认采用异步压缩策略。主要分为以下两个阶段:
调度
由消费作业完成。Hudi扫描分区,选择压缩的文件分片并最终写入时间线。
执行
独立进程读取压缩计划并执行文件分片压缩。
(1) Spark结构化流
默认开启。
1 | import org.apache.hudi.DataSourceWriteOptions; |
(2) DeltaStreamer持续模式
常驻Spark应用程序持续压缩数据。
1 | spark-submit --packages org.apache.hudi:hudi-utilities-bundle_2.11:0.6.0 \ |
2 同步压缩
如果关注读取提交结果的实时性,可以在同一作业中同步压缩。
配置–disable-compaction关闭异步压缩,实现同步压缩。同时可以在DeltaStreamer CLI中配置资源分配策略,如–delta-sync-scheduling-weight, –compact-scheduling-weight, –delta-sync-scheduling-minshare和–compact-scheduling-minshare
3 离线压缩
MOR默认每5次提交执行一次压缩。
注意:压缩任务执行主要分为两步:调度压缩计划和执行压缩计划。推荐调度压缩计划由写任务周期触发。默认开启ompaction.schedule.enable。
(1) 压缩工具
Hudi提供独立工具用于异步压缩。其中–instant-time可选,否则选择最早的已调度压缩。
1 | spark-submit --packages org.apache.hudi:hudi-utilities-bundle_2.11:0.6.0 \ |
(2) CLI
1 | hudi:trips->compaction run --tableName <table_name> --parallelism <parallelism> --compactionInstant <InstantTime> |
(3) Flink离线压缩
1 | # Command line |
Option Name | Required | Default | Remarks |
---|---|---|---|
--path |
frue |
-- |
The path where the target table is stored on Hudi |
--compaction-max-memory |
false |
100 |
The index map size of log data during compaction, 100 MB by default. If you have enough memory, you can turn up this parameter |
--schedule |
false |
false |
whether to execute the operation of scheduling compaction plan. When the write process is still writing, turning on this parameter have a risk of losing data. Therefore, it must be ensured that there are no write tasks currently writing data to this table when this parameter is turned on |
--seq |
false |
LIFO |
The order in which compaction tasks are executed. Executing from the latest compaction plan by default. LIFO : executing from the latest plan. FIFO : executing from the oldest plan. |