Spark SQL执行流程

发表于 2020-09-26 | 分类于 Spark |

1 重要数据结构

Tree：语法树。
Rule：语法树规则

调用sql的explain()方法可以查看执行计划。

阅读全文 »

数据质量

发表于 2020-09-19 | 分类于数据仓库 |

数据质量是数据分析结论有效性和准确性的基础。

阅读全文 »

互联网在线运营分析平台

发表于 2020-09-19 | 分类于数据仓库 |

1 业务场景与开发流程

(1) 业务场景

统计网站PV和UV
按照终端类型和地域维度汇总

阅读全文 »

数据仓库构建和优化

发表于 2020-09-14 | 分类于数据仓库 |

1 流程

阅读全文 »

Spark Streaming中Kafka两种接收方式

发表于 2020-09-10 | 分类于 Spark |

注意：Kakfa 0.8版本支持在2.3版本中标记为弃用。

阅读全文 »

Spark Streaming + Kafka

发表于 2020-08-29 | 分类于 Spark |

适用于版本3.0.1。

新的消费者API 提供并行的、Kafka分区与Spark分区1:1、可以访问偏移和元数据的方式。与旧版本直接API在使用方式上不同。

阅读全文 »

高性能索引

发表于 2020-08-19 | 分类于高性能MySQL |

1 基础

MySQL只能高效地使用索引的最左前缀列。

ORM生成的查询只是合法查询，需要检查是否适合索引查询。

MySQL索引是由存储引擎实现。

阅读全文 »

Scala踩坑

发表于 2020-08-14 | 分类于 Scala |

1 implicit

(1) 隐式类型转换

用于自定义类型转换操作

阅读全文 »

Kafka踩坑

发表于 2020-08-11 | 分类于 Kafka |

1 [2020-08-10 18:27:46,648] WARN [Consumer clientId=consumer-1, groupId=console-consumer-18513] Synchronous auto-commit of offsets {image-topic-0=OffsetAndMetadata{offset=0, metadata=’’}} failed: Commit cannot be completed since the group has already rebalanced and assigned the partitions to another member. This means that the time between subsequent calls to poll() was longer than the configured max.poll.interval.ms, which typically implies that the poll loop is spending too much time message processing. You can address this either by increasing the session timeout or by reducing the maximum size of batches returned in poll() with max.poll.records. (org.apache.kafka.clients.consumer.internals.ConsumerCoordinator)

阅读全文 »

虚拟机踩坑

发表于 2020-08-11 | 分类于虚拟机 |

1 打开虚拟机时，提示“正在使用中”, 打开失败

解决：删除虚拟机文件路径下的.lck文件

阅读全文 »

Hopeful Nick

To Explore

35 分类

GitHub E-Mail