Clickhouse踩坑记录_大数据系统

Clickhouse踩坑记录

背景：我司因presto在大数据量下查询较慢，后综合技术特点及我们的数据特点决定采用Clickhouse替代。

实现方案：将Hive数据每天增量同步至Clickhouse。

备注：以下将Clickhouse简称ck

实现步骤：

在ck中创建Hive引擎的表在ck中创建MergeTree引擎的表每天将Hive引擎的表增量同步至MergeTree引擎的表

踩坑点

一开始在ck创建表后发现hive那边是存储的文本格式的，后来在hive修改为orc格式，ck这边没有重新建表，然后就报了上边的错误，删表后重建即可。

<0x11><0x12><0x10>▒I<0x18>▒"ERROR: There is no line feed. "P" found instead.It's like your file has more columns than expected.And if your file has the right number of columns, maybe it has an unquoted string value with a comma.: While executing HiveTextRowInputFormat: While executing Hive. (INCORRECT_DATA)

Hive非分区表在ck创建Hive引擎表的数据后查询数据条数为0，看官网的意思是只能创建分区表，后用HDFS引擎代替的。

hive那边以‘t’分隔的表在ck中读取会报以下错误，将hive中的分隔符改为默认的即可

Row 1:
Column 0, name: data_date, type: Date, parsed text: "2016-12-02"
ERROR: There is no line feed. "▒" found instead.
It's like your file has more columns than expected.
And if your file has the right number of columns, maybe it has an unquoted string value with a comma.

: While executing HiveTextRowInputFormat: While executing Hive. (INCORRECT_DATA)

Hive增量数据导入ck时，无法查询到新分区的数据，原因可能是你的hdfs文件有问题，只能扫描到正常命名的文件，像-ext这种开头的临时文件在ck是扫描不到的。解决办法就是把这个分区重新overwrite一遍

更新中....

Clickhouse踩坑记录

大数据系统相关栏目本月热门文章