我们的客户中台项目用到了一个用户大宽表,实时对客户数据进行多维度分析。之前一直用的Mysql,随着数据量越来越大,现在想使用ClickHouse进行优化是最佳的选择
ClickHouse和Mysql很像,也是数据库,常用于数据分析OLAP(On-Line Analytical Processing,联机分析处理)领域。传统数据库是OLTP(On-Line Transaction Processing,联机事务处理)
使用场景1.读多于写
通常将数据批量导入后,进行任意维度的灵活探索、BI工具洞察、报表制作等。数据一次性写入后,分析师需要尝试从各个角度对数据做挖掘、分析,直到发现其中的商业价值、业务变化趋势等信息。这是一个需要反复试错、不断调整、持续优化的过程,其中数据的读取次数远多于写入次数。
2.大宽表,读大量行但是少量列,结果集较小
在OLAP场景中,通常存在一张或是几张多列的大宽表,列数高达数百甚至数千列。对数据分析处理时,选择其中的少数几列作为维度列、其他少数几列作为指标列,然后对全表或某一个较大范围内的数据做聚合计算。这个过程会扫描大量的行数据,但是只用到了其中的少数列。而聚合计算的结果集相比于动辄数十亿的原始数据,也明显小得多。
3.无需事务,数据一致性要求低
OLAP类业务对于事务需求较少,通常是导入历史日志数据,或搭配一款事务型数据库并实时从事务型数据库中进行数据同步。其实多数OLAP系统都支持最终一致性。
大厂常用:苏宁超6亿会员如何做到秒级用户画像查询?
原理解析这里列举的很详细:ClickHouse深度揭秘。包括列式存储,索引,数据分区/分片,TTL,主从同步等
这篇更详细地介绍了ClickhHouse的架构:为什么ClickHouse这么快?
作为例子,通过docker起一个clickhouse容器,先拉镜像
docker pull yandex/clickhouse-server
启动容器,ck的默认端口是8123
通过cli进入,输入# clickhouse-client -h 127.0.0.1进入
idea也支持连接ck,可以看这篇:intellij idea如何连接ClickHouse,详细教程
输入简单的sql,结果返回成功
引入maven依赖
ru.yandex.clickhouse clickhouse-jdbc 0.1.40
代码,就是简单的jdbc使用:
@RunWith(SpringRunner.class)
@SpringBootTest
public class ClickhouseWelcomeTest {
private static final String[] GREETINGS = {
"Welcome to DevStudio",
"碧油鸡全部退散, 颈腰椎早日康复! 贼真诚",
"有对象了么? 别慌, 送你一个! 领取请加钉钉群: 35991139", "404?!! 不要慌,不要急,App Observer 帮助您~ 了解一下", "Cosy 提效补全用过没, 还能搜搜搜",
"代码平台哪家强, Codeup! Codeup!! Codeup!!!"
};
@Test
public void welcome() throws ClassNotFoundException {
Random random = new Random();
int position = random.nextInt(GREETINGS.length);
String url = "172.27.31.10";
String username = "";
String password = "";
Class.forName("ru.yandex.clickhouse.ClickHouseDriver");
String connectionStr = "jdbc:clickhouse://" + url + ":8123";
try {
Connection connection = DriverManager.getConnection(connectionStr);//, username, password
Statement stmt = connection.createStatement();
// 创建local表
String createLocalTableDDL = "CREATE TABLE IF NOT EXISTS whatever_table ON CLUSTER default " +
"(user_id UInt64, " +
"city_level SimpleAggregateFunction(anyLast, Nullable(Enum('一线城市' = 0, '二线城市' = 1, '三线城市' = 2, '四线城市' = 3))), " + "gender SimpleAggregateFunction(anyLast, Nullable(Enum('女' = 0, '男' = 1)))," +
"interest_sports SimpleAggregateFunction(anyLast, Nullable(E num('否' = 0, '是' = 1)))," +
"reg_date SimpleAggregateFunction(anyLast, Datetime)) " +
"comment_like_cnt SimpleAggregateFunction(anyLast, Nullable(UInt32)),n" +
"last30d_share_cnt SimpleAggregateFunction(anyLast, Nullable(UInt32)),n" +
"user_like_consume_trend_type SimpleAggregateFunction(anyLast, Nullable(String)),n" +
"province SimpleAggregateFunction(anyLast, Nullable(String)),n" +
"last_access_version SimpleAggregateFunction(anyLast, Nullable(String)),n" +
"others SimpleAggregateFunction(anyLast, Array(String)),n" +
"ENGINE = AggregatingMergeTree() PARTITION by toYYYYMM(reg_date) ORDER BY user_id;";
stmt.execute(createLocalTableDDL);
System.out.println("create local table done.");
// 创建distributed表
String createDistributedTableDDL = "CREATE TABLE IF NOT EXISTS whatever_table_dist ON cluster default " + "AS default.whatever_table " +
"ENGINE = Distributed(default, default, whatever_table, intHash64(user_id));";
stmt.execute(createDistributedTableDDL);
System.out.println("create distributed table done");
// 插入mock数据
String insertSQL = "INSERT INTO whatever_table(n" +
"tuser_id,n" +
"tcity_level,n" +
"tgender,n" +
"tinterest_sports,n" +
"treg_date,n" +
"tcomment_like_cnt,n" +
"tlast30d_share_cnt,n" +
"tuser_like_consume_trend_type,n" +
"tprovince,n" +
"tlast_access_version,n" +
"tothersn" +
"t)SELECtn" +
" number as user_id,n" +
" toUInt32(rand(11)%4) as city_level,n" +
" toUInt32(rand(30)%2) as gender,n" +
" toUInt32(rand(28)%2) as interest_sports,n" +
" (toDateTime('2020-01-01 00:00:00') + rand(1)%(3600*24*30*4)) as reg_date,n" + " toUInt32(rand(15)%10) as comment_like_cnt,n" +
" toUInt32(rand(16)%10) as last30d_share_cnt,n" +
"randomPrintableASCIi(64) as user_like_consume_trend_type,n" + "randomPrintableASCIi(64) as province,n" +
"randomPrintableASCIi(64) as last_access_version,n" + "[randomPrintableASCIi(64)] as othersn" +
" FROM numbers(100000);n";
stmt.execute(insertSQL);
System.out.println("Mock data and insert done.");
System.out.println("Select count(user_id)...");
ResultSet rs = stmt.executeQuery("select count(user_id) from whatever_table_dist");
while (rs.next()) {
int count = rs.getInt(1);
System.out.println("user_id count: " + count);
}
// 数据合并
String optimizeSQL = "OPTIMIZE table whatever_table final;";
// 如数据合并时间过⻓,可在partition级别并行执行
String optimizeByPartitionSQL = "OPTIMIZE table whatever_table PARTITION 202001 final;";
try {
stmt.execute(optimizeByPartitionSQL);
}catch (SQLTimeoutException e){
// 查看merge进展
// String checkMergeSQL = "select * from system.merges where database = 'default' and table = 'whatever_table';"; Thread.sleep(60*1000);
}
// 人群圈选(city_level='一线城市',gender='男性',interest_sports='是', reg_date<='2020-01-31 23:59:59')
String selectSQL = "SELECT user_id from whatever_table_dist where city_level=0 and gender=1 and interest_sports=1 and reg_date <= NOW();";
rs = stmt.executeQuery(selectSQL);
while (rs.next()) {
int user_id = rs.getInt(1);
System.out.println("Got suitable user: " + user_id);
}
} catch (Exception e) {
e.printStackTrace();
}
System.out.println(GREETINGS[position]);
}
}
3.阿里云
产品首页:云数据库 ClickHouse
直接丢上SDK:云数据库 ClickHouse 版
调试:阿里云OpenAPI开发门户
因为项目中要将mysql的大宽表数据同步到clickhouse,这一步很重要。我们用的是datax,b站的使用教程:tis-datax-mysql-clickhouse
阿里云直接导:从RDS MySQL导入/同步
SQL语法直接上官方文档:https://clickhouse.com/docs/zh/sql-reference/syntax/



