hbase学习之整合Phoenix与hive

hbase学习之整合Phoenix与hive Phoenix Phoenix简介

定义

Phoenix是Hbase的开源SQL皮肤。可以使用标准JDBC API代替Hbase客户端API来创建表，插入数据和查询Hbase数据。

特点

1）容易集成：如Spark，Hive，Pig，Flume和Map Reduce；

2）操作简单：DML命令以及通过DDL命令创建和操作表和版本化增量更改；

3）支持Hbase二级索引创建。

架构

Phoenix快速入门部署安装

#1.官网地址
http://phoenix.apache.org/

#2.Phoenix部署,上传并解压tar包
[atguigu@hadoop102 module]$ tar -zxvf apache-phoenix-5.0.0-Hbase-2.0-bin.tar.gz -C /opt/module/

[atguigu@hadoop102 module]$ mv apache-phoenix-5.0.0-Hbase-2.0-bin phoenix
#3.复制server包并拷贝到各个节点的hbase/lib
[atguigu@hadoop102 module]$ cd /opt/module/phoenix/
[atguigu@hadoop102 phoenix]$ cp /opt/module/phoenix/phoenix-5.0.0-Hbase-2.0-server.jar /opt/module/hbase/lib/
[atguigu@hadoop102 phoenix]$ xsync /opt/module/hbase/lib/phoenix-5.0.0-Hbase-2.0-server.jar
#4.配置环境变量
#phoenix
export PHOENIX_HOME=/opt/module/phoenix
export PHOENIX_CLASSPATH=$PHOENIX_HOME
export PATH=$PATH:$PHOENIX_HOME/bin

#5.重启Hbase
[atguigu@hadoop102 ~]$ stop-hbase.sh
[atguigu@hadoop102 ~]$ start-hbase.sh
#6.连接Phoenix(使用重量级的连接),后面如果不加zookeeper地址会默认使用本机的
[atguigu@hadoop101 phoenix]$ sqlline.py hadoop102,hadoop103,hadoop104:2181

#7.连接Phoenix(使用轻量级的连接，还要先启动一个queryserver的服务queryserver.py)
[atguigu@hadoop102 ~]$ queryserver.py start
[atguigu@hadoop102 ~]$ sqlline-thin.py hadoop102:8765 # 默认找本机的8765端口

phoenix shell操作

schema(库)

默认情况下，在phoenix中不能直接创建schema。需要将如下的参数添加到Hbase中conf目录下的hbase-site.xml 和 phoenix中bin目录下的 hbase-site.xml中


    phoenix.schema.isNamespaceMappingEnabled
    true

# 重新启动Hbase和连接phoenix客户端.
[atguigu@hadoop102 ~]$ stop-hbase.sh
[atguigu@hadoop102 ~]$ start-hbase.sh
[atguigu@hadoop102 ~]$ sqlline.py hadoop102,hadoop103,hadoop104:2181

# 创建命令，注意这里创建的mydb,到hbase中都是大写的，如果想使用小写的加上双引号
create schema if not exists mybd;
create schema if not exists "mybd2";

# 删除
drop schema if exists "mybd2";

table

-- 1.显示所有表
!table 或 !tables
-- 2.创建表(指定单个列作为RowKey)
CREATE TABLE IF NOT EXISTS student(
id VARCHAR primary key,
name VARCHAR,
addr VARCHAR);

-- 3.插入或者数据
upsert into student(id,name,addr) values('1001','pihao','shenzhen');

-- 4.查询数据
select id,name,addr from student;

-- 5.删除数据
delete from student where id = '1001';

-- 6.创建表（使用联合主键）在hbase中会将联合主键的字段拼接作为rowkey
CREATE TABLE IF NOT EXISTS us_population (
State CHAR(2) NOT NULL,
City VARCHAR NOT NULL,
Population BIGINT
CONSTRAINT my_pk PRIMARY KEY (state, city));

upset into us_population('CA','LOGSJ',1000000);

hbase查看:

value=x的解释：

在hbase中，不允许存在只有rowkey,没有column的数据
在phoenix中，允许存在只有主键的数据（只存在主键，其余字段都为null）
那么问题来了，这种数据在hbase中是怎么存的呢？就使用到了这个value=x。可以理解为空数据，与phoenix对应

表的映射

1) Hbase中没有表,phoenix中创建表会同时在hbase中也创建表

2) Hbase中有表, 可以在phoenix中创建视图(只读)进行映射
   create 'emp','info'
   put 'emp','1001','info:name','zhangsan'
   put 'emp','1001','info:addr','beijing'

   create view "emp"(
     id varchar primary key , 
     "info"."name" varchar ,
     "info"."addr" varchar
   )

   # 正常查询出，没问题
   select * from "emp" ; 
   select id , "name","addr" from "emp" ;

   #对视图只能做查询操作，不能增删 
   upsert into "emp" values('1002','lisi','shanghai');

   drop view "emp";


3) Hbase中有表, 可以在phoenix中创建表进行映射，
   #phoenix在创建表的时候会给你的字段进行编码储存，但是hbase那边如果先有表的话，那么就是存在字段映射不上的问题，比如，select name,底层给你转化为了 select  'xxx' ,这样就查询不出数据，解决办法：在见表的时候添加上COLUMN_ENCODED_BYTES = NONE;表示不适用编码。这样字段就映射上了，还要注意字段大写的问题，phoenix中的小写到hbase中都会转化为大写。
   
   create table "emp"(
     id varchar primary key , 
     "info"."name" varchar ,
     "info"."addr" varchar
   )
   COLUMN_ENCODED_BYTES = NONE;


   select * from "emp" ; 
   select id , "name","addr" from "emp" ; 

   drop table "emp";

数值问题

在phoenix中，存数字的话，然后在hbase中会给你数字格式话，默认使用的是‘toStringBinary’，我们可以自定编码方式

# 现象
phoenix存,phoenix查.没有问题
phoenix存,hbase查.有问题
hbase存,hbase查,没有问题
hbase存,phoenix查,有问题

# 建表
 create table test (
   id varchar primary key , 
   name varchar ,
   salary integer 
 )
 COLUMN_ENCODED_BYTES = NONE;  

# phoenix插入
 upsert into test values('1001','zs',123456); 
 然后使用hbase查看发现数字变成二进制了，
 scan 'TEST',{COLUMNS=>['0:SALARY:toInt']}
 就算使用toInt编码也变成负数的了。
 delete from test where id = '1001';
# hbase插入
 put 'TEST','1002','0:NAME','ls'
 put 'TEST','1002','0:SALARY',Bytes.toBytes(456789)   --  Long类型
 scan 'TEST',{COLUMNS=>['0:SALARY:toLong']} -- hbase查看正常
 使用phoenix查看，又变成负数的了。
 
 
 -- 解决办法：使用无服务的UNSIGNED_INT，或者hbase和phoenix不要交叉使用。
   create table test1 (
   id varchar primary key , 
   name varchar ,
   salary UNSIGNED_INT -- 无符号的int
 )
 COLUMN_ENCODED_BYTES = NONE;

phoenix api操作

添加依赖 thin client


    org.apache.phoenix
    phoenix-queryserver-client
    5.0.0-Hbase-2.0

编写java代码

// 测试前启动queryserver.py 服务
package com.pihao;

import java.sql.*;
import org.apache.phoenix.queryserver.client.ThinClientUtil;

public class PhoenixTest {
public static void main(String[] args) throws SQLException {

    //1.queryserver的连接地址
    String connectionUrl = ThinClientUtil.getConnectionUrl("hadoop102", 8765);
	//2.获取连接
    Connection connection = DriverManager.getConnection(connectionUrl);
    //3.编写sql语句
    PreparedStatement preparedStatement = connection.prepareStatement("select * from student");
	//4.执行
    ResultSet resultSet = preparedStatement.executeQuery();

    while (resultSet.next()) {
        resultSet.getString("id");
        resultSet.getString("name");
        resultSet.getString("addr");
    }

    //关闭
    connection.close();

}
}

添加依赖 thick client

 
     org.apache.phoenix
     phoenix-core
     5.0.0-Hbase-2.0
     
         
             org.glassfish
             javax.el
         
     



    org.glassfish
    javax.el
    3.0.1-b06

编写java代码

package com.atguigu.phoenix.thin;

import java.sql.*;
import java.util.Properties;

public class TestThick {

    public static void main(String[] args) throws SQLException {
        String url = "jdbc:phoenix:hadoop102,hadoop103,hadoop104:2181";
        Properties props = new Properties();
        props.put("phoenix.schema.isNamespaceMappingEnabled","true");
        Connection connection = DriverManager.getConnection(url,props);
        PreparedStatement ps = connection.prepareStatement("select * from "test"");
        ResultSet rs = ps.executeQuery();
        while(rs.next()){
            System.out.println(rs.getString(1)+":" +rs.getString(2));
        }
    }
}

Phoenix二级索引

为啥一上来就是二级索引呢？其实可以这样理解：rowkey是hbase的以及索引，二级索引是给除rowkey外的其他列来创建的

二级索引配置文件

添加如下配置到Hbase的HRegionserver节点的hbase-site.xml


    hbase.regionserver.wal.codec
    org.apache.hadoop.hbase.regionserver.wal.IndexedWALEditCodec



    hbase.region.server.rpc.scheduler.factory.class
    org.apache.hadoop.hbase.ipc.PhoenixRpcSchedulerFactory
    Factory to create the Phoenix RPC Scheduler that uses separate queues for index and metadata updates



    hbase.rpc.controllerfactory.class
    org.apache.hadoop.hbase.ipc.controller.ServerRpcControllerFactory
    Factory to create the Phoenix RPC Scheduler that uses separate queues for index and metadata updates

全局二级索引

所谓的全局二级索引,意味着建索引会创建一张索引表.
在索引表中，将索引列与原表中的rowkey组合起来作为索引表的rowkey.

CREATE TABLE IF NOT EXISTS student(
  id VARCHAR primary key,
  name VARCHAR,
  addr VARCHAR);

upsert into student values('1001','pihao','shanghai');	
upsert into student values('1002','zhangsan','shenzhen');	
# 没建索引之前测试explain
explain select id from student ;   // FULL SCAN全表扫描
explain select id from student where id = '1001' ;  //POINT LOOKUP (主键唯一)
explain select id from student where name = 'pihao' ; // FULL SCAN 全表扫描

-- 给name字段建索引
create index idx_student_name on student(name); 
explain select id from student where name = 'lixiaosi' ; // RANGE SCAN

# 给name添加索引之后测试explain
explain select id ,name from student where id ='1001' ;  // POINT LOOKUP
explain select id ,name from student where name  ='pihao' ; //RANGE SCAN,现在走索引了

# 新的问题，多查询了一个addr,这个addr没有创建索引的
explain select id ,name ,addr  from student where name  ='pihao' ; //FULL SCAN

# 解决办法：创建复合索引
drop index idx_student_name on student;
create index idx_student_addr_name on student(name,addr); 

# 测试
-- RANGE SCAN
explain select id ,name ,addr  from student where name  ='lixiaosi' ; 
-- RANGE SCAN
explain select id ,name ,addr from student where name ='lixiaosi' and addr = 'beijing'; 
-- FULL SCAN（带头大哥不能死）
explain select id ,name ,addr from student where addr = 'beijing'; 
-- RANGE SCAN （sql被优化）
explain select id ,name ,addr from student where addr = 'beijing' and name ='lixiaosi'


# 给name列建索引包含addr列
drop index idx_student_addr_name on student; 
create index idx_student_name on student(name) include(addr); -- 使用include关键字
explain select id ,name ,addr  from student where name  ='pihao' ; //RANGE SCAN

# 强制使用索引(目前没有了，官方不推荐使用)
select id ,name ,addr  from student where name  ='pihao' ;
就拿这个sql来说，目前只创建了name索引，让你强制使用name的索引，先找到id,然后再使用id去原表查找，相当于查了两次表

本地二级索引

-- 只创建name索引
create local index idx_student_name on student(name);
-- 还查询了addr
explain select id ,name ,addr  from student where name  ='pihao' ; //range scan

# 这种方式会在原表中增加数据，将name字段和rowkey拼接作为新的rowkey插入hbase，不会创建新的表

整合hive

hbase作为一个数据库，缺乏强大的数据分析能力，这可以整合hive来使用

整合步骤

在hive-site.xml中添加zookeeper的属性，如下：


    hive.zookeeper.quorum
    hadoop102,hadoop103,hadoop104



    hive.zookeeper.client.port
    2181

启动hive，测试新的表

1) 在hive中建表，对应着在hbase中也建表

CREATE TABLE hive_hbase_emp_table(
empno int,
ename string,
job string,
mgr int,
hiredate string,
sal double,
comm double,
deptno int)
STORED BY 'org.apache.hadoop.hive.hbase.HbaseStorageHandler'
WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key,info:ename,info:job,info:mgr,info:hiredate,info:sal,info:comm,info:deptno")
TBLPROPERTIES ("hbase.table.name" = "hbase_emp_table");
-- 提示：完成之后，可以分别进入Hive和Hbase查看，都生成了对应的表
-- 提示：不能将数据直接load进Hive所关联Hbase的那张表中

# 再准备一个hive的中间表
CREATE TABLE emp(
empno int,
ename string,
job string,
mgr int,
hiredate string,
sal double,
comm double,
deptno int)
row format delimited fields terminated by 't';

# 在hive中执行命令
load data local inpath '/opt/module/hive/data/emp.txt' into table emp;

# 将数据从中间表写入hbase的表
load data local inpath '/home/module/hive/data/emp.txt' into table emp;

# 在hive以及hbase中查询

hbase中又表，hive关联那张表

# 注意：建表语句和之前的一样，唯一的区别是只能创建外部表EXTERNAL
CREATE EXTERNAL TABLE relevance_hbase_emp(
empno int,
ename string,
job string,
mgr int,
hiredate string,
sal double,
comm double,
deptno int)
STORED BY 
'org.apache.hadoop.hive.hbase.HbaseStorageHandler'
WITH SERDEPROPERTIES ("hbase.columns.mapping" = 
":key,info:ename,info:job,info:mgr,info:hiredate,info:sal,info:comm,info:deptno") 
TBLPROPERTIES ("hbase.table.name" = "hbase_emp_table");

hbase学习之整合Phoenix与hive

大数据系统相关栏目本月热门文章