6.1.7、Hbase_大数据系统

1、过滤器作用

过滤器的作用是在服务端判断数据是否满足条件，然后只将满足条件的数据返回给客户端

2、布隆过滤器

布隆过滤器简介：布隆过滤器可以用于检索一个元素是否在一个集合中。它的优点是空间效率和查询时间都远远超过一般的算法，缺点是有一定的误识别率和删除困难
客户端查询：row key
服务端回答：不存在/不知道
占用内存小，返回结果不确定，高效
原理：
布隆过滤器使用k个独立的hash函数，将集合中每一个元素映射到{1-100}的范围，映射到的数字对应的位置置为1，
假如：元素x1被映射到数字8，那么位数组第8位为1，所有的key都是这样一个个被多个hash函数映射
判断时：看对应位置上的值是否是0就行了，0就是没有row key，1就是不知道了
参数：hash函数个数、位数组的大小
HFile过大，位数组也很大，一个hfile可以有多个位数组，需要Bloom index block 索引表明row key在哪个bloom上

2、过滤器的类型很多，但是可以分为两大类

比较过滤器：可应用于rowkey、列簇、列、列值过滤器
专用过滤器：只能适用于特定的过滤器

3、比较运算符

LESS  <

LESS_OR_EQUAL <=

EQUAL =

NOT_EQUAL <>

GREATER_OR_EQUAL >=

GREATER >

NO_OP 排除所有

4、常见的过滤器

比较过滤器每次选取一个过滤器结合一个作用在什么（row key，列簇，列，值）上面一起使用，随意组合24种可能
六大比较过滤器

BinaryComparator：按字节索引顺序比较指定字节数组，采用Bytes.compareTo(byte[])
BinaryPrefixComparator：通BinaryComparator，只是比较左端前缀的数据是否相同
NullComparator：判断给定的是否为空
BitComparator：按位比较
RegexStringComparator：提供一个正则的比较器，仅支持 EQUAL 和非EQUAL
SubstringComparator：判断提供的子串是否出现在中

专用过滤器–单独使用，也可以结合六大过滤器使用

单列值过滤器：SingleColumnValueFilter
----SingleColumnValueFilter会返回满足条件的cell所在行的所有cell的值（即会返回一行数据）
列值排除过滤器：SingleColumnValueExcludeFilter----与SingleColumnValueFilter相反，会排除掉指定的列，其他的列全部返回
rowkey前缀过滤器：PrefixFilter
----通过PrefixFilter查询以150010008开头的所有前缀的rowkey
分页过滤器PageFilter
----通过PageFilter查询第三页的数据，每页10条
，使用PageFilter分页效率比较低，每次都需要扫描前面的数据，直到扫描到所需要查的数据
多过滤器综合查询

作用于：

row key过滤器：RowFilter----行键过滤器，一般来讲，执行 Scan 使用 startRow/stopRow 方式比较好
列簇过滤器：FamilyFilter----用于过滤列族（通常在 Scan 过程中通过设定某些列族来实现该功能，而不是直接使用该过滤器）
列过滤器：QualifierFilter----用于列名（Qualifier）过滤
列值过滤器：ValueFilter----效率较低，需要做全表扫描

5、示例

正则过滤器和列值过滤器—列值过滤器只会返回符合的这一列

//正则过滤器和列值过滤器--文科的学生，
// 只会返回clazz列，其他列的数据不符合条件，不会返回
@Test
public void getRegQua() throws Exception {
    Scan scan = new Scan();
    //创建一个正则过滤器，传入列值里面
    RegexStringComparator regexStringComparator = new RegexStringComparator("^文科.*班$");
    //创建一个列值过滤器,传入判断条件，和过滤器
    ValueFilter valueFilter = new ValueFilter(CompareFilter.CompareOp.EQUAL, regexStringComparator);
    //设置过滤的东西（列值），传入一个列值Filter
    scan.setFilter(valueFilter);
    ResultScanner scanner = table.getScanner(scan);
    for (Result result : scanner) {
        byte[] value = result.getValue("info".getBytes(), "clazz".getBytes());
        String string = Bytes.toString(value);
        System.out.println(string);
    }
}

单列值过滤器----列值过滤

    //单列值过滤器，文科的学生
    @Test
    public void getSinage()throws Exception{
        Scan scan = new Scan();

        //创建一个单列值过滤器
        //这种只能过滤文科一班
//        SingleColumnValueFilter singleColumnValueFilter = new SingleColumnValueFilter
//                ("info".getBytes(), "clazz".getBytes(), CompareFilter.CompareOp.EQUAL, "文科一班".getBytes());

        //可以传入一个正则表达式，与基本过滤器结合使用
        SingleColumnValueFilter singleColumnValueFilter = new SingleColumnValueFilter
               ("info".getBytes(), "clazz".getBytes(), CompareFilter.CompareOp.EQUAL, new RegexStringComparator("^文科.*班$"));

        //设置过滤的东西（列值），单列值过滤器单独使用
        scan.setFilter(singleColumnValueFilter);
        ResultScanner scanner = table.getScanner(scan);
        for (Result result : scanner) {
            String row = Bytes.toString(result.getRow());
            System.out.print(row+"t");
            List cells = result.listCells();
            for (Cell cell : cells) {
                String value = Bytes.toString(CellUtil.clonevalue(cell));
                System.out.print(value+"t");
            }
            System.out.println();
        }
    }

SubstringComparator过滤器----列过滤

//SubstringComparator过滤器----列过滤
//获取列名包含a的列数据（age,name,clazz）
@Test
public void getSub()throws Exception{
    Scan scan = new Scan();
    //创建一个sub子串过滤器
    SubstringComparator sub = new SubstringComparator("a");
    //创建一个列过滤器,传入比较式，过滤器
    QualifierFilter qualifierFilter = new QualifierFilter(CompareFilter.CompareOp.EQUAL,sub);
    //传入一个列过滤器
    scan.setFilter(qualifierFilter);
    ResultScanner scanner = table.getScanner(scan);
    for (Result result : scanner) {
        String row = Bytes.toString(result.getRow());
        System.out.print(row+"t");
        List cells = result.listCells();
        for (Cell cell : cells) {
            String value = Bytes.toString(CellUtil.clonevalue(cell));
            System.out.print(value+"t");
        }
        System.out.println();
    }
}

使用BinaryPrefixComparator—rowFilter—返回整行数据

//使用BinaryPrefixComparator
//过滤id为150010010开头的学生信息
@Test
public void Binary()throws Exception{
    BinaryPrefixComparator binar = new BinaryPrefixComparator("150010010".getBytes());
    RowFilter rowFilter = new RowFilter(CompareFilter.CompareOp.EQUAL, binar);
    printRes(rowFilter);	//定义的方法，用于输出结果集
}

PrefixFilter专用过滤器

//使用专用过滤器实现--PrefixFilter
//过滤id为150010010开头的学生信息
@Test
public void getPri()throws Exception{
    PrefixFilter prefixFilter = new PrefixFilter("150010010".getBytes());
    printRes(prefixFilter);
}

PageFilter
效率很低–每次需要遍历前页的数据

//分页过滤器
@Test
public void getLimit() throws Exception {
    int pageSize = 10;
    int page = 4;
    //这一页第一条数据rk
    int current_first_rk = (page - 1) * pageSize + 1;
    Scan scan = new Scan();
    PageFilter pageFilter = new PageFilter(current_first_rk);
    scan.setFilter(pageFilter);
    ResultScanner scanner = table.getScanner(scan);
    byte[] start_rowkey = null;
    for (Result result : scanner) {
        //本页第一条数据row key
        start_rowkey = result.getRow();
    }
    //取第四页数据
    Scan scan1 = new Scan();
    //设置开始row key
    scan1.withStartRow(start_rowkey);
    
 	//PageFilter pageFilter1 = new PageFilter(pageSize);
    scan1.setLimit(pageSize);
    //scan1.setFilter(pageFilter1);
    
    scan1.setFilter(pageFilter1);
    ResultScanner scanner1 = table.getScanner(scan1);
    for (Result result : scanner1) {
        String row = Bytes.toString(result.getRow());
        System.out.print(row+"t");
        List cells = result.listCells();
        for (Cell cell : cells) {
            byte[] bytes = CellUtil.clonevalue(cell);
            String value = Bytes.toString(bytes);
            System.out.print(value+"t");
        }
        System.out.println();
    }
}

合理设计row key实现分页

//设计分页需要合理的设计row_key
@Test
public void getLimitt()throws Exception{
    int baseId=1500100000;
    int pageSize=10;
    int page =4;
    //第四页第一条
    int current_first=(page-1)*pageSize+1+baseId;
    //转String
    String currend_f = String.valueOf(current_first);
    //第四页最后一条
    int current_end=current_first+pageSize;
    //转String
    String currend_e = String.valueOf(current_end);
    Scan scan = new Scan();
    scan.withStartRow(currend_f.getBytes());
    scan.withStopRow(currend_e.getBytes());
    ResultScanner scanner = table.getScanner(scan);
    for (Result result : scanner) {
        String row = Bytes.toString(result.getRow());
        System.out.print(row + "t");
        List cells = result.listCells();
        for (Cell cell : cells) {
            String value = Bytes.toString(CellUtil.clonevalue(cell));
            System.out.print(value + "t");
        }
        System.out.println();
    }
}

多过滤器综合查询

//需求过滤出age>23 and gender='男' and row_key以150010010开头学生
@Test
public void getList() throws Exception{
    SingleColumnValueFilter single1 = new SingleColumnValueFilter("info".getBytes(), "age".getBytes(), CompareFilter.CompareOp.GREATER, "23".getBytes());
    SingleColumnValueFilter single2 = new SingleColumnValueFilter("info".getBytes(), "gender".getBytes(), CompareFilter.CompareOp.EQUAL, "男".getBytes());
    PrefixFilter prefixFilter = new PrefixFilter("150010010".getBytes());
    FilterList filterList = new FilterList();
    filterList.addFilter(single1);
    filterList.addFilter(single2);
    filterList.addFilter(prefixFilter);
    printRes(filterList);
}

6.1.7、Hbase

大数据系统相关栏目本月热门文章