栏目分类:
子分类:
返回
名师互学网用户登录
快速导航关闭
当前搜索
当前分类
子分类
实用工具
热门搜索
名师互学网 > IT > 软件开发 > 后端开发 > Java

java 读写Parquet格式的数据的示例代码

Java 更新时间: 发布时间: IT归档 最新发布 模块sitemap 名妆网 法律咨询 聚返吧 英语巴士网 伯小乐 网商动力

java 读写Parquet格式的数据的示例代码

本文介绍了java 读写Parquet格式的数据,分享给大家,具体如下:

import java.io.BufferedReader;
import java.io.File;
import java.io.FileReader;
import java.io.IOException;
import java.util.Random;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.log4j.Logger;
import org.apache.parquet.example.data.Group;
import org.apache.parquet.example.data.GroupFactory;
import org.apache.parquet.example.data.simple.SimpleGroupFactory;
import org.apache.parquet.hadoop.ParquetReader;
import org.apache.parquet.hadoop.ParquetReader.Builder;
import org.apache.parquet.hadoop.ParquetWriter;
import org.apache.parquet.hadoop.example.GroupReadSupport;
import org.apache.parquet.hadoop.example.GroupWriteSupport;
import org.apache.parquet.schema.MessageType;
import org.apache.parquet.schema.MessageTypeParser;

public class ReadParquet {
  static Logger logger=Logger.getLogger(ReadParquet.class);
  public static void main(String[] args) throws Exception {
    
//    parquetWriter("test\parquet-out2","input.txt");
    parquetReaderV2("test\parquet-out2");
  }
  
  
  static void parquetReaderV2(String inPath) throws Exception{
    GroupReadSupport readSupport = new GroupReadSupport();
    Builder reader= ParquetReader.builder(readSupport, new Path(inPath));
    ParquetReader build=reader.build();
    Group line=null;
    while((line=build.read())!=null){
      Group time= line.getGroup("time", 0);
        //通过下标和字段名称都可以获取
        
        System.out.println(line.getString("city", 0)+"t"+
        line.getString("ip", 0)+"t"+
        time.getInteger("ttl", 0)+"t"+
        time.getString("ttl2", 0)+"t");
        //System.out.println(line.toString());
    }
    System.out.println("读取结束");
  } 
  //新版本中new ParquetReader()所有构造方法好像都弃用了,用上面的builder去构造对象
  static void parquetReader(String inPath) throws Exception{
    GroupReadSupport readSupport = new GroupReadSupport();
    ParquetReader reader = new ParquetReader(new Path(inPath),readSupport);
    Group line=null;
    while((line=reader.read())!=null){
     System.out.println(line.toString());
    }
    System.out.println("读取结束");
    
  }
  
  static void parquetWriter(String outPath,String inPath) throws IOException{
    MessageType schema = MessageTypeParser.parseMessageType("message Pair {n" +
 " required binary city (UTF8);n" +
 " required binary ip (UTF8);n" +
 " repeated group time {n"+
   " required int32 ttl;n"+
    " required binary ttl2;n"+
 "}n"+
"}");
    GroupFactory factory = new SimpleGroupFactory(schema);
    Path path = new Path(outPath);
    Configuration configuration = new Configuration();
    GroupWriteSupport writeSupport = new GroupWriteSupport();
    writeSupport.setSchema(schema,configuration);
    ParquetWriter writer = new ParquetWriter(path,configuration,writeSupport);
    //把本地文件读取进去,用来生成parquet格式文件
    BufferedReader br =new BufferedReader(new FileReader(new File(inPath)));
    String line="";
    Random r=new Random();
    while((line=br.readLine())!=null){
      String[] strs=line.split("\s+");
      if(strs.length==2) {
 Group group = factory.newGroup()
     .append("city",strs[0])
     .append("ip",strs[1]);
 Group tmpG =group.addGroup("time");
 tmpG.append("ttl", r.nextInt(9)+1);
 tmpG.append("ttl2", r.nextInt(9)+"_a");
 writer.write(group);
      }
    }
    System.out.println("write end");
    writer.close();
  }
}

说下schema(写Parquet格式数据需要schema,读取的话"自动识别"了schema)



这个repeated和required 不光是次数上的区别,序列化后生成的数据类型也不同,比如repeqted修饰 ttl2 打印出来为 WrappedArray([7,7_a]) 而 required修饰 ttl2 打印出来为 [7,7_a]  除了用MessageTypeParser.parseMessageType类生成MessageType 还可以用下面方法

(注意这里有个坑--spark里会有这个问题--ttl2这里 as(OriginalType.UTF8) 和 required binary city (UTF8)作用一样,加上UTF8,在读取的时候可以转为StringType,不加的话会报错 [B cannot be cast to java.lang.String  )


    
//import org.apache.parquet.schema.Types;
MessageType schema = Types.buildMessage() 
      .required(PrimitiveTypeName.BINARY).as(OriginalType.UTF8).named("city") 
      .required(PrimitiveTypeName.BINARY).as(OriginalType.UTF8).named("ip") 
      .repeatedGroup().required(PrimitiveTypeName.INT32).named("ttl")
.required(PrimitiveTypeName.BINARY).as(OriginalType.UTF8).named("ttl2")
.named("time")
     .named("Pair"); 

解决 [B cannot be cast to java.lang.String 异常:

1.要么生成parquet文件的时候加个UTF8
2.要么读取的时候再提供一个同样的schema类指定该字段类型,比如下面:

maven依赖(我用的1.7)


  org.apache.parquet
  parquet-hadoop
  1.7.0

以上就是本文的全部内容,希望对大家的学习有所帮助,也希望大家多多支持考高分网。

转载请注明:文章转载自 www.mshxw.com
本文地址:https://www.mshxw.com/it/144110.html
我们一直用心在做
关于我们 文章归档 网站地图 联系我们

版权所有 (c)2021-2022 MSHXW.COM

ICP备案号:晋ICP备2021003244-6号