首先了解一下什么是ORC文件及其格式。接着实现读取操作。
1.ORC文件格式ORC的全称是(Optimized Record Columnar),使用ORC文件格式可以提高hive读、写和处理数据的能力。
ORC在RCFile的基础上进行了一定的改进,所以与RCFile相比,具有以下一些优势:
ORC中的特定的序列化与反序列化操作可以使ORC file writer根据数据类型进行写出。 提供了多种RCFile中没有的indexes,这些indexes可以使ORC的reader很快的读到需要的数据,并且跳过无用数据,这使得ORC文件中的数据可以很快的得到访问。 由于ORC file writer可以根据数据类型进行写出,所以ORC可以支持复杂的数据结构(比如Map等)。 除了上面三个理论上就具有的优势之外,ORC的具体实现上还有一些其他的优势,比如ORC的stripe默认大小更大,为ORC writer提供了一个memory manager来管理内存使用情况。 2.ORC文件读取java实现
2.1 OrcSerde 调用
public void readOrcFile(String fileName) throws SerDeException, IOException {
Configuration hadoopConf = new Configuration();
JobConf conf = new JobConf(hadoopConf);
Path orcFilePath = new Path(fileName);
StringBuilder allColumns = new StringBuilder();
StringBuilder allColumnTypes = new StringBuilder();
Properties p = new Properties();
p.setProperty("columns", "url,word,freq,weight");
p.setProperty("columns.types", "string:string:string:string");
OrcSerde serde = new OrcSerde();
serde.initialize(conf, p);
StructObjectInspector inspector = (StructObjectInspector) serde.getObjectInspector();
InputFormat in = new OrcInputFormat();
FileInputFormat.setInputPaths(conf, orcFilePath);
//windows本地无法运行
InputSplit[] splits = in.getSplits(conf, 1);
System.out.println("splits.length==" + splits.length);
RecordReader reader = in.getRecordReader(splits[0], conf, Reporter.NULL);
List extends StructField> fields = inspector.getAllStructFieldRefs();
System.out.println(fields.size());
Long count = 0L;
// while (reader.next(key, value)) {
// count ++;
// }
reader.close();
}
2.2 OrcSerde调用2
public void readOrc2(String fileName) throws Exception {
JobConf conf = new JobConf();
Path testFilePath = new Path(fileName);
Properties p = new Properties();
OrcSerde serde = new OrcSerde();
p.setProperty("columns", "url,word,freq,weight");
p.setProperty("columns.types", "string:string:string:string");
serde.initialize(conf, p);
StructObjectInspector inspector = (StructObjectInspector) serde.getObjectInspector();
InputFormat in = new OrcInputFormat();
FileInputFormat.setInputPaths(conf, testFilePath.toString());
InputSplit[] splits = in.getSplits(conf, 1);
System.out.println("splits.length==" + splits.length);
conf.set("hive.io.file.readcolumn.ids", "1");
RecordReader reader = in.getRecordReader(splits[0], conf, Reporter.NULL);
Object key = reader.createKey();
Object value = reader.createvalue();
List extends StructField> fields = inspector.getAllStructFieldRefs();
long offset = reader.getPos();
while (reader.next(key, value)) {
Object url = inspector.getStructFieldData(value, fields.get(0));
Object word = inspector.getStructFieldData(value, fields.get(1));
Object freq = inspector.getStructFieldData(value, fields.get(2));
Object weight = inspector.getStructFieldData(value, fields.get(3));
offset = reader.getPos();
System.out.println(url + "|" + word + "|" + freq + "|" + weight);
}
reader.close();
}
3.OrcFile实现
public void readOrc(String INPUT) throws IOException {
Configuration conf = new Configuration();
Path file_in = new Path(INPUT);
Reader reader = OrcFile.createReader(FileSystem.getLocal(conf), file_in);
TypeDescription schema = reader.getSchema(); // 获取ORC文件的schema文件
System.out.println(schema.toJson());
System.out.println(schema.getCategory());
List fieldNames = schema.getFieldNames();
System.out.println(fieldNames.get(1));
System.out.println(schema.toString());
System.out.println("--------------------------------");
StructObjectInspector inspector = (StructObjectInspector) reader.getObjectInspector();
org.apache.hadoop.hive.ql.io.orc.RecordReader records = reader.rows();
StructField structFieldRef = inspector.getStructFieldRef("name");
System.out.println(structFieldRef == null ? true : false);
System.out.println(structFieldRef.getFieldID());
System.out.println(structFieldRef.getFieldName());
Object row = null;
Long count = 0L;
while (records.hasNext()) {
row = records.next(row);
System.out.println(row.toString());
count++;
List value_lst = inspector.getStructFieldsDataAsList(row);
for (int i = 0; i < value_lst.size(); i++) {
System.out.println(value_lst.get(i));
}
}
System.out.println("--------total line=" + count);
}
使用下面方法测试:
public static void main(String[] args) throws Exception {
String str = "D:\myworkspace\idea\workspace\myPs\new\mytest\1.orc";
// new mytest().readOrc(str);
new mytest().readOrc2(str);
}
自己创建的两个群,一个是做3D及机器人方面的(包括算法图像处理,以及三维模型生成,机器人开发等),另一个是元宇宙:技术站群,两个群都是为了交流技术的,禁止乱发广告。



