栏目分类:
子分类:
返回
名师互学网用户登录
快速导航关闭
当前搜索
当前分类
子分类
实用工具
热门搜索
名师互学网 > IT > 前沿技术 > 大数据 > 大数据系统

Apache ORC深度探索(下篇)

Apache ORC深度探索(下篇)

上篇文章我们探索了Apache ORC的发展史、当前Hadoop适配情况以及支持的数据类型。今天我们来看一下如何使用Apache ORC。

三、在Hive中使用

Hive可以说是ORC格式及程度最好的软件了。下面让我们看看如何在Hive里面使用ORC和一些相关的配置。

Hive中的语法

在Hive里面,如果您是新建表,那么只需要在表后增加“STORED AS ORC”语句即可。例如下面的表:istari。

CREATE TABLE istari (
  name STRING,
  color STRING
) STORED AS ORC;

如果要对现有的表或者表分区修改格式为ORC,可以直接使用ALTER语法,格式如下:

ALTER TABLE istari SET FILEFORMAT ORC;

另外,从Hive 0.14开始,用户可以通过使用CONCATENATE命令语法来手动合并ORC小文件。执行该命令后,不必重新序列化文件就能在类型级别进行合并。以下是语法参考:

ALTER TABLE istari [PARTITION partition_spec] CONCATENATE;

如果想要获取ORC文件信息,可以使用Hive的orcfiledump命令。如下:

% hive --orcfiledump 

从Hive 1.1开始,命令增加了-d参数。如下:

% hive --orcfiledump -d 
Hive中的相关配置

在Hive中,有一些表属性或者库级别属性可以影响到ORC的行为,下面以列表的形式进行介绍。

表属性

以ORC文件形式存储的表使用一些ORC表属性来控制存储行为。通过定义或修改这些表述行,用户可以确保所有客户端采用相同的设置进行数据存储。

属性默认值备注
orc.compresszlibhigh level compression = {NONE, ZLIB, SNAPPY}
orc.compress.size262,144compression chunk size
orc.stripe.size67,108,864memory buffer in bytes for writing
orc.row.index.stride10,000number of rows between index entries
orc.create.indextruecreate indexes?
orc.bloom.filter.columns""comma separated list of column names
orc.bloom.filter.fpp0.05bloom filter false positive rate

下面这个例子展示了创建不带high level压缩的ORC表:

CREATE TABLE istari (
  name STRING,
  color STRING
) STORED AS ORC TBLPROPERTIES ("orc.compress"="NONE");
配置属性

在Hive的配置中,也有很多和ORC文件格式相关的配置项,下面我们通过表格的方式来统一查看一下:

属性默认值备注
hive.default.fileformatTextFileThis is the default file format for new tables. If it is set to ORC, new tables will default to ORC.
hive.stats.gather.num.threads10Number of threads used by partialscan/noscan analyze command for partitioned tables. This is applicable only for file formats that implement the StatsProvidingRecordReader interface (like ORC).
hive.exec.orc.memory.pool0.5Maximum fraction of heap that can be used by ORC file writers.
hive.exec.orc.write.formatNULLDefine the version of the file to write. Possible values are 0.11 and 0.12. If this parameter is not defined, ORC will use the latest version.
hive.exec.orc.default.stripe.size67,108,864Define the default size of ORC writer buffers in bytes.
hive.exec.orc.default.block.size268,435,456Define the default file system block size for ORC files.
hive.exec.orc.dictionary.key.size.threshold0.8If the number of keys in a dictionary is greater than this fraction of the total number of non-null rows, turn off dictionary encoding. Use 1.0 to always use dictionary encoding.
hive.exec.orc.default.row.index.stride10,000Define the default number of rows between row index entries.
hive.exec.orc.default.buffer.size262,144Define the default ORC buffer size, in bytes.
hive.exec.orc.default.block.paddingtruetrue Should ORC file writers pad stripes to minimize stripes that cross HDFS block boundaries.
hive.exec.orc.block.padding.tolerance0.05Define the tolerance for block padding as a decimal fraction of stripe size (for example, the default value 0.05 is 5% of the stripe size). For the defaults of 64Mb ORC stripe and 256Mb HDFS blocks, a maximum of 3.2Mb will be reserved for padding within the 256Mb block with the default hive.exec.orc.block.padding.tolerance. In that case, if the available size within the block is more than 3.2Mb, a new smaller stripe will be inserted to fit within that space. This will make sure that no stripe written will cross block boundaries and cause remote reads within a node local task.
hive.exec.orc.default.compressZLIBDefine the default compression codec for ORC file.
hive.exec.orc.encoding.strategySPEEDDefine the encoding strategy to use while writing data. Changing this will only affect the light weight encoding for integers. This flag will not change the compression level of higher level compression codec (like ZLIB). Possible options are SPEED and COMPRESSION.
hive.orc.splits.include.file.footerfalseIf turned on, splits generated by ORC will include metadata about the stripes in the file. This data is read remotely (from the client or HiveServer2 machine) and sent to all the tasks.
hive.orc.cache.stripe.details.size10,000Cache size for keeping meta information about ORC splits cached in the client.
hive.orc.compute.splits.num.threads10How many threads ORC should use to create splits in parallel.
hive.exec.orc.skip.corrupt.datafalseIf ORC reader encounters corrupt data, this value will be used to determine whether to skip the corrupt data or throw an exception. The default behavior is to throw an exception.
hive.exec.orc.zerocopyfalseUse zerocopy reads with ORC. (This requires Hadoop 2.3 or later.)
hive.merge.orcfile.stripe.leveltrueWhen hive.merge.mapfiles, hive.merge.mapredfiles or hive.merge.tezfiles is enabled while writing a table with ORC file format, enabling this configuration property will do stripe-level fast merge for small ORC files. Note that enabling this configuration property will not honor the padding tolerance configuration (hive.exec.orc.block.padding.tolerance).
hive.merge.orcfile.stripe.leveltrueWhen hive.merge.mapfiles, hive.merge.mapredfiles or hive.merge.tezfiles is enabled while writing a table with ORC file format, enabling this configuration property will do stripe-level fast merge for small ORC files. Note that enabling this configuration property will not honor the padding tolerance configuration (hive.exec.orc.block.padding.tolerance).
hive.orc.row.index.stride.dictionary.checktrueIf enabled dictionary check will happen after first row index stride (default 10000 rows) else dictionary check will happen before writing first stripe. In both cases, the decision to use dictionary or not will be retained thereafter.
hive.exec.orc.compression.strategySPEEDDefine the compression strategy to use while writing data. This changes the compression level of higher level compression codec. Value can be SPEED or COMPRESSION.
orc.write.variable.length.blocksfalseShould the ORC writer use HDFS variable length blocks, if they are available? If the new stripe would straddle a block, Hadoop is ≥ 2.7, and this is enabled, it will end the block before the new stripe.
四、在Python中使用

在Python开发中,如果要使用ORC,可以使用Apache Arrow项目的PyArrow包,或者Dask包,以下是如何安装这两个包并展示一个使用例子:

PyArrow包安装和使用

安装语法:

pip3 install pyarrow==7.0.0
pip3 install pandas

读写ORC文件的例子:

In [1]: import pandas as pd

In [2]: import pyarrow as pa

In [3]: from pyarrow import orc

In [4]: orc.write_table(pa.table({"col1": [1, 2, 3]}), "test.orc")

In [5]: orc.read_table("test.orc").to_pandas()
Out[5]:
   col1
0     1
1     2
2     3
Dask包安装和使用

安装语法:

pip3 install "dask[dataframe]==2022.2.0"
pip3 install pandas

读写ORC文件:

In [1]: import pandas as pd

In [2]: import dask.dataframe as dd

In [3]: pf = pd.Dataframe(data={"col1": [1, 2, 3]})

In [4]: dd.to_orc(dd.from_pandas(pf, npartitions=2), path="/tmp/orc")
Out[4]: (None,)

In [5]: dd.read_orc(path="/tmp/orc").compute()
Out[5]:
   col1
0     1
1     2
2     3
五、在Spark中的使用

Apache Spark对ORC的集成也很好,下面就让我们看看在Spark里面如何使用ORC和一些相关的配置吧。

Spark中的语法

在Spark的建表语句中,你可以少写几个字母,只需要在语句最后增加USING ORC即可:

CREATE TABLE istari (
  name STRING,
  color STRING
) USING ORC;

如果想获取ORC文件的信息,可以使用orc-tools命令,如下:

% orc-tools meta 

如果要现实ORC文件的数据,使用如下命令:

% orc-tools data 
Spark中的相关配置

在Spark中与Hive中一样,也有一些表属性或者库级别属性可以影响到ORC的行为,下面以列表的形式进行介绍。

表属性

以ORC文件形式存储的表使用一些ORC表属性来控制存储行为。通过定义或修改这些表述行,用户可以确保所有客户端采用相同的设置进行数据存储。

属性默认值备注
orc.compressZLIBhigh level compression = {NONE, ZLIB, SNAPPY, ZSTD}
orc.compress.size262,144compression chunk size
orc.stripe.size67,108,864memory buffer in bytes for writing
orc.row.index.stride10,000number of rows between index entries
orc.create.indextruecreate indexes?
orc.bloom.filter.columns””comma separated list of column names
orc.bloom.filter.fpp0.05bloom filter false positive rate
orc.key.provider“hadoop”key provider
orc.encrypt””list of keys and columns to encrypt with
orc.mask””masks to apply to the encrypted columns

下面是一个带配置参数的ORC表的使用示例:

CREATE TABLE encrypted (
  ssn STRING,
  email STRING,
  name STRING
)
USING ORC
OPTIONS (
  hadoop.security.key.provider.path "kms://http@localhost:9600/kms",
  orc.key.provider "hadoop",
  orc.encrypt "pii:ssn,email",
  orc.mask "nullify:ssn;sha256:email"
);
配置属性

在Spark的配置中,也有很多和ORC文件格式相关的配置项,下面我们通过表格的方式来统一查看一下:

属性默认值备注
spark.sql.orc.implnativeThe name of ORC implementation. It can be one of native or hive. native means the native ORC support. hive means the ORC library in Hive.
spark.sql.orc.enableVectorizedReadertrueEnables vectorized orc decoding in native implementation.
spark.sql.orc.mergeSchemafalseWhen true, the ORC data source merges schemas collected from all data files, otherwise the schema is picked from a random data file.
spark.sql.hive.convertmetastoreOrctrueSpark SQL will use the Hive SerDe for ORC tables instead of the built-in support.
转载请注明:文章转载自 www.mshxw.com
本文地址:https://www.mshxw.com/it/758456.html
我们一直用心在做
关于我们 文章归档 网站地图 联系我们

版权所有 (c)2021-2022 MSHXW.COM

ICP备案号:晋ICP备2021003244-6号