Apache ORC深度探索（下篇）

上篇文章我们探索了Apache ORC的发展史、当前Hadoop适配情况以及支持的数据类型。今天我们来看一下如何使用Apache ORC。

三、在Hive中使用

Hive可以说是ORC格式及程度最好的软件了。下面让我们看看如何在Hive里面使用ORC和一些相关的配置。

Hive中的语法

在Hive里面，如果您是新建表，那么只需要在表后增加“STORED AS ORC”语句即可。例如下面的表：istari。

CREATE TABLE istari (
  name STRING,
  color STRING
) STORED AS ORC;

如果要对现有的表或者表分区修改格式为ORC，可以直接使用ALTER语法，格式如下：

ALTER TABLE istari SET FILEFORMAT ORC;

另外，从Hive 0.14开始，用户可以通过使用CONCATENATE命令语法来手动合并ORC小文件。执行该命令后，不必重新序列化文件就能在类型级别进行合并。以下是语法参考：

ALTER TABLE istari [PARTITION partition_spec] CONCATENATE;

如果想要获取ORC文件信息，可以使用Hive的orcfiledump命令。如下：

% hive --orcfiledump

从Hive 1.1开始，命令增加了-d参数。如下：

% hive --orcfiledump -d

Hive中的相关配置

在Hive中，有一些表属性或者库级别属性可以影响到ORC的行为，下面以列表的形式进行介绍。

表属性

以ORC文件形式存储的表使用一些ORC表属性来控制存储行为。通过定义或修改这些表述行，用户可以确保所有客户端采用相同的设置进行数据存储。

属性	默认值	备注
orc.compress	zlib	high level compression = {NONE, ZLIB, SNAPPY}
orc.compress.size	262,144	compression chunk size
orc.stripe.size	67,108,864	memory buffer in bytes for writing
orc.row.index.stride	10,000	number of rows between index entries
orc.create.index	true	create indexes?
orc.bloom.filter.columns	""	comma separated list of column names
orc.bloom.filter.fpp	0.05	bloom filter false positive rate

下面这个例子展示了创建不带high level压缩的ORC表：

CREATE TABLE istari (
  name STRING,
  color STRING
) STORED AS ORC TBLPROPERTIES ("orc.compress"="NONE");

配置属性

在Hive的配置中，也有很多和ORC文件格式相关的配置项，下面我们通过表格的方式来统一查看一下：

属性	默认值	备注
hive.default.fileformat	TextFile	This is the default file format for new tables. If it is set to ORC, new tables will default to ORC.
hive.stats.gather.num.threads	10	Number of threads used by partialscan/noscan analyze command for partitioned tables. This is applicable only for file formats that implement the StatsProvidingRecordReader interface (like ORC).
hive.exec.orc.memory.pool	0.5	Maximum fraction of heap that can be used by ORC file writers.
hive.exec.orc.write.format	NULL	Define the version of the file to write. Possible values are 0.11 and 0.12. If this parameter is not defined, ORC will use the latest version.
hive.exec.orc.default.stripe.size	67,108,864	Define the default size of ORC writer buffers in bytes.
hive.exec.orc.default.block.size	268,435,456	Define the default file system block size for ORC files.
hive.exec.orc.dictionary.key.size.threshold	0.8	If the number of keys in a dictionary is greater than this fraction of the total number of non-null rows, turn off dictionary encoding. Use 1.0 to always use dictionary encoding.
hive.exec.orc.default.row.index.stride	10,000	Define the default number of rows between row index entries.
hive.exec.orc.default.buffer.size	262,144	Define the default ORC buffer size, in bytes.
hive.exec.orc.default.block.padding	true	true Should ORC file writers pad stripes to minimize stripes that cross HDFS block boundaries.
hive.exec.orc.block.padding.tolerance	0.05	Define the tolerance for block padding as a decimal fraction of stripe size (for example, the default value 0.05 is 5% of the stripe size). For the defaults of 64Mb ORC stripe and 256Mb HDFS blocks, a maximum of 3.2Mb will be reserved for padding within the 256Mb block with the default hive.exec.orc.block.padding.tolerance. In that case, if the available size within the block is more than 3.2Mb, a new smaller stripe will be inserted to fit within that space. This will make sure that no stripe written will cross block boundaries and cause remote reads within a node local task.
hive.exec.orc.default.compress	ZLIB	Define the default compression codec for ORC file.
hive.exec.orc.encoding.strategy	SPEED	Define the encoding strategy to use while writing data. Changing this will only affect the light weight encoding for integers. This flag will not change the compression level of higher level compression codec (like ZLIB). Possible options are SPEED and COMPRESSION.
hive.orc.splits.include.file.footer	false	If turned on, splits generated by ORC will include metadata about the stripes in the file. This data is read remotely (from the client or HiveServer2 machine) and sent to all the tasks.
hive.orc.cache.stripe.details.size	10,000	Cache size for keeping meta information about ORC splits cached in the client.
hive.orc.compute.splits.num.threads	10	How many threads ORC should use to create splits in parallel.
hive.exec.orc.skip.corrupt.data	false	If ORC reader encounters corrupt data, this value will be used to determine whether to skip the corrupt data or throw an exception. The default behavior is to throw an exception.
hive.exec.orc.zerocopy	false	Use zerocopy reads with ORC. (This requires Hadoop 2.3 or later.)
hive.merge.orcfile.stripe.level	true	When hive.merge.mapfiles, hive.merge.mapredfiles or hive.merge.tezfiles is enabled while writing a table with ORC file format, enabling this configuration property will do stripe-level fast merge for small ORC files. Note that enabling this configuration property will not honor the padding tolerance configuration (hive.exec.orc.block.padding.tolerance).
hive.merge.orcfile.stripe.level	true	When hive.merge.mapfiles, hive.merge.mapredfiles or hive.merge.tezfiles is enabled while writing a table with ORC file format, enabling this configuration property will do stripe-level fast merge for small ORC files. Note that enabling this configuration property will not honor the padding tolerance configuration (hive.exec.orc.block.padding.tolerance).
hive.orc.row.index.stride.dictionary.check	true	If enabled dictionary check will happen after first row index stride (default 10000 rows) else dictionary check will happen before writing first stripe. In both cases, the decision to use dictionary or not will be retained thereafter.
hive.exec.orc.compression.strategy	SPEED	Define the compression strategy to use while writing data. This changes the compression level of higher level compression codec. Value can be SPEED or COMPRESSION.
orc.write.variable.length.blocks	false	Should the ORC writer use HDFS variable length blocks, if they are available? If the new stripe would straddle a block, Hadoop is ≥ 2.7, and this is enabled, it will end the block before the new stripe.

四、在Python中使用

在Python开发中，如果要使用ORC，可以使用Apache Arrow项目的PyArrow包，或者Dask包，以下是如何安装这两个包并展示一个使用例子：

PyArrow包安装和使用

安装语法：

pip3 install pyarrow==7.0.0
pip3 install pandas

读写ORC文件的例子：

In [1]: import pandas as pd

In [2]: import pyarrow as pa

In [3]: from pyarrow import orc

In [4]: orc.write_table(pa.table({"col1": [1, 2, 3]}), "test.orc")

In [5]: orc.read_table("test.orc").to_pandas()
Out[5]:
   col1
0     1
1     2
2     3

Dask包安装和使用

安装语法：

pip3 install "dask[dataframe]==2022.2.0"
pip3 install pandas

读写ORC文件：

In [1]: import pandas as pd

In [2]: import dask.dataframe as dd

In [3]: pf = pd.Dataframe(data={"col1": [1, 2, 3]})

In [4]: dd.to_orc(dd.from_pandas(pf, npartitions=2), path="/tmp/orc")
Out[4]: (None,)

In [5]: dd.read_orc(path="/tmp/orc").compute()
Out[5]:
   col1
0     1
1     2
2     3

五、在Spark中的使用

Apache Spark对ORC的集成也很好，下面就让我们看看在Spark里面如何使用ORC和一些相关的配置吧。

Spark中的语法

在Spark的建表语句中，你可以少写几个字母，只需要在语句最后增加USING ORC即可：

CREATE TABLE istari (
  name STRING,
  color STRING
) USING ORC;

如果想获取ORC文件的信息，可以使用orc-tools命令，如下：

% orc-tools meta

如果要现实ORC文件的数据，使用如下命令：

% orc-tools data

Spark中的相关配置

在Spark中与Hive中一样，也有一些表属性或者库级别属性可以影响到ORC的行为，下面以列表的形式进行介绍。

表属性

以ORC文件形式存储的表使用一些ORC表属性来控制存储行为。通过定义或修改这些表述行，用户可以确保所有客户端采用相同的设置进行数据存储。

属性	默认值	备注
orc.compress	ZLIB	high level compression = {NONE, ZLIB, SNAPPY, ZSTD}
orc.compress.size	262,144	compression chunk size
orc.stripe.size	67,108,864	memory buffer in bytes for writing
orc.row.index.stride	10,000	number of rows between index entries
orc.create.index	true	create indexes?
orc.bloom.filter.columns	””	comma separated list of column names
orc.bloom.filter.fpp	0.05	bloom filter false positive rate
orc.key.provider	“hadoop”	key provider
orc.encrypt	””	list of keys and columns to encrypt with
orc.mask	””	masks to apply to the encrypted columns

下面是一个带配置参数的ORC表的使用示例：

CREATE TABLE encrypted (
  ssn STRING,
  email STRING,
  name STRING
)
USING ORC
OPTIONS (
  hadoop.security.key.provider.path "kms://http@localhost:9600/kms",
  orc.key.provider "hadoop",
  orc.encrypt "pii:ssn,email",
  orc.mask "nullify:ssn;sha256:email"
);

配置属性

在Spark的配置中，也有很多和ORC文件格式相关的配置项，下面我们通过表格的方式来统一查看一下：

属性	默认值	备注
spark.sql.orc.impl	native	The name of ORC implementation. It can be one of native or hive. native means the native ORC support. hive means the ORC library in Hive.
spark.sql.orc.enableVectorizedReader	true	Enables vectorized orc decoding in native implementation.
spark.sql.orc.mergeSchema	false	When true, the ORC data source merges schemas collected from all data files, otherwise the schema is picked from a random data file.
spark.sql.hive.convertmetastoreOrc	true	Spark SQL will use the Hive SerDe for ORC tables instead of the built-in support.

Apache ORC深度探索（下篇）

大数据系统相关栏目本月热门文章