Spark/Hive如何解析嵌套Json

Hive解析嵌套Json用get_json_object

数据是string类型，拆分的时候get_json_object的逗号后要加个空格

{"KdProjCode":"A20160518015NB","DTjType":"调价","xmlns:xsi":"http://www.w3.org/2001/XMLSchema-instance","PriceHzInfo":{"Item":{"HzAmountAfter":1.0190920349E10,"HzAmountBefore":1.0190805449E10}},"TITLE":"宁波上湖城章","PriceCompareDetails":""}

查询不嵌套字段 (json外加上单引号)

select get_json_object('{"KdProjCode":"A20160518015NB","DTjType":"调价","xmlns:xsi":"http://www.w3.org/2001/XMLSchema-instance","PriceHzInfo":{"Item":{"HzAmountAfter":1.0190920349E10,"HzAmountBefore":1.0190805449E10}},"TITLE":"宁波上湖城章","PriceCompareDetails":""}', '$.KdProjCode')

得到A20160518015NB

查询嵌套字段

select get_json_object('{"KdProjCode":"A20160518015NB","DTjType":"调价","xmlns:xsi":"http://www.w3.org/2001/XMLSchema-instance","PriceHzInfo":{"Item":{"HzAmountAfter":1.0190920349E10,"HzAmountBefore":1.0190805449E10}},"TITLE":"宁波上湖城章","PriceCompareDetails":""}', '$.PriceHzInfo.Item.HzAmountBefore')

得到 1.0190805449E10

Hive用explode拆分嵌套Json数组

这样数组里面套Json然后嵌套数组的， explode就是将hive一行中复杂的array或者map结构拆分成多行

[{"categoryId":"9","categoryName":"xxx","brandList":[{"brandId":"597","brandName":"xxx"}]}
,{"categoryId":"5","categoryName":"xxx","brandList":[{"brandId":"597","brandName":"xxx"}]}
,{"categoryId":"10","categoryName":"xxx","brandList" [{"brandId":"529","brandName":"xxx","seriesList":[{"seriesId":"22","seriesName":"xxx"}]}]}
]

[{"brandList":[{"brandId":"752","brandName":"xxx"},{"brandId":"516","brandName":"xxx"},{"brandId":"650","brandName":"xxx"},{"brandId":"586","brandName":"xxx"},{"brandId":"630","brandName":"xxx"}],"categoryId":"542","categoryName":"xxx"}
,{"brandList":[{"brandId":"752","brandName":"xxx"},{"brandId":"650","brandName":"xxx"}],"categoryId":"7","categoryName":"xxx"}
,{"brandList":[{"brandId":"529","brandName":"xxx","seriesList":[{"seriesId":"22","seriesName":"xxx"}]}],"categoryId":"10","categoryName":"xxx"}
]

使用hive自带的get_json_object函数进行处理：

select get_json_object(brand_control,"$[0].brandList"),
get_json_object(get_json_object(brand_control,"$[0].brandList"), "$[0].brandName")
from 库名.表名
where dayid='20190729'

尝试之后，发现最终的效果只能是取出其中一个brandname，并不能取出全部

explode

其实Hive SQL也可以一行sql直接搞定，没必要写什么udf，思路如下：

将json 以brand_name进行切分成多行使用lateral view和explode进行配合处理对每一行数据进行处理提取出品牌名剔除为数字的情况

SQL如下：

select 
  seller_id,
  collect_set(split(split(brand_name,'":"')[1],'"')[0]) as brand_name
from 
  (select 
    seller_id,
    brand_name,
    brand_control
  from 库名.表名
  lateral view 
    explode(
      split(brand_control,'brandName')
      ) adTable as brand_name
  where dayid = '20190729'
  ) a
where (split(split(brand_name,'":"')[1],'"')[0] REGEXP '[^0-9.]')!=0  --剔除brand_name为数字的情况
group by seller_id

这样，对应的一个seller_id就将对应的所有branName给全部取了出来，以一个list的形式拼接在了一起

Spark解析JSON用get_json_object()

get_json_object() 方法从一个json 字符串中根据指定的json路径抽取一个json 对象

根据指定数据，获取一个Dataframe

import org.apache.spark.sql.functions._
val json4 = Seq(
  (0, """{"device_id": 0, "device_type": "sensor-ipad", "ip": "68.161.225.1", "cn": "United States"}"""))
  .toDF("id", "json")
json4.printSchema()

schema结构为

使用get_json_object 从json字符串中提取列：

// 利用get_json_object 从 json字符串中，提取列
val jsDF = json4.select($"id",
  get_json_object($"json", "$.device_type").alias("device_type"),
  get_json_object($"json", "$.ip").alias("ip"),
  get_json_object($"json", "$.cn").alias("cn"))
jsDF.printSchema()

schema信息为

读取嵌套Json

读取文件

val json: Dataframe = session.read.json("jsonlog2.json")
json.printSchema()

操作嵌套json的方式:

//DSL 语法的查询
json.select("address.province").show()
 
// 使用sql语法查询
json.createTempView("v_tmp")
session.sql("select address.city from  v_tmp").show()

操作嵌套json数组-explode函数

读取json文件

val json3 = session.read.json("jsonlog3array.json")
json3.printSchema()
json3.show()

解决方案：

利用explode函数，把数组数据进行展开。

// 导入sparksql中的函数
import org.apache.spark.sql.functions._
// 利用explode函数  把json数组进行展开， 数组中的每一条数据，都是一条记录
val explodeDF = json3
.select($"name", explode($"myScore")).toDF("name", "score")
 
explodeDF.printSchema()
 
// 再次进行查询  类似于普通的数据库表  默认schema： score1, 可以通过as 指定schema名称
val json3Res: Dataframe = explodeDF.select($"name", $"score.score1",
  $"score.score2" as "score222")
// 创建临时视图
json3Res.createTempView("v_jsonArray")
// 写sql，分别求平均值
session.sql("select name,avg(score1),avg(score222) from v_jsonArray group by name")
  .show()

Spark/Hive如何解析嵌套Json

大数据系统相关栏目本月热门文章