栏目分类:
子分类:
返回
名师互学网用户登录
快速导航关闭
当前搜索
当前分类
子分类
实用工具
热门搜索
名师互学网 > IT > 前沿技术 > 大数据 > 大数据系统

Spark/Hive如何解析嵌套Json

Spark/Hive如何解析嵌套Json

Hive解析嵌套Json用get_json_object

数据是string类型,拆分的时候get_json_object的逗号后要加个空格

{"KdProjCode":"A20160518015NB","DTjType":"调价","xmlns:xsi":"http://www.w3.org/2001/XMLSchema-instance","PriceHzInfo":{"Item":{"HzAmountAfter":1.0190920349E10,"HzAmountBefore":1.0190805449E10}},"TITLE":"宁波上湖城章","PriceCompareDetails":""}

 查询不嵌套字段 (json外加上单引号)

select get_json_object('{"KdProjCode":"A20160518015NB","DTjType":"调价","xmlns:xsi":"http://www.w3.org/2001/XMLSchema-instance","PriceHzInfo":{"Item":{"HzAmountAfter":1.0190920349E10,"HzAmountBefore":1.0190805449E10}},"TITLE":"宁波上湖城章","PriceCompareDetails":""}', '$.KdProjCode')   

得到A20160518015NB

查询嵌套字段

select get_json_object('{"KdProjCode":"A20160518015NB","DTjType":"调价","xmlns:xsi":"http://www.w3.org/2001/XMLSchema-instance","PriceHzInfo":{"Item":{"HzAmountAfter":1.0190920349E10,"HzAmountBefore":1.0190805449E10}},"TITLE":"宁波上湖城章","PriceCompareDetails":""}', '$.PriceHzInfo.Item.HzAmountBefore')

 得到   1.0190805449E10

Hive用explode拆分嵌套Json数组

这样数组里面套Json然后嵌套数组的, explode就是将hive一行中复杂的array或者map结构拆分成多行

[{"categoryId":"9","categoryName":"xxx","brandList":[{"brandId":"597","brandName":"xxx"}]}
,{"categoryId":"5","categoryName":"xxx","brandList":[{"brandId":"597","brandName":"xxx"}]}
,{"categoryId":"10","categoryName":"xxx","brandList" [{"brandId":"529","brandName":"xxx","seriesList":[{"seriesId":"22","seriesName":"xxx"}]}]}
]
[{"brandList":[{"brandId":"752","brandName":"xxx"},{"brandId":"516","brandName":"xxx"},{"brandId":"650","brandName":"xxx"},{"brandId":"586","brandName":"xxx"},{"brandId":"630","brandName":"xxx"}],"categoryId":"542","categoryName":"xxx"}
,{"brandList":[{"brandId":"752","brandName":"xxx"},{"brandId":"650","brandName":"xxx"}],"categoryId":"7","categoryName":"xxx"}
,{"brandList":[{"brandId":"529","brandName":"xxx","seriesList":[{"seriesId":"22","seriesName":"xxx"}]}],"categoryId":"10","categoryName":"xxx"}
]
使用hive自带的get_json_object函数进行处理:

select get_json_object(brand_control,"$[0].brandList"),
       get_json_object(get_json_object(brand_control,"$[0].brandList"), "$[0].brandName")
from 库名.表名
where dayid='20190729'

 尝试之后,发现最终的效果只能是取出其中一个brandname,并不能取出全部

explode

其实Hive SQL也可以一行sql直接搞定,没必要写什么udf,思路如下:

    将json 以brand_name进行切分成多行 使用lateral view和explode进行配合处理对每一行数据进行处理 提取出品牌名剔除为数字的情况

SQL如下:

select 
  seller_id,
  collect_set(split(split(brand_name,'":"')[1],'"')[0]) as brand_name
from 
  (select 
    seller_id,
    brand_name,
    brand_control
  from 库名.表名
  lateral view 
    explode(
      split(brand_control,'brandName')
      ) adTable as brand_name
  where dayid = '20190729'
  ) a
where (split(split(brand_name,'":"')[1],'"')[0] REGEXP '[^0-9.]')!=0  --剔除brand_name为数字的情况
group by seller_id

这样,对应的一个seller_id就将对应的所有branName给全部取了出来,以一个list的形式拼接在了一起

Spark解析JSON用get_json_object() 

get_json_object() 方法 从一个json 字符串中根据指定的json路径抽取一个json 对象

根据指定数据,获取一个Dataframe

import org.apache.spark.sql.functions._
val json4 = Seq(
  (0, """{"device_id": 0, "device_type": "sensor-ipad", "ip": "68.161.225.1", "cn": "United States"}"""))
  .toDF("id", "json")
json4.printSchema()

schema结构为

 使用get_json_object 从json字符串中提取列:

// 利用get_json_object 从 json字符串中,提取列
val jsDF = json4.select($"id",
  get_json_object($"json", "$.device_type").alias("device_type"),
  get_json_object($"json", "$.ip").alias("ip"),
  get_json_object($"json", "$.cn").alias("cn"))
jsDF.printSchema()

schema信息为

读取嵌套Json

读取文件

val json: Dataframe = session.read.json("jsonlog2.json")
json.printSchema()

操作嵌套json的方式:

//DSL 语法的查询
json.select("address.province").show()
 
// 使用sql语法查询
json.createTempView("v_tmp")
session.sql("select address.city from  v_tmp").show()
操作嵌套json数组-explode函数

 读取json文件

val json3 = session.read.json("jsonlog3array.json")
json3.printSchema()
json3.show()

解决方案:

利用explode函数,把数组数据进行展开。

// 导入sparksql中的函数
import org.apache.spark.sql.functions._
// 利用explode函数  把json数组进行展开, 数组中的每一条数据,都是一条记录
val explodeDF = json3
.select($"name", explode($"myScore")).toDF("name", "score")
 
explodeDF.printSchema()
 
// 再次进行查询  类似于普通的数据库表  默认schema: score1, 可以通过as 指定schema名称
val json3Res: Dataframe = explodeDF.select($"name", $"score.score1",
  $"score.score2" as "score222")
// 创建临时视图
json3Res.createTempView("v_jsonArray")
// 写sql,分别求平均值
session.sql("select name,avg(score1),avg(score222) from v_jsonArray group by name")
  .show() 

转载请注明:文章转载自 www.mshxw.com
本文地址:https://www.mshxw.com/it/744988.html
我们一直用心在做
关于我们 文章归档 网站地图 联系我们

版权所有 (c)2021-2022 MSHXW.COM

ICP备案号:晋ICP备2021003244-6号