数据是string类型,拆分的时候get_json_object的逗号后要加个空格
查询不嵌套字段 (json外加上单引号){"KdProjCode":"A20160518015NB","DTjType":"调价","xmlns:xsi":"http://www.w3.org/2001/XMLSchema-instance","PriceHzInfo":{"Item":{"HzAmountAfter":1.0190920349E10,"HzAmountBefore":1.0190805449E10}},"TITLE":"宁波上湖城章","PriceCompareDetails":""}
select get_json_object('{"KdProjCode":"A20160518015NB","DTjType":"调价","xmlns:xsi":"http://www.w3.org/2001/XMLSchema-instance","PriceHzInfo":{"Item":{"HzAmountAfter":1.0190920349E10,"HzAmountBefore":1.0190805449E10}},"TITLE":"宁波上湖城章","PriceCompareDetails":""}', '$.KdProjCode')
得到A20160518015NB
查询嵌套字段select get_json_object('{"KdProjCode":"A20160518015NB","DTjType":"调价","xmlns:xsi":"http://www.w3.org/2001/XMLSchema-instance","PriceHzInfo":{"Item":{"HzAmountAfter":1.0190920349E10,"HzAmountBefore":1.0190805449E10}},"TITLE":"宁波上湖城章","PriceCompareDetails":""}', '$.PriceHzInfo.Item.HzAmountBefore')
得到 1.0190805449E10
Hive用explode拆分嵌套Json数组这样数组里面套Json然后嵌套数组的, explode就是将hive一行中复杂的array或者map结构拆分成多行
[{"categoryId":"9","categoryName":"xxx","brandList":[{"brandId":"597","brandName":"xxx"}]}
,{"categoryId":"5","categoryName":"xxx","brandList":[{"brandId":"597","brandName":"xxx"}]}
,{"categoryId":"10","categoryName":"xxx","brandList" [{"brandId":"529","brandName":"xxx","seriesList":[{"seriesId":"22","seriesName":"xxx"}]}]}
]
[{"brandList":[{"brandId":"752","brandName":"xxx"},{"brandId":"516","brandName":"xxx"},{"brandId":"650","brandName":"xxx"},{"brandId":"586","brandName":"xxx"},{"brandId":"630","brandName":"xxx"}],"categoryId":"542","categoryName":"xxx"}
,{"brandList":[{"brandId":"752","brandName":"xxx"},{"brandId":"650","brandName":"xxx"}],"categoryId":"7","categoryName":"xxx"}
,{"brandList":[{"brandId":"529","brandName":"xxx","seriesList":[{"seriesId":"22","seriesName":"xxx"}]}],"categoryId":"10","categoryName":"xxx"}
]
使用hive自带的get_json_object函数进行处理:
select get_json_object(brand_control,"$[0].brandList"),
get_json_object(get_json_object(brand_control,"$[0].brandList"), "$[0].brandName")
from 库名.表名
where dayid='20190729'
尝试之后,发现最终的效果只能是取出其中一个brandname,并不能取出全部
explode其实Hive SQL也可以一行sql直接搞定,没必要写什么udf,思路如下:
- 将json 以brand_name进行切分成多行 使用lateral view和explode进行配合处理对每一行数据进行处理 提取出品牌名剔除为数字的情况
SQL如下:
select
seller_id,
collect_set(split(split(brand_name,'":"')[1],'"')[0]) as brand_name
from
(select
seller_id,
brand_name,
brand_control
from 库名.表名
lateral view
explode(
split(brand_control,'brandName')
) adTable as brand_name
where dayid = '20190729'
) a
where (split(split(brand_name,'":"')[1],'"')[0] REGEXP '[^0-9.]')!=0 --剔除brand_name为数字的情况
group by seller_id
这样,对应的一个seller_id就将对应的所有branName给全部取了出来,以一个list的形式拼接在了一起
Spark解析JSON用get_json_object()get_json_object() 方法 从一个json 字符串中根据指定的json路径抽取一个json 对象
根据指定数据,获取一个Dataframe
import org.apache.spark.sql.functions._
val json4 = Seq(
(0, """{"device_id": 0, "device_type": "sensor-ipad", "ip": "68.161.225.1", "cn": "United States"}"""))
.toDF("id", "json")
json4.printSchema()
schema结构为
使用get_json_object 从json字符串中提取列:
// 利用get_json_object 从 json字符串中,提取列
val jsDF = json4.select($"id",
get_json_object($"json", "$.device_type").alias("device_type"),
get_json_object($"json", "$.ip").alias("ip"),
get_json_object($"json", "$.cn").alias("cn"))
jsDF.printSchema()
schema信息为
读取嵌套Json读取文件
val json: Dataframe = session.read.json("jsonlog2.json")
json.printSchema()
操作嵌套json的方式:
//DSL 语法的查询
json.select("address.province").show()
// 使用sql语法查询
json.createTempView("v_tmp")
session.sql("select address.city from v_tmp").show()
操作嵌套json数组-explode函数
读取json文件
val json3 = session.read.json("jsonlog3array.json")
json3.printSchema()
json3.show()
解决方案:
利用explode函数,把数组数据进行展开。
// 导入sparksql中的函数
import org.apache.spark.sql.functions._
// 利用explode函数 把json数组进行展开, 数组中的每一条数据,都是一条记录
val explodeDF = json3
.select($"name", explode($"myScore")).toDF("name", "score")
explodeDF.printSchema()
// 再次进行查询 类似于普通的数据库表 默认schema: score1, 可以通过as 指定schema名称
val json3Res: Dataframe = explodeDF.select($"name", $"score.score1",
$"score.score2" as "score222")
// 创建临时视图
json3Res.createTempView("v_jsonArray")
// 写sql,分别求平均值
session.sql("select name,avg(score1),avg(score222) from v_jsonArray group by name")
.show()



