Spark: org.apache.spark.sql.AnalysisException: Resolved attribute(s) ... missing from....

背景：Spark scala代码处理数据。

前言：

请先检查代码，是否遗漏了字段，也就是要解析的字段XXX不存在。如果真的漏了，补上即可，不需要再往下看了。

具体报错日志如下：

ApplicationMaster: User class threw exception: org.apache.spark.sql.AnalysisException: Resolved attribute(s) team_id#51L missing from team_id#479L, … in operator !Join LeftOuter, (latest_secondary_team_id#328L = team_id#51L). Attribute(s) with the same name appear in the operation: team_id. Please check if the right attribute(s) are used.;;
可以看到少了team_id#51L，但日志里后面的所有字段中，是有team_id#479L的，但数字不一样。

什么场景下会出现这种问题：

A表joinB表的abc字段得到C表，然后拿C表又 join了B表abc字段时，此时可能会报错提示：Resolved attribute(s) 字段a#一个数字 missing from 字段a#另一个数字, …。推测是因为C join B如果能成功，会有2个abc字段，但又不能有重名的字段。

解决方案一句话说明：

Reuse of the reference will create ambiguity in naming, so you will have to clone the df,或者重命名所有的列。（对引用的重复使用，会造成名字上的二义性，所以你需要克隆这个Dataframe或者重命名你用到的列）

我的代码：重命名所有需要的列

	// 选取需要的列，然后重命名所有列
    val recalculationTeamInfo = distinctRtxOrbacInfo.select("first_team_name")
      .withColumnRenamed("first_team_name", "new_" + "first_team_name")

PS： val dataframe2 = dataframe1不是克隆dataframe的正确写法。
val dataframe1Clone = dataframe1.as(“dataframe1Clone”)我试过这个方法，有时候可以解决这个问题的，但有时候不行，所以我一般推荐重命名列的方法，一了百了哈哈。

参考：

https://stackoverflow.com/questions/45713290/how-to-resolve-the-analysisexception-resolved-attributes-in-spark/53848160

Spark: org.apache.spark.sql.AnalysisException: Resolved attribute(s) ... missing from....

大数据系统相关栏目本月热门文章