栏目分类:
子分类:
返回
名师互学网用户登录
快速导航关闭
当前搜索
当前分类
子分类
实用工具
热门搜索
名师互学网 > IT > 前沿技术 > 大数据 > 大数据系统

spark dataframe中列的metadata信息结构

spark dataframe中列的metadata信息结构

*******************************************************************************************
1、schema的结构形式
df.schema

StructType(
StructField(Elevation,IntegerType,true),
StructField(Aspect,IntegerType,true),
StructField(Slope,IntegerType,true)
)


*******************************************************************************************
2、查看metadata的形式--查看某一列的metadata
df.schema("features").metadata


3、数据列如不进行操作,其metadata信息为空
例如,当刚读入某数据文件时,其metadata信息显示为空
显示样式:{}

4、进行vectorAssember后显示样式
{"ml_attr":{"attrs":{"numeric":[
{"idx":0,"name":"Elevation"},
{"idx":1,"name":"Aspect"},
{"idx":2,"name":"Slope"},
{"idx":3,"name":"Horizontal_Distance_To_Hydrology"},
{"idx":4,"name":"Vertical_Distance_To_Hydrology"},
{"idx":5,"name":"Horizontal_Distance_To_Roadways"},
{"idx":6,"name":"Hillshade_9am"},
{"idx":7,"name":"Hillshade_Noon"},
{"idx":8,"name":"Hillshade_3pm"},
{"idx":9,"name":"Horizontal_Distance_To_Fire_Points"},
{"idx":10,"name":"Wilderness_Area_0"},
{"idx":11,"name":"Wilderness_Area_1"},
{"idx":12,"name":"Wilderness_Area_2"},
{"idx":13,"name":"Wilderness_Area_3"},
{"idx":14,"name":"Soil_Type_0"},
{"idx":15,"name":"Soil_Type_1"},
{"idx":16,"name":"Soil_Type_2"},
{"idx":17,"name":"Soil_Type_3"},
{"idx":18,"name":"Soil_Type_4"},
{"idx":19,"name":"Soil_Type_5"},
{"idx":20,"name":"Soil_Type_6"},
{"idx":21,"name":"Soil_Type_7"},
{"idx":22,"name":"Soil_Type_8"},
{"idx":23,"name":"Soil_Type_9"},
{"idx":24,"name":"Soil_Type_10"},
{"idx":25,"name":"Soil_Type_11"},
{"idx":26,"name":"Soil_Type_12"},
{"idx":27,"name":"Soil_Type_13"},
{"idx":28,"name":"Soil_Type_14"},
{"idx":29,"name":"Soil_Type_15"},
{"idx":30,"name":"Soil_Type_16"},
{"idx":31,"name":"Soil_Type_17"},
{"idx":32,"name":"Soil_Type_18"},
{"idx":33,"name":"Soil_Type_19"},
{"idx":34,"name":"Soil_Type_20"},
{"idx":35,"name":"Soil_Type_21"},
{"idx":36,"name":"Soil_Type_22"},
{"idx":37,"name":"Soil_Type_23"},
{"idx":38,"name":"Soil_Type_24"},
{"idx":39,"name":"Soil_Type_25"},
{"idx":40,"name":"Soil_Type_26"},
{"idx":41,"name":"Soil_Type_27"},
{"idx":42,"name":"Soil_Type_28"},
{"idx":43,"name":"Soil_Type_29"},
{"idx":44,"name":"Soil_Type_30"},
{"idx":45,"name":"Soil_Type_31"},
{"idx":46,"name":"Soil_Type_32"},
{"idx":47,"name":"Soil_Type_33"},
{"idx":48,"name":"Soil_Type_34"},
{"idx":49,"name":"Soil_Type_35"},
{"idx":50,"name":"Soil_Type_36"},
{"idx":51,"name":"Soil_Type_37"},
{"idx":52,"name":"Soil_Type_38"},
{"idx":53,"name":"Soil_Type_39"}]},
"num_attrs":54}
}

5、上面54个数值列,其实本质只有12个属性列,是因为原始数据是按照one-hot编码形式的
将其还原回12个属性
//K值编码
{"ml_attr":{
    "attrs":{
        "numeric":[
            {"idx":0,"name":"Elevation"},
            {"idx":1,"name":"Aspect"},
            {"idx":2,"name":"Slope"},
            {"idx":3,"name":"Horizontal_Distance_To_Hydrology"},
            {"idx":4,"name":"Vertical_Distance_To_Hydrology"},
            {"idx":5,"name":"Horizontal_Distance_To_Roadways"},
            {"idx":6,"name":"Hillshade_9am"},
            {"idx":7,"name":"Hillshade_Noon"},
            {"idx":8,"name":"Hillshade_3pm"},
            {"idx":9,"name":"Horizontal_Distance_To_Fire_Points"},
            {"idx":10,"name":"wilderness"},
            {"idx":11,"name":"soil"}
                   ]
            },
    "num_attrs":12
            }
}

6、对其中的离散列,进行vectorIndexer 离散化编码
//VectorIndexer后产生离散列,多了 nominal
{"ml_attr":{
    "attrs":{
        "numeric":[
            {"idx":0,"name":"Elevation"},
            {"idx":1,"name":"Aspect"},
            {"idx":2,"name":"Slope"},
            {"idx":3,"name":"Horizontal_Distance_To_Hydrology"},
            {"idx":4,"name":"Vertical_Distance_To_Hydrology"},
            {"idx":5,"name":"Horizontal_Distance_To_Roadways"},
            {"idx":6,"name":"Hillshade_9am"},
            {"idx":7,"name":"Hillshade_Noon"},
            {"idx":8,"name":"Hillshade_3pm"},
            {"idx":9,"name":"Horizontal_Distance_To_Fire_Points"}],
        "nominal":[
            {"ord":false,
            "vals":["0.0","1.0","2.0","3.0"],
            "idx":10,"name":"wilderness"},
            {"ord":false,
            "vals":["0.0","1.0","2.0","3.0","4.0","5.0","6.0","7.0","8.0","9.0","10.0","11.0","12.0","13.0","14.0","15.0","16.0","17.0","18.0","19.0","20.0","21.0","22.0","23.0","24.0","25.0","26.0","27.0","28.0","29.0","30.0","31.0","32.0","33.0","34.0","35.0","36.0","37.0","38.0","39.0"],
            "idx":11,"name":"soil"}]
            },
    "num_attrs":12
            }
}

7、label列一般进行的处理少,常用的是对于分类问题,label列定义成string类型,并且也进行stringIndexer,这样就会得到metadata信息,可据此统计分类个数等
//label列经过离散处理后有metadata信息
{"ml_attr":
    {"vals":["2.0","1.0","3.0","7.0","6.0","5.0","4.0"],
    "type":"nominal",
    "name":"coverIndex"
    }
}

转载请注明:文章转载自 www.mshxw.com
本文地址:https://www.mshxw.com/it/487509.html
我们一直用心在做
关于我们 文章归档 网站地图 联系我们

版权所有 (c)2021-2022 MSHXW.COM

ICP备案号:晋ICP备2021003244-6号