自定义UDF之自定义标识分组

**
功能：根据字段匹配自行分组，传入的两个参数都必须是string类型，返回值是int类型
首先添加maven依赖，我使用的hive版本是2.3.5，根据自己需求自己更改版本



    4.0.0

    com.atweimiao.udf
    selfgroup
    1.0-SNAPSHOT
    
        
            org.apache.hive
            hive-exec
            2.3.5

创建实现类

import org.apache.hadoop.hive.ql.exec.UDFArgumentException;
import org.apache.hadoop.hive.ql.exec.UDFArgumentTypeException;
import org.apache.hadoop.hive.ql.metadata.HiveException;
import org.apache.hadoop.hive.ql.udf.generic.GenericUDF;
import org.apache.hadoop.hive.ql.udf.generic.GenericUDFUtils;
import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspector;
import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorConverters;
import org.apache.hadoop.hive.serde2.objectinspector.primitive.PrimitiveObjectInspectorFactory;
import org.apache.hadoop.io.IntWritable;


public class SelfGroup extends GenericUDF {
    private transient ObjectInspectorConverters.Converter[] converters;
    
    public ObjectInspector initialize(ObjectInspector[] arguments) throws UDFArgumentException {

        for (int i = 0; i < arguments.length; i++) {
            ObjectInspector.Category category = arguments[i].getCategory();
            if (category != ObjectInspector.Category.PRIMITIVE) {
                throw new UDFArgumentTypeException(i, "The "
                        + GenericUDFUtils.getOrdinal(i + 1)
                        + " argument of function LOCATE is expected to a "
                        + ObjectInspector.Category.PRIMITIVE.toString().toLowerCase() + " type, but "
                        + category.toString().toLowerCase() + " is found");
            }
        }

        converters = new ObjectInspectorConverters.Converter[arguments.length];
        for (int i = 0; i < arguments.length; i++) {
            converters[i] = ObjectInspectorConverters.getConverter(arguments[i],
                    PrimitiveObjectInspectorFactory.writableStringObjectInspector);
        }

        return PrimitiveObjectInspectorFactory.writableIntObjectInspector;
    }
    
    private final IntWritable intWritable = new IntWritable(0);
    int flag = 0;
    public Object evaluate(DeferredObject[] arguments) throws HiveException {
        if (arguments[0].get() == null || arguments[1].get() == null) {
            return null;
        }

        String fir = arguments[0].get().toString();
        String sen = arguments[1].get().toString();
        if(fir.equals(sen)){
            flag=flag+1;
        }
        intWritable.set(flag);
        return intWritable;
    }


    public String getDisplayString(String[] children) {
        return "";
    }
}

完成之后打成jar包放到hive目录下的lib目录下，使用时添加相应jar包，创建临时函数即可使用
操作如下，添加jar包，

 add jar /opt/module/hive/datas/selfgroup.jar;

之后创建临时函数，格式： create temporary function 自定义方法名 as “全类名”，如下

 create temporary function my_group as "com.test.hive.SelfGroup";

然后就可以使用了，效果如下

如果不是机房，是云环境是要将jar包传到hdfs的对应目录下引用才能正常使用

自定义UDF之自定义标识分组

大数据系统相关栏目本月热门文章