本文隶属于专栏《1000个问题搞定大数据技术体系》,该专栏为笔者原创,引用请注明来源,不足和错误之处请在评论区帮忙指出,谢谢!
目录本专栏目录结构和参考文献请见1000个问题搞定大数据技术体系
Spark SQL functions.scala 源码解析(一)Sort functions (基于 Spark 3.3.0)
Spark SQL functions.scala 源码解析(二)Aggregate functions(基于 Spark 3.3.0)
Spark SQL functions.scala 源码解析(三)Window functions (基于 Spark 3.3.0)
Spark SQL functions.scala 源码解析(四)Non-aggregate functions (基于 Spark 3.3.0)
Spark SQL functions.scala 源码解析(五)Math Functions (基于 Spark 3.3.0)
Spark SQL functions.scala 源码解析(六)Misc functions (基于 Spark 3.3.0)
Spark SQL functions.scala 源码解析(七)String functions (基于 Spark 3.3.0)
Spark SQL functions.scala 源码解析(八)DateTime functions (基于 Spark 3.3.0)
Spark SQL functions.scala 源码解析(九)Collection functions (基于 Spark 3.3.0)
Spark SQL functions.scala 源码解析(十)Partition transform functions(基于 Spark 3.3.0)
Spark SQL functions.scala 源码解析(十一)Scala UDF functions(基于 Spark 3.3.0)
Spark SQL functions.scala 源码解析(十二)Java UDF functions(基于 Spark 3.3.0)
正文 ascii
def ascii(e: Column): Column = withExpr { Ascii(e.expr) }
用法
========== df.select(ascii($"a"), ascii($"b"), ascii($"c")).show() ========== +--------+--------+--------+ |ascii(a)|ascii(b)|ascii(c)| +--------+--------+--------+ | 97| 97| 0| +--------+--------+--------+base64
def base64(e: Column): Column = withExpr { base64(e.expr) }
用法
========== df.select(base64($"a"), base64($"b"), base64($"c")).show() ========== +---------+---------+---------+ |base64(a)|base64(b)|base64(c)| +---------+---------+---------+ | YWJj| YWFhQmI=| | +---------+---------+---------+bit_length
def bit_length(e: Column): Column = withExpr { BitLength(e.expr) }
concat_ws
@scala.annotation.varargs
def concat_ws(sep: String, exprs: Column*): Column = withExpr {
ConcatWs(Literal.create(sep, StringType) +: exprs.map(_.expr))
}
用法
========== df.select(concat_ws(";", $"a", $"b", $"c")).show() ==========
+---------------------+
|concat_ws(;, a, b, c)|
+---------------------+
| abc;aaaBb;|
+---------------------+
decode/encode
def decode(value: Column, charset: String): Column = withExpr {
StringDecode(value.expr, lit(charset).expr)
}
def encode(value: Column, charset: String): Column = withExpr {
Encode(value.expr, lit(charset).expr)
}
用法
========== df.select(decode($"a", "utf-8")).show() ========== +----------------------+ |stringdecode(a, utf-8)| +----------------------+ | abc| +----------------------+ ========== df.select(encode($"a", "utf-8")).show() ========== +----------------+ |encode(a, utf-8)| +----------------+ | [61 62 63]| +----------------+format_number/format_string
def format_number(x: Column, d: Int): Column = withExpr {
FormatNumber(x.expr, lit(d).expr)
}
@scala.annotation.varargs
def format_string(format: String, arguments: Column*): Column = withExpr {
FormatString((lit(format) +: arguments).map(_.expr): _*)
}
用法HALF_EVEN 舍入模式:向最接近数字方向舍入,如果与两个相邻数字的距离相等,则向相邻的偶数舍入。
========== df.select(format_number(lit(5L), 4)).show() ==========
+-------------------+
|format_number(5, 4)|
+-------------------+
| 5.0000|
+-------------------+
========== df.select(format_number(lit(1.toByte), 4)).show() ==========
+-------------------+
|format_number(1, 4)|
+-------------------+
| 1.0000|
+-------------------+
========== df.select(format_number(lit(2.toShort), 4)).show() ==========
+-------------------+
|format_number(2, 4)|
+-------------------+
| 2.0000|
+-------------------+
========== df.select(format_number(lit(3.1322.toFloat), 4)).show() ==========
+------------------------+
|format_number(3.1322, 4)|
+------------------------+
| 3.1322|
+------------------------+
========== df.select(format_number(lit(4), 4)).show() ==========
+-------------------+
|format_number(4, 4)|
+-------------------+
| 4.0000|
+-------------------+
========== df.select(format_number(lit(5L), 4)).show() ==========
+-------------------+
|format_number(5, 4)|
+-------------------+
| 5.0000|
+-------------------+
========== df.select(format_number(lit(6.48173), 4)).show() ==========
+-------------------------+
|format_number(6.48173, 4)|
+-------------------------+
| 6.4817|
+-------------------------+
========== df.select(format_number(lit(BigDecimal("7.128381")), 4)).show() ==========
+--------------------------+
|format_number(7.128381, 4)|
+--------------------------+
| 7.1284|
+--------------------------+
========== df.select(format_string("aa%d%s", lit(123), lit("cc"))).show() ==========
+------------------------------+
|format_string(aa%d%s, 123, cc)|
+------------------------------+
| aa123cc|
+------------------------------+
initcap
def initcap(e: Column): Column = withExpr { InitCap(e.expr) }
用法
========== df.select(initcap($"a"), initcap($"b"), initcap($"c")).show() ========== +----------+----------+----------+ |initcap(a)|initcap(b)|initcap(c)| +----------+----------+----------+ | Abc| Aaabb| | +----------+----------+----------+instr
def instr(str: Column, substring: String): Column = withExpr {
StringInstr(str.expr, lit(substring).expr)
}
用法
========== df.select(instr($"b", "aa")).show() ========== +------------+ |instr(b, aa)| +------------+ | 1| +------------+length
def length(e: Column): Column = withExpr { Length(e.expr) }
用法
========== df.select(length($"a"), length($"b"), length($"c")).show() ========== +---------+---------+---------+ |length(a)|length(b)|length(c)| +---------+---------+---------+ | 3| 5| 0| +---------+---------+---------+lower
def lower(e: Column): Column = withExpr { Lower(e.expr) }
用法
========== df.select(lower($"b")).show() ========== +--------+ |lower(b)| +--------+ | aaabb| +--------+levenshtein
def levenshtein(l: Column, r: Column): Column = withExpr { Levenshtein(l.expr, r.expr) }
用法莱文斯坦距离,又称 Levenshtein 距离,是编辑距离的一种。指两个字串之间,由一个转成另一个所需的最少编辑操作次数。允许的编辑操作包括将一个字符替换成另一个字符,插入一个字符,删除一个字符。
例如将 kitten 一字转成 sitting:
sitten (k→s)
sittin (e→i)
sitting (→g)
俄罗斯科学家弗拉基米尔·莱文斯坦在1965年提出这个概念。
========== df.select(levenshtein($"a", $"b")).show() ========== +-----------------+ |levenshtein(a, b)| +-----------------+ | 4| +-----------------+locate
def locate(substr: String, str: Column): Column = withExpr {
new StringLocate(lit(substr).expr, str.expr)
}
def locate(substr: String, str: Column, pos: Int): Column = withExpr {
StringLocate(lit(substr).expr, str.expr, lit(pos).expr)
}
用法
========== df.select(locate("aa", $"b")).show() ==========
+----------------+
|locate(aa, b, 1)|
+----------------+
| 1|
+----------------+
========== df.select(locate("aa", $"b", 2)).show() ==========
+----------------+
|locate(aa, b, 2)|
+----------------+
| 2|
+----------------+
lpad
def lpad(str: Column, len: Int, pad: String): Column = withExpr {
StringLPad(str.expr, lit(len).expr, lit(pad).expr)
}
def lpad(str: Column, len: Int, pad: Array[Byte]): Column = withExpr {
new BinaryLPad(str.expr, lit(len).expr, lit(pad).expr)
}
用法
========== df.select(lpad($"a", 10, " ")).show() ========== +--------------+ |lpad(a, 10, )| +--------------+ | abc| +--------------+ltrim
def ltrim(e: Column): Column = withExpr {StringTrimLeft(e.expr) }
def ltrim(e: Column, trimString: String): Column = withExpr {
StringTrimLeft(e.expr, Literal(trimString))
}
用法
========== df.select(ltrim(lit(" 123"))).show() ==========
+-------------+
|ltrim( 123)|
+-------------+
| 123|
+-------------+
========== df.select(ltrim(lit("aaa123"), "a")).show() ==========
+---------------------------+
|TRIm(LEADING a FROM aaa123)|
+---------------------------+
| 123|
+---------------------------+
octet_length
def octet_length(e: Column): Column = withExpr { OctetLength(e.expr) }
regexp_extract/regexp_replace
def regexp_extract(e: Column, exp: String, groupIdx: Int): Column = withExpr {
RegExpExtract(e.expr, lit(exp).expr, lit(groupIdx).expr)
}
def regexp_replace(e: Column, pattern: String, replacement: String): Column = withExpr {
RegExpReplace(e.expr, lit(pattern).expr, lit(replacement).expr)
}
def regexp_replace(e: Column, pattern: Column, replacement: Column): Column = withExpr {
RegExpReplace(e.expr, pattern.expr, replacement.expr)
}
用法
========== df.select(regexp_extract(lit("abc123"), "(\d+)", 1)).show() ==========
+--------------------------------+
|regexp_extract(abc123, (d+), 1)|
+--------------------------------+
| 123|
+--------------------------------+
========== df.select(regexp_replace(lit("abc123"), "(\d+)", "num")).show() ==========
+-------------------------------------+
|regexp_replace(abc123, (d+), num, 1)|
+-------------------------------------+
| abcnum|
+-------------------------------------+
========== df.select(regexp_replace(lit("abc123"), lit("(\d+)"), lit("num"))).show() ==========
+-------------------------------------+
|regexp_replace(abc123, (d+), num, 1)|
+-------------------------------------+
| abcnum|
+-------------------------------------+
unbase64
def unbase64(e: Column): Column = withExpr { Unbase64(e.expr) }
用法
========== df.select(unbase64(typedlit(Array[Byte](1, 2, 3, 4)))).show() ========== +---------------------+ |unbase64(X'01020304')| +---------------------+ | []| +---------------------+rpad
def rpad(str: Column, len: Int, pad: String): Column = withExpr {
StringRPad(str.expr, lit(len).expr, lit(pad).expr)
}
def rpad(str: Column, len: Int, pad: Array[Byte]): Column = withExpr {
new BinaryRPad(str.expr, lit(len).expr, lit(pad).expr)
}
用法
========== df.select(rpad($"a", 10, " ")).show() ========== +--------------+ |rpad(a, 10, )| +--------------+ | abc | +--------------+repeat
def repeat(str: Column, n: Int): Column = withExpr {
StringRepeat(str.expr, lit(n).expr)
}
用法
========== df.select(repeat($"a", 3)).show() ========== +------------+ |repeat(a, 3)| +------------+ | abcabcabc| +------------+rtrim
def rtrim(e: Column): Column = withExpr { StringTrimRight(e.expr) }
def rtrim(e: Column, trimString: String): Column = withExpr {
StringTrimRight(e.expr, Literal(trimString))
}
用法
========== df.select(rtrim(lit("123 "))).show() ==========
+-------------+
|rtrim(123 )|
+-------------+
| 123|
+-------------+
========== df.select(rtrim(lit("123aaa"), "a")).show() ==========
+----------------------------+
|TRIm(TRAILING a FROM 123aaa)|
+----------------------------+
| 123|
+----------------------------+
soundex
def soundex(e: Column): Column = withExpr { SoundEx(e.expr) }
用法soundex 是一个将任何文本串转换为描述其语音表示的字母数字模式的算法。soundex 考虑了类似的发音字符和音节,使得对字符串进行发音比较而不是字母比较。
========== df.select(soundex($"a"), soundex($"b")).show() ========== +----------+----------+ |soundex(a)|soundex(b)| +----------+----------+ | A120| A100| +----------+----------+split
def split(str: Column, pattern: String): Column = withExpr {
StringSplit(str.expr, Literal(pattern), Literal(-1))
}
def split(str: Column, pattern: String, limit: Int): Column = withExpr {
StringSplit(str.expr, Literal(pattern), Literal(limit))
}
用法
========== df.select(split(lit("a;b;c"), ";")).show() ==========
+-------------------+
|split(a;b;c, ;, -1)|
+-------------------+
| [a, b, c]|
+-------------------+
========== df.select(split(lit("a;b;c"), ";", 2)).show() ==========
+------------------+
|split(a;b;c, ;, 2)|
+------------------+
| [a, b;c]|
+------------------+
========== df.select(split(lit("a;b;c"), ";", 0)).show() ==========
+------------------+
|split(a;b;c, ;, 0)|
+------------------+
| [a, b, c]|
+------------------+
========== df.select(split(lit("a;b;c"), ";", -1)).show() ==========
+-------------------+
|split(a;b;c, ;, -1)|
+-------------------+
| [a, b, c]|
+-------------------+
substring/substring_index
def substring(str: Column, pos: Int, len: Int): Column = withExpr {
Substring(str.expr, lit(pos).expr, lit(len).expr)
}
def substring_index(str: Column, delim: String, count: Int): Column = withExpr {
SubstringIndex(str.expr, lit(delim).expr, lit(count).expr)
}
用法
========== df.select(substring(lit("abcdef"), 2, 5)).show() ==========
+-----------------------+
|substring(abcdef, 2, 5)|
+-----------------------+
| bcdef|
+-----------------------+
========== df.select(substring_index(lit("www.shockang.com"), ".", 2)).show() ==========
+---------------------------------------+
|substring_index(www.shockang.com, ., 2)|
+---------------------------------------+
| www.shockang|
+---------------------------------------+
overlay
def overlay(src: Column, replace: Column, pos: Column, len: Column): Column = withExpr {
Overlay(src.expr, replace.expr, pos.expr, len.expr)
}
def overlay(src: Column, replace: Column, pos: Column): Column = withExpr {
new Overlay(src.expr, replace.expr, pos.expr)
}
用法
========== df.select(overlay(lit("abcdef"), lit("abc"), lit(4), lit(1))).show() ==========
+--------------------------+
|overlay(abcdef, abc, 4, 1)|
+--------------------------+
| abcabcef|
+--------------------------+
========== df.select(overlay(lit("abcdef"), lit("abc"), lit(4))).show() ==========
+---------------------------+
|overlay(abcdef, abc, 4, -1)|
+---------------------------+
| abcabc|
+---------------------------+
sentences
def sentences(string: Column, language: Column, country: Column): Column = withExpr {
Sentences(string.expr, language.expr, country.expr)
}
def sentences(string: Column): Column = withExpr {
Sentences(string.expr)
}
用法
========== df.select(sentences(lit("我们都有一个家,名字叫中国"), lit("zh"), lit("CN"))).show() ==========
+---------------------------------------------+
|sentences(我们都有一个家,名字叫中国, zh, CN)|
+---------------------------------------------+
| [[我们都有一个家, 名字叫中国]]|
+---------------------------------------------+
========== df.select(sentences(lit("我们都有一个家,名字叫中国"))).show() ==========
+-----------------------------------------+
|sentences(我们都有一个家,名字叫中国, , )|
+-----------------------------------------+
| [[我们都有一个家, 名字叫中国]]|
+-----------------------------------------+
translate
def translate(src: Column, matchingString: String, replaceString: String): Column = withExpr {
StringTranslate(src.expr, lit(matchingString).expr, lit(replaceString).expr)
}
用法
========== df.select(translate(lit("abcdef"), "def", "123")).show() ==========
+---------------------------+
|translate(abcdef, def, 123)|
+---------------------------+
| abc123|
+---------------------------+
trim
def trim(e: Column): Column = withExpr { StringTrim(e.expr) }
def trim(e: Column, trimString: String): Column = withExpr {
StringTrim(e.expr, Literal(trimString))
}
用法
========== df.select(trim(lit(" abc "))).show() ==========
+---------------+
|trim( abc )|
+---------------+
| abc|
+---------------+
========== df.select(trim(lit("aaabcaaaa"), "a")).show() ==========
+---------------------------+
|TRIm(BOTH a FROM aaabcaaaa)|
+---------------------------+
| bc|
+---------------------------+
upper
def upper(e: Column): Column = withExpr { Upper(e.expr) }
用法
========== df.select(upper($"b")).show() ========== +--------+ |upper(b)| +--------+ | AAABB| +--------+实践 代码
package com.shockang.study.spark.sql.functions
import com.shockang.study.spark.util.Utils.formatPrint
import org.apache.log4j.{Level, Logger}
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions._
object StringFunctionsExample {
def main(args: Array[String]): Unit = {
Logger.getLogger("org").setLevel(Level.OFF)
val spark = SparkSession.builder().appName("StringFunctionsExample").master("local[*]").getOrCreate()
import spark.implicits._
val df = Seq(("abc", "aaaBb", "")).toDF("a", "b", "c")
// ascii
formatPrint("""df.select(ascii($"a"), ascii($"b"), ascii($"c")).show()""")
df.select(ascii($"a"), ascii($"b"), ascii($"c")).show()
// base64
formatPrint("""df.select(base64($"a"), base64($"b"), base64($"c")).show()""")
df.select(base64($"a"), base64($"b"), base64($"c")).show()
// concat_ws
formatPrint("""df.select(concat_ws(";", $"a", $"b", $"c")).show()""")
df.select(concat_ws(";", $"a", $"b", $"c")).show()
// decode/encode
formatPrint("""df.select(decode($"a", "utf-8")).show()""")
df.select(decode($"a", "utf-8")).show()
formatPrint("""df.select(encode($"a", "utf-8")).show()""")
df.select(encode($"a", "utf-8")).show()
// format_number/format_string
formatPrint("""df.select(format_number(lit(5L), 4)).show()""")
df.select(format_number(lit(5L), 4)).show()
formatPrint("""df.select(format_number(lit(1.toByte), 4)).show()""")
df.select(format_number(lit(1.toByte), 4)).show()
formatPrint("""df.select(format_number(lit(2.toShort), 4)).show()""")
df.select(format_number(lit(2.toShort), 4)).show()
formatPrint("""df.select(format_number(lit(3.1322.toFloat), 4)).show()""")
df.select(format_number(lit(3.1322.toFloat), 4)).show()
formatPrint("""df.select(format_number(lit(4), 4)).show()""")
df.select(format_number(lit(4), 4)).show()
formatPrint("""df.select(format_number(lit(5L), 4)).show()""")
df.select(format_number(lit(5L), 4)).show()
formatPrint("""df.select(format_number(lit(6.48173), 4)).show()""")
df.select(format_number(lit(6.48173), 4)).show()
formatPrint("""df.select(format_number(lit(BigDecimal("7.128381")), 4)).show()""")
df.select(format_number(lit(BigDecimal("7.128381")), 4)).show()
formatPrint("""df.select(format_string("aa%d%s", lit(123), lit("cc"))).show()""")
df.select(format_string("aa%d%s", lit(123), lit("cc"))).show()
// initcap
formatPrint("""df.select(initcap($"a"), initcap($"b"), initcap($"c")).show()""")
df.select(initcap($"a"), initcap($"b"), initcap($"c")).show()
// instr
formatPrint("""df.select(instr($"b", "aa")).show()""")
df.select(instr($"b", "aa")).show()
// length
formatPrint("""df.select(length($"a"), length($"b"), length($"c")).show()""")
df.select(length($"a"), length($"b"), length($"c")).show()
// lower
formatPrint("""df.select(lower($"b")).show()""")
df.select(lower($"b")).show()
// levenshtein
formatPrint("""df.select(levenshtein($"a", $"b")).show()""")
df.select(levenshtein($"a", $"b")).show()
// locate
formatPrint("""df.select(locate("aa", $"b")).show()""")
df.select(locate("aa", $"b")).show()
formatPrint("""df.select(locate("aa", $"b", 2)).show()""")
df.select(locate("aa", $"b", 2)).show()
// lpad
formatPrint("""df.select(lpad($"a", 10, " ")).show()""")
df.select(lpad($"a", 10, " ")).show()
// ltrim
formatPrint("""df.select(ltrim(lit(" 123"))).show()""")
df.select(ltrim(lit(" 123"))).show()
formatPrint("""df.select(ltrim(lit("aaa123"), "a")).show()""")
df.select(ltrim(lit("aaa123"), "a")).show()
// regexp_extract/regexp_replace
formatPrint("""df.select(regexp_extract(lit("abc123"), "(\d+)", 1)).show()""")
df.select(regexp_extract(lit("abc123"), "(\d+)", 1)).show()
formatPrint("""df.select(regexp_replace(lit("abc123"), "(\d+)", "num")).show()""")
df.select(regexp_replace(lit("abc123"), "(\d+)", "num")).show()
formatPrint("""df.select(regexp_replace(lit("abc123"), lit("(\d+)"), lit("num"))).show()""")
df.select(regexp_replace(lit("abc123"), lit("(\d+)"), lit("num"))).show()
// unbase64
formatPrint("""df.select(unbase64(typedlit(Array[Byte](1, 2, 3, 4)))).show()""")
df.select(unbase64(typedlit(Array[Byte](1, 2, 3, 4)))).show()
// rpad
formatPrint("""df.select(rpad($"a", 10, " ")).show()""")
df.select(rpad($"a", 10, " ")).show()
// repeat
formatPrint("""df.select(repeat($"a", 3)).show()""")
df.select(repeat($"a", 3)).show()
// rtrim
formatPrint("""df.select(rtrim(lit("123 "))).show()""")
df.select(rtrim(lit("123 "))).show()
formatPrint("""df.select(rtrim(lit("123aaa"), "a")).show()""")
df.select(rtrim(lit("123aaa"), "a")).show()
// soundex
formatPrint("""df.select(soundex($"a"), soundex($"b")).show()""")
df.select(soundex($"a"), soundex($"b")).show()
// split
formatPrint("""df.select(split(lit("a;b;c"), ";")).show()""")
df.select(split(lit("a;b;c"), ";")).show()
formatPrint("""df.select(split(lit("a;b;c"), ";", 2)).show()""")
df.select(split(lit("a;b;c"), ";", 2)).show()
formatPrint("""df.select(split(lit("a;b;c"), ";", 0)).show()""")
df.select(split(lit("a;b;c"), ";", 0)).show()
formatPrint("""df.select(split(lit("a;b;c"), ";", -1)).show()""")
df.select(split(lit("a;b;c"), ";", -1)).show()
// substring/substring_index
formatPrint("""df.select(substring(lit("abcdef"), 2, 5)).show()""")
df.select(substring(lit("abcdef"), 2, 5)).show()
formatPrint("""df.select(substring_index(lit("www.shockang.com"), ".", 2)).show()""")
df.select(substring_index(lit("www.shockang.com"), ".", 2)).show()
// overlay
formatPrint("""df.select(overlay(lit("abcdef"), lit("abc"), lit(4), lit(1))).show()""")
df.select(overlay(lit("abcdef"), lit("abc"), lit(4), lit(1))).show()
formatPrint("""df.select(overlay(lit("abcdef"), lit("abc"), lit(4))).show()""")
df.select(overlay(lit("abcdef"), lit("abc"), lit(4))).show()
// sentences
formatPrint("""df.select(sentences(lit("我们都有一个家,名字叫中国"), lit("zh"), lit("CN"))).show()""")
df.select(sentences(lit("我们都有一个家,名字叫中国"), lit("zh"), lit("CN"))).show()
formatPrint("""df.select(sentences(lit("我们都有一个家,名字叫中国"))).show()""")
df.select(sentences(lit("我们都有一个家,名字叫中国"))).show()
// translate
formatPrint("""df.select(translate(lit("abcdef"), "def", "123")).show()""")
df.select(translate(lit("abcdef"), "def", "123")).show()
// trim
formatPrint("""df.select(trim(lit(" abc "))).show()""")
df.select(trim(lit(" abc "))).show()
formatPrint("""df.select(trim(lit("aaabcaaaa"), "a")).show()""")
df.select(trim(lit("aaabcaaaa"), "a")).show()
// upper
formatPrint("""df.select(upper($"b")).show()""")
df.select(upper($"b")).show()
}
}
输出
========== df.select(ascii($"a"), ascii($"b"), ascii($"c")).show() ==========
+--------+--------+--------+
|ascii(a)|ascii(b)|ascii(c)|
+--------+--------+--------+
| 97| 97| 0|
+--------+--------+--------+
========== df.select(base64($"a"), base64($"b"), base64($"c")).show() ==========
+---------+---------+---------+
|base64(a)|base64(b)|base64(c)|
+---------+---------+---------+
| YWJj| YWFhQmI=| |
+---------+---------+---------+
========== df.select(concat_ws(";", $"a", $"b", $"c")).show() ==========
+---------------------+
|concat_ws(;, a, b, c)|
+---------------------+
| abc;aaaBb;|
+---------------------+
========== df.select(decode($"a", "utf-8")).show() ==========
+----------------------+
|stringdecode(a, utf-8)|
+----------------------+
| abc|
+----------------------+
========== df.select(encode($"a", "utf-8")).show() ==========
+----------------+
|encode(a, utf-8)|
+----------------+
| [61 62 63]|
+----------------+
========== df.select(format_number(lit(5L), 4)).show() ==========
+-------------------+
|format_number(5, 4)|
+-------------------+
| 5.0000|
+-------------------+
========== df.select(format_number(lit(1.toByte), 4)).show() ==========
+-------------------+
|format_number(1, 4)|
+-------------------+
| 1.0000|
+-------------------+
========== df.select(format_number(lit(2.toShort), 4)).show() ==========
+-------------------+
|format_number(2, 4)|
+-------------------+
| 2.0000|
+-------------------+
========== df.select(format_number(lit(3.1322.toFloat), 4)).show() ==========
+------------------------+
|format_number(3.1322, 4)|
+------------------------+
| 3.1322|
+------------------------+
========== df.select(format_number(lit(4), 4)).show() ==========
+-------------------+
|format_number(4, 4)|
+-------------------+
| 4.0000|
+-------------------+
========== df.select(format_number(lit(5L), 4)).show() ==========
+-------------------+
|format_number(5, 4)|
+-------------------+
| 5.0000|
+-------------------+
========== df.select(format_number(lit(6.48173), 4)).show() ==========
+-------------------------+
|format_number(6.48173, 4)|
+-------------------------+
| 6.4817|
+-------------------------+
========== df.select(format_number(lit(BigDecimal("7.128381")), 4)).show() ==========
+--------------------------+
|format_number(7.128381, 4)|
+--------------------------+
| 7.1284|
+--------------------------+
========== df.select(format_string("aa%d%s", lit(123), lit("cc"))).show() ==========
+------------------------------+
|format_string(aa%d%s, 123, cc)|
+------------------------------+
| aa123cc|
+------------------------------+
========== df.select(initcap($"a"), initcap($"b"), initcap($"c")).show() ==========
+----------+----------+----------+
|initcap(a)|initcap(b)|initcap(c)|
+----------+----------+----------+
| Abc| Aaabb| |
+----------+----------+----------+
========== df.select(instr($"b", "aa")).show() ==========
+------------+
|instr(b, aa)|
+------------+
| 1|
+------------+
========== df.select(length($"a"), length($"b"), length($"c")).show() ==========
+---------+---------+---------+
|length(a)|length(b)|length(c)|
+---------+---------+---------+
| 3| 5| 0|
+---------+---------+---------+
========== df.select(lower($"b")).show() ==========
+--------+
|lower(b)|
+--------+
| aaabb|
+--------+
========== df.select(levenshtein($"a", $"b")).show() ==========
+-----------------+
|levenshtein(a, b)|
+-----------------+
| 4|
+-----------------+
========== df.select(locate("aa", $"b")).show() ==========
+----------------+
|locate(aa, b, 1)|
+----------------+
| 1|
+----------------+
========== df.select(locate("aa", $"b", 2)).show() ==========
+----------------+
|locate(aa, b, 2)|
+----------------+
| 2|
+----------------+
========== df.select(lpad($"a", 10, " ")).show() ==========
+--------------+
|lpad(a, 10, )|
+--------------+
| abc|
+--------------+
========== df.select(ltrim(lit(" 123"))).show() ==========
+-------------+
|ltrim( 123)|
+-------------+
| 123|
+-------------+
========== df.select(ltrim(lit("aaa123"), "a")).show() ==========
+---------------------------+
|TRIm(LEADING a FROM aaa123)|
+---------------------------+
| 123|
+---------------------------+
========== df.select(regexp_extract(lit("abc123"), "(\d+)", 1)).show() ==========
+--------------------------------+
|regexp_extract(abc123, (d+), 1)|
+--------------------------------+
| 123|
+--------------------------------+
========== df.select(regexp_replace(lit("abc123"), "(\d+)", "num")).show() ==========
+-------------------------------------+
|regexp_replace(abc123, (d+), num, 1)|
+-------------------------------------+
| abcnum|
+-------------------------------------+
========== df.select(regexp_replace(lit("abc123"), lit("(\d+)"), lit("num"))).show() ==========
+-------------------------------------+
|regexp_replace(abc123, (d+), num, 1)|
+-------------------------------------+
| abcnum|
+-------------------------------------+
========== df.select(unbase64(typedlit(Array[Byte](1, 2, 3, 4)))).show() ==========
+---------------------+
|unbase64(X'01020304')|
+---------------------+
| []|
+---------------------+
========== df.select(rpad($"a", 10, " ")).show() ==========
+--------------+
|rpad(a, 10, )|
+--------------+
| abc |
+--------------+
========== df.select(repeat($"a", 3)).show() ==========
+------------+
|repeat(a, 3)|
+------------+
| abcabcabc|
+------------+
========== df.select(rtrim(lit("123 "))).show() ==========
+-------------+
|rtrim(123 )|
+-------------+
| 123|
+-------------+
========== df.select(rtrim(lit("123aaa"), "a")).show() ==========
+----------------------------+
|TRIm(TRAILING a FROM 123aaa)|
+----------------------------+
| 123|
+----------------------------+
========== df.select(soundex($"a"), soundex($"b")).show() ==========
+----------+----------+
|soundex(a)|soundex(b)|
+----------+----------+
| A120| A100|
+----------+----------+
========== df.select(split(lit("a;b;c"), ";")).show() ==========
+-------------------+
|split(a;b;c, ;, -1)|
+-------------------+
| [a, b, c]|
+-------------------+
========== df.select(split(lit("a;b;c"), ";", 2)).show() ==========
+------------------+
|split(a;b;c, ;, 2)|
+------------------+
| [a, b;c]|
+------------------+
========== df.select(split(lit("a;b;c"), ";", 0)).show() ==========
+------------------+
|split(a;b;c, ;, 0)|
+------------------+
| [a, b, c]|
+------------------+
========== df.select(split(lit("a;b;c"), ";", -1)).show() ==========
+-------------------+
|split(a;b;c, ;, -1)|
+-------------------+
| [a, b, c]|
+-------------------+
========== df.select(substring(lit("abcdef"), 2, 5)).show() ==========
+-----------------------+
|substring(abcdef, 2, 5)|
+-----------------------+
| bcdef|
+-----------------------+
========== df.select(substring_index(lit("www.shockang.com"), ".", 2)).show() ==========
+---------------------------------------+
|substring_index(www.shockang.com, ., 2)|
+---------------------------------------+
| www.shockang|
+---------------------------------------+
========== df.select(overlay(lit("abcdef"), lit("abc"), lit(4), lit(1))).show() ==========
+--------------------------+
|overlay(abcdef, abc, 4, 1)|
+--------------------------+
| abcabcef|
+--------------------------+
========== df.select(overlay(lit("abcdef"), lit("abc"), lit(4))).show() ==========
+---------------------------+
|overlay(abcdef, abc, 4, -1)|
+---------------------------+
| abcabc|
+---------------------------+
========== df.select(sentences(lit("我们都有一个家,名字叫中国"), lit("zh"), lit("CN"))).show() ==========
+---------------------------------------------+
|sentences(我们都有一个家,名字叫中国, zh, CN)|
+---------------------------------------------+
| [[我们都有一个家, 名字叫中国]]|
+---------------------------------------------+
========== df.select(sentences(lit("我们都有一个家,名字叫中国"))).show() ==========
+-----------------------------------------+
|sentences(我们都有一个家,名字叫中国, , )|
+-----------------------------------------+
| [[我们都有一个家, 名字叫中国]]|
+-----------------------------------------+
========== df.select(translate(lit("abcdef"), "def", "123")).show() ==========
+---------------------------+
|translate(abcdef, def, 123)|
+---------------------------+
| abc123|
+---------------------------+
========== df.select(trim(lit(" abc "))).show() ==========
+---------------+
|trim( abc )|
+---------------+
| abc|
+---------------+
========== df.select(trim(lit("aaabcaaaa"), "a")).show() ==========
+---------------------------+
|TRIm(BOTH a FROM aaabcaaaa)|
+---------------------------+
| bc|
+---------------------------+
========== df.select(upper($"b")).show() ==========
+--------+
|upper(b)|
+--------+
| AAABB|
+--------+



