Spark SQL Text Functions and Applications

1. Basic String Operations

Purpose	Function	Example	Result
Check if null value	isnull()		true/false
Check if not null value	isnotnull()		true/false
Check null value and replace null value	ifnull(expr1, expr2)	ifnull(null,0)	0
Return first non-null value	coalesce(expr1, expr2, ...)	coalesce(NULL, 1, NULL)	1
Return string length	length()	length('ABC DE')	6
Convert English string to title case	initcap()	initcap('ABC DE')	Abc De
Convert English string to lowercase	lcase()/lower()	lcase('ABC'); lower('ABC')	abc
Convert English string to uppercase	ucase()/upper()	ucase('abc'); upper('abc')	ABC
Reverse string	reverse()	reverse('abc')	cba
Convert other types to string	string()	string(123)	123
Convert other types to string	cast( expr as string)	cast(123 as string)	123

2. String Search

Purpose	Function	Example	Result
Return position of first matched string	locate(substr, str[, pos]) [, pos] used to specify starting search position, can be omitted.	locate('n','Ann',3)	3
	instr(str, substr)	instr('Ann','n')	2
	find_in_set(str, str_array) String to be searched is comma-separated string.	find_in_set('ab','abc,b,ab,c,def')	3
Return first substring matched by regex	regexp_extract(str, regexp[, idx])	regexp_extract('A1B2C3**','[A-Z]+',0)	A
Fuzzy match string with specified pattern, return true/false. Need to use with wildcards.	like	'A&B' like '%&%'; 'A&B' like '&%'	true; false
Fuzzy match string with specified pattern, return true/false. Need to use with regex.	rlike	'PURE&MILD' rlike '[A-Z]+\&[A-Z]+'	true

3. String Extraction

Purpose	Function	Example	Result
String extraction, extract length [, len] omitted when extract all characters from specified position	substr(str, pos[, len]) ; substring(str, pos[, len])	substring('123abcABC', 2, 3); substr('Spark SQL', -3)	23a; SQL
Return substring before the nth occurrence of delimiter. When n is negative, return from the right -nth delimiter to all characters on the right.	substring_index(str, delim, n)	substring_index('a.b.c.d.e', '.', 2); substring_index('a.b.c.d.e', '.', -2)	a.b; d.e
Extract fixed length string from left beginning	left(str, len)	left('Spark SQL', 3)	Spa
Extract fixed length string from right end	right(str, len)	right('Spark SQL', 3)	SQL
Remove spaces at string beginning (left side)	ltrim(str) ; trim(LEADING FROM str)	ltrim(' Spark SQL')	Spark SQL
Remove spaces at string end (right side)	rtrim(str) ; trim(TRAILING FROM str)	rtrim('Spark SQL ')	Spark SQL
Remove spaces at string beginning and end (both sides)	trim(str) ; trim(BOTH FROM str)	trim(' Spark SQL ')	Spark SQL
Remove specified characters at string beginning and end (both sides)	trim(trimStr FROM str) remove both sides; trim(LEADING trimStr FROM str) remove left side; trim(TRAILING trimStr FROM str) remove right side	trim('' from 'ABC*')	AB*C

4. String Replacement

Purpose	Function	Example	Result
Replace all matched characters. [, replace] omitted when remove all matched characters.	replace(str, search[, replace])	replace('ABCabc', 'abc', 'DEF')	ABCDEF
Multi-character replacement. For input, replace each character in from with corresponding character in to. If from string is longer than to string, extra characters in from will be deleted.	translate(input, from, to)	translate('AaBbCc', 'abc', '123'); translate('AaBbCc', 'abc', '12')	A1B2C3; A1B2C
Replace character at fixed position, can specify replacement length	overlay(input, replace, pos[, len])	overlay('Spark SQL' ,'tructured' ,2,4);overlay('Spark SQL' PLACING 'tructured' FROM 2 FOR 4)	Structured SQL
Regex match replace all matched characters	regexp_replace(str, regexp, rep)	REGEXP_REPLACE('A1B2C3*','[\\d\]','') Remove all digits and *	ABC

5. String Split and Concatenation

Purpose	Function	Example	Result
Split string with single or multiple characters, return array. Delimiter supports regex, limit controls number of elements after split, omitted means split all	split(str, regex, limit)	split('A1B2C','\\d'); split('A1B2C','\\d',2); split('A-B-C','-')	[A, B, C]; [A, B2C]; [A, B, C]
String concatenation	concat	concat('Spark', 'SQL')	SparkSQL
String concatenation	expr1 \|\| expr2	'Spark' \|\| 'SQL'	SparkSQL
Concatenate strings or arrays with delimiter	concat_ws(sep[, str \| array(str)]+)	concat_ws('-', 'Spark', 'SQL')	Spark-SQL
Return new string after repeating string for specified number of times	repeat(str, n)	repeat('ABC', 2)	ABCABC

Application Cases

【Requirement】Text field "Sprint", need to extract respectively:

①Middle part "schedule", 6 to 8 character string composed of numbers and letters;

②Content in parentheses "test version";

③Based on "test version" calculate "official version";

【Logic】The number after the last decimal point is the minor version number, when it's 0, the preceding part is the official version number, when it's not 0, the middle number plus 1.

【Final Effect】

①"Schedule" implementation method (choose one):

substr([Sprint],8,8)
left(replace([Sprint],'GUANDATA'),8)
regexp_extract([Sprint], '(\\d{4,6}\\w{2})', 1)
element_at(flatten(sentences([Sprint])),2)

②Extract content in parentheses "test version" implementation method (choose one):

regexp_extract([Sprint], '(\\d\\.\\d{1,2}\\.\\d)', 1)
case when instr([Sprint],'(')>0 then replace(substr([Sprint],instr([Sprint],'(')+1),')') end
case when [Sprint] like '%(%' then substring_index(translate([Sprint],'()','-'),'-',-1) end
element_at(flatten(sentences([Sprint])),3)

③Based on "test version" calculate "official version" implementation method:

case when right([test version],1)=0 then substring_index([test version],'.',2)
else concat(left([test version],2),int(substr(substring_index([test version],'.',2),3)+1))
end

1. Basic String Operations​

2. String Search​

3. String Extraction​

4. String Replacement​

5. String Split and Concatenation​

Application Cases​