Skip to main content

Spark SQL Text Functions and Applications

1. Basic String Operations

Purpose

Function

Example

Result

Check if null value

isnull()


true/false

Check if not null value

isnotnull()


true/false

Check null value and replace null value

ifnull(expr1, expr2)

ifnull(null,0)

0

Return first non-null value

coalesce(expr1, expr2, ...)

coalesce(NULL, 1, NULL)

1

Return string length

length()

length('ABC DE')

6

Convert English string to title case

initcap()

initcap('ABC DE')

Abc De

Convert English string to lowercase

lcase()/lower()

lcase('ABC'); lower('ABC')

abc

Convert English string to uppercase

ucase()/upper()

ucase('abc'); upper('abc')

ABC

Reverse string

reverse()

reverse('abc')

cba

Convert other types to string

string()

string(123)

123

cast( expr as string)

cast(123 as string)

123

Purpose

Function

Example

Result

Return position of first matched string

locate(substr, str[, pos]) [, pos] used to specify starting search position, can be omitted.

locate('n','Ann',3)

3

instr(str, substr)

instr('Ann','n')

2

find_in_set(str, str_array) String to be searched is comma-separated string.

find_in_set('ab','abc,b,ab,c,def')

3

Return first substring matched by regex

regexp_extract(str, regexp[, idx])

regexp_extract('*A1B2*C3**','[A-Z]+',0)

A

Fuzzy match string with specified pattern, return true/false. Need to use with wildcards.

like

'A&B' like '%&%'; 'A&B' like '&%'

true; false

Fuzzy match string with specified pattern, return true/false. Need to use with regex.

rlike

'PURE&MILD' rlike '[A-Z]+\&[A-Z]+'

true

3. String Extraction

PurposeFunctionExampleResult
String extraction, extract length [, len] omitted when extract all characters from specified positionsubstr(str, pos[, len]) ; substring(str, pos[, len])substring('123abcABC', 2, 3); substr('Spark SQL', -3)23a; SQL
Return substring before the nth occurrence of delimiter. When n is negative, return from the right -nth delimiter to all characters on the right.substring_index(str, delim, n)substring_index('a.b.c.d.e', '.', 2); substring_index('a.b.c.d.e', '.', -2)a.b; d.e
Extract fixed length string from left beginningleft(str, len)left('Spark SQL', 3)Spa
Extract fixed length string from right endright(str, len)right('Spark SQL', 3)SQL
Remove spaces at string beginning (left side)ltrim(str) ; trim(LEADING FROM str)ltrim(' Spark SQL')Spark SQL
Remove spaces at string end (right side)rtrim(str) ; trim(TRAILING FROM str)rtrim('Spark SQL ')Spark SQL
Remove spaces at string beginning and end (both sides)trim(str) ; trim(BOTH FROM str)trim(' Spark SQL ')Spark SQL
Remove specified characters at string beginning and end (both sides)trim(trimStr FROM str) remove both sides; trim(LEADING trimStr FROM str) remove left side; trim(TRAILING trimStr FROM str) remove right sidetrim('*' from 'ABC**')AB*C

4. String Replacement

PurposeFunctionExampleResult
Replace all matched characters. [, replace] omitted when remove all matched characters.replace(str, search[, replace])replace('ABCabc', 'abc', 'DEF')ABCDEF
Multi-character replacement. For input, replace each character in from with corresponding character in to. If from string is longer than to string, extra characters in from will be deleted.translate(input, from, to)translate('AaBbCc', 'abc', '123'); translate('AaBbCc', 'abc', '12')A1B2C3; A1B2C
Replace character at fixed position, can specify replacement lengthoverlay(input, replace, pos[, len])overlay('Spark SQL' ,'tructured' ,2,4);overlay('Spark SQL' PLACING 'tructured' FROM 2 FOR 4)Structured SQL
Regex match replace all matched charactersregexp_replace(str, regexp, rep)REGEXP_REPLACE('A1B2C3**','[\\d\*]','') Remove all digits and *ABC

5. String Split and Concatenation

Purpose

Function

Example

Result

Split string with single or multiple characters, return array. Delimiter supports regex, limit controls number of elements after split, omitted means split all

split(str, regex, limit)

split('A1B2C','\\d'); split('A1B2C','\\d',2); split('A-B-C','-')

[A, B, C]; [A, B2C]; [A, B, C]


String concatenation

concat

concat('Spark', 'SQL')

SparkSQL

expr1 || expr2

'Spark' || 'SQL'

SparkSQL

Concatenate strings or arrays with delimiter

concat_ws(sep[, str | array(str)]+)

concat_ws('-', 'Spark', 'SQL')

Spark-SQL

Return new string after repeating string for specified number of times

repeat(str, n)

repeat('ABC', 2)

ABCABC

Application Cases

【Requirement】Text field "Sprint", need to extract respectively:

①Middle part "schedule", 6 to 8 character string composed of numbers and letters;

②Content in parentheses "test version";

③Based on "test version" calculate "official version";

【Logic】The number after the last decimal point is the minor version number, when it's 0, the preceding part is the official version number, when it's not 0, the middle number plus 1.

【Final Effect】

image.png

①"Schedule" implementation method (choose one):

1. substr([Sprint],8,8)
2. left(replace([Sprint],'GUANDATA'),8)
3. regexp_extract([Sprint], '(\\d{4,6}\\w{2})', 1)
4. element_at(flatten(sentences([Sprint])),2)

②Extract content in parentheses "test version" implementation method (choose one):

1. regexp_extract([Sprint], '(\\d\\.\\d{1,2}\\.\\d)', 1)
2. case when instr([Sprint],'(')>0 then replace(substr([Sprint],instr([Sprint],'(')+1),')') end
3. case when [Sprint] like '%(%' then substring_index(translate([Sprint],'()','-'),'-',-1) end
4. element_at(flatten(sentences([Sprint])),3)

③Based on "test version" calculate "official version" implementation method:

case when right([test version],1)=0 then substring_index([test version],'.',2)
else concat(left([test version],2),int(substr(substring_index([test version],'.',2),3)+1))
end