Spark SQL Text Functions and Applications
1. Basic String Operations
Purpose | Function | Example | Result |
Check if null value | isnull() | true/false | |
Check if not null value | isnotnull() | true/false | |
Check null value and replace null value | ifnull(expr1, expr2) | ifnull(null,0) | 0 |
Return first non-null value | coalesce(expr1, expr2, ...) | coalesce(NULL, 1, NULL) | 1 |
Return string length | length() | length('ABC DE') | 6 |
Convert English string to title case | initcap() | initcap('ABC DE') | Abc De |
Convert English string to lowercase | lcase()/lower() | lcase('ABC'); lower('ABC') | abc |
Convert English string to uppercase | ucase()/upper() | ucase('abc'); upper('abc') | ABC |
Reverse string | reverse() | reverse('abc') | cba |
Convert other types to string | string() | string(123) | 123 |
cast( expr as string) | cast(123 as string) | 123 |
2. String Search
Purpose | Function | Example | Result |
Return position of first matched string | locate(substr, str[, pos]) [, pos] used to specify starting search position, can be omitted. | locate('n','Ann',3) | 3 |
instr(str, substr) | instr('Ann','n') | 2 | |
find_in_set(str, str_array) String to be searched is comma-separated string. | find_in_set('ab','abc,b,ab,c,def') | 3 | |
Return first substring matched by regex | regexp_extract(str, regexp[, idx]) | regexp_extract('*A1B2*C3**','[A-Z]+',0) | A |
Fuzzy match string with specified pattern, return true/false. Need to use with wildcards. | like | 'A&B' like '%&%'; 'A&B' like '&%' | true; false |
Fuzzy match string with specified pattern, return true/false. Need to use with regex. | rlike | 'PURE&MILD' rlike '[A-Z]+\&[A-Z]+' | true |
3. String Extraction
Purpose | Function | Example | Result |
String extraction, extract length [, len] omitted when extract all characters from specified position | substr(str, pos[, len]) ; substring(str, pos[, len]) | substring('123abcABC', 2, 3); substr('Spark SQL', -3) | 23a; SQL |
Return substring before the nth occurrence of delimiter. When n is negative, return from the right -nth delimiter to all characters on the right. | substring_index(str, delim, n) | substring_index('a.b.c.d.e', '.', 2); substring_index('a.b.c.d.e', '.', -2) | a.b; d.e |
Extract fixed length string from left beginning | left(str, len) | left('Spark SQL', 3) | Spa |
Extract fixed length string from right end | right(str, len) | right('Spark SQL', 3) | SQL |
Remove spaces at string beginning (left side) | ltrim(str) ; trim(LEADING FROM str) | ltrim(' Spark SQL') | Spark SQL |
Remove spaces at string end (right side) | rtrim(str) ; trim(TRAILING FROM str) | rtrim('Spark SQL ') | Spark SQL |
Remove spaces at string beginning and end (both sides) | trim(str) ; trim(BOTH FROM str) | trim(' Spark SQL ') | Spark SQL |
Remove specified characters at string beginning and end (both sides) | trim(trimStr FROM str) remove both sides; trim(LEADING trimStr FROM str) remove left side; trim(TRAILING trimStr FROM str) remove right side | trim('*' from 'ABC**') | AB*C |
4. String Replacement
Purpose | Function | Example | Result |
Replace all matched characters. [, replace] omitted when remove all matched characters. | replace(str, search[, replace]) | replace('ABCabc', 'abc', 'DEF') | ABCDEF |
Multi-character replacement. For input, replace each character in from with corresponding character in to. If from string is longer than to string, extra characters in from will be deleted. | translate(input, from, to) | translate('AaBbCc', 'abc', '123'); translate('AaBbCc', 'abc', '12') | A1B2C3; A1B2C |
Replace character at fixed position, can specify replacement length | overlay(input, replace, pos[, len]) | overlay('Spark SQL' ,'tructured' ,2,4);overlay('Spark SQL' PLACING 'tructured' FROM 2 FOR 4) | Structured SQL |
Regex match replace all matched characters | regexp_replace(str, regexp, rep) | REGEXP_REPLACE('A1B2C3**','[\\d\*]','') Remove all digits and * | ABC |
5. String Split and Concatenation
Purpose | Function | Example | Result |
Split string with single or multiple characters, return array. Delimiter supports regex, limit controls number of elements after split, omitted means split all | split(str, regex, limit) | split('A1B2C','\\d'); split('A1B2C','\\d',2); split('A-B-C','-') | [A, B, C]; [A, B2C]; [A, B, C] |
String concatenation | concat | concat('Spark', 'SQL') | SparkSQL |
expr1 || expr2 | 'Spark' || 'SQL' | SparkSQL | |
Concatenate strings or arrays with delimiter | concat_ws(sep[, str | array(str)]+) | concat_ws('-', 'Spark', 'SQL') | Spark-SQL |
Return new string after repeating string for specified number of times | repeat(str, n) | repeat('ABC', 2) | ABCABC |
Application Cases
【Requirement】Text field "Sprint", need to extract respectively:
①Middle part "schedule", 6 to 8 character string composed of numbers and letters;
②Content in parentheses "test version";
③Based on "test version" calculate "official version";
【Logic】The number after the last decimal point is the minor version number, when it's 0, the preceding part is the official version number, when it's not 0, the middle number plus 1.
【Final Effect】

①"Schedule" implementation method (choose one):
1. substr([Sprint],8,8)
2. left(replace([Sprint],'GUANDATA'),8)
3. regexp_extract([Sprint], '(\\d{4,6}\\w{2})', 1)
4. element_at(flatten(sentences([Sprint])),2)
②Extract content in parentheses "test version" implementation method (choose one):
1. regexp_extract([Sprint], '(\\d\\.\\d{1,2}\\.\\d)', 1)
2. case when instr([Sprint],'(')>0 then replace(substr([Sprint],instr([Sprint],'(')+1),')') end
3. case when [Sprint] like '%(%' then substring_index(translate([Sprint],'()','-'),'-',-1) end
4. element_at(flatten(sentences([Sprint])),3)
③Based on "test version" calculate "official version" implementation method:
case when right([test version],1)=0 then substring_index([test version],'.',2)
else concat(left([test version],2),int(substr(substring_index([test version],'.',2),3)+1))
end