Spark SQL Array Processing Functions and Applications

Definition

Array (Array) is an ordered sequence of elements, and the individual variables that make up the array are called array elements. Arrays are a form used in programming to organize several elements of the same type in an ordered form for processing convenience. According to the different types of array elements, arrays can be divided into numeric arrays, character arrays, and various other categories. Array processing methods differ in various programming languages, and this article only lists the functions and application cases of arrays in Spark SQL.

Characteristics

Arrays are collections of elements of the same data type.
The storage of elements in arrays has a sequential order, and they are stored continuously in memory according to this sequential order.
Array elements are queried using the entire array name and their own sequential position (index) in the array.
Arrays are displayed as string-like objects enclosed in square brackets in the database, with each element separated by commas. For example [1, 2, 3].

Main Functions

Array Generation and Conversion

Common Functions

Purpose	Function	Example	Result
Merge multiple rows into one row, return array withno deduplication	collect_list(expr)	collect_list([field]) [field] values are A,A,B,B,C,D respectively	[A,A,B,B,C,D]
Merge multiple rows into one row, return array afterdeduplication	collect_set(expr)	collect_set([field]) [field] values are A,A,B,B,C,D respectively	[A,B,C,D]
Merge multiple columns into one column	array(expr, ...)	array(1, 2, 3)	[1,2,3]
Split string with delimiter, return array.	split(str, regex[, limit]) [, limit] limits the number of split elements, can be omitted	split('A-B-C', '-'); split('A-B-C', '-',2)	[A,B,C]; [A,B-C]
Split one row into multiple rows	explode(expr)	explode(array('A','B'))	A B
Join array elements with delimiter into string	array_join(array, sep[, nullRep]) [, nullRep] used to specify replacement value for null values, removes null when omitted	array_join(array('hello', null ,'world'), ' ', ',')	hello, world
Join array elements with delimiter into string	concat_ws(sep[, str \| array(str)]+) removes null, can concatenate multiple arrays and strings	concat_ws(',',array('hello', null ,'world'))	hello,world
Split text into nested arrays according to punctuation and spaces	sentences(str[, lang, country])	sentences('Hi there! Good morning.')	[["Hi","there"],["Good","morning"]]
Generate sequence array with fixed step increment, used for numeric and datetime types	sequence(start, stop, step)	sequence(1, 5); sequence(to_date('2021-01-01'), to_date('2021-03-01'), interval 1 month)	[1,2,3,4,5]; [2021-01-01,2021-02-01,2021-03-01]

Common Scenarios

Merge multiple rows of data in dataset into one row, split one row into multiple rows. Mainly uses combination of functions like concat_ws(collect_set( )), explode(split( )) etc. For specific usage cases, please refer to: 《Processing Data Row Merging and Splitting through ETL》
Because of missing data (e.g., no sales records) causing year-over-year and month-over-month data to not match, need to supplement data, or other scenarios requiring Cartesian product. For example, use combination function explode(sequence([start date],[end date],interval 1 month)) to supplement calendar. For specific usage cases, please refer to: ETL Data Supplementation Methods.
Distinct count. *count(distinct( )) function does not support window function usage, when data volume is not large, can use size(collect_set([field])over(partition by [group column name])) to perform distinct count. For specific usage cases, please refer to: Distinct Count Function Implementation with Window Functions.

Array Basic Operations

Common Functions

Purpose	Function	Example	Result
Deduplicate array elements	array_distinct(array)	array_distinct(array(1, 2, 3, null, 3))	[1,2,3,null]
Return number of array elements	size(expr)	size(array('b', 'd', 'c', 'a'))	4
Remove all specified elements from array	array_remove(array, element) (null cannot be removed)	array_remove(array(1, 2, 3, null, 3), 3)	[1,2,null]
Return array composed of element repeated specified number of times	array_repeat(element, count)	array_repeat('123', 2)	["123","123"]
Extract subarray of fixed length from fixed position	slice(x, start, length)	slice(array(1, 2, 3, 4), 2, 2)	[2,3]
Process all elements in array according to specified method, return new array	transform(expr, func)	transform(array(1, 2, 3), x -> x + 1); transform(array('A','B','C'),(x,i)->i\|\|x)	[2,3,4]; [0A,1B,2C]
Reverse sort array	reverse(array)	reverse(array(2, 1, 4, 3))	[3,4,1,2]
Randomly sort array	shuffle(array)	shuffle(array(1, 20, 3, 5))	[3,1,5,20]
Sort array	array_sort(expr, func) func specifies sorting method, ascending order when omitted, null at end	array_sort(array('b', 'd', null, 'c', 'a'))	["a","b","c","d",null]
Sort array	sort_array(array[, true/false]) When ascending (true), null at front, when descending (false), null at end	sort_array(array('b', 'd', null, 'c', 'a'), true)	[null,"a","b","c","d"]

Array Query and Calculation

Common Functions

Purpose	Function	Example	Result
Return maximum value in array	array_max(array)	array_max(array(1, 20, null, 3))	20
Return minimum value in array	array_min(array)	array_max(array(1, 20, null, 4))	1
Query if element exists in array, return true/false	array_contains(array, value)	array_contains(array(1, 2, 3), 2)	true
Query position/index of element in array	array_position(array, element)	array_position(array('A','B','C'), B)	2
Query nth element in array	element_at(array, n) n starts from 1, when negative, query from back to front	element_at(array('A','B','C'), -1)	C
Query nth element in array	array[n] n starts from 0, cannot be negative	array('A','B','C')[0]	A
Filter array with specified condition/method, return array containing elements that meet conditions	filter(expr, func)	filter(array(1, 2, 3), x -> x % 2 == 1)	[1,3]
Query if any element in array meets judgment condition, return true/false	exists(expr, pred) Need to filter null values in advance, otherwise may cause result to return null.	exists(array(1, 2, 3), x -> x % 2 == 0)	true
Query if all elements in array meet judgment condition, return true/false	forall(expr, pred) Need to filter null values in advance, otherwise may cause result to return null.	forall(array(1, 2, 3), x -> x % 2 == 0)	false
Aggregate array elements into one result value according to binary operation	aggregate(expr, start, merge, finish)	aggregate(array(1, 2, 3), 0, (acc, x) -> acc + x) Sum elements in array	6

Common Scenarios

User attributes are multi-valued, set dataset row and column permissions.

Multi-valued user attributes in database are in string format connected by delimiters, need to be split into arrays for processing when applied. For example, commonly used row permission formula array_contains(split([CURRENT_USER.city],','),[city]) uses function split() to split user attribute values with commas into arrays, then array_contains() is used to judge whether the array includes the value in dataset field [city]. In this way, "Anshan" will exactly match "Anshan", and won't mistakenly match "Maanshan". Related cases please refer to: Row Permission Usage Case Sharing.

Multiple keywords, mutually "OR" relationship, fuzzy match long strings.

For example, to filter products whose titles contain flavor keywords like "original flavor" or "spicy", can use exists(split([flavor],','), x -> instr([product title],x)>0). Related cases please refer to: Filter Fuzzy Matching Implementation Methods

Multi-array Processing

Common Functions

Purpose	Function	Example	Result
Concatenate arrays, elements not deduplicated	expr1 \|\| expr2	array(1, 2, 3) \|\| array(4, 5) \|\| array(6)	[1,2,3,4,5,6]
Concatenate arrays, elements not deduplicated	concat(col1, col2, ..., colN)	concat(array(1, 2, 3), array(4, 5), array(6))	[1,2,3,4,5,6]
Return list of elements that exist in array1 but not in array2, elements deduplicated	array_except(array1, array2)	array_except(array(1, 2, 3), array(1, 3, 5))	[2]
Return intersection of array1 and array2, result deduplicated	array_intersect(array1, array2)	array_intersect(array(1, 2, 3), array(1, 3, 5))	[1,3]
Return union of array1 and array3, result deduplicated	array_union(array1, array2)	array_union(array(1, 2, 3), array(1, 3, 5))	[1,2,3,5]
Query if array1 and array2 have non-null value intersection, return true/false	arrays_overlap(array1, array2)	arrays_overlap(array(1, 2, 3), array(3, 4, 5))	true
Merge two-dimensional array into one-dimensional array	flatten(arrayOfArrays)	flatten(array(array(1, 2), array(3, 4)))	[1,2,3,4]

Common Scenarios

Extract content from long strings.

For example, string 'GUANDATA 202109R2(3.10.3)', extract the middle 202109R2 and 3.10.3 in parentheses, there are many extraction methods, among which element_at(flatten(sentences([Sprint])),3) first uses sentences() to split the string into nested arrays, then uses flatten() to merge nested arrays into a single array, finally uses element_at() to extract element content according to element position. For details please refer to: Spark SQL Text String Processing Functions and Applications.

Merge multiple long strings, and remove duplicate content in strings. For example, to achieve the following effect, can refer to the following combination function.

array_join(array_union(split([region1],','),split([region2],',')),',') 
--or--
concat_ws(',',array_distinct(split(concat_ws(',',[region1],[region2]),',')))

Dataset field format is single-layer or multi-layer nested json array, need to extract json content. Related cases: How to Use ETL to Parse JSON.

Note: In cases with large data volume, try to avoid multi-layer nested functions, recommend splitting into multiple calculated fields for step-by-step operations.

Definition​

Characteristics​

Main Functions​

Array Generation and Conversion​

Array Basic Operations​

Common Functions​

Array Query and Calculation​

Common Functions​

Common Scenarios​

Multi-array Processing​

Common Functions​

Common Scenarios​

Definition

Characteristics

Main Functions

Array Generation and Conversion

Array Basic Operations

Common Functions

Array Query and Calculation

Common Functions

Common Scenarios

Multi-array Processing

Common Functions

Common Scenarios