Skip to main content

Spark SQL Array Processing Functions and Applications

Definition

Array (Array) is an ordered sequence of elements, and the individual variables that make up the array are called array elements. Arrays are a form used in programming to organize several elements of the same type in an ordered form for processing convenience. According to the different types of array elements, arrays can be divided into numeric arrays, character arrays, and various other categories. Array processing methods differ in various programming languages, and this article only lists the functions and application cases of arrays in Spark SQL.

Characteristics

  1. Arrays are collections of elements of the same data type.

  2. The storage of elements in arrays has a sequential order, and they are stored continuously in memory according to this sequential order.

  3. Array elements are queried using the entire array name and their own sequential position (index) in the array.

  4. Arrays are displayed as string-like objects enclosed in square brackets in the database, with each element separated by commas. For example [1, 2, 3].

Main Functions

Array Generation and Conversion

Common Functions

Purpose

Function

Example

Result

Merge multiple rows into one row, return array withno deduplication

collect_list(expr)

collect_list([field]) [field] values are A,A,B,B,C,D respectively

[A,A,B,B,C,D]

Merge multiple rows into one row, return array afterdeduplication

collect_set(expr)

collect_set([field]) [field] values are A,A,B,B,C,D respectively

[A,B,C,D]

Merge multiple columns into one column

array(expr, ...)

array(1, 2, 3)

[1,2,3]

Split string with delimiter, return array.

split(str, regex[, limit]) [, limit] limits the number of split elements, can be omitted

split('A-B-C', '-'); split('A-B-C', '-',2)

[A,B,C]; [A,B-C]

Split one row into multiple rows

explode(expr)

explode(array('A','B'))

A B

Join array elements with delimiter into string

array_join(array, sep[, nullRep]) [, nullRep] used to specify replacement value for null values, removes null when omitted

array_join(array('hello', null ,'world'), ' ', ',')

hello, world

concat_ws(sep[, str | array(str)]+) removes null, can concatenate multiple arrays and strings

concat_ws(',',array('hello', null ,'world'))

hello,world

Split text into nested arrays according to punctuation and spaces

sentences(str[, lang, country])

sentences('Hi there! Good morning.')

[["Hi","there"],["Good","morning"]]

Generate sequence array with fixed step increment, used for numeric and datetime types

sequence(start, stop, step)

sequence(1, 5); sequence(to_date('2021-01-01'), to_date('2021-03-01'), interval 1 month)

[1,2,3,4,5]; [2021-01-01,2021-02-01,2021-03-01]

Common Scenarios

  1. Merge multiple rows of data in dataset into one row, split one row into multiple rows. Mainly uses combination of functions like concat_ws(collect_set( )), explode(split( )) etc. For specific usage cases, please refer to: 《Processing Data Row Merging and Splitting through ETL

  2. Because of missing data (e.g., no sales records) causing year-over-year and month-over-month data to not match, need to supplement data, or other scenarios requiring Cartesian product. For example, use combination function explode(sequence([start date],[end date],interval 1 month)) to supplement calendar. For specific usage cases, please refer to: ETL Data Supplementation Methods.

  3. Distinct count. *count(distinct( )) function does not support window function usage, when data volume is not large, can use size(collect_set([field])over(partition by [group column name])) to perform distinct count. For specific usage cases, please refer to: Distinct Count Function Implementation with Window Functions.

Array Basic Operations

Common Functions

Purpose

Function

Example

Result

Deduplicate array elements

array_distinct(array)

array_distinct(array(1, 2, 3, null, 3))

[1,2,3,null]

Return number of array elements

size(expr)

size(array('b', 'd', 'c', 'a'))

4

Remove all specified elements from array

array_remove(array, element) (null cannot be removed)

array_remove(array(1, 2, 3, null, 3), 3)

[1,2,null]

Return array composed of element repeated specified number of times

array_repeat(element, count)

array_repeat('123', 2)

["123","123"]

Extract subarray of fixed length from fixed position

slice(x, start, length)

slice(array(1, 2, 3, 4), 2, 2)

[2,3]

Process all elements in array according to specified method, return new array

transform(expr, func)

transform(array(1, 2, 3), x -> x + 1); transform(array('A','B','C'),(x,i)->i||x)

[2,3,4]; [0A,1B,2C]

Reverse sort array

reverse(array)

reverse(array(2, 1, 4, 3))

[3,4,1,2]

Randomly sort array

shuffle(array)

shuffle(array(1, 20, 3, 5))

[3,1,5,20]

Sort array

array_sort(expr, func) func specifies sorting method, ascending order when omitted, null at end

array_sort(array('b', 'd', null, 'c', 'a'))

["a","b","c","d",null]

sort_array(array[, true/false]) When ascending (true), null at front, when descending (false), null at end

sort_array(array('b', 'd', null, 'c', 'a'), true)

[null,"a","b","c","d"]

Array Query and Calculation

Common Functions

Purpose

Function

Example

Result

Return maximum value in array

array_max(array)

array_max(array(1, 20, null, 3))

20

Return minimum value in array

array_min(array)

array_max(array(1, 20, null, 4))

1

Query if element exists in array, return true/false

array_contains(array, value)

array_contains(array(1, 2, 3), 2)

true

Query position/index of element in array

array_position(array, element)

array_position(array('A','B','C'), B)

2

Query nth element in array

element_at(array, n) n starts from 1, when negative, query from back to front

element_at(array('A','B','C'), -1)

C

array[n] n starts from 0, cannot be negative

array('A','B','C')[0]

A

Filter array with specified condition/method, return array containing elements that meet conditions

filter(expr, func)

filter(array(1, 2, 3), x -> x % 2 == 1)

[1,3]

Query if any element in array meets judgment condition, return true/false

exists(expr, pred) Need to filter null values in advance, otherwise may cause result to return null.

exists(array(1, 2, 3), x -> x % 2 == 0)

true

Query if all elements in array meet judgment condition, return true/false

forall(expr, pred) Need to filter null values in advance, otherwise may cause result to return null.

forall(array(1, 2, 3), x -> x % 2 == 0)

false

Aggregate array elements into one result value according to binary operation

aggregate(expr, start, merge, finish)

aggregate(array(1, 2, 3), 0, (acc, x) -> acc + x) Sum elements in array

6

Common Scenarios

  1. User attributes are multi-valued, set dataset row and column permissions.

Multi-valued user attributes in database are in string format connected by delimiters, need to be split into arrays for processing when applied. For example, commonly used row permission formula array_contains(split([CURRENT_USER.city],','),[city]) uses function split() to split user attribute values with commas into arrays, then array_contains() is used to judge whether the array includes the value in dataset field [city]. In this way, "Anshan" will exactly match "Anshan", and won't mistakenly match "Maanshan". Related cases please refer to: Row Permission Usage Case Sharing.

  1. Multiple keywords, mutually "OR" relationship, fuzzy match long strings.

For example, to filter products whose titles contain flavor keywords like "original flavor" or "spicy", can use exists(split([flavor],','), x -> instr([product title],x)>0). Related cases please refer to: Filter Fuzzy Matching Implementation Methods

Multi-array Processing

Common Functions

Purpose

Function

Example

Result

Concatenate arrays, elements not deduplicated

expr1 || expr2

array(1, 2, 3) || array(4, 5) || array(6)

[1,2,3,4,5,6]

concat(col1, col2, ..., colN)

concat(array(1, 2, 3), array(4, 5), array(6))

Return list of elements that exist in array1 but not in array2, elements deduplicated

array_except(array1, array2)

array_except(array(1, 2, 3), array(1, 3, 5))

[2]

Return intersection of array1 and array2, result deduplicated

array_intersect(array1, array2)

array_intersect(array(1, 2, 3), array(1, 3, 5))

[1,3]

Return union of array1 and array3, result deduplicated

array_union(array1, array2)

array_union(array(1, 2, 3), array(1, 3, 5))

[1,2,3,5]

Query if array1 and array2 have non-null value intersection, return true/false

arrays_overlap(array1, array2)

arrays_overlap(array(1, 2, 3), array(3, 4, 5))

true

Merge two-dimensional array into one-dimensional array

flatten(arrayOfArrays)

flatten(array(array(1, 2), array(3, 4)))

[1,2,3,4]

Common Scenarios

  1. Extract content from long strings.

For example, string 'GUANDATA 202109R2(3.10.3)', extract the middle 202109R2 and 3.10.3 in parentheses, there are many extraction methods, among which element_at(flatten(sentences([Sprint])),3) first uses sentences() to split the string into nested arrays, then uses flatten() to merge nested arrays into a single array, finally uses element_at() to extract element content according to element position. For details please refer to: Spark SQL Text String Processing Functions and Applications.

  1. Merge multiple long strings, and remove duplicate content in strings. For example, to achieve the following effect, can refer to the following combination function.
array_join(array_union(split([region1],','),split([region2],',')),',') 
--or--
concat_ws(',',array_distinct(split(concat_ws(',',[region1],[region2]),',')))

image.png

  1. Dataset field format is single-layer or multi-layer nested json array, need to extract json content. Related cases: How to Use ETL to Parse JSON.

Note: In cases with large data volume, try to avoid multi-layer nested functions, recommend splitting into multiple calculated fields for step-by-step operations.