Commonly Used Spark Functions

【Introduction】

Guandata's calculation engine is built on Spark, so all SQL functions supported by Spark can be used on the Guandata platform. Spark Official Documentation.

The following lists some commonly used functions.

【Applicable Scenarios】

Applicable to Guan-index datasets, ETL output datasets, Excel/CSV datasets.

For datasets directly connected to databases, there are differences in SQL standard support among various database types.

Numeric

Function	Meaning	Usage	Description	Example
ABS	Absolute value	ABS([field])	Returns absolute value	ABS(-3), returns 3
CEIL	Round up	CEIL(numeric/[field])	Returns the smallest integer not less than numeric/[field]	CEIL(4.12), returns 5
FLOOR	Round down	FLOOR(numeric/[field])	Returns the largest integer not greater than numeric/[field]	FLOOR(4.12), returns 4
INT	Convert to integer	INT([field])	Returns the largest integer not greater than numeric/[field]. Converts String format to numeric/[field] type	INT(1.5) returns 1
LOG	Logarithm	LOG(numeric/[field]A,[numeric/[field]B])	Takes A as base, finds logarithm of B, defaults to natural logarithm base	LOG(2,4), returns 2.0
POW	Power	POW(numeric/[field]A,numeric/[field]B)	Finds the Bth power of numeric/[field]A	POW(4,2), returns 16.0
RAND	Random number	RAND()	Returns random decimal greater than 0 and less than 1	RAND(), returns random number
ROUND	Round to nearest	ROUND(numeric/[field]A[,integer D])	Returns numeric/[field]A rounded to D decimal places. Defaults to 0 when not specified	ROUND(4.12,1), returns 4.1
SQRT	Square root	SQRT(numeric/[field])	Finds the square root of numeric/[field], must be greater than or equal to zero	SQRT(4), returns 2.0

String

Function	Meaning	Usage	Description	Example
CONCAT	Concatenate strings	CONCAT([field1], [field2]...)	Returns strings connected in order	CONCAT([product code], [type code]), returns the concatenated string of product code and type code
FORMAT_STRING	Format output	format_string(specified format,[field])	Formats string output in specified form	format_string('%06d',82343), gets 082343
INSTR	Find string position	INSTR([text field], [search field])	Returns the position where the search string first appears in the text [field] value, result is an integer greater than 0, returns 0 if not found	INSTR([name], ','), returns the position where ',' first appears in the name field
LENGTH	String length	LENGTH([field])	Returns the length of the string	LENGTH([product name]), returns the length of product name
LOWER	Lowercase form	LOWER([field])	Returns the string in all lowercase form of the expression or [field] value	LOWER(ABC), returns abc
REGEXP_EXTRACT	String regex parsing	REGEXP_EXTRACT(string, regex, index)	Returns string regex parsing result, 'index' is the return result (0 means return all results, 1 means return the first match result in the regex)	REGEXP_EXTRACT([product ID], '\d+', 0), returns the numeric part in product ID
REGEXP_REPLACE	String regex replacement	REGEXP_REPLACE(string A, regex, string B)	Returns the result after replacing parts of string A that match the regex with string B	REGEXP_REPLACE([product name], '\d＋', ''), replaces the numeric part in product ID with empty string
REPEAT	Repeat string	REPEAT([field], numeric)	Returns new string result after repeating the string for the specified number of times	REPEAT([product name], 2), returns the string after repeating product name 2 times, such as [product name][product name]
REVERSE	Reverse string	REVERSE([field])	Returns new string result after reversing the string	REVERSE([type code]), input ABC returns CBA
SUBSTR	Extract substring	SUBSTR([field], start position[, length])	Returns substring of specified length from start position, length is optional; negative index represents counting string characters from end forward	SUBSTR([product type], 4), returns substring from index 4 to end of product type; SUBSTR([product type], -4), returns last 4 characters of product type
TRIM	Remove spaces	TRIM([field])	Removes spaces on both sides of data in expression or [field]	TRIM(" ABC "), returns ABC
UPPER	Uppercase form	UPPER([field])	Returns the string in all uppercase form of the expression or [field] value	UPPER("abc"), returns ABC

Aggregation Statistics

Function	Meaning	Usage	Description	Example
AVG	Average	AVG([field])	Returns the average of all values in [field], only applies to numeric [field] NULL values are not calculated	AVG(sales), returns the average of all non-null values corresponding to the sales [field]
COUNT	Count	COUNT([field])	Returns the number of data entries for all valid [field] in the expression NULL values are not calculated	COUNT(sales), returns the number of data entries for all non-null values corresponding to the sales [field]
COUNT_DISTINCT	Distinct count	COUNT(DISTINCT([field]))	Distinct count, returns the number of different data entries for all valid [field] in the expression, null values are not calculated	COUNT(DISTINCT(sales)), returns the number of different data entries for all non-null values corresponding to the sales [field]
MAX	Maximum	MAX([field])	Returns the maximum value in the expression or numeric [field], only applies to numeric fields	MAX(sales), returns the maximum value corresponding to the sales field
MIN	Minimum	MIN([field])	Returns the minimum value in the expression or numeric [field], only applies to numeric fields	MIN(sales), returns the minimum value corresponding to the sales field
PERCENTILE	Median/percentile	PERCENTILE([field], percentile)	Finds median and percentile of numeric type, input parameter is numeric, second parameter range is 0 to 1, e.g. 0.1, 0.25, 0.75, 0.95	PERCENTILE([math exam score], 0.5), returns the median of exam scores
STDDEV	Sample standard deviation	STDDEV([field])	Returns sample standard deviation	STDDEV([sales])
STDDEV_POP	Population standard deviation	STDDEV_POP([field])	Returns population standard deviation	STDDEV_POP([sales])
SUM	Sum	SUM([field])	Returns the sum of all values in the expression or numeric [field], only applies to numeric [field] Null values are not calculated	SUM(sales), returns the sum of all non-null values corresponding to the sales [field]
VAR_POP	Population variance	VAR_POP([field])	Returns population variance	VAR_POP([sales])
VARIANCE	Sample variance	VARIANCE([field])	Returns sample variance	VARIANCE([sales])

Non-aggregation Statistics

Function	Meaning	Usage	Description	Example
GREATEST	Compare maximum	GREATEST([field1], [field2]……)	Compares values in each row across columns, returns the maximum value	GREATEST([department1 sales], [department2 sales]), returns the maximum value between department1 sales and department2 sales fields in that row
LEAST	Compare minimum	LEAST([field1], [field2]……)	Compares values in each row across columns, returns the minimum value	LEAST([department1 sales], [department2 sales]), returns the minimum value between department1 sales and department2 sales fields in that row

Logic

Function	Meaning	Usage	Description	Example
CASE WHEN	Conditional function	CASE WHEN(expression1) THEN result1 WHEN (expression2) THEN result2 …… ELSE resultn END	Returns corresponding result if expression is satisfied, returns default result after ELSE if none satisfied, must have END keyword at the end	CASE WHEN([score]>=60 and [score]<70) THEN 'Pass' WHEN ([score]>=70 and [score]<80) THEN 'Medium' WHEN ([score]>=80 and [score]<90) THEN 'Good' WHEN ([score]>=90) THEN 'Excellent' ELSE 'Fail' END
COALESCE	Return non-null value	COALESCE([field1], [field2], 'default')	Returns the first non-null value in a group of [field]; if all values are NULL, returns NULL	COALESCE([name1], [name2], 'Anonymous'); returns the first non-null name in the parameters, if all are empty, returns 'Anonymous'
IF	Conditional function	IF(expression,result1,result2)	Returns result1 if expression is satisfied, otherwise returns result2	IF([score]>80,'Pass','Fail')

【Introduction】​

【Applicable Scenarios】​

Numeric​

String​

Aggregation Statistics​

Non-aggregation Statistics​

Logic​