Skip to main content

Commonly Used Spark Functions

【Introduction】

Guandata's calculation engine is built on Spark, so all SQL functions supported by Spark can be used on the Guandata platform. Spark Official Documentation.

The following lists some commonly used functions.

【Applicable Scenarios】

Applicable to Guan-index datasets, ETL output datasets, Excel/CSV datasets.

For datasets directly connected to databases, there are differences in SQL standard support among various database types.


Numeric

FunctionMeaningUsageDescriptionExample
ABSAbsolute valueABS([field])Returns absolute valueABS(-3), returns 3
CEILRound upCEIL(numeric/[field])Returns the smallest integer not less than numeric/[field]CEIL(4.12), returns 5
FLOORRound downFLOOR(numeric/[field])Returns the largest integer not greater than numeric/[field]FLOOR(4.12), returns 4
INTConvert to integerINT([field])Returns the largest integer not greater than numeric/[field]. Converts String format to numeric/[field] typeINT(1.5) returns 1
LOGLogarithmLOG(numeric/[field]A,[numeric/[field]B])Takes A as base, finds logarithm of B, defaults to natural logarithm baseLOG(2,4), returns 2.0
POWPowerPOW(numeric/[field]A,numeric/[field]B)Finds the Bth power of numeric/[field]APOW(4,2), returns 16.0
RANDRandom numberRAND()Returns random decimal greater than 0 and less than 1RAND(), returns random number
ROUNDRound to nearestROUND(numeric/[field]A[,integer D])Returns numeric/[field]A rounded to D decimal places. Defaults to 0 when not specifiedROUND(4.12,1), returns 4.1
SQRTSquare rootSQRT(numeric/[field])Finds the square root of numeric/[field], must be greater than or equal to zeroSQRT(4), returns 2.0

String

FunctionMeaningUsageDescriptionExample
CONCATConcatenate stringsCONCAT([field1], [field2]...)Returns strings connected in orderCONCAT([product code], [type code]), returns the concatenated string of product code and type code
FORMAT_STRINGFormat outputformat_string(specified format,[field])Formats string output in specified formformat_string('%06d',82343), gets 082343
INSTRFind string positionINSTR([text field], [search field])Returns the position where the search string first appears in the text [field] value, result is an integer greater than 0, returns 0 if not foundINSTR([name], ','), returns the position where ',' first appears in the name field
LENGTHString lengthLENGTH([field])Returns the length of the stringLENGTH([product name]), returns the length of product name
LOWERLowercase formLOWER([field])Returns the string in all lowercase form of the expression or [field] valueLOWER(ABC), returns abc
REGEXP_EXTRACTString regex parsingREGEXP_EXTRACT(string, regex, index)Returns string regex parsing result, 'index' is the return result (0 means return all results, 1 means return the first match result in the regex)REGEXP_EXTRACT([product ID], '\d+', 0), returns the numeric part in product ID
REGEXP_REPLACEString regex replacementREGEXP_REPLACE(string A, regex, string B)Returns the result after replacing parts of string A that match the regex with string BREGEXP_REPLACE([product name], '\d+', ''), replaces the numeric part in product ID with empty string
REPEATRepeat stringREPEAT([field], numeric)Returns new string result after repeating the string for the specified number of timesREPEAT([product name], 2), returns the string after repeating product name 2 times, such as [product name][product name]
REVERSEReverse stringREVERSE([field])Returns new string result after reversing the stringREVERSE([type code]), input ABC returns CBA
SUBSTRExtract substringSUBSTR([field], start position[, length])Returns substring of specified length from start position, length is optional; negative index represents counting string characters from end forwardSUBSTR([product type], 4), returns substring from index 4 to end of product type; SUBSTR([product type], -4), returns last 4 characters of product type
TRIMRemove spacesTRIM([field])Removes spaces on both sides of data in expression or [field]TRIM(" ABC "), returns ABC
UPPERUppercase formUPPER([field])Returns the string in all uppercase form of the expression or [field] valueUPPER("abc"), returns ABC

Aggregation Statistics

FunctionMeaningUsageDescriptionExample
AVGAverageAVG([field])Returns the average of all values in [field], only applies to numeric [field] NULL values are not calculatedAVG(sales), returns the average of all non-null values corresponding to the sales [field]
COUNTCountCOUNT([field])Returns the number of data entries for all valid [field] in the expression NULL values are not calculatedCOUNT(sales), returns the number of data entries for all non-null values corresponding to the sales [field]
COUNT_DISTINCTDistinct countCOUNT(DISTINCT([field]))Distinct count, returns the number of different data entries for all valid [field] in the expression, null values are not calculatedCOUNT(DISTINCT(sales)), returns the number of different data entries for all non-null values corresponding to the sales [field]
MAXMaximumMAX([field])Returns the maximum value in the expression or numeric [field], only applies to numeric fieldsMAX(sales), returns the maximum value corresponding to the sales field
MINMinimumMIN([field])Returns the minimum value in the expression or numeric [field], only applies to numeric fieldsMIN(sales), returns the minimum value corresponding to the sales field
PERCENTILEMedian/percentilePERCENTILE([field], percentile)Finds median and percentile of numeric type, input parameter is numeric, second parameter range is 0 to 1, e.g. 0.1, 0.25, 0.75, 0.95PERCENTILE([math exam score], 0.5), returns the median of exam scores
STDDEVSample standard deviationSTDDEV([field])Returns sample standard deviationSTDDEV([sales])
STDDEV_POPPopulation standard deviationSTDDEV_POP([field])Returns population standard deviationSTDDEV_POP([sales])
SUMSumSUM([field])Returns the sum of all values in the expression or numeric [field], only applies to numeric [field] Null values are not calculatedSUM(sales), returns the sum of all non-null values corresponding to the sales [field]
VAR_POPPopulation varianceVAR_POP([field])Returns population varianceVAR_POP([sales])
VARIANCESample varianceVARIANCE([field])Returns sample varianceVARIANCE([sales])

Non-aggregation Statistics

FunctionMeaningUsageDescriptionExample
GREATESTCompare maximumGREATEST([field1], [field2]……)Compares values in each row across columns, returns the maximum valueGREATEST([department1 sales], [department2 sales]), returns the maximum value between department1 sales and department2 sales fields in that row
LEASTCompare minimumLEAST([field1], [field2]……)Compares values in each row across columns, returns the minimum valueLEAST([department1 sales], [department2 sales]), returns the minimum value between department1 sales and department2 sales fields in that row

Logic

FunctionMeaningUsageDescriptionExample
CASE WHENConditional functionCASE WHEN(expression1) THEN result1 WHEN (expression2) THEN result2 …… ELSE resultn ENDReturns corresponding result if expression is satisfied, returns default result after ELSE if none satisfied, must have END keyword at the endCASE WHEN([score]>=60 and [score]<70) THEN 'Pass' WHEN ([score]>=70 and [score]<80) THEN 'Medium' WHEN ([score]>=80 and [score]<90) THEN 'Good' WHEN ([score]>=90) THEN 'Excellent' ELSE 'Fail' END
COALESCEReturn non-null valueCOALESCE([field1], [field2], 'default')Returns the first non-null value in a group of [field]; if all values are NULL, returns NULLCOALESCE([name1], [name2], 'Anonymous'); returns the first non-null name in the parameters, if all are empty, returns 'Anonymous'
IFConditional functionIF(expression,result1,result2)Returns result1 if expression is satisfied, otherwise returns result2IF([score]>80,'Pass','Fail')