Skip to main content

Field Splitting Case Sharing

Background

In some datasets, there are fields that are strings joined by delimiters. Sometimes, you need to split them to extract a fixed position substring or split them into multiple columns.

Case Scenarios

Case 1: Create a calculated field, use string functions to locate the delimiter and split

image.png

  1. Actual route start point: extract the substring before the first dash.
Method 1: SUBSTR([Actual Route],0,INSTR([Actual Route],'-')-1)  
Method 2: LEFT([Actual Route],INSTR([Actual Route],'-')-1)
Method 3: SUBSTRING_INDEX([Actual Route],'-',1)

image.png

  1. Remaining part of the actual route: 1 and 2 can be joined to restore the full route.
SUBSTR([Actual Route],INSTR([Actual Route],'-')+1)

image.png

  1. Actual route endpoint
Method 1: right([Actual Route Endpoint],instr(REVERSE([Actual Route]),'-')-1)
Method 2: SUBSTRING_INDEX([Actual Route],'-',-1)

image.png

  1. First half of the remaining actual route: 3 and 4 can be joined to restore the full route.
REGEXP_EXTRACT([Actual Route],'(.+)(-{1}.+)',1)

image.png

  1. Is the endpoint a factory: The rule is pure letters and does not end with DC.
when SUBSTR([Actual Route Endpoint],-2)<>'DC' and (SUBSTR([Actual Route Endpoint],0,1)  <='Z' AND 'A'<= SUBSTR([Actual Route Endpoint],0,1))
then 'Yes'
when [Actual Route Endpoint] is null or [Actual Route Endpoint] = ''
then 'No'
else 'No'
end

image.png

  1. Is the endpoint WS: The rule is pure Chinese characters, no other letters or numbers.
when SUBSTR([Actual Route Endpoint],-2)='DC' or (SUBSTR([Actual Route Endpoint],0,1)  <='Z' AND 'A'<= SUBSTR([Actual Route Endpoint],0,1))
then 'No'
when [Actual Route Endpoint] is null or [Actual Route Endpoint] = ''
then 'No'
else 'Yes'
end

image.png

  1. Endpoint splitting

Example:
Actual Route=BAD-Cangzhou_Cangxian_Wulongtang
Actual Route Endpoint=Cangzhou_Cangxian_Wulongtang
Endpoint Fuzzy 1=Cangzhou_Cangxian
Endpoint Fuzzy 2=Cangzhou

-- Endpoint Fuzzy 1
REGEXP_EXTRACT([Actual Route Endpoint],'(.+)(_{1}.+)',1)
-- Endpoint Fuzzy 2
case when INSTR([Actual Route Endpoint],'_')>0 then SUBSTR([Actual Route Endpoint],0,INSTR([Actual Route Endpoint],'_')-1) end

image.png

Case 2: Create a calculated field, split the original field into an array by delimiter, and extract array elements

image.png

  1. Split into array: split([Field],'-')

image.png

  1. Remove the ending part: array_join(slice([Split Char],1,size([Split Char])-1),'-')
    Split Field 1: [Split Char][0]
    Split Field 2: [Split Char][1]
    Ending part: array_join(slice([Split Char],size([Split Char]),1),'')

image.png

Case 3: Extract Chinese/English part of a string

image.png

Implementation: You can use regular expressions in a new field.

  1. Keep Chinese:

image.png

  1. Keep English:

image.png

Final result:

image.png

Note: The above functions can be used in ETL and non-direct/non-accelerated datasets. For direct datasets, use the corresponding database functions. For high-performance (accelerated) datasets, use Clickhouse functions. For more text processing functions and cases, see Spark SQL Text Functions and Applications.