Deduplication

1. Overview

Data deduplication refers to detecting and removing duplicate records in a dataset during data processing to ensure that each record in the result data is unique. Through deduplication transformation on single or multiple columns, analysis errors and inaccurate results caused by duplicate records are avoided.

For example, in e-commerce order processing scenarios, the order system may have duplicate order records due to system issues or user misoperations. At this time, through deduplication operations, it is ensured that each order number appears only once, avoiding misleading sales statistics and inventory management.

2. Operation Steps

Drag the Deduplication operator from the data flow operator area into the right canvas editing area;
Click the Data Deduplication operator and click Add;
Check the deduplication primary key (deduplication column), supports multiple selection;
Click OK and preview the data results.

3. Specific Case

The following introduces Product Name Deduplication as an example.

Prerequisites: The upstream node is a product demonstration dataset containing duplicate data.

Drag the Data Deduplication operator from the ETL operator area into the right canvas editing area and connect it to the upstream node;
Click the Data Deduplication operator. The left area becomes the current operator configuration area. Click Add and check the target fields for deduplication;

Note: Usually use the primary key of the input dataset as the deduplication column. Primary key: One or more fields in a table whose values are used to uniquely identify a record in the table. If the deduplication column field is selected as "Province", the effect is as follows:

Province	City	Product Category	Product Name	Retail Price
Shanxi Province	Xinzhou City	Daily Necessities	Plant Shampoo 500ML	12.5
Sichuan Province	Chengdu City	Daily Necessities	Drawing Paper 100 Sheets	12.5
Henan Province	Shangqiu City	Daily Necessities	English Exercise Book Complete	12.5

Click OK. After configuration is complete, preview the processed data effect. Deduplication has been successfully completed.

1. Overview​

2. Operation Steps​

3. Specific Case​

1. Overview

2. Operation Steps

3. Specific Case