Skip to main content

Deduplication

1. Overview

Data deduplication refers to detecting and removing duplicate records in a dataset during data processing to ensure that each record in the result data is unique. Through deduplication transformation on single or multiple columns, analysis errors and inaccurate results caused by duplicate records are avoided.

For example, in e-commerce order processing scenarios, the order system may have duplicate order records due to system issues or user misoperations. At this time, through deduplication operations, it is ensured that each order number appears only once, avoiding misleading sales statistics and inventory management.

2. Operation Steps

  1. Drag the Deduplication operator from the data flow operator area into the right canvas editing area;

  2. Click the Data Deduplication operator and click Add;

    |450

  3. Check the deduplication primary key (deduplication column), supports multiple selection;

    |250

  4. Click OK and preview the data results.

3. Specific Case

The following introduces Product Name Deduplication as an example.

Prerequisites: The upstream node is a product demonstration dataset containing duplicate data.

  1. Drag the Data Deduplication operator from the ETL operator area into the right canvas editing area and connect it to the upstream node;

  2. Click the Data Deduplication operator. The left area becomes the current operator configuration area. Click Add and check the target fields for deduplication;

Note: Usually use the primary key of the input dataset as the deduplication column. Primary key: One or more fields in a table whose values are used to uniquely identify a record in the table. If the deduplication column field is selected as "Province", the effect is as follows:

ProvinceCityProduct CategoryProduct NameRetail Price
Shanxi ProvinceXinzhou CityDaily NecessitiesPlant Shampoo 500ML12.5
Sichuan ProvinceChengdu CityDaily NecessitiesDrawing Paper 100 Sheets12.5
Henan ProvinceShangqiu CityDaily NecessitiesEnglish Exercise Book Complete12.5
  1. Click OK. After configuration is complete, preview the processed data effect. Deduplication has been successfully completed.