Skip to main content

Remove Duplicates

1. Overview

Removing duplicates means detecting and removing duplicate records in the dataset during data processing, ensuring that each record in the result data is unique. By deduplicating on one or more columns, you can avoid analysis errors and inaccurate results caused by duplicate records.

For example, in e-commerce order processing, there may be duplicate order records due to system issues or user misoperation. Deduplication ensures that each order number appears only once, avoiding misleading sales statistics and inventory management.

image.png

2. Usage Guide

2.1. Operation Steps

  1. Drag the Remove Duplicates operator from the ETL operator area to the right canvas editing area;
  2. Click the Remove Duplicates operator and click Add;
  3. Select the deduplication key (deduplication column), supports multiple selection;
  4. Click OK and preview the data result.

image.png

2.2. Detailed Description

Below is an example of deduplicating by Product Name.

Prerequisite: The upstream node is a product demo dataset containing duplicate data.

image.png

  1. Drag the Remove Duplicates operator from the ETL operator area to the right canvas editing area and connect it to the upstream node;

image.png

  1. Click the Remove Duplicates operator, the left area becomes the current operator configuration area. Click Add and select the target field for deduplication;

image.png

Note: Usually, the primary key of the input dataset is used as the deduplication column. Primary key: one or more fields in the table whose values uniquely identify a record in the table. If you select "Province" as the deduplication column, the effect is as follows:

ProvinceCityProduct CategoryProduct NameRetail Price
ShanxiXinzhouDaily NecessitiesPlant Shampoo 500ML12.5
SichuanChengduDaily NecessitiesDrawing Paper 100 Sheets12.5
HenanShangqiuDaily NecessitiesEnglish Exercise Book Collection12.5
  1. Click OK and preview the processed data to confirm successful deduplication.

image.png

For subsequent use of other data processing operators, see Getting Started.