Data testing is always a pit? 6 stages + 3 methods to help you get data accuracy

Testing Insights

Data testing is always a pit? 6 stages + 3 methods to help you get data accuracy

2025-10-29

1. What is data testing?

Data testing is to verify the life cycle processes of data such as "generation, transmission, processing, storage, and display" in the business system, with the aim of ensuring the accuracy, integrity, consistency, security, and timeliness of data. It is not only an important branch of software testing, but also a key link to ensure the correct operation of data-driven services. Compared with traditional functional testing, data testing emphasizes dealing with databases, interfaces, data models, ETL processes, logs, files and other data sources, and has a better understanding of the testers' logical ability, SQL ability, Data sensitivity and business understanding are more demanding.

2. The main aspects of data testing

(1) Data collection stage

It mainly verifies that the system is correctly collecting the source data.

Test Points:

· Is the data source correct: Which system, interface, or third-party source is the data pulled from? Is the data source legitimate and stable?

· Are the fields complete: Have all the fields that need to be collected been collected? Is there a missing one?

· Is the field type and format consistent: for example, is the time field YYYY-MM-DD or timestamp? Is the amount floating-point?

· Is the data collected in real time/timed: Is the system not timed capture? Is there a delay in real-time data? Does the scheduling frequency match the demand?

(2) Data transfer phase

Data is transferred from one system to another, such as API transfers, file transfers, message queues, etc.

Test Points:

· Is the data lost or duplicated: for example, is the data paginated transfer missing pages or duplicated?

· Does it support retries/compensation for network exceptions: For example, does MQ have idempotent mechanisms or exception handling logic configured?

· Data encoding issues: Is Chinese garbled? Are special characters (e.g. emojis, line breaks) transmitted normally?

· Is the interface structure stable: Have the fields been changed? Are there compatibility issues with older versions?

(3) Data Transformation/Cleaning Phase (ETL).

This stage is the most problematic part of data testing, involving a large number of field mapping, rule processing, data calculation, format conversion, and other operations.

Frequently Asked Questions:

· Field name mapping error: Source field A should be mapped to target field B, but the result is incorrectly mapped to C.

· Numerical calculation errors: such as the conversion of the amount unit is wrong (10,000 yuan is confused with yuan), and the percentage is reversed with the decimal.

· Inaccurate business logic: such as missing ID number check digits, old versions of credit rating algorithms, etc.

· Insufficient data cleaning: For example, deduplication failed, illegal values were not eliminated, and empty fields were not assigned default values.

· Source System: Investment Ratio investRate = "12.3456%"

· Target system requirements: 4 decimal places are reserved => 0.1235

The test should check that the percent sign is removed correctly and the format is converted.

(4) Data storage stage (storage).

The database is the final landing, and the correctness of the database is the core of ensuring data quality.

Key checks:

Field type matching: varchar corresponds to string, decimal corresponds to amount, datetime corresponds to time, etc.

Accuracy issues: especially amounts, percentages, and ratios, you need to pay attention to the number of decimal places (2 digits, 4 digits, etc.).

Primary key/unique constraints: Are there duplicate data writes? Is there a primary key conflict or uniqueness violation?

Consistency between warehousing time and creation time: For example, some system generates time fields, is it accurately placed?

Data archiving/updating/deletion policies are correct: e.g. are tombstone fields set correctly?

SQL example validation statement:

-- Check uniqueness

　　SELECT credit_code, COUNT(*) FROM ent_base GROUP BY credit_code HAVING COUNT(*) > 1;

-- Check null value

　　SELECT * FROM t_data WHERE important_field IS NULL;

-- Accuracy check

　　SELECT investRate FROM t_data WHERE LENGTH(SUBSTRING_INDEX(investRate, '.', -1)) > 4;

(5) Data presentation stage (report/large screen/query).

This is where users intuitively see the data, and the test cannot only look at "whether it is correct on the page", but also needs to verify the data source behind the page.

Test Points:

·whether the report numbers are consistent with the database;

· Whether the filtering/searching/pagination function is normal;

· whether the total/average calculation is accurate;

· whether the time dimension, unit, and format are clear and clear;

· Whether the chart data corresponds to the text data.

(6) Data changes, rollbacks, deletions, and other operations

Special Note:

· whether the data update correctly overwrites the old value;

· whether the undo operation can be fully restored;

· whether the delete operation is a physical or logical delete;

· Are there data logs or audit records retained?

//img.enjoy4fun.com/news_icon/d40ne838hlms72ojgkt0.jpg

3. Commonly used data testing methods

(1) SQL script: JOIN, COUNT, HAVING, etc. for data comparison.

-- Verify data consistency between source and target tables

　　SELECT * FROM source_table a

　　LEFT JOIN target_table b ON a.code = b.code

　　WHERE a.value != b.value;

-- Verify that the number of data is consistent

　　SELECT COUNT(*) FROM source_table;

　　SELECT COUNT(*) FROM target_table;

(2) Excel Comparison: Quickly analyze differences such as VLOOKUP and conditional formatting.

For small batches of sample data, you can use Excel's formulas to assist in checking:

VLOOKUP and IF determine whether the values are consistent;

conditional format highlighting differences;

Pivot table to view aggregate consistency.

(3) Python scripting: Use pandas to process big data files and match field values.

　　import pandas as pd

　　df1 = pd.read_excel('source.xlsx')

　　df2 = pd.read_excel('target.xlsx')

# Compare whether a field is consistent

　　diff = df1[df1['code'] != df2['code']]

diff.to_excel('Variance Data .xlsx')

(4) Data test case design suggestions

Design use cases from five dimensions, including business, data, technology, timeliness, and boundaries, focusing on data integrity, correctness, and coverage, and cooperate with SQL scripts and sample files.

Recommendation: Each test case should describe the input source, processing logic, expected output, and a SQL script or Excel sample.

(5) Common error-prone points in data testing

1. Field mismapping

2. Wrong decimal precision

3. Encoding handling exceptions

4. Bias in status code comprehension

5. Lack of anomaly/boundary data

6. The test data is not representative

(6) Ability requirements for data testers

To do a good job in data testing, testers are recommended to have the following abilities:

·Proficient in SQL, able to perform complex queries, field comparisons, and statistical analysis;

· Know how to use Python for simple data processing (recommended to learn pandas);

· Learn about common ETL tools (e.g., DataX, Kettle, Flink);

· Familiarity with data table structures, field types, and business logic;

· Sensitive to data details, with certain fault tolerance psychology and troubleshooting skills;

· Write specification test cases and comparison scripts.

default title

Android

iOS

Testing Insights