Strategies for Testing Data Warehouse Applications
Testing Goals
- Data completeness.
Ensures that all expected data is loaded. - Data transformation.
Ensures that all data is transformed correctly according to business rules and/or design specifications. - Data quality.
Ensures that the ETL application correctly rejects, substitutes default values, corrects or ignores and reports invalid data. - Performance and scalability.
Ensures that data loads and queries perform within expected time frames and that the technical architecture is scalable. - Integration testing.
Ensures that the ETL process functions well with other upstream and downstream processes. - User-acceptance testing.
Ensures the solution meets users' current expectations and anticipates their future expectations. - Regression testing.
Ensures existing functionality remains intact each time a new release of code is completed.
Data Completeness
Preconditions.
1. Objects are deinstalled successfully.
2. Objects are installed successfully.
3. -init scripts have been fulfilled.
4. the n.a., n.d. and initial data is in dimensions, facts are empty.
Example 1, DIM. Comparing record counts between source and target data.
Script:
SELECT COUNT ( * ) AS ERRORS
FROM ( SELECT COUNT ( * ) AS cnt
FROM wrk_table_contacts
MINUS
SELECT COUNT ( * ) AS cnt
FROM mv_table_for_dataload );
Strategy:
1. Run loading process.
2. Run loading process again.
3. Test target minus source.
4. Test source minus target.
Example 2, DIM. The same script but sometimes it's needed to exclude default values.
SELECT COUNT ( * ) AS ERRORS
FROM ( SELECT table_channel_code AS cnt
FROM dim_table_channels
WHERE table_channel_id NOT IN (-98, -99)
MINUS
SELECT DISTINCT table_channel AS cnt
FROM wrk_table_contacts
WHERE table_channel NOT IN ('n.a', 'n.d.') );
Example 3, DIM. Populating the full contents of each field to validate that no truncation occurs at any step in the process.
Script:
SELECT COUNT ( * ) AS ERRORS
FROM ( SELECT agent_id_ticket
, agent_id
, table_theme_id
, table_channel
, table_channel_id
, table_status
FROM wrk_table_contacts
MINUS
SELECT agent_id_ticket
, agent_id
, table_theme_id
, table_channel
, table_channel_id
, table_status
FROM mv_table_for_dataload );
Example 4, FCT. Comparing record counts between source and target data. Sometimes it is hard to compare 1 to 1 source and target tables. Then compare all essences.
Script:
-- Verify total count of tickets
SELECT COUNT ( * ) AS ERRORS
FROM ( SELECT COUNT ( table_srcid ) AS cnt
FROM mv_table_for_dataload
MINUS
SELECT COUNT ( table_srcid ) AS cnt
FROM fct_table_contacts );
Example 5, FCT. Find lost data.
-- Verify total count of lost data
SELECT COUNT ( * ) AS LOST
FROM mv_table_for_dataload a
, fct_table_contacts b
WHERE a.table_srcid = b.table_srcid(+)
AND a.contact_srcid = b.contact_srcid(+)
AND b.contact_srcid IS NULL );
Data Transformation
Example 1. Validate that data types in the warehouse are as specified in the design and/or the data model.
Manually.
Strategy:
1. Review ETL code.
2. Wait couple of days.
3. Review ETL code again. If there are defects then repeat from 1 point.
Example 2. SCD transformations. Find wrong flags.
Test 1:
SELECT COUNT ( * ) AS ERRORS
FROM dim_table_contact_actions_scd
WHERE valid_to = TO_DATE ( '31.12.9999'
, 'DD.MM.YYYY' )
AND active_flag = 'N'
AND table_contact_action_surr_id NOT IN (-98, -99);
Test 2:
SELECT COUNT ( * ) AS ERRORS
FROM(SELECT table_contact_action_surr_id, table_contact_action_id, valid_from, valid_to, active_flag cnt
FROM dim_table_contact_actions_scd
WHERE valid_to != TO_DATE ( '31.12.9999'
, 'DD.MM.YYYY' )
AND active_flag = 'Y'
AND table_contact_action_surr_id NOT IN (-98, -99));
Example 3. SCD transformations. Find lost time frames in dimension.
Script:
SELECT SUM ( CASE WHEN valid_to <> NVL ( valid_from_next, '31.12.9999' ) THEN '1' ELSE '0' END ) AS ERRORS
FROM ( SELECT table_contact_action_surr_id
, table_contact_action_id
, valid_to
, LEAD ( valid_from ) OVER (PARTITION BY table_contact_action_id ORDER BY valid_from ASC)
AS valid_from_next
FROM dim_table_contact_actions_scd
WHERE table_contact_action_id NOT IN (-99, -98) ) );
Example 4. SCD transformations. Verify time filter.
Manually.
...
AND ticket.event_dt >= agent.valid_from(+)
AND ticket.event_dt < agent.valid_to(+) --> if (+) then verify that there is NVL.
...
Don't use BETWEEN function in this case!!!
Data Quality
Example 1. Verify all dimension IDs in fact table, every dim_id.
Script:
SELECT COUNT ( * ) AS ERRORS
FROM ( SELECT product_surr_id
FROM fct_table_contacts
MINUS
SELECT product_surr_id
FROM dim_products_scd );
Example 2. Verify all level codes in the filters.
...
FROM DIM_PRODUCTS_SCD prod,
...
Verify filter -> WHERE prod.level_code = 'AGENT'
Example 3. Verify comments in code.
...
WHERE system_id = 1
...
Verify filter -> -- source system
Example 4. Verify comments of db objects.
TableStructure.sql - generate the html file with table structure and comments.
Performance and scalability
Example 1.
DataAmount.sql - shows the size and growth index of fact table.
Example 2. Analyse log table.
SELECT *
FROM lib.vw_log_t
WHERE SYSTEM like '%Project Title%'
AND timestamp > TO_DATE('10.05.2010', 'DD.MM.YYYY')
ORDER BY timestamp DESC;
Regression Testing
Building automation during system testing will make the process of regression testing much smoother.
Fitnesse is a good point.