Skip to content

Investigate conditional uniqueness logic in marts__combined_program_enrollment_detail and downstream reporting models #1957

@quazi-h

Description

@quazi-h

While working on adding a new reporting model (program_enrollment_with_user_report), we discovered some differences in the uniqueness tests between marts__combined_program_enrollment_detail and the upstream models used to build that model.

The marts__combined_program_enrollment_detail mart model and its downstream reporting models use conditional uniqueness tests with platform-specific logic that may indicate underlying data quality issues or overly complex grain definitions. This creates maintenance burden and weakens data integrity validation.

Please investigate why marts__combined_program_enrollment_detail needs conditional uniqueness tests and whether the upstream model should be fixed.

Expected Behavior

The two models should have the same logic in the dbt tests for uniqueness.

Current Behavior

Current State:

Two separate uniqueness tests with mutually exclusive row_condition filters
Hard-coded exception for program_id = 2 (edX.org programs)
Different grain definitions for different platforms within the same model

For non-edX.org programs (program_id !=2):

uniqueness: [program_title, user_username, platform_name, programenrollment_created_on]

For edX.org programs (program_id =2):

uniqueness: [user_username, user_email]

Using the same logic for the uniqueness tests causes the reporting model to fail.
19:50:13 Failure in test dbt_expectations_expect_compound_columns_to_be_unique_program_enrollment_with_user_report_platform_name__program_id__user_username__user_email__programenrollment_created_on (models/reporting/_reporting__models.yml) 19:50:13  Got 1870062 results, configured to fail if >10

Steps to Reproduce

dbt build --select program_enrollment_with_user_report --vars 'schema_suffix: <your_username>' --target dev_production

Possible Solution (Generated by AI, may need to be revised)

Understand the Business Requirements
Interview stakeholders: Why do edX.org enrollments behave differently?
Understand re-enrollment scenarios across all platforms
Determine if programenrollment_created_on should always be part of uniqueness
Identify if there's a true unique identifier we're missing (e.g., enrollment UUID)

Analyze the Data
Query edX.org program enrollments to identify actual duplicate patterns
Check if duplicates exist only for re-enrolled users or across all users
Verify if user_username + user_email is truly unique for edX.org
Compare duplication rates across platforms

Review Upstream Sources
Check source system documentation for edX.org program enrollment grain
Review ETL logic in stg__edxorg__* models for enrollment handling
Identify if duplicates are introduced during staging or marts layer

Evaluate Alternatives
Option A: Add enrollment_id or similar unique identifier to all platforms
Option B: Standardize re-enrollment handling (e.g., is_current_enrollment flag)
Option C: Use qualified grain (most recent enrollment per user-program)
Option D: Accept conditional logic but document extensively with business rationale

Success Criteria
At minimum, this investigation should result in:
Documentation: Clear explanation of why conditional uniqueness tests exist and whether they're correct by design or masking data issues
One of the following outcomes:
Simplification: Replace conditional tests with a single comprehensive uniqueness test
Standardization: Add proper unique identifiers to eliminate need for conditional logic
Acceptance: Keep conditional tests but add thorough business logic documentation
Downstream Impact: Update all affected reporting models with findings

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions