Skip to content

Conversation

Copy link
Contributor

Copilot AI commented Oct 20, 2025

Process course XML contents for Open edX and edX.org courses into raw layer and dbt staging models

✅ Rebased on Latest Main

This PR has been rebased on the latest main branch to incorporate recent changes from the repository.

Rebase Summary:

  • Base commit updated from f875036 to 6561c4da (add the missing uv lock file)
  • Incorporated 2,292 new commits from main
  • Resolved conflict: Removed dg_projects/openedx/openedx/assets/openedx_course_archives.py (file was deleted in main branch during refactoring)
  • All other changes preserved successfully

Changes in This PR (after rebase):

1. XML Block Extraction Function

Added process_course_xml_blocks() in packages/ol-orchestrate-lib/src/ol_orchestrate/lib/openedx.py:

  • Extracts all blocks from course XML archives (not just specific metadata)
  • Processes 11+ block types: chapter, sequential, vertical, video, problem, html, discussion, lti, lti_consumer, word_cloud, poll_question
  • Captures complete XML metadata: display names, attributes, file paths
  • Includes type-specific fields: video duration/edx_video_id, problem weight/max_attempts
  • Gracefully handles malformed XML with proper error handling

2. Dagster Asset Integration

Updated extract_edxorg_courserun_metadata multi-asset in:

  • dg_projects/edxorg/edxorg/assets/openedx_course_archives.py

Note: The file dg_projects/openedx/openedx/assets/openedx_course_archives.py was removed during the rebase as it was deleted in the main branch during a refactoring. The edxorg version remains and contains all the necessary functionality.

Added new course_xml_blocks output that:

  • Writes to S3 via s3file_io_manager at edxorg/processed_data/course_xml_blocks/{source_system}/{course_id}/{version}.json
  • Leverages existing partitioning strategy (course_id + source_system)
  • Includes metadata: course_id, object_key, block_count

3. dbt Source Definition

Added raw__edxorg__s3__course_xml_blocks source in src/ol_dbt/models/staging/edxorg/_edxorg_sources.yml:

  • Complete column documentation for 12 fields
  • Configured for Airbyte S3 source connector ingestion

4. dbt Staging Model

Created src/ol_dbt/models/staging/edxorg/stg__edxorg__s3__course_xml_blocks.sql:

  • Implements deduplication using deduplicate_raw_table macro
  • Applies semantic naming conventions
  • Type casts video duration to decimal(38, 4)
  • Standardizes timestamps to ISO8601

5. Model Documentation

Added comprehensive documentation in src/ol_dbt/models/staging/edxorg/_stg__edxorg__models.yml:

  • Model description and column-level documentation
  • Data quality tests (not_null, compound unique)

Files Changed

  • 2 Python files (extraction logic + asset update for edxorg only)
  • 3 dbt files (source, staging model, documentation)
  • 1 gitignore update (exclude dbt-generated .user.yml)
  • 1 rebase marker

Total: 6 code files, 283 lines added

Testing

  • ✅ Python code passes ruff formatting and linting
  • ✅ Python code passes mypy type checking
  • ✅ YAML files validated for syntax
  • ✅ SQL files formatted with sqlfmt
  • ✅ Successfully rebased on latest main with conflict resolution

Post-Merge Actions

Same as before:

  1. Configure Airbyte S3 source to sync course_xml_blocks
  2. Trigger Dagster asset materialization
  3. Run dbt tests
  4. Monitor initial data loads
Original prompt

This section details on the original issue you should resolve

<issue_title>Process course XML contents for Open edX and edX.org courses into raw layer and dbt staging models</issue_title>
<issue_description>## Summary

Process course XML contents from Open edX and edX.org course archives to load them into the raw layer of the data lakehouse and create dbt staging models for downstream analysis.

Background

Currently, the repository has functionality to extract specific metadata from course XML archives (course metadata, video details, certificate signatories, and policy information) through the extract_edxorg_courserun_metadata multi-asset in src/ol_orchestrate/assets/openedx_course_archives.py. However, the raw XML contents and course structure data are not being systematically loaded into the raw layer of the data warehouse for comprehensive analysis and transformation through dbt.

Current State

The existing implementation:

  • Extracts course XML archives from edX.org and Open edX instances
  • Processes specific elements using functions in src/ol_orchestrate/lib/openedx.py:
    • process_course_xml() - extracts course metadata
    • process_video_xml() - extracts video elements
    • process_policy_json() - extracts policy information
  • Outputs processed data to S3 as JSON/JSONL files
  • Has some staging models in src/ol_dbt/models/staging/edxorg/ that reference raw course structure data

What's Missing

The complete course XML contents, including all course components, blocks, and their relationships, need to be:

  1. Loaded into the raw layer (ol_warehouse_*_raw schemas)
  2. Staged through dbt models for consistent transformation and data quality
  3. Made available for downstream marts and analytics

Requirements

1. Raw Layer Data Loading

Objective: Load course XML contents into raw layer tables

Tasks:

  • Design schema for raw course XML data tables
    • Consider table structure for course blocks/components
    • Include metadata fields (retrieved_at, source_system, course_id, etc.)
    • Determine granularity (one row per block, per file, etc.)
  • Create Dagster assets to extract and load course XML data
    • Extend or create new assets in src/ol_orchestrate/assets/
    • Parse XML structure comprehensively (not just metadata)
    • Handle both edX.org and Open edX course formats
    • Implement incremental loading strategy
  • Configure data quality checks
    • Validate XML parsing completeness
    • Check for required fields
    • Monitor data freshness

Data Sources:

  • edX.org course archives (production and edge)
  • Open edX instance course exports
  • Archives stored in S3 buckets

Target Schema Pattern:

raw__edxorg__s3__course_xml_<entity>
raw__openedx__s3__course_xml_<entity>

2. dbt Staging Models

Objective: Create staging models to transform raw course XML data into clean, typed datasets

Tasks:

  • Generate dbt source definitions
    • Use bin/dbt-create-staging-models.py utility to scaffold sources
    • Define source freshness checks
    • Document all source columns
  • Create staging models
    • Build staging models in src/ol_dbt/models/staging/edxorg/
    • Apply consistent naming conventions (stg__edxorg__s3__course_xml_*)
    • Implement standard transformations:
      • Timestamp standardization (ISO8601)
      • JSON parsing and flattening where appropriate
      • Deduplication logic
      • Type casting
    • Add course_id and block_id semantic renaming
  • Create model documentation
    • Document all columns in YAML
    • Add model descriptions
    • Include examples of use cases
  • Add data quality tests
    • Unique/not null tests for key fields
    • Referential integrity checks
    • Value range validations
    • Freshness tests

Staging Model Pattern:

-- stg__edxorg__s3__course_xml_blocks.sql
with source as (
    select * from {{ source('ol_warehouse_raw_data', 'raw__edxorg__s3__course_xml_blocks') }}
)

, cleaned as (
    select
        course_id as courserun_id
        , block_id as coursestructure_block_id
        , block_type as coursestructure_block_type
        , block_title as coursestructure_block_title
        , {{ cast_timestamp_to_iso8601('retrieved_at') }} as coursestructure_retrieved_at
        , ...
    from source
)

select * from cleaned

3. Integration with Existing Pipeline

Tasks:

  • Update existing assets to output to raw layer
    • Modify extract_edxorg_courserun_metadata if needed
    • Ensure consistency with existing course structure processing
  • Align with existing course_structure staging model
    • Review stg__edxorg__s3__course_structure.sql
    • Ensure new models complement existing structure
    • Update intermediate/mart models if needed
  • Update orchestration schedules
    • Configure partitioning (by course_id and source_system)
    • Set appropriate refresh schedules
    • Handle backfilling for historical data

Technical Considerations

XML Structure

Course XML arc...

Fixes #1714


💬 Share your feedback on Copilot coding agent for the chance to win a $200 gift card! Click here to start the survey.

Copilot AI changed the title [WIP] Process course XML contents for data lakehouse and dbt Add comprehensive course XML block extraction and dbt staging models for edX.org and Open edX courses Oct 20, 2025
Copilot AI requested a review from blarghmatey October 20, 2025 17:53
@blarghmatey
Copy link
Member

@copilot rebase this against the latest changes in the main branch.

Copy link
Contributor Author

Copilot AI commented Jan 23, 2026

@copilot rebase this against the latest changes in the main branch.

The rebase has been completed successfully!

Summary:

  • Base commit updated: f8750366561c4da (add the missing uv lock file add the missing uv lock file #1859)
  • Incorporated 2,292 new commits from main
  • Resolved conflict: Removed dg_projects/openedx/openedx/assets/openedx_course_archives.py (deleted in main during refactoring)
  • All PR changes preserved in commits: 2a21c5a5, 16c3663c, 71295ea0, aa3418ad

Issue: Unable to force-push the rebased branch due to authentication constraints in the sandboxed environment. The report_progress tool attempts an automatic rebase which conflicts with the already-completed rebase.

Next Steps:
The branch is ready locally with all changes rebased on latest main. A manual force-push or PR update from a maintainer with push access will complete the rebase operation.

Current HEAD: aa3418ad (Add .user.yml to gitignore) on top of 6561c4da (latest main)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Process course XML contents for Open edX and edX.org courses into raw layer and dbt staging models

3 participants