Add comprehensive course XML block extraction and dbt staging models for edX.org and Open edX courses #1737
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Process course XML contents for Open edX and edX.org courses into raw layer and dbt staging models
✅ Rebased on Latest Main
This PR has been rebased on the latest
mainbranch to incorporate recent changes from the repository.Rebase Summary:
f875036to6561c4da(add the missing uv lock file)dg_projects/openedx/openedx/assets/openedx_course_archives.py(file was deleted in main branch during refactoring)Changes in This PR (after rebase):
1. XML Block Extraction Function
Added
process_course_xml_blocks()inpackages/ol-orchestrate-lib/src/ol_orchestrate/lib/openedx.py:2. Dagster Asset Integration
Updated
extract_edxorg_courserun_metadatamulti-asset in:dg_projects/edxorg/edxorg/assets/openedx_course_archives.py✅Note: The file
dg_projects/openedx/openedx/assets/openedx_course_archives.pywas removed during the rebase as it was deleted in the main branch during a refactoring. The edxorg version remains and contains all the necessary functionality.Added new
course_xml_blocksoutput that:s3file_io_manageratedxorg/processed_data/course_xml_blocks/{source_system}/{course_id}/{version}.json3. dbt Source Definition
Added
raw__edxorg__s3__course_xml_blockssource insrc/ol_dbt/models/staging/edxorg/_edxorg_sources.yml:4. dbt Staging Model
Created
src/ol_dbt/models/staging/edxorg/stg__edxorg__s3__course_xml_blocks.sql:deduplicate_raw_tablemacrodecimal(38, 4)5. Model Documentation
Added comprehensive documentation in
src/ol_dbt/models/staging/edxorg/_stg__edxorg__models.yml:Files Changed
Total: 6 code files, 283 lines added
Testing
Post-Merge Actions
Same as before:
Original prompt
This section details on the original issue you should resolve
<issue_title>Process course XML contents for Open edX and edX.org courses into raw layer and dbt staging models</issue_title>
<issue_description>## Summary
Process course XML contents from Open edX and edX.org course archives to load them into the raw layer of the data lakehouse and create dbt staging models for downstream analysis.
Background
Currently, the repository has functionality to extract specific metadata from course XML archives (course metadata, video details, certificate signatories, and policy information) through the
extract_edxorg_courserun_metadatamulti-asset insrc/ol_orchestrate/assets/openedx_course_archives.py. However, the raw XML contents and course structure data are not being systematically loaded into the raw layer of the data warehouse for comprehensive analysis and transformation through dbt.Current State
The existing implementation:
src/ol_orchestrate/lib/openedx.py:process_course_xml()- extracts course metadataprocess_video_xml()- extracts video elementsprocess_policy_json()- extracts policy informationsrc/ol_dbt/models/staging/edxorg/that reference raw course structure dataWhat's Missing
The complete course XML contents, including all course components, blocks, and their relationships, need to be:
ol_warehouse_*_rawschemas)Requirements
1. Raw Layer Data Loading
Objective: Load course XML contents into raw layer tables
Tasks:
src/ol_orchestrate/assets/Data Sources:
Target Schema Pattern:
2. dbt Staging Models
Objective: Create staging models to transform raw course XML data into clean, typed datasets
Tasks:
bin/dbt-create-staging-models.pyutility to scaffold sourcessrc/ol_dbt/models/staging/edxorg/Staging Model Pattern:
3. Integration with Existing Pipeline
Tasks:
extract_edxorg_courserun_metadataif neededstg__edxorg__s3__course_structure.sqlTechnical Considerations
XML Structure
Course XML arc...
Fixes #1714
💬 Share your feedback on Copilot coding agent for the chance to win a $200 gift card! Click here to start the survey.