diff --git a/examples/minimal/README.md b/examples/minimal/README.md new file mode 100644 index 0000000..7a5943b --- /dev/null +++ b/examples/minimal/README.md @@ -0,0 +1,195 @@ +# Minimal PQG Example Data + +This directory contains small, hand-crafted example datasets to help understand the iSamples PQG format. The same data is represented in JSON, CSV, and all three parquet formats (export, narrow, wide). + +## Dataset Overview + +**Domain**: Geological rock samples from Mount Rainier volcanic monitoring project + +**Entities**: +- 3 MaterialSampleRecords (samples) +- 3 SamplingEvents (collection/preparation events) +- 2 GeospatialCoordLocations (coordinates) +- 1 SamplingSite (Mount Rainier Summit Area) +- 1 Agent (Jane Smith, collector) + +**Relationships demonstrated**: +- Sample → produced_by → SamplingEvent (how samples are created) +- Sample → derivedFrom → Sample (parent/child relationship) +- SamplingEvent → sample_location → GeospatialCoordLocation +- SamplingEvent → sampling_site → SamplingSite +- SamplingSite → site_location → GeospatialCoordLocation + +## File Structure + +``` +minimal/ +├── json/ +│ ├── 1_sample.json # Single sample (simplest case) +│ └── 3_samples.json # Three related samples +├── csv/ +│ ├── samples.csv # MaterialSampleRecords +│ ├── events.csv # SamplingEvents +│ ├── locations.csv # GeospatialCoordLocations +│ ├── sites.csv # SamplingSites +│ ├── agents.csv # Agents +│ └── edges.csv # Relationships (for narrow format) +└── parquet/ + ├── minimal_export.parquet # Export format (3 rows, nested) + ├── minimal_narrow.parquet # Narrow format (21 rows, with edges) + └── minimal_wide.parquet # Wide format (10 rows, p__* columns) +``` + +## The Three Parquet Formats + +### Export Format (`minimal_export.parquet`) +- **3 rows** - one per sample +- Sample-centric with nested structs for related entities +- Best for: Simple queries on sample properties +- Coordinates pre-extracted to `sample_location_latitude/longitude` + +### Narrow Format (`minimal_narrow.parquet`) +- **21 rows** - 10 entities + 11 edge rows +- Graph-normalized with explicit `_edge_` rows +- Columns `s` (subject), `p` (predicate), `o` (object array) +- Best for: Graph traversal, flexible relationship queries + +### Wide Format (`minimal_wide.parquet`) +- **10 rows** - one per entity (no edge rows) +- Relationships stored as `p__*` columns with row_id arrays +- Best for: Fast entity queries, smaller file size, analytical queries + +## Example Queries + +### Query 1: Find all samples (works in all formats) + +**Export format:** +```sql +SELECT sample_identifier, label +FROM read_parquet('parquet/minimal_export.parquet') +``` + +**Wide format:** +```sql +SELECT pid, label +FROM read_parquet('parquet/minimal_wide.parquet') +WHERE otype = 'MaterialSampleRecord' +``` + +**Narrow format:** +```sql +SELECT pid, label +FROM read_parquet('parquet/minimal_narrow.parquet') +WHERE otype = 'MaterialSampleRecord' +``` + +### Query 2: Find samples with their locations + +**Wide format (uses p__* columns):** +```sql +SELECT + s.pid as sample, + s.label, + loc.latitude, + loc.longitude +FROM read_parquet('parquet/minimal_wide.parquet') s +JOIN read_parquet('parquet/minimal_wide.parquet') e + ON e.otype = 'SamplingEvent' + AND list_contains(s.p__produced_by, e.row_id) +JOIN read_parquet('parquet/minimal_wide.parquet') loc + ON loc.otype = 'GeospatialCoordLocation' + AND list_contains(e.p__sample_location, loc.row_id) +WHERE s.otype = 'MaterialSampleRecord' +``` + +**Narrow format (uses edge rows):** +```sql +SELECT + s.pid as sample, + s.label, + loc.latitude, + loc.longitude +FROM read_parquet('parquet/minimal_narrow.parquet') s +JOIN read_parquet('parquet/minimal_narrow.parquet') e1 + ON e1.otype = '_edge_' + AND e1.s = s.row_id + AND e1.p = 'produced_by' +JOIN read_parquet('parquet/minimal_narrow.parquet') ev + ON ev.otype = 'SamplingEvent' + AND list_contains(e1.o, ev.row_id) +JOIN read_parquet('parquet/minimal_narrow.parquet') e2 + ON e2.otype = '_edge_' + AND e2.s = ev.row_id + AND e2.p = 'sample_location' +JOIN read_parquet('parquet/minimal_narrow.parquet') loc + ON loc.otype = 'GeospatialCoordLocation' + AND list_contains(e2.o, loc.row_id) +WHERE s.otype = 'MaterialSampleRecord' +``` + +### Query 3: Count entities by type + +```sql +SELECT otype, COUNT(*) as count +FROM read_parquet('parquet/minimal_wide.parquet') +GROUP BY otype +ORDER BY count DESC +``` + +Expected output: +``` +MaterialSampleRecord 3 +SamplingEvent 3 +GeospatialCoordLocation 2 +SamplingSite 1 +Agent 1 +``` + +## JSON Schema Validation + +The JSON files validate against the iSamples Core 1.0 schema: + +```python +import json +from jsonschema import validate + +# Load schema (from isamplesorg-metadata repo) +with open('path/to/iSamplesSchemaCore1.0.json') as f: + schema = json.load(f) + +# Load and validate +with open('json/1_sample.json') as f: + sample = json.load(f) + +validate(instance=sample, schema=schema) # Raises if invalid +``` + +## Entity Relationship Diagram + +``` +MaterialSampleRecord ──produced_by──► SamplingEvent ──sample_location──► GeospatialCoordLocation + │ │ + │ └──sampling_site──► SamplingSite ──site_location──► GeospatialCoordLocation + │ + ├──registrant──► Agent + │ + └──derivedFrom──► MaterialSampleRecord (parent sample) +``` + +## Size Comparison + +| Format | Rows | File Size | Notes | +|--------|------|-----------|-------| +| Export | 3 | 1.7 KB | Nested structs, sample-centric | +| Narrow | 21 | 4.8 KB | Explicit edge rows | +| Wide | 10 | 5.0 KB | p__* columns | + +In production datasets: +- Wide is typically 60-70% smaller than narrow +- Export is smallest but less flexible for complex queries + +## See Also + +- [PQG Specification](../../docs/PQG_SPECIFICATION.md) - Full format specification +- [Edge Types](../../pqg/edge_types.py) - All 14 iSamples edge types +- [Schema Definitions](../../pqg/schemas/) - Python schema validators diff --git a/examples/minimal/csv/agents.csv b/examples/minimal/csv/agents.csv new file mode 100644 index 0000000..0e3c700 --- /dev/null +++ b/examples/minimal/csv/agents.csv @@ -0,0 +1,4 @@ +agent_id,name,role,affiliation,contact_information +agent:jsmith,Jane Smith,collector,University of Washington,jsmith@uw.edu +agent:labtech,Lab Technician,preparer,University of Washington, +agent:curator,Collections Manager,curator,Burke Museum, diff --git a/examples/minimal/csv/edges.csv b/examples/minimal/csv/edges.csv new file mode 100644 index 0000000..1090eab --- /dev/null +++ b/examples/minimal/csv/edges.csv @@ -0,0 +1,14 @@ +subject_id,predicate,object_id,description +ark:/99999/example001,produced_by,event:example001,Sample was produced by this sampling event +ark:/99999/example002,produced_by,event:example002,Sample was produced by this sampling event +ark:/99999/example003,produced_by,event:example003,Sample was produced by this sampling event +ark:/99999/example002,derivedFrom,ark:/99999/example001,Thin section derived from parent rock sample +ark:/99999/example003,relatedTo,ark:/99999/example001,Sibling sample from same site +event:example001,sample_location,loc:rainier001,Event occurred at this location +event:example003,sample_location,loc:rainier002,Event occurred at this location +event:example001,sampling_site,site:rainier001,Event occurred at this site +event:example003,sampling_site,site:rainier001,Event occurred at this site +site:rainier001,site_location,loc:rainier001,Site is at this location +ark:/99999/example001,registrant,agent:jsmith,Sample registered by this agent +ark:/99999/example002,registrant,agent:jsmith,Sample registered by this agent +ark:/99999/example003,registrant,agent:jsmith,Sample registered by this agent diff --git a/examples/minimal/csv/events.csv b/examples/minimal/csv/events.csv new file mode 100644 index 0000000..f9d8e04 --- /dev/null +++ b/examples/minimal/csv/events.csv @@ -0,0 +1,4 @@ +event_id,label,description,result_time,project,feature_of_interest,site_id,location_id,collector_id +event:example001,Mount Rainier Field Collection 2024-06-10,Field collection during summer geology survey,2024-06-10,Cascade Volcanic Monitoring Project,Recent lava flow on Mount Rainier,site:rainier001,loc:rainier001,agent:jsmith +event:example002,Lab Preparation 2024-07-01,Thin section preparation in petrology lab,2024-07-01,,,,agent:labtech +event:example003,Mount Rainier Field Collection 2024-06-10 (Site B),Field collection 10m from first sample,2024-06-10,Cascade Volcanic Monitoring Project,Recent lava flow on Mount Rainier,site:rainier001,loc:rainier002,agent:jsmith diff --git a/examples/minimal/csv/locations.csv b/examples/minimal/csv/locations.csv new file mode 100644 index 0000000..d63ebb2 --- /dev/null +++ b/examples/minimal/csv/locations.csv @@ -0,0 +1,3 @@ +location_id,latitude,longitude,elevation,obfuscated +loc:rainier001,46.8523,-121.7603,4392 m above mean sea level,false +loc:rainier002,46.8524,-121.7601,4390 m above mean sea level,false diff --git a/examples/minimal/csv/samples.csv b/examples/minimal/csv/samples.csv new file mode 100644 index 0000000..b73ebd3 --- /dev/null +++ b/examples/minimal/csv/samples.csv @@ -0,0 +1,4 @@ +sample_id,label,description,last_modified_time,event_id,material_category,sample_object_type,registrant_id +ark:/99999/example001,Rock Sample MR-001 (Parent),"Basalt collected during 2024 field survey. Fresh, unweathered sample from recent lava flow.",2024-06-15T10:30:00Z,event:example001,rock,physicalspecimen,agent:jsmith +ark:/99999/example002,Rock Sample MR-001-A (Child - Thin Section),Thin section prepared from parent sample MR-001 for petrographic analysis.,2024-07-01T14:00:00Z,event:example002,rock,thinsection,agent:jsmith +ark:/99999/example003,Rock Sample MR-002,"Second basalt sample from same site, collected 10m away from MR-001.",2024-06-15T11:00:00Z,event:example003,rock,physicalspecimen,agent:jsmith diff --git a/examples/minimal/csv/sites.csv b/examples/minimal/csv/sites.csv new file mode 100644 index 0000000..d2573e4 --- /dev/null +++ b/examples/minimal/csv/sites.csv @@ -0,0 +1,2 @@ +site_id,label,description,place_name +site:rainier001,Mount Rainier Summit Area,Collection site near the summit crater rim,"Mount Rainier, Pierce County, Washington, USA" diff --git a/examples/minimal/json/1_sample.json b/examples/minimal/json/1_sample.json new file mode 100644 index 0000000..ac915fd --- /dev/null +++ b/examples/minimal/json/1_sample.json @@ -0,0 +1,86 @@ +{ + "sample_identifier": "ark:/99999/example001", + "label": "Rock Sample from Mount Rainier", + "description": "Basalt collected during 2024 field survey. Fresh, unweathered sample from recent lava flow.", + "last_modified_time": "2024-06-15T10:30:00Z", + "produced_by": { + "label": "Mount Rainier Field Collection 2024-06-10", + "description": "Field collection during summer geology survey", + "result_time": "2024-06-10", + "project": "Cascade Volcanic Monitoring Project", + "has_feature_of_interest": "Recent lava flow on Mount Rainier", + "sampling_site": { + "label": "Mount Rainier Summit Area", + "description": "Collection site near the summit crater rim", + "place_name": ["Mount Rainier", "Pierce County", "Washington", "USA"], + "sample_location": { + "latitude": 46.8523, + "longitude": -121.7603, + "elevation": "4392 m above mean sea level", + "obfuscated": false + } + }, + "responsibility": [ + { + "name": "Jane Smith", + "role": "collector", + "affiliation": "University of Washington", + "contact_information": "jsmith@uw.edu" + } + ] + }, + "has_material_category": [ + { + "identifier": "https://w3id.org/isample/vocabulary/material/1.0/ite", + "label": "ite", + "scheme_name": "iSamples Material Type" + } + ], + "has_context_category": [ + { + "identifier": "https://w3id.org/isample/vocabulary/sampledfeature/1.0/activehumanoccupationsite", + "label": "Earth interior", + "scheme_name": "iSamples Sampled Feature Type" + } + ], + "has_sample_object_type": [ + { + "identifier": "https://w3id.org/isample/vocabulary/specimentype/1.0/physicalspecimen", + "label": "Physical specimen", + "scheme_name": "iSamples Specimen Type" + } + ], + "keywords": [ + { + "keyword": "basalt", + "scheme_name": "Free text" + }, + { + "keyword": "volcanic rock", + "scheme_name": "Free text" + }, + { + "keyword": "Cascade Range", + "scheme_name": "Geographic" + } + ], + "registrant": { + "name": "Jane Smith", + "affiliation": "University of Washington", + "contact_information": "jsmith@uw.edu", + "role": "registrant" + }, + "curation": { + "label": "UW Geology Sample Collection", + "description": "Stored in climate-controlled facility", + "curation_location": "University of Washington, Burke Museum, Room 142, Drawer B-15", + "access_constraints": ["By appointment only", "Research use only"], + "responsibility": [ + { + "name": "Collections Manager", + "role": "curator", + "affiliation": "Burke Museum" + } + ] + } +} diff --git a/examples/minimal/json/3_samples.json b/examples/minimal/json/3_samples.json new file mode 100644 index 0000000..02e47fe --- /dev/null +++ b/examples/minimal/json/3_samples.json @@ -0,0 +1,146 @@ +[ + { + "sample_identifier": "ark:/99999/example001", + "label": "Rock Sample MR-001 (Parent)", + "description": "Basalt collected during 2024 field survey. Fresh, unweathered sample from recent lava flow. This is the original field sample.", + "last_modified_time": "2024-06-15T10:30:00Z", + "produced_by": { + "label": "Mount Rainier Field Collection 2024-06-10", + "identifier": "event:example001", + "result_time": "2024-06-10", + "project": "Cascade Volcanic Monitoring Project", + "has_feature_of_interest": "Recent lava flow on Mount Rainier", + "sampling_site": { + "identifier": "site:rainier001", + "label": "Mount Rainier Summit Area", + "place_name": ["Mount Rainier", "Pierce County", "Washington", "USA"], + "sample_location": { + "latitude": 46.8523, + "longitude": -121.7603, + "elevation": "4392 m above mean sea level" + } + }, + "responsibility": [ + { + "name": "Jane Smith", + "role": "collector", + "affiliation": "University of Washington" + } + ] + }, + "has_material_category": [ + { + "identifier": "https://w3id.org/isample/vocabulary/material/1.0/rock", + "label": "Rock", + "scheme_name": "iSamples Material Type" + } + ], + "has_sample_object_type": [ + { + "identifier": "https://w3id.org/isample/vocabulary/specimentype/1.0/physicalspecimen", + "label": "Physical specimen" + } + ], + "registrant": { + "name": "Jane Smith", + "affiliation": "University of Washington" + } + }, + { + "sample_identifier": "ark:/99999/example002", + "label": "Rock Sample MR-001-A (Child - Thin Section)", + "description": "Thin section prepared from parent sample MR-001 for petrographic analysis.", + "last_modified_time": "2024-07-01T14:00:00Z", + "produced_by": { + "label": "Lab Preparation 2024-07-01", + "identifier": "event:example002", + "result_time": "2024-07-01", + "description": "Thin section preparation in petrology lab", + "responsibility": [ + { + "name": "Lab Technician", + "role": "preparer", + "affiliation": "University of Washington" + } + ] + }, + "has_material_category": [ + { + "identifier": "https://w3id.org/isample/vocabulary/material/1.0/rock", + "label": "Rock" + } + ], + "has_sample_object_type": [ + { + "identifier": "https://w3id.org/isample/vocabulary/specimentype/1.0/thinsection", + "label": "Thin section" + } + ], + "related_resource": [ + { + "label": "Parent sample", + "relationship": "derivedFrom", + "target": "ark:/99999/example001", + "description": "This thin section was prepared from the parent rock sample" + } + ], + "registrant": { + "name": "Jane Smith", + "affiliation": "University of Washington" + } + }, + { + "sample_identifier": "ark:/99999/example003", + "label": "Rock Sample MR-002", + "description": "Second basalt sample from same site, collected 10m away from MR-001.", + "last_modified_time": "2024-06-15T11:00:00Z", + "produced_by": { + "label": "Mount Rainier Field Collection 2024-06-10 (Site B)", + "identifier": "event:example003", + "result_time": "2024-06-10", + "project": "Cascade Volcanic Monitoring Project", + "has_feature_of_interest": "Recent lava flow on Mount Rainier", + "sampling_site": { + "identifier": "site:rainier001", + "label": "Mount Rainier Summit Area", + "place_name": ["Mount Rainier", "Pierce County", "Washington", "USA"], + "sample_location": { + "latitude": 46.8524, + "longitude": -121.7601, + "elevation": "4390 m above mean sea level" + } + }, + "responsibility": [ + { + "name": "Jane Smith", + "role": "collector", + "affiliation": "University of Washington" + } + ] + }, + "has_material_category": [ + { + "identifier": "https://w3id.org/isample/vocabulary/material/1.0/rock", + "label": "Rock" + } + ], + "has_sample_object_type": [ + { + "identifier": "https://w3id.org/isample/vocabulary/specimentype/1.0/physicalspecimen", + "label": "Physical specimen" + } + ], + "related_resource": [ + { + "label": "Sibling sample", + "relationship": "relatedTo", + "target": "ark:/99999/example001", + "description": "Collected from same site as MR-001" + } + ], + "registrant": { + "name": "Jane Smith", + "affiliation": "University of Washington" + } + } +] diff --git a/examples/minimal/parquet/minimal_export.parquet b/examples/minimal/parquet/minimal_export.parquet new file mode 100644 index 0000000..d06011e Binary files /dev/null and b/examples/minimal/parquet/minimal_export.parquet differ diff --git a/examples/minimal/parquet/minimal_narrow.parquet b/examples/minimal/parquet/minimal_narrow.parquet new file mode 100644 index 0000000..64d12bb Binary files /dev/null and b/examples/minimal/parquet/minimal_narrow.parquet differ diff --git a/examples/minimal/parquet/minimal_wide.parquet b/examples/minimal/parquet/minimal_wide.parquet new file mode 100644 index 0000000..3f7e38a Binary files /dev/null and b/examples/minimal/parquet/minimal_wide.parquet differ