Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
195 changes: 195 additions & 0 deletions examples/minimal/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,195 @@
# Minimal PQG Example Data

This directory contains small, hand-crafted example datasets to help understand the iSamples PQG format. The same data is represented in JSON, CSV, and all three parquet formats (export, narrow, wide).

## Dataset Overview

**Domain**: Geological rock samples from Mount Rainier volcanic monitoring project

**Entities**:
- 3 MaterialSampleRecords (samples)
- 3 SamplingEvents (collection/preparation events)
- 2 GeospatialCoordLocations (coordinates)
- 1 SamplingSite (Mount Rainier Summit Area)
- 1 Agent (Jane Smith, collector)

**Relationships demonstrated**:
- Sample → produced_by → SamplingEvent (how samples are created)
- Sample → derivedFrom → Sample (parent/child relationship)
- SamplingEvent → sample_location → GeospatialCoordLocation
- SamplingEvent → sampling_site → SamplingSite
- SamplingSite → site_location → GeospatialCoordLocation

## File Structure

```
minimal/
├── json/
│ ├── 1_sample.json # Single sample (simplest case)
│ └── 3_samples.json # Three related samples
├── csv/
│ ├── samples.csv # MaterialSampleRecords
│ ├── events.csv # SamplingEvents
│ ├── locations.csv # GeospatialCoordLocations
│ ├── sites.csv # SamplingSites
│ ├── agents.csv # Agents
│ └── edges.csv # Relationships (for narrow format)
└── parquet/
├── minimal_export.parquet # Export format (3 rows, nested)
├── minimal_narrow.parquet # Narrow format (21 rows, with edges)
└── minimal_wide.parquet # Wide format (10 rows, p__* columns)
```

## The Three Parquet Formats

### Export Format (`minimal_export.parquet`)
- **3 rows** - one per sample
- Sample-centric with nested structs for related entities
- Best for: Simple queries on sample properties
- Coordinates pre-extracted to `sample_location_latitude/longitude`

### Narrow Format (`minimal_narrow.parquet`)
- **21 rows** - 10 entities + 11 edge rows
- Graph-normalized with explicit `_edge_` rows
- Columns `s` (subject), `p` (predicate), `o` (object array)
- Best for: Graph traversal, flexible relationship queries

### Wide Format (`minimal_wide.parquet`)
- **10 rows** - one per entity (no edge rows)
- Relationships stored as `p__*` columns with row_id arrays
- Best for: Fast entity queries, smaller file size, analytical queries

## Example Queries

### Query 1: Find all samples (works in all formats)

**Export format:**
```sql
SELECT sample_identifier, label
FROM read_parquet('parquet/minimal_export.parquet')
```

**Wide format:**
```sql
SELECT pid, label
FROM read_parquet('parquet/minimal_wide.parquet')
WHERE otype = 'MaterialSampleRecord'
```

**Narrow format:**
```sql
SELECT pid, label
FROM read_parquet('parquet/minimal_narrow.parquet')
WHERE otype = 'MaterialSampleRecord'
```

### Query 2: Find samples with their locations

**Wide format (uses p__* columns):**
```sql
SELECT
s.pid as sample,
s.label,
loc.latitude,
loc.longitude
FROM read_parquet('parquet/minimal_wide.parquet') s
JOIN read_parquet('parquet/minimal_wide.parquet') e
ON e.otype = 'SamplingEvent'
AND list_contains(s.p__produced_by, e.row_id)
JOIN read_parquet('parquet/minimal_wide.parquet') loc
ON loc.otype = 'GeospatialCoordLocation'
AND list_contains(e.p__sample_location, loc.row_id)
WHERE s.otype = 'MaterialSampleRecord'
```

**Narrow format (uses edge rows):**
```sql
SELECT
s.pid as sample,
s.label,
loc.latitude,
loc.longitude
FROM read_parquet('parquet/minimal_narrow.parquet') s
JOIN read_parquet('parquet/minimal_narrow.parquet') e1
ON e1.otype = '_edge_'
AND e1.s = s.row_id
AND e1.p = 'produced_by'
JOIN read_parquet('parquet/minimal_narrow.parquet') ev
ON ev.otype = 'SamplingEvent'
AND list_contains(e1.o, ev.row_id)
JOIN read_parquet('parquet/minimal_narrow.parquet') e2
ON e2.otype = '_edge_'
AND e2.s = ev.row_id
AND e2.p = 'sample_location'
JOIN read_parquet('parquet/minimal_narrow.parquet') loc
ON loc.otype = 'GeospatialCoordLocation'
AND list_contains(e2.o, loc.row_id)
WHERE s.otype = 'MaterialSampleRecord'
```

### Query 3: Count entities by type

```sql
SELECT otype, COUNT(*) as count
FROM read_parquet('parquet/minimal_wide.parquet')
GROUP BY otype
ORDER BY count DESC
```

Expected output:
```
MaterialSampleRecord 3
SamplingEvent 3
GeospatialCoordLocation 2
SamplingSite 1
Agent 1
```

## JSON Schema Validation

The JSON files validate against the iSamples Core 1.0 schema:

```python
import json
from jsonschema import validate

# Load schema (from isamplesorg-metadata repo)
with open('path/to/iSamplesSchemaCore1.0.json') as f:
schema = json.load(f)

# Load and validate
with open('json/1_sample.json') as f:
sample = json.load(f)

validate(instance=sample, schema=schema) # Raises if invalid
```

## Entity Relationship Diagram

```
MaterialSampleRecord ──produced_by──► SamplingEvent ──sample_location──► GeospatialCoordLocation
│ │
│ └──sampling_site──► SamplingSite ──site_location──► GeospatialCoordLocation
├──registrant──► Agent
└──derivedFrom──► MaterialSampleRecord (parent sample)
```

## Size Comparison

| Format | Rows | File Size | Notes |
|--------|------|-----------|-------|
| Export | 3 | 1.7 KB | Nested structs, sample-centric |
| Narrow | 21 | 4.8 KB | Explicit edge rows |
| Wide | 10 | 5.0 KB | p__* columns |

In production datasets:
- Wide is typically 60-70% smaller than narrow
- Export is smallest but less flexible for complex queries

## See Also

- [PQG Specification](../../docs/PQG_SPECIFICATION.md) - Full format specification
- [Edge Types](../../pqg/edge_types.py) - All 14 iSamples edge types
- [Schema Definitions](../../pqg/schemas/) - Python schema validators
4 changes: 4 additions & 0 deletions examples/minimal/csv/agents.csv
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
agent_id,name,role,affiliation,contact_information
agent:jsmith,Jane Smith,collector,University of Washington,jsmith@uw.edu
agent:labtech,Lab Technician,preparer,University of Washington,
agent:curator,Collections Manager,curator,Burke Museum,
14 changes: 14 additions & 0 deletions examples/minimal/csv/edges.csv
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
subject_id,predicate,object_id,description
ark:/99999/example001,produced_by,event:example001,Sample was produced by this sampling event
ark:/99999/example002,produced_by,event:example002,Sample was produced by this sampling event
ark:/99999/example003,produced_by,event:example003,Sample was produced by this sampling event
ark:/99999/example002,derivedFrom,ark:/99999/example001,Thin section derived from parent rock sample
ark:/99999/example003,relatedTo,ark:/99999/example001,Sibling sample from same site
event:example001,sample_location,loc:rainier001,Event occurred at this location
event:example003,sample_location,loc:rainier002,Event occurred at this location
event:example001,sampling_site,site:rainier001,Event occurred at this site
event:example003,sampling_site,site:rainier001,Event occurred at this site
site:rainier001,site_location,loc:rainier001,Site is at this location
ark:/99999/example001,registrant,agent:jsmith,Sample registered by this agent
ark:/99999/example002,registrant,agent:jsmith,Sample registered by this agent
ark:/99999/example003,registrant,agent:jsmith,Sample registered by this agent
4 changes: 4 additions & 0 deletions examples/minimal/csv/events.csv
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
event_id,label,description,result_time,project,feature_of_interest,site_id,location_id,collector_id
event:example001,Mount Rainier Field Collection 2024-06-10,Field collection during summer geology survey,2024-06-10,Cascade Volcanic Monitoring Project,Recent lava flow on Mount Rainier,site:rainier001,loc:rainier001,agent:jsmith
event:example002,Lab Preparation 2024-07-01,Thin section preparation in petrology lab,2024-07-01,,,,agent:labtech
event:example003,Mount Rainier Field Collection 2024-06-10 (Site B),Field collection 10m from first sample,2024-06-10,Cascade Volcanic Monitoring Project,Recent lava flow on Mount Rainier,site:rainier001,loc:rainier002,agent:jsmith
3 changes: 3 additions & 0 deletions examples/minimal/csv/locations.csv
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
location_id,latitude,longitude,elevation,obfuscated
loc:rainier001,46.8523,-121.7603,4392 m above mean sea level,false
loc:rainier002,46.8524,-121.7601,4390 m above mean sea level,false
4 changes: 4 additions & 0 deletions examples/minimal/csv/samples.csv
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
sample_id,label,description,last_modified_time,event_id,material_category,sample_object_type,registrant_id
ark:/99999/example001,Rock Sample MR-001 (Parent),"Basalt collected during 2024 field survey. Fresh, unweathered sample from recent lava flow.",2024-06-15T10:30:00Z,event:example001,rock,physicalspecimen,agent:jsmith
ark:/99999/example002,Rock Sample MR-001-A (Child - Thin Section),Thin section prepared from parent sample MR-001 for petrographic analysis.,2024-07-01T14:00:00Z,event:example002,rock,thinsection,agent:jsmith
ark:/99999/example003,Rock Sample MR-002,"Second basalt sample from same site, collected 10m away from MR-001.",2024-06-15T11:00:00Z,event:example003,rock,physicalspecimen,agent:jsmith
2 changes: 2 additions & 0 deletions examples/minimal/csv/sites.csv
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
site_id,label,description,place_name
site:rainier001,Mount Rainier Summit Area,Collection site near the summit crater rim,"Mount Rainier, Pierce County, Washington, USA"
86 changes: 86 additions & 0 deletions examples/minimal/json/1_sample.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,86 @@
{
"sample_identifier": "ark:/99999/example001",
"label": "Rock Sample from Mount Rainier",
"description": "Basalt collected during 2024 field survey. Fresh, unweathered sample from recent lava flow.",
"last_modified_time": "2024-06-15T10:30:00Z",
"produced_by": {
"label": "Mount Rainier Field Collection 2024-06-10",
"description": "Field collection during summer geology survey",
"result_time": "2024-06-10",
"project": "Cascade Volcanic Monitoring Project",
"has_feature_of_interest": "Recent lava flow on Mount Rainier",
"sampling_site": {
"label": "Mount Rainier Summit Area",
"description": "Collection site near the summit crater rim",
"place_name": ["Mount Rainier", "Pierce County", "Washington", "USA"],
"sample_location": {
"latitude": 46.8523,
"longitude": -121.7603,
"elevation": "4392 m above mean sea level",
"obfuscated": false
}
},
"responsibility": [
{
"name": "Jane Smith",
"role": "collector",
"affiliation": "University of Washington",
"contact_information": "jsmith@uw.edu"
}
]
},
"has_material_category": [
{
"identifier": "https://w3id.org/isample/vocabulary/material/1.0/ite",
"label": "ite",
"scheme_name": "iSamples Material Type"
}
],
"has_context_category": [
{
"identifier": "https://w3id.org/isample/vocabulary/sampledfeature/1.0/activehumanoccupationsite",
"label": "Earth interior",
"scheme_name": "iSamples Sampled Feature Type"
}
],
"has_sample_object_type": [
{
"identifier": "https://w3id.org/isample/vocabulary/specimentype/1.0/physicalspecimen",
"label": "Physical specimen",
"scheme_name": "iSamples Specimen Type"
}
],
"keywords": [
{
"keyword": "basalt",
"scheme_name": "Free text"
},
{
"keyword": "volcanic rock",
"scheme_name": "Free text"
},
{
"keyword": "Cascade Range",
"scheme_name": "Geographic"
}
],
"registrant": {
"name": "Jane Smith",
"affiliation": "University of Washington",
"contact_information": "jsmith@uw.edu",
"role": "registrant"
},
"curation": {
"label": "UW Geology Sample Collection",
"description": "Stored in climate-controlled facility",
"curation_location": "University of Washington, Burke Museum, Room 142, Drawer B-15",
"access_constraints": ["By appointment only", "Research use only"],
"responsibility": [
{
"name": "Collections Manager",
"role": "curator",
"affiliation": "Burke Museum"
}
]
}
}
Loading