Skip to content

OceanGPT/OceanPile

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

16 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

OceanPile: A Large-Scale Multimodal Ocean Corpus for Foundation Models

🌐 Home Page πŸ“„ Paper πŸ€— Hugging Face 🌊 Website

OceanPile, a large-scale multimodal corpus designed for ocean intelligence. It comprises three key components: OceanCorpus, a unified collection integrating sonar data, underwater imagery, marine science visuals, and scientific text from diverse authoritative sources; OceanInstruction, a high-quality instruction dataset synthesized via a novel pipeline guided by a hierarchical Ocean Concept Knowledge Graph; and OceanBench, a manually curated evaluation benchmark for rigorous assessment.

πŸ”” News

  • 04-2026, We released the OceanPile models.
  • 03-2026, We released the OceanPile datasets.
  • 02-2026, We launched the OceanPile project.

Contents:

🌟 Overview

As illustrated, our approach begins with constructing a domain-specific knowledge graph by extracting and enriching concepts from authoritative scientific literature and structured marine data. Guided by this knowledge graph, we synthesize and validate instruction-response pairs, ensuring high-quality data that reflects the nature of marine science.

πŸ“š DataSets

πŸ“˜ Dataset Summary

# images # samples # tokens Download
OceanCorpus - > 300K PDF documents > 50B πŸ€— Download
OceanInstruction 25,730 141,124 - πŸ€— Download
OceanBench 1,367 1,469 - πŸ€— Download

More details about these datasets can be found in our Paper or Hugging Face.

πŸ€– Model Zoo

Model Name Domain Download
OceanGPT-o-OceanPile-Sci Marine Science VQA πŸ€— Download
OceanGPT-basic-OceanPile-Sci Marine Science QA πŸ€— Download
OceanGPT-o-OceanPile-Sonar Sonar Image VQA πŸ€— Download
OceanGPT-o-OceanPile-Bio Marine Biology VQA πŸ€— Download

🌊 Quick Start Guide

πŸ“¦ Environment Setup

Create and activate a dedicated conda environment:

conda create -n oceanbench python=3.11
conda activate oceanbench
pip install -r requirements.txt

πŸ“₯ Dataset Download

Option 1: Using HuggingFace CLI

huggingface-cli download --repo-type dataset --resume-download zjunlp/OceanBenchmark --local-dir OceanBenchmark

Option 2: Using Python

from datasets import load_dataset

# Load the VQA evaluation subset
ds_test = load_dataset("zjunlp/OceanBenchmark", "Ocean_Science_VQA", split="test")
print(ds_test[0])

πŸ€– Model Download

Option 1: Git LFS

git lfs install
git clone https://huggingface.co/zjunlp/OceanGPT-o-8B-OceanPile-Sci

Option 2: HuggingFace CLI

huggingface-cli download --resume-download zjunlp/OceanGPT-o-8B-OceanPile-Sci \
    --local-dir OceanGPT-o-8B-OceanPile-Sci \
    --local-dir-use-symlinks False

Option 3: Python (Transformers)

from transformers import AutoModelForImageTextToText, AutoProcessor

model = AutoModelForImageTextToText.from_pretrained(
    "zjunlp/OceanGPT-o-8B-OceanPile-Sci",
    dtype="auto",
    device_map="auto"
)

processor = AutoProcessor.from_pretrained("zjunlp/OceanGPT-o-8B-OceanPile-Sci")

πŸ–ΌοΈ Inference for MLLMs (Multimodal)

from transformers import Qwen3VLForConditionalGeneration, AutoProcessor

# Load model on available device(s)
model = Qwen3VLForConditionalGeneration.from_pretrained(
    "zjunlp/OceanGPT-o-8B-OceanPile-Sci",
    dtype="auto",
    device_map="auto"
)

processor = AutoProcessor.from_pretrained("zjunlp/OceanGPT-o-8B-OceanPile-Sci")

# Prepare message with image and text
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": "file:///path/to/your/image.jpg"},
            {"type": "text", "text": "Describe this image."},
        ],
    }
]

# Tokenize inputs
inputs = processor.apply_chat_template(
    messages,
    tokenize=True,
    add_generation_prompt=True,
    return_dict=True,
    return_tensors="pt"
)
inputs = inputs.to(model.device)

# Generate response
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
    out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]

# Decode output
output_text = processor.batch_decode(
    generated_ids_trimmed,
    skip_special_tokens=True,
    clean_up_tokenization_spaces=False
)
print(output_text)

πŸ’¬ Inference for LLMs (Text-only)

from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "zjunlp/OceanGPT-basic-30B-OceanPile-Sci"

# Load tokenizer and model
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype="auto",
    device_map="auto"
)

question = "<Your Question>"
messages = [{"role": "user", "content": question}]

# Apply chat template
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
    enable_thinking=False
)

# Tokenize and generate
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)

generated_ids = model.generate(
    **model_inputs,
    max_new_tokens=8192
)

# Extract and decode output
output_ids = generated_ids[0][len(model_inputs.input_ids[0]):].tolist()

# Remove thinking tokens if present
try:
    index = len(output_ids) - output_ids[::-1].index(151668)  # </think> token ID
except ValueError:
    index = 0

content = tokenizer.decode(output_ids[index:], skip_special_tokens=True).strip("\n")
print(content)

πŸ“Š Evaluation with OceanBenchmark

More details see eval folder.

🟒 Marine Science VQA Evaluation (API)

python eval/sci_eval.py --input_dir "YOUR_DATA_DIR" --type qa --eval_model gpt-4o

🟒 Marine Science QA Evaluation (API)

python eval/eval.py --input_dir "YOUR_DATA_DIR" --type vqa --eval_model gpt-4o

πŸ”΅ Marine Science VQA Evaluation (Local Model)

python eval/eval.py --input_dir "YOUR_DATA_DIR" --type vqa \
    --eval_model qwen3-vl --local \
    --local_model_path "YOUR_LOCAL_MODEL_PATH"

πŸ” License

This dataset is released under MIT License.

🚩 Citation

If this OceanPile paper or datasets is helpful, please kindly cite as this:

@misc{xue2026oceanpilelargescalemultimodalocean,
      title={OceanPile: A Large-Scale Multimodal Ocean Corpus for Foundation Models}, 
      author={Yida Xue and Ningyu Zhang and Tingwei Wu and Zhe Ma and Daxiong Ji and Zhao Wang and Guozhou Zheng and Huajun Chen},
      year={2026},
      eprint={2605.00877},
      archivePrefix={arXiv},
      primaryClass={cs.MM},
      url={https://arxiv.org/abs/2605.00877}, 
}

πŸ’ Citations for our other related works:

@misc{xue2025oceangymbenchmarkenvironmentunderwater,
      title={OceanGym: A Benchmark Environment for Underwater Embodied Agents}, 
      author={Yida Xue and Mingjun Mao and Xiangyuan Ru and Yuqi Zhu and Baochang Ren and Shuofei Qiao and Mengru Wang and Shumin Deng and Xinyu An and Ningyu Zhang and Ying Chen and Huajun Chen},
      year={2025},
      eprint={2509.26536},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2509.26536}, 
}

@article{bi2024oceangpt,
  title={OceanGPT: A Large Language Model for Ocean Science Tasks},
  author={Bi, Zhen and Zhang, Ningyu and Xue, Yida and Ou, Yixin and Ji, Daxiong and Zheng, Guozhou and Chen, Huajun},
  journal={arXiv preprint arXiv:2310.02031},
  year={2024}
}

Releases

No releases published

Packages

 
 
 

Contributors

Languages