🔔 News

OceanPile: A Large-Scale Multimodal Ocean Corpus for Foundation Models

🌐 Home Page 📄 Paper 🤗 Hugging Face 🌊 Website

OceanPile, a large-scale multimodal corpus designed for ocean intelligence. It comprises three key components: OceanCorpus, a unified collection integrating sonar data, underwater imagery, marine science visuals, and scientific text from diverse authoritative sources; OceanInstruction, a high-quality instruction dataset synthesized via a novel pipeline guided by a hierarchical Ocean Concept Knowledge Graph; and OceanBench, a manually curated evaluation benchmark for rigorous assessment.

🔔 News

04-2026, We released the OceanPile models.
03-2026, We released the OceanPile datasets.
02-2026, We launched the OceanPile project.

Contents:

🌟Overview
🔔 News
📺 Quick Start
📚 Datasets
🚩 Citation

🌟 Overview

As illustrated, our approach begins with constructing a domain-specific knowledge graph by extracting and enriching concepts from authoritative scientific literature and structured marine data. Guided by this knowledge graph, we synthesize and validate instruction-response pairs, ensuring high-quality data that reflects the nature of marine science.

📚 DataSets

📘 Dataset Summary

	# images	# samples	# tokens	Download
OceanCorpus	-	> 300K PDF documents	> 50B	🤗 Download
OceanInstruction	25,730	141,124	-	🤗 Download
OceanBench	1,367	1,469	-	🤗 Download

More details about these datasets can be found in our Paper or Hugging Face.

🤖 Model Zoo

Model Name	Domain	Download
OceanGPT-o-OceanPile-Sci	Marine Science VQA	🤗 Download
OceanGPT-basic-OceanPile-Sci	Marine Science QA	🤗 Download
OceanGPT-o-OceanPile-Sonar	Sonar Image VQA	🤗 Download
OceanGPT-o-OceanPile-Bio	Marine Biology VQA	🤗 Download

🌊 Quick Start Guide

📦 Environment Setup

Create and activate a dedicated conda environment:

conda create -n oceanbench python=3.11
conda activate oceanbench
pip install -r requirements.txt

📥 Dataset Download

Option 1: Using HuggingFace CLI

huggingface-cli download --repo-type dataset --resume-download zjunlp/OceanBenchmark --local-dir OceanBenchmark

Option 2: Using Python

from datasets import load_dataset

# Load the VQA evaluation subset
ds_test = load_dataset("zjunlp/OceanBenchmark", "Ocean_Science_VQA", split="test")
print(ds_test[0])

🤖 Model Download

Option 1: Git LFS

git lfs install
git clone https://huggingface.co/zjunlp/OceanGPT-o-8B-OceanPile-Sci

Option 2: HuggingFace CLI

huggingface-cli download --resume-download zjunlp/OceanGPT-o-8B-OceanPile-Sci \
    --local-dir OceanGPT-o-8B-OceanPile-Sci \
    --local-dir-use-symlinks False

Option 3: Python (Transformers)

from transformers import AutoModelForImageTextToText, AutoProcessor

model = AutoModelForImageTextToText.from_pretrained(
    "zjunlp/OceanGPT-o-8B-OceanPile-Sci",
    dtype="auto",
    device_map="auto"
)

processor = AutoProcessor.from_pretrained("zjunlp/OceanGPT-o-8B-OceanPile-Sci")

🖼️ Inference for MLLMs (Multimodal)

from transformers import Qwen3VLForConditionalGeneration, AutoProcessor

# Load model on available device(s)
model = Qwen3VLForConditionalGeneration.from_pretrained(
    "zjunlp/OceanGPT-o-8B-OceanPile-Sci",
    dtype="auto",
    device_map="auto"
)

processor = AutoProcessor.from_pretrained("zjunlp/OceanGPT-o-8B-OceanPile-Sci")

# Prepare message with image and text
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": "file:///path/to/your/image.jpg"},
            {"type": "text", "text": "Describe this image."},
        ],
    }
]

# Tokenize inputs
inputs = processor.apply_chat_template(
    messages,
    tokenize=True,
    add_generation_prompt=True,
    return_dict=True,
    return_tensors="pt"
)
inputs = inputs.to(model.device)

# Generate response
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
    out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]

# Decode output
output_text = processor.batch_decode(
    generated_ids_trimmed,
    skip_special_tokens=True,
    clean_up_tokenization_spaces=False
)
print(output_text)

💬 Inference for LLMs (Text-only)

from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "zjunlp/OceanGPT-basic-30B-OceanPile-Sci"

# Load tokenizer and model
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype="auto",
    device_map="auto"
)

question = "<Your Question>"
messages = [{"role": "user", "content": question}]

# Apply chat template
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
    enable_thinking=False
)

# Tokenize and generate
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)

generated_ids = model.generate(
    **model_inputs,
    max_new_tokens=8192
)

# Extract and decode output
output_ids = generated_ids[0][len(model_inputs.input_ids[0]):].tolist()

# Remove thinking tokens if present
try:
    index = len(output_ids) - output_ids[::-1].index(151668)  # </think> token ID
except ValueError:
    index = 0

content = tokenizer.decode(output_ids[index:], skip_special_tokens=True).strip("\n")
print(content)

📊 Evaluation with OceanBenchmark

More details see eval folder.

🟢 Marine Science VQA Evaluation (API)

python eval/sci_eval.py --input_dir "YOUR_DATA_DIR" --type qa --eval_model gpt-4o

🟢 Marine Science QA Evaluation (API)

python eval/eval.py --input_dir "YOUR_DATA_DIR" --type vqa --eval_model gpt-4o

🔵 Marine Science VQA Evaluation (Local Model)

python eval/eval.py --input_dir "YOUR_DATA_DIR" --type vqa \
    --eval_model qwen3-vl --local \
    --local_model_path "YOUR_LOCAL_MODEL_PATH"

🔏 License

This dataset is released under MIT License.

🚩 Citation

If this OceanPile paper or datasets is helpful, please kindly cite as this:

@misc{xue2026oceanpilelargescalemultimodalocean,
      title={OceanPile: A Large-Scale Multimodal Ocean Corpus for Foundation Models}, 
      author={Yida Xue and Ningyu Zhang and Tingwei Wu and Zhe Ma and Daxiong Ji and Zhao Wang and Guozhou Zheng and Huajun Chen},
      year={2026},
      eprint={2605.00877},
      archivePrefix={arXiv},
      primaryClass={cs.MM},
      url={https://arxiv.org/abs/2605.00877}, 
}

💐 Citations for our other related works:

@misc{xue2025oceangymbenchmarkenvironmentunderwater,
      title={OceanGym: A Benchmark Environment for Underwater Embodied Agents}, 
      author={Yida Xue and Mingjun Mao and Xiangyuan Ru and Yuqi Zhu and Baochang Ren and Shuofei Qiao and Mengru Wang and Shumin Deng and Xinyu An and Ningyu Zhang and Ying Chen and Huajun Chen},
      year={2025},
      eprint={2509.26536},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2509.26536}, 
}

@article{bi2024oceangpt,
  title={OceanGPT: A Large Language Model for Ocean Science Tasks},
  author={Bi, Zhen and Zhang, Ningyu and Xue, Yida and Ou, Yixin and Ji, Daxiong and Zheng, Guozhou and Chen, Huajun},
  journal={arXiv preprint arXiv:2310.02031},
  year={2024}
}

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
assets		assets
eval		eval
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

OceanPile: A Large-Scale Multimodal Ocean Corpus for Foundation Models

🔔 News

🌟 Overview

📚 DataSets

📘 Dataset Summary

🤖 Model Zoo

🌊 Quick Start Guide

📦 Environment Setup

📥 Dataset Download

Option 1: Using HuggingFace CLI

Option 2: Using Python

🤖 Model Download

Option 1: Git LFS

Option 2: HuggingFace CLI

Option 3: Python (Transformers)

🖼️ Inference for MLLMs (Multimodal)

💬 Inference for LLMs (Text-only)

📊 Evaluation with OceanBenchmark

🟢 Marine Science VQA Evaluation (API)

🟢 Marine Science QA Evaluation (API)

🔵 Marine Science VQA Evaluation (Local Model)

🔏 License

🚩 Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

OceanPile: A Large-Scale Multimodal Ocean Corpus for Foundation Models

🔔 News

🌟 Overview

📚 DataSets

📘 Dataset Summary

🤖 Model Zoo

🌊 Quick Start Guide

📦 Environment Setup

📥 Dataset Download

Option 1: Using HuggingFace CLI

Option 2: Using Python

🤖 Model Download

Option 1: Git LFS

Option 2: HuggingFace CLI

Option 3: Python (Transformers)

🖼️ Inference for MLLMs (Multimodal)

💬 Inference for LLMs (Text-only)

📊 Evaluation with OceanBenchmark

🟢 Marine Science VQA Evaluation (API)

🟢 Marine Science QA Evaluation (API)

🔵 Marine Science VQA Evaluation (Local Model)

🔏 License

🚩 Citation

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages