π Home Page π Paper π€ Hugging Face π Website
OceanPile, a large-scale multimodal corpus designed for ocean intelligence. It comprises three key components: OceanCorpus, a unified collection integrating sonar data, underwater imagery, marine science visuals, and scientific text from diverse authoritative sources; OceanInstruction, a high-quality instruction dataset synthesized via a novel pipeline guided by a hierarchical Ocean Concept Knowledge Graph; and OceanBench, a manually curated evaluation benchmark for rigorous assessment.
- 04-2026, We released the OceanPile models.
- 03-2026, We released the OceanPile datasets.
- 02-2026, We launched the OceanPile project.
Contents:
As illustrated, our approach begins with constructing a domain-specific knowledge graph by extracting and enriching concepts from authoritative scientific literature and structured marine data. Guided by this knowledge graph, we synthesize and validate instruction-response pairs, ensuring high-quality data that reflects the nature of marine science.
| # images | # samples | # tokens | Download | |
|---|---|---|---|---|
| OceanCorpus | - | > 300K PDF documents | > 50B | π€ Download |
| OceanInstruction | 25,730 | 141,124 | - | π€ Download |
| OceanBench | 1,367 | 1,469 | - | π€ Download |
More details about these datasets can be found in our Paper or Hugging Face.
| Model Name | Domain | Download |
|---|---|---|
| OceanGPT-o-OceanPile-Sci | Marine Science VQA | π€ Download |
| OceanGPT-basic-OceanPile-Sci | Marine Science QA | π€ Download |
| OceanGPT-o-OceanPile-Sonar | Sonar Image VQA | π€ Download |
| OceanGPT-o-OceanPile-Bio | Marine Biology VQA | π€ Download |
Create and activate a dedicated conda environment:
conda create -n oceanbench python=3.11
conda activate oceanbench
pip install -r requirements.txthuggingface-cli download --repo-type dataset --resume-download zjunlp/OceanBenchmark --local-dir OceanBenchmarkfrom datasets import load_dataset
# Load the VQA evaluation subset
ds_test = load_dataset("zjunlp/OceanBenchmark", "Ocean_Science_VQA", split="test")
print(ds_test[0])git lfs install
git clone https://huggingface.co/zjunlp/OceanGPT-o-8B-OceanPile-Scihuggingface-cli download --resume-download zjunlp/OceanGPT-o-8B-OceanPile-Sci \
--local-dir OceanGPT-o-8B-OceanPile-Sci \
--local-dir-use-symlinks Falsefrom transformers import AutoModelForImageTextToText, AutoProcessor
model = AutoModelForImageTextToText.from_pretrained(
"zjunlp/OceanGPT-o-8B-OceanPile-Sci",
dtype="auto",
device_map="auto"
)
processor = AutoProcessor.from_pretrained("zjunlp/OceanGPT-o-8B-OceanPile-Sci")from transformers import Qwen3VLForConditionalGeneration, AutoProcessor
# Load model on available device(s)
model = Qwen3VLForConditionalGeneration.from_pretrained(
"zjunlp/OceanGPT-o-8B-OceanPile-Sci",
dtype="auto",
device_map="auto"
)
processor = AutoProcessor.from_pretrained("zjunlp/OceanGPT-o-8B-OceanPile-Sci")
# Prepare message with image and text
messages = [
{
"role": "user",
"content": [
{"type": "image", "image": "file:///path/to/your/image.jpg"},
{"type": "text", "text": "Describe this image."},
],
}
]
# Tokenize inputs
inputs = processor.apply_chat_template(
messages,
tokenize=True,
add_generation_prompt=True,
return_dict=True,
return_tensors="pt"
)
inputs = inputs.to(model.device)
# Generate response
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
# Decode output
output_text = processor.batch_decode(
generated_ids_trimmed,
skip_special_tokens=True,
clean_up_tokenization_spaces=False
)
print(output_text)from transformers import AutoModelForCausalLM, AutoTokenizer
model_name = "zjunlp/OceanGPT-basic-30B-OceanPile-Sci"
# Load tokenizer and model
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype="auto",
device_map="auto"
)
question = "<Your Question>"
messages = [{"role": "user", "content": question}]
# Apply chat template
text = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True,
enable_thinking=False
)
# Tokenize and generate
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)
generated_ids = model.generate(
**model_inputs,
max_new_tokens=8192
)
# Extract and decode output
output_ids = generated_ids[0][len(model_inputs.input_ids[0]):].tolist()
# Remove thinking tokens if present
try:
index = len(output_ids) - output_ids[::-1].index(151668) # </think> token ID
except ValueError:
index = 0
content = tokenizer.decode(output_ids[index:], skip_special_tokens=True).strip("\n")
print(content)More details see eval folder.
python eval/sci_eval.py --input_dir "YOUR_DATA_DIR" --type qa --eval_model gpt-4opython eval/eval.py --input_dir "YOUR_DATA_DIR" --type vqa --eval_model gpt-4opython eval/eval.py --input_dir "YOUR_DATA_DIR" --type vqa \
--eval_model qwen3-vl --local \
--local_model_path "YOUR_LOCAL_MODEL_PATH"This dataset is released under MIT License.
If this OceanPile paper or datasets is helpful, please kindly cite as this:
@misc{xue2026oceanpilelargescalemultimodalocean,
title={OceanPile: A Large-Scale Multimodal Ocean Corpus for Foundation Models},
author={Yida Xue and Ningyu Zhang and Tingwei Wu and Zhe Ma and Daxiong Ji and Zhao Wang and Guozhou Zheng and Huajun Chen},
year={2026},
eprint={2605.00877},
archivePrefix={arXiv},
primaryClass={cs.MM},
url={https://arxiv.org/abs/2605.00877},
}π Citations for our other related works:
@misc{xue2025oceangymbenchmarkenvironmentunderwater,
title={OceanGym: A Benchmark Environment for Underwater Embodied Agents},
author={Yida Xue and Mingjun Mao and Xiangyuan Ru and Yuqi Zhu and Baochang Ren and Shuofei Qiao and Mengru Wang and Shumin Deng and Xinyu An and Ningyu Zhang and Ying Chen and Huajun Chen},
year={2025},
eprint={2509.26536},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2509.26536},
}
@article{bi2024oceangpt,
title={OceanGPT: A Large Language Model for Ocean Science Tasks},
author={Bi, Zhen and Zhang, Ningyu and Xue, Yida and Ou, Yixin and Ji, Daxiong and Zheng, Guozhou and Chen, Huajun},
journal={arXiv preprint arXiv:2310.02031},
year={2024}
}

