Training stucks at the beginning..

I'm working on training this model on the FlintstonesSV dataset. I run the training script on a GPU server with 8x 3080ti (with 12GB ram each card). Is this server able to train this model? What's the maximun memory useage during training?

The training process seems to stuck at "trainer.fit(model, dataloader, ckpt_path=args.train_model_file)". Here is the log:
```
Global seed set to 0
clip 4 new tokens added
blip 1 new tokens added
clip 4 new tokens added
blip 1 new tokens added
load checkpoint from https://storage.googleapis.com/sfr-vision-language-research/BLIP/models/model_large.pth
GPU available: True, used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
Global seed set to 0
clip 4 new tokens added
blip 1 new tokens added
Global seed set to 0
clip 4 new tokens added
blip 1 new tokens added
clip 4 new tokens added
blip 1 new tokens added
clip 4 new tokens added
blip 1 new tokens added
Global seed set to 0
Global seed set to 0
Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/4
clip 4 new tokens added
blip 1 new tokens added
clip 4 new tokens added
blip 1 new tokens added
load checkpoint from https://storage.googleapis.com/sfr-vision-language-research/BLIP/models/model_large.pth
load checkpoint from https://storage.googleapis.com/sfr-vision-language-research/BLIP/models/model_large.pth
Global seed set to 0
Initializing distributed: GLOBAL_RANK: 1, MEMBER: 2/4
[2022-12-26 18:54:48,402][torch.distributed.distributed_c10d][INFO] - Added key: store_based_barrier_key:1 to store for rank: 1
load checkpoint from https://storage.googleapis.com/sfr-vision-language-research/BLIP/models/model_large.pth
Global seed set to 0
Initializing distributed: GLOBAL_RANK: 2, MEMBER: 3/4
[2022-12-26 18:54:51,472][torch.distributed.distributed_c10d][INFO] - Added key: store_based_barrier_key:1 to store for rank: 2
Global seed set to 0
Initializing distributed: GLOBAL_RANK: 3, MEMBER: 4/4
[2022-12-26 18:54:55,093][torch.distributed.distributed_c10d][INFO] - Added key: store_based_barrier_key:1 to store for rank: 3
[2022-12-26 18:54:55,097][torch.distributed.distributed_c10d][INFO] - Added key: store_based_barrier_key:1 to store for rank: 0
[2022-12-26 18:54:55,098][torch.distributed.distributed_c10d][INFO] - Rank 0: Completed store-based barrier for key:store_based_barrier_key:1 with 4 nodes.
----------------------------------------------------------------------------------------------------
distributed_backend=nccl
All distributed processes registered. Starting with 4 processes
----------------------------------------------------------------------------------------------------

[2022-12-26 18:54:55,103][torch.distributed.distributed_c10d][INFO] - Rank 3: Completed store-based barrier for key:store_based_barrier_key:1 with 4 nodes.
[2022-12-26 18:54:55,106][torch.distributed.distributed_c10d][INFO] - Rank 1: Completed store-based barrier for key:store_based_barrier_key:1 with 4 nodes.
[2022-12-26 18:54:55,106][torch.distributed.distributed_c10d][INFO] - Rank 2: Completed store-based barrier for key:store_based_barrier_key:1 with 4 nodes.
```
no signal after waiting for 30 min...

The config.yaml is:
```
# device
mode: train  # train sample
gpu_ids: [ 0, 1, 2, 3 ]  # gpu ids
batch_size: 1  # batch size each item denotes one story
num_workers: 4  # number of workers
num_cpu_cores: -1  # number of cpu cores
seed: 0  # random seed
ckpt_dir: results/ # checkpoint directory
run_name: first_try # name for this run

# task
dataset: flintstones  # pororo flintstones vistsis vistdii
task: visualization  # continuation visualization

# train
init_lr: 1e-5  # initial learning rate
warmup_epochs: 1  # warmup epochs
max_epochs: 50  # max epochs
train_model_file:  # model file for resume, none for train from scratch
freeze_clip: False  # whether to freeze clip
freeze_blip: False  # whether to freeze blip
freeze_resnet: False  # whether to freeze resnet

# # sample
# test_model_file:  # model file for test
# calculate_fid: True  # whether to calculate FID scores
# scheduler: ddim  # ddim pndm
# guidance_scale: 6  # guidance scale
# num_inference_steps: 250  # number of inference steps
# sample_output_dir: /path/to/save_samples # output directory

# pororo:
#   hdf5_file: /path/to/pororo.h5
#   max_length: 85
#   new_tokens: [ "pororo", "loopy", "eddy", "harry", "poby", "tongtong", "crong", "rody", "petty" ]
#   clip_embedding_tokens: 49416
#   blip_embedding_tokens: 30530

flintstones:
  hdf5_file: Downloads/save_hdf5_files/flintstones.hdf5
  max_length: 91
  new_tokens: [ "fred", "barney", "wilma", "betty", "pebbles", "dino", "slate" ]
  clip_embedding_tokens: 49412
  blip_embedding_tokens: 30525

# vistsis:
#   hdf5_file: /path/to/vist.h5
#   max_length: 100
#   clip_embedding_tokens: 49408
#   blip_embedding_tokens: 30524

# vistdii:
#   hdf5_file: /path/to/vist.h5
#   max_length: 65
#   clip_embedding_tokens: 49408
#   blip_embedding_tokens: 30524

hydra:
  run:
    dir: .
  output_subdir: null
hydra/job_logging: disabled
hydra/hydra_logging: disabled
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Training stucks at the beginning.. #4

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Training stucks at the beginning.. #4

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions