Unused parameters

https://github.com/cloneofsimo/minRF/blob/261859e8b89a4cf5ab7eb35b4a4ffd8037c35ea1/advanced/mmdit.py#L161
https://github.com/cloneofsimo/minRF/blob/72feb0c87d435e9f9d220f34f348ed66c0b6ccec/advanced/mmdit.py#L86

Not used in last layer, should be moved into an `if not last` statement. Unused parameters make some distributed algos slow and sad: https://pytorch.org/docs/stable/notes/ddp.html#internal-design


Edit: Also, (unless I misread your code) you seem to only put the timestep embedding in the AdaLN scale/shift thingy, but the SD3 paper also puts a vector made from the image description in there. Did you find the former worked better?

Edit 2: Also also, did your muP optimization lead that far from a 1e^-4 learning rate? Can you share the results of your hparam search?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unused parameters #4

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Unused parameters #4

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions