Author: Moh Rafik
Duration: Month 3 (Weeks 9–12)
Learning Time: 6–7 hrs/day
Goal: Understand, implement, and experiment with modern Generative AI techniques — VAEs, GANs, Diffusion Models, and Transformers.
This repository is part of the AI Learning Roadmap (3-Month Intensive) that includes:
- Machine Learning Basics
- Deep Learning Foundations
- Generative AI Projects ← (You are here)
This repository focuses on:
- Building a conceptual and practical understanding of Generative AI
- Implementing and training core models like Autoencoders, VAEs, GANs, and Diffusion Models
- Exploring modern architectures like Transformers and LLMs
- Building mini-projects for text and image generation
generative-ai-projects/
│
├── 01_autoencoders/
│ ├── basic_autoencoder.ipynb
│ ├── variational_autoencoder.ipynb
│
├── 02_gans/
│ ├── vanilla_gan.ipynb
│ ├── dcgan_faces.ipynb
│ ├── cycle_gan_intro.ipynb
│
├── 03_diffusion_models/
│ ├── diffusion_intro.ipynb
│ ├── stable_diffusion_walkthrough.ipynb
│
├── 04_transformers_llms/
│ ├── gpt2_finetuning.ipynb
│ ├── text_generation_huggingface.ipynb
│ ├── prompt_engineering_examples.ipynb
│
├── projects/
│ ├── text_to_image_pipeline.ipynb
│ ├── custom_gan_dataset.ipynb
│
├── assets/
│ ├── images/
│ └── figures/
│
└── README.md
| Week | Focus | Topics |
|---|---|---|
| 9 | Autoencoders & VAEs | Representation learning, latent space exploration |
| 10 | GANs | Adversarial training, DCGANs, Conditional GANs |
| 11 | Diffusion Models | Denoising processes, Stable Diffusion basics |
| 12 | LLMs & Prompt Engineering | GPT, fine-tuning, text-to-image pipelines |
- Concept: Learn compressed representations (encodings) of data.
- Structure: Encoder → Bottleneck → Decoder.
- Loss Function: Reconstruction loss (MSE).
- Applications: Noise removal, dimensionality reduction.
- Extension of AEs: Introduces probabilistic latent variables.
- Loss: Combination of Reconstruction + KL Divergence.
- Formula:
( L = E_{q(z|x)}[log \ p(x|z)] - KL[q(z|x) || p(z)] )
- Components: Generator (G) and Discriminator (D) in a minimax game.
- Objective:
( \min_G \max_D V(D, G) = E_{x \sim p_{data}}[log D(x)] + E_{z \sim p_z}[log(1 - D(G(z)))] ) - Common Variants: DCGAN, CycleGAN, StyleGAN.
- Idea: Learn to reverse a diffusion (noise-adding) process.
- Famous Models: DDPM, Stable Diffusion.
- Applications: Image generation, denoising, inpainting.
- Mechanism: Self-Attention, Multi-Head Attention, Positional Encoding.
- Popular Models: GPT, BERT, T5.
- Applications: Text generation, summarization, translation.
PyTorch,TensorFlow,Keras– model implementationsHugging Face Transformers– fine-tuning and inferenceDiffusers– Stable Diffusion toolkitMatplotlib,NumPy,Torchvision– visualization and datasets
- Build Autoencoder & VAE from scratch.
- Train DCGAN on MNIST and CelebA datasets.
- Experiment with Diffusion Models (using Hugging Face).
- Fine-tune GPT-2 on custom text data.
- Build a Text-to-Image Pipeline using open models.
Goal: Compress and reconstruct handwritten digits.
Concepts: KL divergence, latent space sampling.
Deliverables: Visualizations of latent space clusters.
Goal: Generate realistic human faces using adversarial training.
Concepts: Generator vs. Discriminator training loop.
Dataset: CelebA.
Deliverables: Saved generated images during training.
Goal: Learn denoising diffusion probabilistic model (DDPM).
Concepts: Forward and reverse diffusion processes.
Deliverables: Generate new images from Gaussian noise.
Goal: Fine-tune GPT-2 for domain-specific text generation.
Concepts: Tokenization, causal language modeling.
Dataset: Custom text corpus.
Deliverables: Generate coherent domain-specific text samples.
Goal: Combine transformer and diffusion components.
Concepts: Prompt-based generation, CLIP guidance.
Deliverables: Generate image outputs from textual prompts.
| Model | Dataset | Output | Notes |
|---|---|---|---|
| VAE | MNIST | Reconstructed digits | Clear latent structure |
| DCGAN | CelebA | Generated faces | Improved over epochs |
| DDPM | CIFAR-10 | Generated images | Stable results |
| GPT-2 | Custom | Domain text | Fine-tuned successfully |
| Text-to-Image | Prompts | Realistic images | Used CLIP + Diffusers |
- Understood probabilistic generative models.
- Implemented GANs and learned adversarial optimization.
- Trained diffusion-based models for image synthesis.
- Fine-tuned and deployed transformer-based language models.
- Built creative multimodal applications (Text → Image).
🎯 Apply knowledge to real-world domains: biomedical imaging, text synthesis, or digital art generation.
- Python 3.9+
- Jupyter Notebook / VS Code
- Libraries:
torch,torchvision,transformers,diffusers,matplotlib
- MIT 6.S192 — Deep Generative Models
- David Foster — Generative Deep Learning
- Hugging Face Docs — https://huggingface.co/docs
- DDPM Paper — Ho et al., 2020
- GAN Paper — Goodfellow et al., 2014
| Task | Status |
|---|---|
| Implement Autoencoder and VAE | ☐ |
| Train DCGAN | ☐ |
| Experiment with Diffusion Models | ☐ |
| Fine-tune GPT-2 | ☐ |
| Build Text-to-Image pipeline | ☐ |
| Complete all 5 projects | ☐ |
| Final documentation & Git push | ☐ |
⭐ Pro Tip:
Document every experiment visually. Include GIFs of training progression, sample generations, and model performance graphs for an impressive GitHub portfolio.
📌 Maintained by Moh Rafik
💬 For queries or collaborations: [RAFIKIITBHU@GMAIL.COM or LinkedIn]