Small Vision Language Model
Topic
This project aims to distill knowledge from an existing large human-centric foundation model to create a (smaller) foundation model for multiple vision and language tasks involving humans.
Tasks
- Run inference on multiple models to create a training dataset
- Implement a distillation architecture
- Train and evaluate model on one benchmark dataset
Expected Skills
- PyTorch (required)
- Strong programming skills (required)
- Foundation Models, Knowledge Distillation, Contrastive Learning, Dataset Curation (preferred)