Project DFKI Augmented Vision

Small Vision Language Model

Topic

This project aims to distill knowledge from an existing large human-centric foundation model to create a (smaller) foundation model for multiple vision and language tasks involving humans.

Tasks

Run inference on multiple models to create a training dataset
Implement a distillation architecture
Train and evaluate model on one benchmark dataset

Expected Skills

PyTorch (required)
Strong programming skills (required)
Foundation Models, Knowledge Distillation, Contrastive Learning, Dataset Curation (preferred)

[1] Sapiens
[2] TinyViT
[3] DIME-FM

Small Vision Language Model

Topic

Tasks

Expected Skills

Related Literature