Completed November 2025

Text-to-Video Evaluation System

Systematic quality assessment for AI-generated video

Python GPT-4V Jupyter pandas

Background

As text-to-video generation models rapidly improve, systematic evaluation frameworks are needed to measure quality across multiple dimensions — visual fidelity, temporal coherence, prompt alignment, and aesthetic appeal.

Approach

Built an automated evaluation pipeline that leverages GPT-4V for multi-dimensional scoring, then validates against human annotations to measure alignment reliability.

Key Results

Evaluated 500+ generated videos across 5 quality dimensions
Achieved 0.85+ correlation between LLM scores and human judgments
Identified systematic biases in model-specific failure modes