LEHA-CVQAD: Dataset To Enable Generalized Video Quality Assessment of Compression Artifacts

Aleksandr Gushchin^1,2, Maksim Smirnov⁴, Anastasia Antsiferova^2,3, Dmitriy Vatolin^1,3,

¹ Lomonosov Moscow State University
² ISP RAS Research Center for Trusted Artificial Intelligence Artificial Intelligence
³ MSU Institute for Artificial Intelligence
⁴ HSE

Download dataset open part

Benchmark results

Samples of our dataset.

Description

LEHA-CVQAD is the novel large dataset providing a diverse collection of 6000+ compressed video streams generated with 186 modern codecs and encoding presets. The dataset is featuring a variety of real‑world content and backed by crowdsourced subjective quality scores from Subjectify.us, it offers a reliable foundation for evaluating and advancing video‑quality assessment methods.

Key features

LEHA-CVQAD is subjective dataset with:

186 different video codecs
5+ compression standards (including H.264/AVC, H.265/HEVC, H.266/VVC, AV1, VP9)
6,000+ compressed streams
2M+ subjective score
15,000+ subjective assessors
Various content, including UGC and screen content

Videos

To gather source videos we parse high‐quality, openly licensed mostly FullHD clips from Vimeo, media.xiph, and YouTubeUGC. Then we clasterized collected 25,562 videos in terms of SI/TI to sample 60 of them. Sampled videos were first transcoded to a uniform YUV 4:2:0 format. Each reference video was then encoded with a suite of modern codecs (AVC/H.264, HEVC/H.265, AV1, VVC/H.266, VP9, etc.) using multiple presets and bitrate levels, yielding a broad spectrum of compressed streams for subjective quality evaluation.

Distribution of SI/TI characteristics and clusters during source videos sampling.

Subjective study

To obtain reliable subjective scores for each video in LEHA-CVQAD, we employed two consecutive subjective studies. During the first one, we performed pairwise comparisons for each reference video and then applied ELO and Bradley-Terry models. This way, obtained pairwise scores do not consider content of the video, which could potentially distort the resulting scores, but apper to be highly accurate. For the second subjective study, we sampled three videos from each group to assess their MOS values. After that we use MAP to merge these two types of subjective score, projecting them onto single scale. More detailes can be seen in the paper.