Video Challenge Audio Challenge

Video and Audio Splice Detection and Localization Challenge

Hosted by: UL Research Institutes — Digital Safety Research Institute (DSRI)
Co-located with the Authenticity and Provenance in the Age of Generative AI Workshop at CVPR 2026

The Video and Audio Splice Detection and Localization Challenge will advance techniques for detecting and temporally localizing synthetic and manipulated video and audio content. The competition will engage the research community in addressing one of the most pressing challenges in generative AI research: distinguishing between authentic and AI-synthesized video with audio.

View Challenge Results

Overview

The Video and Audio Splice Detection and Localization Challenge will advance techniques for detecting and temporally localizing synthetic and manipulated video and audio content. The competition will engage the research community in addressing one of the most pressing challenges in generative AI research: distinguishing between authentic and AI-synthesized video and audio media assets.

Objective: Detection and temporal localization of synthetic content in novel, undisclosed video+audio multimedia datasets.

Recognition: Top-performing teams may be invited to attend and engage with the workshop community, with opportunities to share their approaches. Travel stipends and research grants may be offered to outstanding teams to support further research and development.


Ready to participate?

Register your team to get started. The principal investigator fills out the registration form -- organizers manually approve and issue your access token.

Register

Updates

No updates yet.

Participation

Desired submissions

Submissions to the challenge should analyze the content of the inputs only. Submissions that rely on metadata are out of scope for this challenge.

We specifically encourage submissions that are based on novel or emerging approaches. Novelty, as well as performance, will be a factor in determining recognition and possible awards for participating teams. Participants will have an opportunity to describe their approaches in detail at the conclusion of the challenge.

Submit your detector model for evaluation

We also welcome submissions that only provide a partial solution. For example, teams that can detect and localize only video or only audio are encouraged to participate.

Submission format

Submissions will take the form of a Docker image containing an HTTP server that serves a defined inference API. Submissions may optionally include a separate data volume to be mounted in the running container. Participants may implement their systems using any technologies they wish.

Submitted systems will run in a virtual machine with access to the following resources:

  • GPU: 1x Nvidia L4
  • vCPU: 8
  • Memory: 32Gi

The main task dataset contains 4040 items. There is a time limit of 60 seconds per inference and 6 hours total to process the whole dataset.

The challenge will include an open pilot task that teams can use to test their submissions, validate model performance, and ensure compatibility with the challenge system. Teams will have access to input data and system logs for the pilot task, but not for the main task. Participants are strongly encouraged to verify success on the pilot task before submitting to the main task.

Example Submission

Code for an example submission is provided: VASDL 2026: Video and Audio Splice Detection and Localization Challenge.

This project provides a starting point for implementing a submission to the SynAV 2026: Video Edit Detection and Localization Challenge. You do not need to use this code to participate in the challenge.

FAQ and Tips

  • We recommend against using OpenCV to decode video and audio stream because it has limited support with certain codecs. We recommend other libraries such as PyAV.
  • We welcome submissions that only provide a partial solution. For example, teams that can detect and localize only video or only audio are encouraged to participate.
  • All “LLR” scores are assumed to be calibrated at threshold of 0.
    i.e. llr > 0 -> synthetic and llr<=0  -> real
  • Only a single localization format (timeseries or timespan) is required. If both are present timespan will take priority for scoring
  • Timespan timestamps will be rounded to 1/4 of a second. This way they will be directly comparable to the timeseries prediction.
  • If any prediction type is omitted then it will be scored as "no detection" (i.e., predicting "real").

Challenge Timeline

Task 1 will open on May 1, marking the official start of the challenge. A pilot task will remain available throughout the challenge period to help participants test and troubleshoot submission formats, evaluation pipelines, and system integration before submitting to the main task.

The challenge will run through the CVPR APAI Workshop, where organizers will share an overview of the challenge, early observations, and initial results. Following the workshop, the challenge will remain open to provide participants additional time to refine and improve their models.

The challenge will conclude later this summer. At that time, final rankings will be computed and winners and results will be announced.

After the challenge concludes, we will host post-challenge participant roundtables to discuss technical approaches, analyze performance across submissions, and share additional insights about the dataset, evaluation framework, and results.

Datasets

The challenge uses novel curated datasets created especially for the challenge, comprising both authentic and AI-generated videos in controlled and naturalistic settings. The audio and video streams will be encoded using common video and audio codecs and will be varying resolutions and framerates.

The synthetic examples (less than 30 seconds in length) are produced using state-of-the-art video and audio models. The dataset includes varying degrees of fidelity and manipulation granularity to test model robustness in detecting subtle audio and video inconsistencies.

Input videos may be fully authentic, fully synthetic, or may contain sections of synthetic content within an otherwise authentic video. The audio may or may not contain human speech. Either or both the audio and video modalities may be synthetic, and synthetic content may not be temporally aligned across modalities.

No training data will be provided to participants, and the evaluation data will not be released.

Evaluation

Primary Metrics
For the Detection capability, the primary metrics are Balanced Accuracy and ROC AUC for binary detection of synthetic content anywhere in the video. For the Localization capability, the primary metric is Intersection-over-Union (IoU) of predicted and true time intervals during which synthetic content is present.

The competition will maintain both a public leaderboard and a private leaderboard. Teams will have access to their public leaderboard scores during the challenge, while the private scores will determine final rankings.

For the Detection capability, the primary metrics are Balanced Accuracy and ROC AUC for binary detection of synthetic content anywhere in the video.

For the Localization capability, the primary metric is Intersection-over-Union (IoU) of predicted and true time intervals during which synthetic content is present.

All capabilities will be evaluated for both the video and audio modalities separately and in combination.

The final ranking will be computed on the union of public and private splits.

Scoring

The final ranking will be computed on the union of public and private splits.

Predictions on audio and video streams will be scored independently - We welcome submissions that can handle subsets of the predictions i.e. either the audio or video channels, detection or localization

Rules

By registering for the Video and Audio Splice Detection and Localization Challenge, participants agree to be bound by the following challenge rules.

1

Leaderboards

Both public and private leaderboards will be maintained. The private leaderboard will serve as the basis for final ranking. The organizers may, at their sole discretion, exclude from the final results any submissions deemed to have violated the spirit of the challenge.

2

Submission Limits

Participants will be limited in the number of daily submissions and agree not to attempt to circumvent those limits.

3

Confidentiality

Participants agree not to publicly compare results with others until those results are published outside the conference venue. Participants are free to publish and use their own results independently.

4

Appropriate Use

Use of provided computing resources for any non-challenge-related purposes is prohibited. Participants should take appropriate precautions to protect their authorization credentials and report any account compromise or misuse to the challenge organizers immediately. Participants agree not to attempt to reveal or exfiltrate any information about the private evaluation data by any means.

5

Recognition and Awards

Recognition and awards for participants will be given at the sole discretion of the organizers. Achieving a particular score or rank does not entitle a participant to any recognition or awards. Reporting of challenge results does not constitute an endorsement or certification of any kind.

6

Compliance

All rules and guidelines issued by the organizers must be followed. Failure to comply may result in disqualification or exclusion from future challenges.

By registering for the Video and Audio Splice Detection and Localization Challenge, participants agree to be bound by the following challenge rules.

Helpful Resources

Submission instructions and code examples

Submission instructions and code examples

Contact The Organizers

Contact The Organizers

Join the Discord

Join the Discord

The Video and Audio Splice Detection and Localization Challenge will advance techniques for detecting and temporally localizing synthetic and manipulated video and audio content. The competition will engage the research community in addressing one of the most pressing challenges in generative AI research: distinguishing between authentic and AI-synthesized video and audio media assets.

Objective: Detection and temporal localization of synthetic content in novel, undisclosed video+audio multimedia datasets.

Recognition: Top-performing teams may be invited to attend and engage with the workshop community, with opportunities to share their approaches. Travel stipends and research grants may be offered to outstanding teams to support further research and development.


Ready to participate?

Register your team to get started. The principal investigator fills out the registration form -- organizers manually approve and issue your access token.

Register

No updates yet.

Desired submissions

Submissions to the challenge should analyze the content of the inputs only. Submissions that rely on metadata are out of scope for this challenge.

We specifically encourage submissions that are based on novel or emerging approaches. Novelty, as well as performance, will be a factor in determining recognition and possible awards for participating teams. Participants will have an opportunity to describe their approaches in detail at the conclusion of the challenge.

Submit your detector model for evaluation

We also welcome submissions that only provide a partial solution. For example, teams that can detect and localize only video or only audio are encouraged to participate.

Submission format

Submissions will take the form of a Docker image containing an HTTP server that serves a defined inference API. Submissions may optionally include a separate data volume to be mounted in the running container. Participants may implement their systems using any technologies they wish.

Submitted systems will run in a virtual machine with access to the following resources:

  • GPU: 1x Nvidia L4
  • vCPU: 8
  • Memory: 32Gi

The main task dataset contains 4040 items. There is a time limit of 60 seconds per inference and 6 hours total to process the whole dataset.

The challenge will include an open pilot task that teams can use to test their submissions, validate model performance, and ensure compatibility with the challenge system. Teams will have access to input data and system logs for the pilot task, but not for the main task. Participants are strongly encouraged to verify success on the pilot task before submitting to the main task.

Example Submission

Code for an example submission is provided: VASDL 2026: Video and Audio Splice Detection and Localization Challenge.

This project provides a starting point for implementing a submission to the SynAV 2026: Video Edit Detection and Localization Challenge. You do not need to use this code to participate in the challenge.

FAQ and Tips

  • We recommend against using OpenCV to decode video and audio stream because it has limited support with certain codecs. We recommend other libraries such as PyAV.
  • We welcome submissions that only provide a partial solution. For example, teams that can detect and localize only video or only audio are encouraged to participate.
  • All “LLR” scores are assumed to be calibrated at threshold of 0.
    i.e. llr > 0 -> synthetic and llr<=0  -> real
  • Only a single localization format (timeseries or timespan) is required. If both are present timespan will take priority for scoring
  • Timespan timestamps will be rounded to 1/4 of a second. This way they will be directly comparable to the timeseries prediction.
  • If any prediction type is omitted then it will be scored as "no detection" (i.e., predicting "real").

Task 1 will open on May 1, marking the official start of the challenge. A pilot task will remain available throughout the challenge period to help participants test and troubleshoot submission formats, evaluation pipelines, and system integration before submitting to the main task.

The challenge will run through the CVPR APAI Workshop, where organizers will share an overview of the challenge, early observations, and initial results. Following the workshop, the challenge will remain open to provide participants additional time to refine and improve their models.

The challenge will conclude later this summer. At that time, final rankings will be computed and winners and results will be announced.

After the challenge concludes, we will host post-challenge participant roundtables to discuss technical approaches, analyze performance across submissions, and share additional insights about the dataset, evaluation framework, and results.

The challenge uses novel curated datasets created especially for the challenge, comprising both authentic and AI-generated videos in controlled and naturalistic settings. The audio and video streams will be encoded using common video and audio codecs and will be varying resolutions and framerates.

The synthetic examples (less than 30 seconds in length) are produced using state-of-the-art video and audio models. The dataset includes varying degrees of fidelity and manipulation granularity to test model robustness in detecting subtle audio and video inconsistencies.

Input videos may be fully authentic, fully synthetic, or may contain sections of synthetic content within an otherwise authentic video. The audio may or may not contain human speech. Either or both the audio and video modalities may be synthetic, and synthetic content may not be temporally aligned across modalities.

No training data will be provided to participants, and the evaluation data will not be released.

Primary Metrics
For the Detection capability, the primary metrics are Balanced Accuracy and ROC AUC for binary detection of synthetic content anywhere in the video. For the Localization capability, the primary metric is Intersection-over-Union (IoU) of predicted and true time intervals during which synthetic content is present.

The competition will maintain both a public leaderboard and a private leaderboard. Teams will have access to their public leaderboard scores during the challenge, while the private scores will determine final rankings.

For the Detection capability, the primary metrics are Balanced Accuracy and ROC AUC for binary detection of synthetic content anywhere in the video.

For the Localization capability, the primary metric is Intersection-over-Union (IoU) of predicted and true time intervals during which synthetic content is present.

All capabilities will be evaluated for both the video and audio modalities separately and in combination.

The final ranking will be computed on the union of public and private splits.

Scoring

The final ranking will be computed on the union of public and private splits.

Predictions on audio and video streams will be scored independently - We welcome submissions that can handle subsets of the predictions i.e. either the audio or video channels, detection or localization

By registering for the Video and Audio Splice Detection and Localization Challenge, participants agree to be bound by the following challenge rules.

1

Leaderboards

Both public and private leaderboards will be maintained. The private leaderboard will serve as the basis for final ranking. The organizers may, at their sole discretion, exclude from the final results any submissions deemed to have violated the spirit of the challenge.

2

Submission Limits

Participants will be limited in the number of daily submissions and agree not to attempt to circumvent those limits.

3

Confidentiality

Participants agree not to publicly compare results with others until those results are published outside the conference venue. Participants are free to publish and use their own results independently.

4

Appropriate Use

Use of provided computing resources for any non-challenge-related purposes is prohibited. Participants should take appropriate precautions to protect their authorization credentials and report any account compromise or misuse to the challenge organizers immediately. Participants agree not to attempt to reveal or exfiltrate any information about the private evaluation data by any means.

5

Recognition and Awards

Recognition and awards for participants will be given at the sole discretion of the organizers. Achieving a particular score or rank does not entitle a participant to any recognition or awards. Reporting of challenge results does not constitute an endorsement or certification of any kind.

6

Compliance

All rules and guidelines issued by the organizers must be followed. Failure to comply may result in disqualification or exclusion from future challenges.

By registering for the Video and Audio Splice Detection and Localization Challenge, participants agree to be bound by the following challenge rules.

Submission instructions and code examples

Submission instructions and code examples

Contact The Organizers

Contact The Organizers

Join the Discord

Join the Discord