Anas El Mhamdi

Detecting Sponsored Content in YouTube Videos

by Anas El Mhamdi

I’m back! Having extra time during lockdown to explore machine learning, I recognized it as central to growth engineering since both fields require similar skills: Mostly Python developing, scraping, data processing, analytic and critical thinking.

Adblock concept

The Audio Route

A friend frustrated with sponsored podcast content led me to SponsorBlock, a crowdsourced ad skipper for YouTube. Their database is completely open and contains labeled videos with sponsor timestamps.

Labeled video showing sponsored segments

I initially pursued audio classification using a CNN-based approach inspired by the Panotti repository, which achieved 99.7% accuracy detecting guitar effects.

Results:

  • Single podcast: 99% training accuracy, 95% test accuracy
  • Multiple podcasts: Less than 60% confidence

The challenge was scale. 300 podcasts generated 50GB of data, making data management untenable. I pivoted to transcript analysis for efficiency.

The Captions Route

After studying NLP and sentiment analysis, I selected a Transformer model based on BERT, Google’s NLP framework.

Dataset Construction:

  • Downloaded videos using youtube-dl
  • Extracted automatic YouTube captions (approximately 80,000 examples)
  • 35 hours of download time
  • Filtered ads: 10 seconds to 5 minutes duration
  • Single-ad videos only
  • Balanced sponsor/non-sponsor content portions by duration

Dataset availability: Kaggle - Sponsor Block Subtitles 80k

Performance: The Transformer model based on BERT achieved 93.79% testing accuracy.

Test accuracy graph

Model notebook: Transformers SponsorBlock on Kaggle

Working with the Model

Testing on a Linus Tech Tips video revealed interesting results. The model correctly detected two sponsored segments but also generated false positives in regular content.

Error patterns identified:

  • Uppercase letters (brand names, locations like “MacBook,” “California”)
  • Bracketed text ([Music], [Applause] frequently appear in ads)
  • Brand mentions (conversations about products being flagged)
  • Rough 10-second chunking creating segmentation issues

Detailed results: Testing spreadsheet

Future Improvements

Several opportunities for enhancement:

  • Expand dataset breadth and diversity
  • Fine-tune BERT model parameters
  • Implement finer caption segmentation (sub-10-second chunks)
  • Cross-validate chunk results for higher confidence
  • Stabilize API deployment for production use

Conclusion

Modern machine learning has become remarkably accessible for newcomers. It’s really telling about how the field has advanced that somebody like me who is a complete newbie could make something somewhat functioning. Well-documented resources and frameworks enable functional implementations despite inexperience, making ML projects achievable during a lockdown learning journey.