Detecting Sponsored Content in YouTube Videos
by Anas El MhamdiI’m back! Having extra time during lockdown to explore machine learning, I recognized it as central to growth engineering since both fields require similar skills: Mostly Python developing, scraping, data processing, analytic and critical thinking.

The Audio Route
A friend frustrated with sponsored podcast content led me to SponsorBlock, a crowdsourced ad skipper for YouTube. Their database is completely open and contains labeled videos with sponsor timestamps.

I initially pursued audio classification using a CNN-based approach inspired by the Panotti repository, which achieved 99.7% accuracy detecting guitar effects.
Results:
- Single podcast: 99% training accuracy, 95% test accuracy
- Multiple podcasts: Less than 60% confidence
The challenge was scale. 300 podcasts generated 50GB of data, making data management untenable. I pivoted to transcript analysis for efficiency.
The Captions Route
After studying NLP and sentiment analysis, I selected a Transformer model based on BERT, Google’s NLP framework.
Dataset Construction:
- Downloaded videos using youtube-dl
- Extracted automatic YouTube captions (approximately 80,000 examples)
- 35 hours of download time
- Filtered ads: 10 seconds to 5 minutes duration
- Single-ad videos only
- Balanced sponsor/non-sponsor content portions by duration
Dataset availability: Kaggle - Sponsor Block Subtitles 80k
Performance: The Transformer model based on BERT achieved 93.79% testing accuracy.

Model notebook: Transformers SponsorBlock on Kaggle
Working with the Model
Testing on a Linus Tech Tips video revealed interesting results. The model correctly detected two sponsored segments but also generated false positives in regular content.
Error patterns identified:
- Uppercase letters (brand names, locations like “MacBook,” “California”)
- Bracketed text ([Music], [Applause] frequently appear in ads)
- Brand mentions (conversations about products being flagged)
- Rough 10-second chunking creating segmentation issues
Detailed results: Testing spreadsheet
Future Improvements
Several opportunities for enhancement:
- Expand dataset breadth and diversity
- Fine-tune BERT model parameters
- Implement finer caption segmentation (sub-10-second chunks)
- Cross-validate chunk results for higher confidence
- Stabilize API deployment for production use
Conclusion
Modern machine learning has become remarkably accessible for newcomers. It’s really telling about how the field has advanced that somebody like me who is a complete newbie could make something somewhat functioning. Well-documented resources and frameworks enable functional implementations despite inexperience, making ML projects achievable during a lockdown learning journey.