AI can't match human efficiency in content moderation, given the massive cost difference of 40 times more when employing humans as opposed to AI systems.
A new study has revealed that while human moderators are more effective at recognising policy-violating content, multimodal large language models (MLLMs) like Gemini, GPT, and Llama offer a cost-effective complement for brand safety tasks [1][5].
The research, presented in the preprint paper "AI vs. Human Moderators: A Comparative Evaluation of Multimodal LLMs in Content Moderation for Brand Safety", evaluated six models: GPT-4o, GPT-4o-mini, Gemini-1.5-Flash, Gemini-2.0-Flash, Gemini-2.0-Flash-Lite, and Llama-3.2-11B-Vision. The evaluation was based on a dataset of 1500 videos [1].
The Llama-3.2-11B-Vision model had an F1 score of 0.86 and costs $459, while the GPT-4o model had an F1 score of 0.87 and costs $419. The compact versions of the Gemini models, Gemini-1.5-Flash, Gemini-2.0-Flash, and Gemini-2.0-Flash-Lite, outperformed others with the highest F1 scores among the evaluated models [1]. The GPT-4o-mini, a smaller version of GPT-4o, had an F1 score of 0.88 and costs only $25 [1].
In contrast, human moderation costs $974, making it significantly more expensive than all the evaluated models [1]. The researchers' calculations show human moderation to be 40 times more expensive than the most cost-efficient machine learning labor [1].
Brand safety involves preventing inappropriate or toxic content from being associated with brands, a task that requires careful moderation across multiple content modalities (text, images, video, audio) [1][5]. The Zefr team explains that advertisers define content categories they wish to avoid, ranging from violent or adult-themed material to controversial political discourse [1].
The study found that human reviewers outperformed all the models in terms of precision, recall, and F1, particularly in complex or nuanced classifications [1]. However, MLLMs perform well across text, audio, and visuals but still fall short in complex scenarios [1][5].
The authors of the paper conclude that compact MLLMs offer a significantly cheaper alternative without sacrificing accuracy [1]. A hybrid approach combining human moderators with AI is suggested as the most effective and economical strategy for brand safety content moderation, leveraging AI’s scalability and lower cost alongside human judgment for complex cases [1].
Jon Morra, Zefr's Chief AI Officer, agrees, stating that a hybrid human and AI approach is the most effective and economical path forward for content moderation in the brand safety and suitability landscape [1].
The dataset and prompts used in the study have been published to GitHub for further research [1]. The paper was accepted at the upcoming Computer Vision in Advertising and Marketing (CVAM) workshop at the 2025 International Conference on Computer Vision [1].
In summary, while human moderators remain more accurate for brand safety tasks, multimodal LLMs offer a cost-effective complement, with the best results achieved by combining both [1][5].
[1] Morra, J., et al. (2025). AI vs. Human Moderators: A Comparative Evaluation of Multimodal LLMs in Content Moderation for Brand Safety. Preprint. [5] Zefr. (2025). Zefr's AI vs. Human Moderators Study. Retrieved from https://www.zefr.com/ai-vs-human-moderators-study/
- The study, titled "AI vs. Human Moderators: A Comparative Evaluation of Multimodal LLMs in Content Moderation for Brand Safety," tested six machine learning models, including GPT-4o, GPT-4o-mini, and several versions of Gemini, showcasing the role of AI in brand safety tasks.
- Among the models, the compact versions of Gemini outperformed others in terms of F1 scores, making them promising choices in the realm of human-computer collaboration for business purposes, such as content moderation.
- In the context of software and technology, while human moderators demonstrated higher precision, recall, and F1 scores, particularly in complex or nuanced classifications, the use of multimodal AI models like GPT-4o and Llama-3.2-11B-Vision can offer significant cost savings, as evidenced in the "AI vs. Human Moderators" study.
- Furthermore, it was suggested that a hybrid approach, combining human moderators with AI models, is the most effective and economical strategy for ensuring brand safety in finance, business, and other sectors, harnessing the potential of AI's scalability and lower cost while maximizing human judgment where necessary.