Adapting Chain of Contradiction (CoC) for Sarcasm Detection

Completed:

April 22, 2025.

Check out the official PDF for Adapting Chain of Contradiction (CoC) for Sarcasm Detection

⚪ Overview

This project extends SarcasmBench, a benchmark for evaluating Large Language Models (LLMs) on sarcasm detection, by incorporating Chain of Contradiction (CoC) prompting. Unlike traditional Chain-of-Thought (CoT) reasoning, CoC explicitly models sarcasm as a contradiction between surface sentiment and true intent. The result is a scalable, reasoning-based prompting method that improves sarcasm classification, particularly for inputs containing direct or overt sarcasm cues.

We evaluate CoC against zero-shot, few-shot, and CoT prompting methods across five benchmark datasets and three LLMs: GPT-4o-mini, LLaMA 3-8B, and Qwen 2-7B. Across these models and tasks, CoC prompting consistently improves recall and F1 score, especially in tweet-style and conversational sarcasm where sentiment mismatches are more pronounced.

⚪ Motivation

Sarcasm is a challenging problem in NLP due to its dependence on tone, sentiment reversal, and pragmatic context. Traditional models (RNNs, CNNs) and even general CoT prompting strategies underperform because they fail to reason about sentiment contradictions explicitly. Inspired by recent work showing LLMs can engage in structured sentiment reasoning, we implement CoC as a three-step prompting pipeline to guide models toward contradiction evaluation, which is core to sarcasm.

⚪ Methodology

CoC prompting operates in three stages:

1. Surface Sentiment Analysis — Identify the literal sentiment via sentiment-bearing keywords or emojis.
2. True Intention Deduction — Infer the actual meaning using rhetorical style and common sense.
3. Contradiction Evaluation — Classify as “Sarcastic” if there is a mismatch between 1 and 2.

This process is applied in few-shot settings using formatted dialogue prompts. Evaluation spans five sarcasm datasets (IAC-V1, IAC-V2, Ghosh, iSarcasmEval, SemEval 2018) after cleaning, normalization, and emoji token conversion. Models are tested under resource-aware conditions, with GPT-4o-mini (API), LLaMA 3-8B, and Qwen 2-7B (run locally via oLLaMA).

⚪ Results

CoC prompting demonstrated strong empirical performance:

4. GPT-4o-mini achieved the highest recall across most datasets (e.g., 94.0% on IAC-V2).
5. LLaMA 3-8B and Qwen 2-7B saw massive F1 improvements on tweet-heavy datasets like Ghosh, highlighting CoC’s ability to guide weaker models.
6. CoC was most effective on datasets with explicit sarcasm markers and least effective on subtle, author-labeled sarcasm.
7. Average F1 gains were most notable on IAC and Ghosh datasets, while SemEval and iSarcasmEval saw marginal improvement.

🔍 Despite modest precision trade-offs, CoC reliably boosted recall, which is crucial for real-world sarcasm detection.

⚪ Challenges & Discussion

8. Prompt adherence varied between models. LLaMA occasionally returned open-ended responses instead of labels.
9. Subtle sarcasm still eludes CoC due to minimal contextual clues.
10. Preprocessing inconsistencies between datasets required additional normalization, especially for emojis.