Why Creativity Benchmarks Like TTCT, TTCW, and CAT Fail LLM Copywriting

Written by Netanel Eliav | Apr 7, 2025 6:33:23 PM

Creativity isn’t a score you tick on a form or a label you glue to AI output. Yet, the industry is obsessed with benchmarks - TTCT (Torrance Tests of Creative Thinking), TTCW (Torrance Test of Creative Writing), and CAT (Creative Assessment Test). Are these the gold standard for measuring LLM and AI creativity? Hardly. Here’s why you need to question everything you hear about “creative benchmarks”.

Introduction

Let’s be honest. Measuring creativity with the TTCT, TTCW, or by executing the CAT is like testing a jazz musician with scales - technically sound, playful in theory, but missing the wildness that makes real artistry. Add large language models (LLMs) to the mix, and all the nice, neat assumptions go out the window.

This post tears apart the limits of these benchmarks. Expect brutal honesty, sharp analogies, and a roadmap for those who want something better for Product Marketing, Growth Hacking, and B2B Content Marketing. You’ll get answers, not platitudes.

What are the TTCT, TTCW, and CAT?

TTCT: Measures divergent thinking - fluency, flexibility, originality, elaboration. Has long “ranked” human kids for school gifted-and-talented programs.
TTCW: Adaptation of TTCT for creative writing. Assesses how original or expressive your story is, or if you just rewrote Harry Potter with dragons in space.
CAT: A panel of experts say, “Is this creative or not?” - subjective but closer to how we really experience new ideas.

In an era dominated by marketing automation and conversational AI, these tools have become fashionable benchmarks for judging LLM performance.

The Big Lie About Benchmarking AI Creativity

A hard truth: LLMs nail the structure of these tests - sometimes outscoring humans. But it’s an illusion. A high TTCT or TTCW score says nothing about whether AI can actually invent, surprise, or move us. Here’s why:

Training vs. Creativity: LLMs regurgitate patterns. The more you train them on benchmark datasets, the more they game the test. It’s no different from a band memorising one setlist. Play them something new and they freeze.
“Expert” Subjectivity: CAT relies on human panels. Humans bring bias - favouring familiar forms or styles, especially if they know it’s AI. What’s “creative” becomes what’s “expected”.
Success is a Mirage: Some studies trumpet “AI exams-top-1%-originality!” Nonsense. Those same models fail, three to ten times more often, when the TTCW is applied outside its favourite dataset.

TTCT and TTCW: Built for Humans, Skewed for AIs

Cultural and Contextual Blindness: AI doesn’t “get” the context behind answers the way a person does. TTCT rewards left-field association - culturally loaded. LLMs don’t natively understand this. They imitate linguistically, but miss the music behind the words.
Static and Stale: Longitudinal data on TTCT shows scores don’t really climb with age or training. You can “improve” by teaching people to hack the test, but that’s just test-savvy, not genuine creative leaps.

CAT: Creative by Committee

Subjectivity Overload: Ask three experts if a story is creative, get five answers. CAT gives the illusion of rigour, but two problems linger:
- Panel fatigue - human raters get bored, lazy, or over-influenced by previous answers.
- Overfitting - LLMs tuned to “sound creative” for these panels, but utterly predictable in real product marketing or campaign work.
No Process Transparency: You can’t audit a CAT-like you can a codebase. The subjectivity is the “feature” - and also the fatal weakness.

LLM Creativity Benchmarks: Currently a Racket

Recent industry attempts to benchmark copywriting, storytelling, or ideation in LLMs have become competitions for who can tune their model for test conditions. Example: “GPT-4 outperforms humans on originality” - except it fails to sustain surprise when the format changes, stakes are real, or when marketers actually need fresh product roadmaps.

Problems in Practice:

Gaming the Metrics: LLMs are optimised for the metric, not for innovation. This is growth hacking without the growth.
Repetition Disguised as Creativity: AI can remix words, spin up clever analogies, or spoof tone - see this Jam 7 review of AI’s typographical "sins" - but that’s hardly invention.
Sterility at Scale: The more you “perfect” the benchmark, the more you sap the irregularity needed for genuine creative leaps. Human imperfection is the secret sauce for AI creativity.

Why It Matters for Product Marketing and B2B Growth

Fake benchmarks deliver fake confidence - a trap for anyone pushing product marketing or content at scale.

Brands using AI for market differentiation can’t rely on hollow scores; you need storytelling that resonates, not that “ticks boxes”.
Agencies boasting about LLM benchmarks? Demand to see actual results - case studies, unique campaign ideas, unexpected angles.
AI-generated content should fuel creativity, not flatten it.

What Marketers and Growth Teams Should Do Instead

Not all is lost. Ditching tired benchmarks opens doors for new approaches - and braver outcomes:

Steps for Real Creative Evaluation

Diversity in Testing: Rotate prompts and domains, avoid “seen before” patterns. Force LLMs outside comfort zones.
Human-AI Teamwork: Pair AI with human creatives, and measure the impact on campaign freshness, conversion rates, and customer feedback.
Qualitative Reviews: Use audience reactions, peer scoring, and real KPIs. If the idea flops in the wild, creative or not, it doesn’t matter.
Continuous Iteration: Test, remix, and break the rules. Encourage surprise, not conformity.

Final Word: Challenge Every Creative Metric

If you’re in product marketing, growth hacking, or steering an agency’s product roadmap - don’t settle for the “best” AI on TTCT or similar. Hold vendors and internal AI teams accountable for impact, not just test scores.

Creativity is a punk band, not a symphony orchestra. Safe is boring. If your AI makes you safe, you’re playing the wrong game.

This blog post created using AMP!

View full post