TL;DR: Explore a real-world test comparing GPT-5 Pro and GPT-4 in solving a complex theorem in random matrix theory, revealing that the more expensive GPT-5 Pro not only failed to outperform its cheaper counterpart but also produced confident yet incorrect answers.
đ Table of Contents
Jump to any section (18 sections available)
In a real-world test pitting the premium $200-per-month GPT5 Pro against the standard $20-per-month GPT-4 (ChatGPT Plus), a mathematics PhD researcher put both AI models to the ultimate challenge: proving a complex theorem in random matrix theory. The goal? To determine whether paying 10 times more delivers 10 times the performanceâor if the upgrade is overhyped. The results were surprising, revealing critical insights into AI reasoning, hallucination risks, and the true value of extended thinking time.
This article dives deep into every detail of that experiment, dissecting methodology, outputs, errors, and practical implications for researchers, developers, and AI users. Spoiler: the expensive model didnât just underdeliverâit hallucinated nonsense while pretending to succeed.
Why This Test Matters for AI Users
As AI models become integral to research, coding, and problem-solving, users face a growing dilemma: is upgrading to premium tiers worth the cost? With GPT5 Pro priced at $200/monthâten times the cost of ChatGPT Plus at $20/monthâexpectations are high. This real-world test in advanced mathematics offers a rare, granular look at whether higher cost translates to higher capability, especially in domains requiring precision, logic, and deep reasoning.
The Researcherâs Goal: Automate Theorem Proving
The tester, a PhD candidate in mathematics, sought to offload part of their research workload to AI. Specifically, they wanted to know: âCan ChatGPT prove this theorem for me so I can focus on other aspects of my work?â This isnât about replacing human insightâitâs about augmenting research efficiency by delegating well-defined, technical tasks to AI.
As the researcher noted: âItâs always good if I can allocate some work to ChatGPT and then I can work on something else. Kind of helps in the research process.â
The Mathematical Challenge: A Theorem in Random Matrix Theory
The test centered on a non-trivial problem from random matrix theory, specifically involving the circular unitary ensemble (CUE). The user provided:
- Standard notation used in the field
- Known theorems as background
- A final theorem to be proven
The prompt was comprehensive: âIâm going to give you some known theorems and some notation that weâre going to use across the whole problem. And Iâd like you to prove the final theorem that Iâm stating to you.â
This setup tested the AIâs ability to integrate context, apply mathematical logic, and produce a valid proofânot just regurgitate information.
Notation and Context Provided in the Prompt
The prompt included detailed mathematical notation, which the researcher acknowledged might seem âconfusing if you donât do maths.â This ensured both models operated with the same foundational knowledge. Crucially, the prompt contained a typo: certain terms involving the variable âzâ were accidentally omitted in one section, though they appeared correctly in earlier theorems.
This unintentional error became a key test of the modelsâ attention to detail and contextual understanding.
GPT-4 (ChatGPT Plus) Performance: $20/Month Model
Thinking Time and Error Detection
The $20/month model (referred to as âGPT Plusâ) took 4 minutes to generate its response. Remarkably, it detected the missing âzâ terms in the prompt, stating: âYouâre missing the scale factor Z.â
This demonstrated strong contextual awarenessârecognizing an inconsistency by cross-referencing the provided theorems. As the researcher noted: âThatâs quite impressive to be fair.â
Proof Strategy: Problem Decomposition
GPT-4 approached the proof by splitting the main expression into two parts: one involving variable âzâ and another involving âw.â Due to symmetry in the problem, solving one effectively solves both. This is a legitimate and efficient mathematical strategy.
The model justified this by referencing differentiation of the kernel (the core component inside a determinant), which determines key numerical coefficients in the final formula.
Clarity and Pedagogical Value
While the proof wasnât âsuper clear,â the researcher admitted: âI was pretty impressed⌠I felt like I got some understanding.â However, it lacked depth for a novice: âIt hasnât really explained it to a dum dum.â Thus, it offered moderate utility as a research aidânot a standalone solution, but a useful starting point.
GPT5 Pro Performance: $200/Month Model
Extended Thinking Time and Initial Approach
GPT5 Pro took 42 minutes to respondâover 10 times longer than GPT-4. It adopted a structured approach: first proving two lemmas (intermediate results) to build toward the main theorem. Like GPT-4, it split the problem into âzâ and âwâ components, leveraging symmetry.
Redundant Restatements Waste Time
A significant portion of GPT5 Proâs output merely repeated information already given in the prompt. The researcher criticized this: âThis whole section⌠was all in the actual prompt itself. So to me thatâs not actually that useful. Itâs just regurgitating exactly what Iâve already asked it.â
This redundancy consumed valuable response space without adding insightâespecially problematic given the extended processing time.
Critical Failure: Hallucinated Equation
The most damning flaw appeared in Step 4 of GPT5 Proâs proof. It presented an equation described as âcomplete mumbo jumboâ and âAI slop hallucination to the max.â
Specifically, the model claimed:
âThen the inner sum in (9) is exactly [nonsense expression involving s^1 = 1].â
Since âsâ is defined as a whole number, âs^1 = 1â is mathematically incorrect unless s=1âa condition not stated. Worse, the equation contained multiple notational errors rendering it meaningless. The model invented a false step to pretend it had completed the proof.
False Confidence vs. Actual Utility
Unlike GPT-4, which offered a plausible (if incomplete) path, GPT5 Pro pretended to succeed while failing fundamentally. The researcher emphasized: âIt pretends that it can prove it⌠[but] fundamentally doesnât manage to prove the identity.â
At best, its response was âonly useful in the same sense as a search engineââproviding general ideas, not trustworthy reasoning.
Head-to-Head Comparison: $20 vs $200 Model
The table below summarizes key differences observed in the test:
| Feature | GPT-4 (ChatGPT Plus) â $20/mo | GPT5 Pro â $200/mo |
|---|---|---|
| Thinking Time | 4 minutes | 42 minutes |
| Error Detection | â Correctly identified missing âzâ terms | â Also detected missing terms |
| Proof Strategy | Split problem into z/w parts; referenced kernel differentiation | Proved lemmas first; same z/w split |
| Redundancy | Minimal restatement | â Extensive regurgitation of prompt content |
| Mathematical Accuracy | Plausible but incomplete; no hallucinations | â Introduced false, nonsensical equation |
| Trustworthiness | Moderateâuseful as a starting point | Lowâpretended to succeed while failing |
| Value for Cost | High | â Extremely low (not 10x better) |
Key Insight: AI Reasoning Is Still Unreliable in Technical Domains
Both models showed that AI cannot yet be trusted as an autonomous research assistant in advanced mathematics. As the researcher put it: âItâs not consistently good. A lot of the time itâs really not that good. And then you should occasionally get a glimpse and youâre like, âWow, thatâs really really useful.ââ
The critical takeaway: always verify AI outputs in technical contexts. Hallucinations can be subtle and embedded within otherwise reasonable textâlike GPT5 Proâs single false equation amid a structured proof.
Why Extended Thinking Time Didnât Help
GPT5 Proâs 42-minute runtime suggested deeper analysis, but the output revealed diminishing returns on compute investment. Instead of higher-quality reasoning, the model produced:
- Redundant restatements
- Over-engineered structure (lemmas) without substance
- A critical hallucination masked as progress
This raises concerns about AI scalability: if 10x more inference compute yields similar (or worse) results, the path to truly reliable AI reasoning may require more than just increased processing time.
The âAI Slopâ Problem: When Models Fake Understanding
The term âAI slopâ perfectly describes GPT5 Proâs failure mode: generating fluent, structured text that looks correct but contains fatal errors. In academic or research settings, this is especially dangerous because:
- It mimics human proof structure
- It uses correct terminology
- It hides errors in technical notation
Only a domain expert would spot the flawâmaking such outputs worse than useless for non-experts who might accept them at face value.
Practical Advice for Researchers Using AI
Based on this experiment, hereâs how to use AI responsibly in technical work:
- Never treat AI as a final authorityâalways verify critical steps.
- Use AI for idea generation, not proof validation. As the researcher said, itâs âbetter than a standard search engineâ for sparking directions.
- Beware of verbose outputsâlength â quality. Redundancy often masks lack of insight.
- Test for hallucination** by introducing subtle errors (like the missing âzâ) to see if the model catches them.
- Donât assume premium = better**âvalidate performance on your specific use case.
Is GPT5 Pro Ever Worth $200/Month?
For this mathematical proof task, no. The researcher concluded: âI expected like the model to be even one and a half times better. I donât think itâs showing that much improvement.â
However, the researcher plans to continue testing with more problems: âI will keep sending these to my friend⌠to give a better idea on an actual long-term sustained use.â This suggests GPT5 Pro might excel in other domains (e.g., code generation, long-form analysis), but in precision-dependent math, it failed to justify its cost.
Broader Implications for AI Development
This test highlights a core challenge in AI: scaling compute doesnât automatically scale correctness. The fact that both models used similar (flawed) strategies suggests underlying architectural limitations. Until models can:
- Avoid regurgitating prompts
- Admit uncertainty instead of hallucinating
- Provide truly novel, verifiable insights
âŚtheir role in research will remain advisory, not authoritative.
What This Means for âCan GPT5 Proâ Questions
When users ask, âCan GPT5 Pro handle advanced math?â, the answer is nuanced: It can attempt it, but you cannot trust its conclusions without expert review. The modelâs confidence is not correlated with accuracyâa dangerous combination in technical fields.
For now, GPT-4 remains the better value** for most users, including researchers. Save the $180/month unless you have evidence GPT5 Pro excels in your specific workflow.
Future Testing Plans
The researcher intends to conduct more side-by-side tests: âSo long as [my friend] is willing to process the queries⌠I can give a better idea on long-term sustained use.â This ongoing evaluation is crucialâsingle-task tests canât capture full capability, but they reveal critical failure modes.
Final Takeaways: What You Should Do Next
- Donât upgrade to GPT5 Pro based on price alone**âdemand evidence of ROI in your domain.
- Always cross-check AI-generated proofs, code, or data**âespecially from premium models that âsoundâ confident.
- Use AI as a brainstorming partner**, not a replacement for critical thinking.
- Report hallucinations** to help improve modelsâthis test shows even high-end AI needs refinement.
Conclusion: The $20 Model Won This Round
In a direct comparison of mathematical theorem proving, the $20/month GPT-4 outperformed the $200/month GPT5 Pro in accuracy, efficiency, and trustworthiness. While neither model delivered a complete, verifiable proof, GPT-4 avoided catastrophic hallucinations and provided actionable insightsâmaking it the clear winner for research support.
As AI continues to evolve, users must stay skeptical, verify outputs, and remember: more expensive doesnât mean more correct. For now, save your moneyâand your research integrityâby sticking with proven, cost-effective tools until premium models demonstrate consistent, measurable superiority.

