ChatGPT vs Gemini

A detailed comparison of AI assistant subscriptions and performance

Feature ChatGPT Gemini
Free Go Plus Free Plus Pro
Reasoning model limit per day Unknown [GPT 5.4 Thinking Mini] 40(10 every 5h) [GPT 5.4 Thinking Mini] 428(3000/week)[GPT 5.5 Thinking] Unknown {Shared}(X every 5h)[All models] Unknown x2 {Shared}(X every 5h)[All models] Unknown x4 {Shared}(X every 5h)[All models]
Non-reasoning model limit per day 40(10 every 5h)[GPT 5.5 Instant] 1280(160 every 3h)[GPT 5.5 Instant] Unknown {Shared}(X every 5h)[All models] Unknown x2 {Shared}(X every 5h)[All models] Unknown x4 {Shared}(X every 5h)[All models]
Context window limit 27k / 256k(Non-reasoning / Reasoning) 54k / 256k(Non-reasoning / Reasoning) 32k 128k 1M
File uploads per day 3 30 640(80 every 3h) Unknown
Memory features Memories Memories, chats history, chat search Memories, chats history, ecosystem memory
Image generation per day Unknown[GPT Image 2 / GPT Image 2 Thinking] Unknown {Shared}(X every 5h)[NB Pro + NB 2 + NB 2 Thinking] Unknown x2 {Shared}(X every 5h)[NB Pro + NB 2 + NB 2 Thinking] Unknown x4 {Shared}(X every 5h)[NB Pro + NB 2 + NB 2 Thinking]
Image resolution (1:1) 1254×1254 (1.2K)[All image models] 2048×2048 (2K)[All image models]
Video generation per day No Unknown[Sora 2] No Unknown x2 {Shared}(X every 5h)[Veo Omni] Unknown x4 {Shared}(X every 5h)[Veo Omni]
Video resolution and length No 480p, 720p(up to 10s)[Sora 2] No 720p(up to 10s)[Veo 3.1 Lite]
Music generation per day No Unknown {Shared}(30s per song)[Lyria 3] Unknown x2 {Shared}(180s per song)[Lyria 3 Pro] Unknown x4 {Shared}(180s per song)[Lyria 3 Pro]
Custom model creation No[Custom GPTs] Yes[Custom GPTs] Yes[Gems]
Folder organization Yes[Projects] Yes[Notebooks]
Maximum active scheduled tasks No 3
[Tasks]
5
[Tasks]
No 10
[Scheduled Actions]
Deep research uses per month Unknown[GPT 5.5 Thinking / GPT o3] Unknown {Shared}[Gemini 3.5 Flash Extended] Unknown x2 {Shared}[Gemini 3.1 Pro] Unknown x4 {Shared}[Gemini 3.1 Pro]
Agent mode uses per month No 40[Agent Mode] No
Advertisements Yes No No
Video inputs No Yes(Up to 5 minutes per file) Yes(Up to 60 minutes per file)
Audio inputs No Yes(Up to 10 minutes per file) Yes(Up to 180 minutes per file)
Youtube summaries No Yes
Cloud storage included No 15GB 200GB 5TB
API credit included No No $10/month(Unused credit expires monthly)
Plan sharing No No Up to 6 accounts
Age estimation and verification method AI-basedVerification only with ID, through a third-party (Persona Identities) Birth date based (prioritized), AI-basedVerification with ID or credit card, sent directly to Google
Best option
Sufficient option (subjectively)
Includes USA-exclusive features
Unknown value
USA-exclusive features
Unknown values
Subjective recommendations

Footnotes

Model Filters
Reasoning models
GPT 5.5 Thinking Extra High
GPT 5.5 Thinking High
GPT 5.5 Thinking Medium
GPT 5.4 Thinking Mini
Gemini 3.1 Pro Extended
Gemini 3.5 Flash Extended
Gemini 3.1 Flash Lite Extended
Claude 5 Fable Adaptive
Claude 4.8 Opus Adaptive
Claude 4.6 Sonnet Adaptive
Claude 4.5 Haiku Extended
Claude 4.6 Opus Adaptive
Grok 4.3 Expert
GLM 5.2 Deep Think Max
GLM 5V Turbo Deep Think
GLM 5 Turbo Deep Think
Qwen 3.7 Max Thinking
Qwen 3.7 Plus Thinking
Qwen 3.6 Plus Thinking
Kimi K2.6 Thinking
Mimo V2.5 Pro
Mimo V2.5
MiniMax M3 Thinking
Nemotron 3 Ultra Reasoning
Non-reasoning models
GPT 5.5 Instant
Gemini 3.5 Flash
Claude 4.6 Sonnet
Claude 4.5 Haiku
Claude 4.6 Opus
Grok 4.3 Fast
Qwen 3.5 Omni Plus Fast
Kimi K2.6 Instant
Extra models Selection of AI models with a low hallucination rate
Legacy models AI models whose new version has been released but which perform better than its predecesor in some benchmarks
Benchmark Filters
Accuracy benchmarks
AA-Omniscience
Long context benchmarks
MRCR v2 8-Needle
AA-LCR
Max Context Limit
Multimodal benchmarks
MMMU-Pro
Agentic benchmarks
GDPval-AA v2
Coding benchmarks
Terminal-Bench v2.1

Knowledge Accuracy — AA-Omniscience Benchmark

Shows how often the model is right or wrong when answering. Higher is better for Correct & Abstention. Lower is better for Hallucination.

Sort by

AA-Omniscience Notes

  • • This benchmark evaluates the accuracy of models when answering questions without web search. Therefore, if the model has good web search capability on the web or app version, it will be able to partially compensate for its hallucination rate if it's too high.
  • Correct answers were calculated with Correct = A.
  • Abstention answers were calculated with Abstention = (1−A)⋅(1−H).
  • Hallucination answers were calculated with Hallucination = (1−A)⋅H.
  • (Where "A" is the percentage scored on the "AA-Omnisicence Accuracy" benchmark, "H" is the percentage scored on the "AA-Omnisicence Hallucination Rate" benchmark.)

Long Context Accuracy — MRCR v2 8-Needle Benchmark

Shows how accuracy at locating specific details in context degrades as context length increases. Ranges from 8k to 1048k (1M) tokens. Higher is better.

Max Context Length: 1M
Y-Axis Min

MRCR v2 8-Needle Notes

  • • Differs greatly from the next benchmark. While AA-LCR revolves around comprehending really long texts, MRCR v2 8-Needle evaluates the model capability to find specific information in those texts.
  • • There are no benchmarks for 200k tokens.
  • • This benchmark doesn't support all models, thus it's likely some of the models selected do not appear here.
  • • Maximum context window of GPT Instant models is 27k/54k/128k tokens (Free/Plus,Go/Pro).
  • • Maximum context window of GPT Thinking models is 256k/400k (Free,Go,Plus/Pro) tokens.
  • • Maximum context window of all Gemini models is 32k/128k/1M tokens (Free/Plus/Pro,Ultra).
  • • Maximum context window of Claude >4.6 models is 200k/500k tokens (Free/Pro,Max x5,Max x20); other Claude models are 200k tokens.
  • • Maximum context window of all Grok models is 1M tokens.
  • • Maximum context window of GLM 5.2 models is 1M tokens; other GLM models have 200k tokens.
  • • Maximum context window of Qwen Omni models is 262k tokens.
  • • Maximum context window of Qwen Plus and Max models is 1M tokens.
  • • Maximum context window of all Kimi models is 200k tokens.
  • • Maximum context window of all Mimo models is 1M tokens.
  • • Maximum context window of all MiniMax models is 1M tokens. However, only the results up to 256k tokens have been published for this benchmark.
  • • Maximum context window of all Nemotron models is 1M tokens.
  • • 1M tokens equals 1048k tokens.

Long Context Reasoning — AA-LCR

Shows extraction, comprehension and synthesizing of information on texts ranging from 10k to 100k tokens. Higher is better.

Sort by
Axis Min

AA-LCR Notes

  • • Differs greatly from the next benchmark. While AA-LCR revolves around comprehending really long texts, MRCR v2 8-Needle evaluates the model capability to find specific information in those texts.
  • • Maximum context window of GPT Instant models is 27k/54k/128k tokens (Free/Plus,Go/Pro).
  • • Maximum context window of GPT Thinking models is 256k/400k (Free,Go,Plus/Pro) tokens.
  • • Maximum context window of all Gemini models is 32k/128k/1M tokens (Free/Plus/Pro,Ultra).
  • • Maximum context window of Claude >4.6 models is 200k/500k tokens (Free/Pro,Max x5,Max x20); other Claude models have 200k tokens.
  • • Maximum context window of all Grok models is 1M tokens.
  • • Maximum context window of GLM 5.2 models is 1M tokens; other GLM models have 200k tokens.
  • • Maximum context window of Qwen Omni models is 262k tokens.
  • • Maximum context window of Qwen Plus and Max models is 1M tokens.
  • • Maximum context window of all Kimi models is 200k tokens.
  • • Maximum context window of all Mimo models is 1M tokens.
  • • Maximum context window of all MiniMax models is 1M tokens.
  • • Maximum context window of all Nemotron models is 1M tokens.
  • • 1M tokens equals 1048k tokens.

Max Context Limit

Maximum context window size in tokens. Take into consideration that a higher context limit doesn't always come with a good performance at high context lenghts.

Sort by

Visual Reasoning — MMMU-Pro

Shows performance processing visual and textual information at the same time. Higher is better.

Sort by
Axis Min

MMMU-Pro Notes

  • GPT models are capable of understanding text and images.
  • Gemini models are capable of understanding text, images, video and audio.
  • Claude models are capable of understanding text and images.
  • Grok models are capable of understanding text and images.
  • GLM models are capable of understanding only text. Image and video inputs are likely routed to GLM 5V Turbo Deep Think and GLM 5V Turbo respectively.
  • Qwen 3.7 Max Thinking is capable of understanding only text.
  • Qwen 3.6 Plus Thinking is capable of understanding text, images and video.
  • Qwen 3.5 Omni Plus Fast is capable of understanding text, images, video and audio.
  • Kimi models are capable of understanding text, images and video.
  • Mimo V2.5 Pro is capable of understanding only text.
  • Mimo V2.5 is capable of understanding text, images, video and audio.
  • Minimax models are capable of understanding only text. Image and video inputs are routed to another unknown model.
  • • Models capable of understanding only text are usually able to understand text inside images, but nothing else in them.

Agentic Work — GDPval-AA v2

Shows performance producing outputs like documents, slides, diagrams, and spreadsheets, mirroring actual work across multiple professional domains. Higher is better.

Sort by
Axis Min

GDPVAL-AA V2 NOTES

  • • Human baseline corresponds to an ELO of 1000.
  • • Values for non-reasoning models of this benchmark are really scarce. Therefore, I've decided against including them or estimating their values. Reasoning models will usually perform better than their non-reasoning counterparts anyways.

Agentic Coding — Terminal-Bench v2.1

Shows performance of agents on terminal tasks, including workflows such as debugging code, configuring systems, and resolving technical problems. Higher is better.

Sort by
Axis Min

TERMINAL-BENCH V2.1 NOTES

  • • Values for non-reasoning models of this benchmark are really scarce. Therefore, I've decided against including them or estimating their values. Reasoning models will usually perform better than their non-reasoning counterparts anyways.

Notes

  • • All models are presented with their respective highest settings.
  • • I have not been able to find consistent graphs for the following models: Claude 4.8 Opus, Gemini 3.1 Pro, Gemini 3.5 Flash, Gemini 3.1 Flash Lite, GLM 5.2, GLM 5 Turbo, GLM 5V Turbo, Qwen 3.7 Plus Fast, Qwen 3.6 Plus Fast, MiniMax M3 and Nemotron 3 Ultra. I will add them immediately if I find some.
  • • GPT 5.5 Thinking Extra High is available for Plus and Pro ChatGPT users through the Codex app, with limits that reset weekly. It's also available on the web for Pro users.
  • • Claude Opus 4.6 Adaptive and Claude Sonnet 4.6 Adaptive are available for free to Gemini users through the Antigravity app, with limits that reset depending on demand.

ChatGPT

  • Web search on thinking models thinks significantly longer (can easily take 7-10 min vs Gemini's 1-3 min)
  • Easy switching between different personalities
  • Can connect to a limited selection of websites, as well as to a web version of Photoshop, though tool calls can be inconsistent

Gemini

  • Nano Banana Pro offers faster and higher resolution on image generation than GPT Image 2. The images' realism are on par; on text GPT Image 2 seems to be better.
  • Native Android integration (iOS confirmed to come soon), though it can be over-eager with tool calls
  • Offers uses of Claude Opus 4.6 Adaptive, and Claude Sonnet 4.6 Adaptive through the Antigravity app