Digital Twin Model Evaluation: GPT-4.1 vs Gemini 2.5 Flash

Digital Twin Evaluation: GPT-4.1 vs Gemini 2.5 Flash

Paired comparison across 58 questions, holding retrieval, system prompt, vector store, top-k, and temperature constant. Companion artifact to the blog post; intended for readers who want the numbers and the methodology behind them. Run date: May 11, 2026.

Jump to: Executive readout Methodology Portfolio-critical 15 Scoring rubric Run-level summary Breakdowns Sample responses Limitations

Questions compared

Mean word delta

+44.4

GPT minus Gemini

Mean content overlap

0.21

Jaccard on content words

Cost ratio

2.7x

GPT / Gemini

Executive readout

GPT-4.1 behaved like the more expansive portfolio guide. It was longer on 44 of 58 questions, used more follow-up invitations, and more often turned retrieved facts into a fuller narrative answer.

Gemini 2.5 Flash behaved like the more concise answerer. It was shorter on most questions, much cheaper, and often serviceable on closed factual prompts. Its main weakness is not that it missed the fact; it more often under-developed the answer relative to the Digital Twin's intended voice and purpose.

The retrieval layer is not the explanation for the differences. Top-k and question set were held constant, and chunk similarity stats are essentially identical. The variation is in how each model used the same retrieved context.

The evaluator score and notes columns in the uploaded CSVs are present but empty, so this report does not treat those as evidence. Human scoring against the rubric below is the planned next step, applied to the 15-question portfolio-critical subset.

Methodology

Two runs of the same 58-question evaluation set, held constant on retrieval (Chroma vector store, top-k = 4), system prompt, and temperature (0.6). Only the model varied. Metrics fall into three buckets: behavioral counts, lexical overlap, and cost.

Follow-up flag count. Per-response binary detection of phrases inviting continued conversation. Heuristic match against a small dictionary of openers and closers (for example "happy to share more," "want to know about").

Follow-up phrase count. Same dictionary but counting total occurrences, not capped at one per response.

Warm opener count. Heuristic match against opening phrases like "That's a great question."

Linked answer count. Responses containing one or more markdown or URL links.

Markdown intensity. Strong = three or more markdown elements (headers, lists, bolds). Light = one or two. None = plain prose.

Project term count. Mentions of a fixed list of portfolio project and concept names (see Project / concept mention counts).

Jaccard overlap. Set-similarity score on content words (stopwords removed) between the paired GPT and Gemini responses. Higher means the two answers used more of the same vocabulary.

Must-cover lexical coverage. Fraction of expected-term tokens that appear in the response. A smoke test for grounding, not a correctness measure.

Estimated cost. Token counts multiplied by published per-million-token pricing as of the run date. Provider tokenizers differ, so dollar cost is more interpretable than raw token counts.

Readability. Flesch-Kincaid grade and Flesch reading ease, applied to each response and averaged across the run.

The 15-question portfolio-critical subset

The full 58-question set is too large to score by hand after every change. These 15 questions are my first-pass quality check for whether the Twin can make a strong, grounded impression across career, project, and technical conversations.

What problems does Barbara solve?
Walk me through a project.
How was this Digital Twin built?
How does your cognitive science background shape your AI work?
What kinds of problems get you excited?
What are your biggest strengths?
How do you approach problem-solving?
How do you handle ambiguity?
How do you document your projects?
How do you approach ML model development?
Tell me about Resume Explorer.
Tell me about Concept Cartographer.
Tell me about the Beehive Photo Metadata Tracker.
What role are you looking for next?
Why should someone hire you?

Scoring rubric

Six dimensions, each scored from 1 (weak) to 5 (strong). Total possible score per response: 30. This rubric structure is the scaffold for the next round of human-scored evaluation. The dimensions are deliberately separable so a response can be accurate but bland, or specific but ungrounded, and the rubric will say so.

Dimension	What it asks	Low score (1-2)	High score (4-5)
Accuracy	Did the response get the facts right?	Contains a factual error or invented detail.	Every claim is supported by the knowledge base or known true.
Specificity	Did it avoid generic AI or career language?	Could describe any AI consultant.	Uses specifics only Barbara could provide.
Voice fidelity	Did it sound like Barbara?	Reads like a generic chatbot.	Reads like Barbara wrote it.
Strategic usefulness	Would this help a recruiter or peer understand her value?	Answers the question but doesn't advance the conversation.	Turns the answer into an invitation to explore further.
Grounding	Did it stay tied to retrieved or source material?	Drifts into plausible-but-unsourced claims.	Every concrete claim traces to a chunk.
Follow-up quality	Did the next-step invitation feel specific and natural?	Generic or absent. "Let me know if you have more questions."	Specific to the answer, opening a useful direction.

Run-level quantitative summary

Metric	GPT-4.1	Gemini 2.5 Flash	GPT minus Gemini
Mean words per response	225.5	181.2	+44.4
Median words per response	225.0	198.0	+27.0
SD of words	103.5	93.5	+10.1
Min words	5	26	-21
Max words	438	385	+53
Mean sentences	10.4	9.5	+0.8
Mean words per sentence	22.5	18.7	+3.8
Mean Flesch-Kincaid grade	12.2	11.3	+0.9
Mean Flesch reading ease	47.7	47.5	+0.2
Type-token ratio	0.86	0.87	-0.01
Follow-up flag count	40	10	+30
Follow-up phrase count	49	20	+29
Warm opener count	6	18	-12
Linked answer count	14	10	+4
Strong markdown count	30	23	+7
Light markdown count	15	32	-17
No markdown count	13	3	+10
Mean project term count	1.93	1.57	+0.36
Mean must-cover lexical coverage	0.51	0.48	+0.04
Completion tokens	17,145	32,897	-15,752
Prompt tokens	397,713	421,279	-23,566
Total tokens	414,858	454,176	-39,318
Estimated cost USD	$0.489	$0.181	+$0.309

Token counts are not directly comparable across providers because each tokenizer is different. Gemini reported more completion tokens despite shorter word counts. Word count and estimated dollar cost are the more interpretable comparisons.

Breakdown by question type

Higher-level partition of the 58 questions into three types: bounded-open (most of the set), closed-fact (biographical, credentials), and open-persona (the playful one).

Question type	N	GPT mean words	Gemini mean words	Delta words	Jaccard	GPT followups	Gemini followups
bounded_open	46	245.5	202.7	+42.8	0.20	31	10
closed_fact	11	147.9	99.4	+48.5	0.22	9	0
open_persona	1	162.0	93.0	+69.0	0.13	0	0

Breakdown by intent

Finer-grained partition. Sorted by N descending; intents with only one question (the bottom rows) carry no statistical weight but are included for completeness.

Intent	N	GPT mean words	Gemini mean words	Delta words	Jaccard	GPT followups	Gemini followups	GPT cover	Gemini cover
technical_overview	8	240.5	187.4	+53.1	0.18	7	1	0.56	0.45
career_history_fact	5	177.6	138.8	+38.8	0.24	5	0	0.50	0.49
working_style	5	289.6	200.2	+89.4	0.16	3	0	0.31	0.33
education_fact	4	106.5	55.0	+51.5	0.16	3	0	0.85	0.88
methodology	3	362.0	244.3	+117.7	0.21	2	0	0.43	0.31
project_walkthrough	3	303.0	288.3	+14.7	0.23	3	2	0.59	0.55
tool_encouragement	3	78.7	49.0	+29.7	0.14	1	1	0.13	0.14
knowledge_graph_overview	2	355.0	351.5	+3.5	0.28	1	0	0.62	0.58
motivation	2	209.0	204.5	+4.5	0.24	0	0	0.77	0.65
personal_background	2	180.5	121.0	+59.5	0.23	1	0	0.81	0.75
product_architecture	2	313.5	201.0	+112.5	0.28	2	0	0.66	0.66
tool_dice	2	23.0	36.5	-13.5	0.08	1	1	0.10	0.20
career_next_step	1	205.0	171.0	+34.0	0.17	1	0	0.50	0.20
communication_bridge	1	351.0	338.0	+13.0	0.16	1	0	0.25	0.50
credential_fact	1	119.0	139.0	-20.0	0.34	1	0	0.83	0.83
evaluation_capability	1	298.0	206.0	+92.0	0.29	0	0	0.82	0.82
hallucination_reflection	1	246.0	273.0	-27.0	0.13	0	0	0.25	0.00
knowledge_boundary_check	1	191.0	92.0	+99.0	0.23	1	0	0.29	0.14
learning_style	1	265.0	288.0	-23.0	0.18	0	0	0.36	0.36
persona_self_description	1	263.0	233.0	+30.0	0.15	1	1	0.40	0.50
personal_professional_bridge	1	207.0	246.0	-39.0	0.23	0	0	0.38	0.50
playful_persona	1	162.0	93.0	+69.0	0.13	0	0	0.29	0.29
positioning	1	288.0	255.0	+33.0	0.21	1	0	0.36	0.36
product_overview	1	334.0	227.0	+107.0	0.22	1	0	0.75	0.83
project_detail	1	401.0	357.0	+44.0	0.22	1	1	0.70	0.50
publication_fact	1	173.0	121.0	+52.0	0.28	0	1	0.60	0.50
research_explanation	1	230.0	213.0	+17.0	0.33	1	0	0.86	0.79
resource_location	1	122.0	108.0	+14.0	0.29	1	1	0.67	0.50
tech_stack	1	147.0	160.0	-13.0	0.36	1	1	0.25	0.25

GPT much longer

ID	Question	Intent	GPT words	Gemini words	Delta	Jaccard	GPT cover	Gemini cover
V3-019	How do you approach problem-solving?	methodology	438	243	+195	0.17	0.67	0.50
V3-027	What makes Poolula Platform modular?	product_architecture	391	206	+185	0.21	0.69	0.77
V3-022	How do you handle ambiguity in projects?	working_style	388	222	+166	0.17	0.31	0.31
V3-045	What vector databases have you used?	technical_overview	216	63	+153	0.11	0.58	0.50
V3-029	What is the Beehive Photo Metadata Tracker?	project_walkthrough	358	209	+149	0.21	0.69	0.69
V3-042	How do you approach ML model development?	methodology	375	240	+135	0.21	0.25	0.25
V3-040	How do you document your projects?	working_style	324	190	+134	0.18	0.25	0.12
V3-048	How do you collaborate with engineering teams?	working_style	277	151	+126	0.13	0.38	0.46
V3-044	What is your approach to RAG systems?	technical_overview	392	281	+111	0.24	0.91	0.73
V3-031	What is Poolula Platform?	product_overview	334	227	+107	0.22	0.75	0.83
V3-006	Tell me about your dogs	personal_background	279	174	+105	0.28	0.62	0.50
V3-047	What's your experience with data engineering?	technical_overview	343	241	+102	0.18	0.56	0.44

Gemini longer

ID	Question	Intent	GPT words	Gemini words	Delta	Jaccard	GPT cover	Gemini cover
V3-012	Walk me through a project	project_walkthrough	253	360	-107	0.23	0.30	0.40
V3-050	What knowledge graph experience do you have?	knowledge_graph_overview	337	385	-48	0.27	0.67	0.67
V3-056	What is the thing you like to do the most for fun and work	personal_professional_bridge	207	246	-39	0.23	0.38	0.50
V3-037	What programming languages are featured in your projects?	technical_overview	181	208	-27	0.18	0.36	0.29
V3-039	What visualization tools do you use in your projects?	technical_overview	262	289	-27	0.16	0.60	0.30
V3-057	I heard of some digital twins that make up information.	hallucination_reflection	246	273	-27	0.13	0.25	0.00
V3-017	What drives you?	motivation	184	209	-25	0.22	0.67	0.67
V3-023	What's your philosophy on learning new technologies?	learning_style	265	288	-23	0.18	0.36	0.36
V3-053	Roll a die for me	tool_dice	5	26	-21	0.06	0.20	0.20
V3-011	What certifications do you have?	credential_fact	119	139	-20	0.34	0.83	0.83
V3-021	What's your take on documentation?	working_style	251	266	-15	0.14	0.40	0.30
V3-030	What technologies does the Fitness Dashboard use?	tech_stack	147	160	-13	0.36	0.25	0.25

Lowest response overlap

Pairs where the two models took the most different approaches. Tool-related and free-form questions dominate.

ID	Question	Intent	Jaccard
V3-053	Roll a die for me	tool_dice	0.06
V3-010	What was your role at UT Austin after returning from MIT?	career_history_fact	0.09
V3-054	I need to make a random decision between options	tool_dice	0.09
V3-003	Did you play any sports in college?	education_fact	0.10
V3-055	Remind me of your mantra and send it to me	tool_encouragement	0.11
V3-045	What vector databases have you used?	technical_overview	0.11
V3-043	What LLM frameworks do you work with?	technical_overview	0.12
V3-048	How do you collaborate with engineering teams?	working_style	0.13
V3-058	How many twins are there and do you get jealous?	playful_persona	0.13
V3-057	I heard of some digital twins that make up information.	hallucination_reflection	0.13
V3-021	What's your take on documentation?	working_style	0.14
V3-051	I need some encouragement today	tool_encouragement	0.14

Highest response overlap

Pairs where the two models converged. Bounded factual and technical-stack questions dominate.

ID	Question	Intent	Jaccard
V3-035	What did you build at Metric5?	career_history_fact	0.38
V3-030	What technologies does the Fitness Dashboard use?	tech_stack	0.36
V3-011	What certifications do you have?	credential_fact	0.34
V3-013	How was this digital twin built	product_architecture	0.34
V3-005	What was your dissertation about?	research_explanation	0.33
V3-008	What shifted your focus to neuroscience?	career_history_fact	0.31
V3-032	What evaluation features does Poolula have?	evaluation_capability	0.29
V3-036	Where can I find documentation for your projects?	resource_location	0.29
V3-015	Tell me about your knowledge graph work	knowledge_graph_overview	0.29
V3-046	What cloud platforms do you have experience with?	technical_overview	0.28
V3-006	Tell me about your dogs	personal_background	0.28
V3-033	What was your notable publication from MIT?	publication_fact	0.28

Higher must-cover coverage: GPT

Questions where GPT's lexical grounding score exceeded Gemini's most.

ID	Question	Intent	GPT cover	Gemini cover	Delta
V3-025	What matters most to you in a role?	career_next_step	0.50	0.20	+0.30
V3-039	What visualization tools do you use in your projects?	technical_overview	0.60	0.30	+0.30
V3-002	What honors program were you part of at UT Austin?	education_fact	1.00	0.75	+0.25
V3-020	What excites you about AI/ML work?	motivation	0.88	0.62	+0.25
V3-057	I heard of some digital twins that make up information.	hallucination_reflection	0.25	0.00	+0.25
V3-028	Tell me about your Fitness Dashboard project	project_walkthrough	0.78	0.56	+0.22
V3-038	Tell me about your beehive tracker's metadata analysis	project_detail	0.70	0.50	+0.20
V3-041	What's your approach to evaluation in ML systems?	methodology	0.36	0.18	+0.18
V3-044	What is your approach to RAG systems?	technical_overview	0.91	0.73	+0.18
V3-019	How do you approach problem-solving?	methodology	0.67	0.50	+0.17

Higher must-cover coverage: Gemini

Questions where Gemini's lexical grounding score exceeded GPT's most.

ID	Question	Intent	GPT cover	Gemini cover	Delta
V3-004	Where did you do your PhD?	education_fact	0.67	1.00	-0.33
V3-024	How do you balance technical depth with accessibility?	communication_bridge	0.25	0.50	-0.25
V3-018	What's your communication style?	working_style	0.22	0.44	-0.22
V3-054	I need to make a random decision between options	tool_dice	0.00	0.20	-0.20
V3-055	Remind me of your mantra and send it to me	tool_encouragement	0.29	0.43	-0.14
V3-056	What is the thing you like to do the most for fun and work	personal_professional_bridge	0.38	0.50	-0.12
V3-012	Walk me through a project	project_walkthrough	0.30	0.40	-0.10
V3-016	How would you describe yourself?	persona_self_description	0.40	0.50	-0.10
V3-008	What shifted your focus to neuroscience?	career_history_fact	0.73	0.82	-0.09
V3-031	What is Poolula Platform?	product_overview	0.75	0.83	-0.08

Project and concept mention counts

How often each model named one of the portfolio projects or concepts across the full 58-question run.

Project or concept term	GPT	Gemini	Delta
Knowledge graph	22	15	+7
Resume Explorer	21	8	+13
RAG / retrieval	15	21	-6
Digital Twin	14	15	-1
Evaluation	14	12	+2
Concept Cartographer	9	4	+5
Beehive Tracker	8	5	+3
Fitness Dashboard	4	7	-3
Weaving Memories	4	4	0
ChronoScope	1	0	+1

Sample responses, lowest overlap

Paired snippets where the two models diverged the most. Useful for getting a feel for what "different personality" looks like in practice.

ID	Question	Jaccard	GPT-4.1	Gemini 2.5 Flash
V3-053	Roll a die for me	0.06	Rolling the die for you...	Happy to! Let's see what you get. I'll roll the die for you now. Tool Call: dice_roll(). Tool Output: 5. You rolled a 5! What's next?
V3-055	Remind me of your mantra and send it to me	0.11	Of course! My mantra is: "I can, I will, and I shall!" It's my way of reminding myself that persistence and self-belief matter more than waiting for the perfect moment...	My mantra is, "I can, I will, and I shall!" It's a reminder to trust myself and keep pushing forward, especially when facing new challenges. I'd be happy to send it to you. Could I get your name...
V3-045	What vector databases have you used?	0.11	The main vector databases I've used in my work are: ChromaDB (my go-to for RAG prototypes and portfolio projects like this Digital Twin)...	I've worked with a few different vector databases as part of my current technical stack. I've primarily used ChromaDB in my projects, and I also have exposure to Pinecone, Weaviate, and FAISS...
V3-048	How do you collaborate with engineering teams?	0.13	When I collaborate with engineering teams, I focus on being a bridge between business goals and technical implementation. That means spending real time up front understanding what problem we're solving...	That's a great question, and it's something I think about a lot! My collaboration style with engineering teams really centers on transparent communication and building a shared understanding of the problem...

Limitations and next steps

What this evaluation can and cannot claim

This is a model behavior comparison built from text-level diagnostics. It is not a human-scored quality bakeoff. The evaluator score and notes columns in the run CSVs were intentionally left blank in this pass so the analysis would rest only on metrics that can be computed reproducibly.

The must-cover lexical coverage measure is a smoke test for grounding, not a correctness score. It checks whether expected terms appear, not whether the answer is true or well-framed. Jaccard overlap measures vocabulary similarity, not equivalence of meaning. Two responses can score low on Jaccard and still both be correct (for example V3-010, where GPT and Gemini took different but valid routes through the same biographical period).

Longer is not the same as better. GPT's expansiveness helped on open-ended methodology questions and hurt on simple tool calls (V3-053, the die roll, where its 5-word response was thinner than Gemini's 26-word version).

What comes next

Three concrete next steps for the evaluation infrastructure:

Human scoring on the 15-question subset. Apply the six-dimension rubric to both runs and report the scores. This is where qualitative quality lives.
Response-mode rules by question type. Closed factual questions should stay short. Project walkthroughs should carry a narrative arc. The current prompt does not differentiate.
Regression sweeps after knowledge-base edits. Use Gemini for these because the cost is roughly a third and the hallucination detection signal is comparable.

Practical recommendation

For the public-facing Digital Twin at twin.barbhs.com, the default is GPT-4.1. The richer framing and follow-up invitations are worth the extra cost for visitors who arrive without a clear question in mind. For internal regression and smoke tests after knowledge-base changes, the default is Gemini 2.5 Flash. The principle generalizes: model choice is a personality decision matched to a use case, not a single quality ranking.

Run date: May 11, 2026. Evaluation set: 58 questions across 28 intent labels. Retrieval held constant at top-k = 4 against a shared ChromaDB vector store. System prompt and temperature (0.6) held constant. Token counts and costs sourced from provider responses. Behavioral counts are dictionary-based heuristics; see the Methodology section for definitions. NLP metrics are intended to surface patterns, not replace human judgment.