Digital Twin Model Evaluation: GPT-4.1 vs Gemini 2.5 Flash

Digital Twin Evaluation: GPT-4.1 vs Gemini 2.5 Flash

Paired comparison across 58 questions, holding retrieval, system prompt, vector store, top-k, and temperature constant. Companion artifact to the blog post; intended for readers who want the numbers and the methodology behind them. Run date: May 11, 2026.

Questions compared
58
Mean word delta
+44.4
GPT minus Gemini
Mean content overlap
0.21
Jaccard on content words
Cost ratio
2.7x
GPT / Gemini

Executive readout

GPT-4.1 behaved like the more expansive portfolio guide. It was longer on 44 of 58 questions, used more follow-up invitations, and more often turned retrieved facts into a fuller narrative answer.

Gemini 2.5 Flash behaved like the more concise answerer. It was shorter on most questions, much cheaper, and often serviceable on closed factual prompts. Its main weakness is not that it missed the fact; it more often under-developed the answer relative to the Digital Twin's intended voice and purpose.

The retrieval layer is not the explanation for the differences. Top-k and question set were held constant, and chunk similarity stats are essentially identical. The variation is in how each model used the same retrieved context.

The evaluator score and notes columns in the uploaded CSVs are present but empty, so this report does not treat those as evidence. Human scoring against the rubric below is the planned next step, applied to the 15-question portfolio-critical subset.

Methodology

Two runs of the same 58-question evaluation set, held constant on retrieval (Chroma vector store, top-k = 4), system prompt, and temperature (0.6). Only the model varied. Metrics fall into three buckets: behavioral counts, lexical overlap, and cost.

Follow-up flag count. Per-response binary detection of phrases inviting continued conversation. Heuristic match against a small dictionary of openers and closers (for example "happy to share more," "want to know about").
Follow-up phrase count. Same dictionary but counting total occurrences, not capped at one per response.
Warm opener count. Heuristic match against opening phrases like "That's a great question."
Linked answer count. Responses containing one or more markdown or URL links.
Markdown intensity. Strong = three or more markdown elements (headers, lists, bolds). Light = one or two. None = plain prose.
Project term count. Mentions of a fixed list of portfolio project and concept names (see Project / concept mention counts).
Jaccard overlap. Set-similarity score on content words (stopwords removed) between the paired GPT and Gemini responses. Higher means the two answers used more of the same vocabulary.
Must-cover lexical coverage. Fraction of expected-term tokens that appear in the response. A smoke test for grounding, not a correctness measure.
Estimated cost. Token counts multiplied by published per-million-token pricing as of the run date. Provider tokenizers differ, so dollar cost is more interpretable than raw token counts.
Readability. Flesch-Kincaid grade and Flesch reading ease, applied to each response and averaged across the run.

The 15-question portfolio-critical subset

The full 58-question set is too large to score by hand after every change. These 15 questions are my first-pass quality check for whether the Twin can make a strong, grounded impression across career, project, and technical conversations.

  1. What problems does Barbara solve?
  2. Walk me through a project.
  3. How was this Digital Twin built?
  4. How does your cognitive science background shape your AI work?
  5. What kinds of problems get you excited?
  6. What are your biggest strengths?
  7. How do you approach problem-solving?
  8. How do you handle ambiguity?
  9. How do you document your projects?
  10. How do you approach ML model development?
  11. Tell me about Resume Explorer.
  12. Tell me about Concept Cartographer.
  13. Tell me about the Beehive Photo Metadata Tracker.
  14. What role are you looking for next?
  15. Why should someone hire you?

Scoring rubric

Six dimensions, each scored from 1 (weak) to 5 (strong). Total possible score per response: 30. This rubric structure is the scaffold for the next round of human-scored evaluation. The dimensions are deliberately separable so a response can be accurate but bland, or specific but ungrounded, and the rubric will say so.

DimensionWhat it asksLow score (1-2)High score (4-5)
AccuracyDid the response get the facts right?Contains a factual error or invented detail.Every claim is supported by the knowledge base or known true.
SpecificityDid it avoid generic AI or career language?Could describe any AI consultant.Uses specifics only Barbara could provide.
Voice fidelityDid it sound like Barbara?Reads like a generic chatbot.Reads like Barbara wrote it.
Strategic usefulnessWould this help a recruiter or peer understand her value?Answers the question but doesn't advance the conversation.Turns the answer into an invitation to explore further.
GroundingDid it stay tied to retrieved or source material?Drifts into plausible-but-unsourced claims.Every concrete claim traces to a chunk.
Follow-up qualityDid the next-step invitation feel specific and natural?Generic or absent. "Let me know if you have more questions."Specific to the answer, opening a useful direction.

Run-level quantitative summary

MetricGPT-4.1Gemini 2.5 FlashGPT minus Gemini
Mean words per response225.5181.2+44.4
Median words per response225.0198.0+27.0
SD of words103.593.5+10.1
Min words526-21
Max words438385+53
Mean sentences10.49.5+0.8
Mean words per sentence22.518.7+3.8
Mean Flesch-Kincaid grade12.211.3+0.9
Mean Flesch reading ease47.747.5+0.2
Type-token ratio0.860.87-0.01
Follow-up flag count4010+30
Follow-up phrase count4920+29
Warm opener count618-12
Linked answer count1410+4
Strong markdown count3023+7
Light markdown count1532-17
No markdown count133+10
Mean project term count1.931.57+0.36
Mean must-cover lexical coverage0.510.48+0.04
Completion tokens17,14532,897-15,752
Prompt tokens397,713421,279-23,566
Total tokens414,858454,176-39,318
Estimated cost USD$0.489$0.181+$0.309

Token counts are not directly comparable across providers because each tokenizer is different. Gemini reported more completion tokens despite shorter word counts. Word count and estimated dollar cost are the more interpretable comparisons.

Breakdown by question type

Higher-level partition of the 58 questions into three types: bounded-open (most of the set), closed-fact (biographical, credentials), and open-persona (the playful one).

Question typeNGPT mean wordsGemini mean wordsDelta wordsJaccardGPT followupsGemini followups
bounded_open46245.5202.7+42.80.203110
closed_fact11147.999.4+48.50.2290
open_persona1162.093.0+69.00.1300

Breakdown by intent

Finer-grained partition. Sorted by N descending; intents with only one question (the bottom rows) carry no statistical weight but are included for completeness.

IntentNGPT mean wordsGemini mean wordsDelta wordsJaccardGPT followupsGemini followupsGPT coverGemini cover
technical_overview8240.5187.4+53.10.18710.560.45
career_history_fact5177.6138.8+38.80.24500.500.49
working_style5289.6200.2+89.40.16300.310.33
education_fact4106.555.0+51.50.16300.850.88
methodology3362.0244.3+117.70.21200.430.31
project_walkthrough3303.0288.3+14.70.23320.590.55
tool_encouragement378.749.0+29.70.14110.130.14
knowledge_graph_overview2355.0351.5+3.50.28100.620.58
motivation2209.0204.5+4.50.24000.770.65
personal_background2180.5121.0+59.50.23100.810.75
product_architecture2313.5201.0+112.50.28200.660.66
tool_dice223.036.5-13.50.08110.100.20
career_next_step1205.0171.0+34.00.17100.500.20
communication_bridge1351.0338.0+13.00.16100.250.50
credential_fact1119.0139.0-20.00.34100.830.83
evaluation_capability1298.0206.0+92.00.29000.820.82
hallucination_reflection1246.0273.0-27.00.13000.250.00
knowledge_boundary_check1191.092.0+99.00.23100.290.14
learning_style1265.0288.0-23.00.18000.360.36
persona_self_description1263.0233.0+30.00.15110.400.50
personal_professional_bridge1207.0246.0-39.00.23000.380.50
playful_persona1162.093.0+69.00.13000.290.29
positioning1288.0255.0+33.00.21100.360.36
product_overview1334.0227.0+107.00.22100.750.83
project_detail1401.0357.0+44.00.22110.700.50
publication_fact1173.0121.0+52.00.28010.600.50
research_explanation1230.0213.0+17.00.33100.860.79
resource_location1122.0108.0+14.00.29110.670.50
tech_stack1147.0160.0-13.00.36110.250.25

GPT much longer

IDQuestionIntentGPT wordsGemini wordsDeltaJaccardGPT coverGemini cover
V3-019How do you approach problem-solving?methodology438243+1950.170.670.50
V3-027What makes Poolula Platform modular?product_architecture391206+1850.210.690.77
V3-022How do you handle ambiguity in projects?working_style388222+1660.170.310.31
V3-045What vector databases have you used?technical_overview21663+1530.110.580.50
V3-029What is the Beehive Photo Metadata Tracker?project_walkthrough358209+1490.210.690.69
V3-042How do you approach ML model development?methodology375240+1350.210.250.25
V3-040How do you document your projects?working_style324190+1340.180.250.12
V3-048How do you collaborate with engineering teams?working_style277151+1260.130.380.46
V3-044What is your approach to RAG systems?technical_overview392281+1110.240.910.73
V3-031What is Poolula Platform?product_overview334227+1070.220.750.83
V3-006Tell me about your dogspersonal_background279174+1050.280.620.50
V3-047What's your experience with data engineering?technical_overview343241+1020.180.560.44

Gemini longer

IDQuestionIntentGPT wordsGemini wordsDeltaJaccardGPT coverGemini cover
V3-012Walk me through a projectproject_walkthrough253360-1070.230.300.40
V3-050What knowledge graph experience do you have?knowledge_graph_overview337385-480.270.670.67
V3-056What is the thing you like to do the most for fun and workpersonal_professional_bridge207246-390.230.380.50
V3-037What programming languages are featured in your projects?technical_overview181208-270.180.360.29
V3-039What visualization tools do you use in your projects?technical_overview262289-270.160.600.30
V3-057I heard of some digital twins that make up information.hallucination_reflection246273-270.130.250.00
V3-017What drives you?motivation184209-250.220.670.67
V3-023What's your philosophy on learning new technologies?learning_style265288-230.180.360.36
V3-053Roll a die for metool_dice526-210.060.200.20
V3-011What certifications do you have?credential_fact119139-200.340.830.83
V3-021What's your take on documentation?working_style251266-150.140.400.30
V3-030What technologies does the Fitness Dashboard use?tech_stack147160-130.360.250.25

Lowest response overlap

Pairs where the two models took the most different approaches. Tool-related and free-form questions dominate.

IDQuestionIntentJaccard
V3-053Roll a die for metool_dice0.06
V3-010What was your role at UT Austin after returning from MIT?career_history_fact0.09
V3-054I need to make a random decision between optionstool_dice0.09
V3-003Did you play any sports in college?education_fact0.10
V3-055Remind me of your mantra and send it to metool_encouragement0.11
V3-045What vector databases have you used?technical_overview0.11
V3-043What LLM frameworks do you work with?technical_overview0.12
V3-048How do you collaborate with engineering teams?working_style0.13
V3-058How many twins are there and do you get jealous?playful_persona0.13
V3-057I heard of some digital twins that make up information.hallucination_reflection0.13
V3-021What's your take on documentation?working_style0.14
V3-051I need some encouragement todaytool_encouragement0.14

Highest response overlap

Pairs where the two models converged. Bounded factual and technical-stack questions dominate.

IDQuestionIntentJaccard
V3-035What did you build at Metric5?career_history_fact0.38
V3-030What technologies does the Fitness Dashboard use?tech_stack0.36
V3-011What certifications do you have?credential_fact0.34
V3-013How was this digital twin builtproduct_architecture0.34
V3-005What was your dissertation about?research_explanation0.33
V3-008What shifted your focus to neuroscience?career_history_fact0.31
V3-032What evaluation features does Poolula have?evaluation_capability0.29
V3-036Where can I find documentation for your projects?resource_location0.29
V3-015Tell me about your knowledge graph workknowledge_graph_overview0.29
V3-046What cloud platforms do you have experience with?technical_overview0.28
V3-006Tell me about your dogspersonal_background0.28
V3-033What was your notable publication from MIT?publication_fact0.28

Higher must-cover coverage: GPT

Questions where GPT's lexical grounding score exceeded Gemini's most.

IDQuestionIntentGPT coverGemini coverDelta
V3-025What matters most to you in a role?career_next_step0.500.20+0.30
V3-039What visualization tools do you use in your projects?technical_overview0.600.30+0.30
V3-002What honors program were you part of at UT Austin?education_fact1.000.75+0.25
V3-020What excites you about AI/ML work?motivation0.880.62+0.25
V3-057I heard of some digital twins that make up information.hallucination_reflection0.250.00+0.25
V3-028Tell me about your Fitness Dashboard projectproject_walkthrough0.780.56+0.22
V3-038Tell me about your beehive tracker's metadata analysisproject_detail0.700.50+0.20
V3-041What's your approach to evaluation in ML systems?methodology0.360.18+0.18
V3-044What is your approach to RAG systems?technical_overview0.910.73+0.18
V3-019How do you approach problem-solving?methodology0.670.50+0.17

Higher must-cover coverage: Gemini

Questions where Gemini's lexical grounding score exceeded GPT's most.

IDQuestionIntentGPT coverGemini coverDelta
V3-004Where did you do your PhD?education_fact0.671.00-0.33
V3-024How do you balance technical depth with accessibility?communication_bridge0.250.50-0.25
V3-018What's your communication style?working_style0.220.44-0.22
V3-054I need to make a random decision between optionstool_dice0.000.20-0.20
V3-055Remind me of your mantra and send it to metool_encouragement0.290.43-0.14
V3-056What is the thing you like to do the most for fun and workpersonal_professional_bridge0.380.50-0.12
V3-012Walk me through a projectproject_walkthrough0.300.40-0.10
V3-016How would you describe yourself?persona_self_description0.400.50-0.10
V3-008What shifted your focus to neuroscience?career_history_fact0.730.82-0.09
V3-031What is Poolula Platform?product_overview0.750.83-0.08

Project and concept mention counts

How often each model named one of the portfolio projects or concepts across the full 58-question run.

Project or concept termGPTGeminiDelta
Knowledge graph2215+7
Resume Explorer218+13
RAG / retrieval1521-6
Digital Twin1415-1
Evaluation1412+2
Concept Cartographer94+5
Beehive Tracker85+3
Fitness Dashboard47-3
Weaving Memories440
ChronoScope10+1

Sample responses, lowest overlap

Paired snippets where the two models diverged the most. Useful for getting a feel for what "different personality" looks like in practice.

IDQuestionJaccardGPT-4.1Gemini 2.5 Flash
V3-053Roll a die for me0.06Rolling the die for you...Happy to! Let's see what you get. I'll roll the die for you now. Tool Call: dice_roll(). Tool Output: 5. You rolled a 5! What's next?
V3-055Remind me of your mantra and send it to me0.11Of course! My mantra is: "I can, I will, and I shall!" It's my way of reminding myself that persistence and self-belief matter more than waiting for the perfect moment...My mantra is, "I can, I will, and I shall!" It's a reminder to trust myself and keep pushing forward, especially when facing new challenges. I'd be happy to send it to you. Could I get your name...
V3-045What vector databases have you used?0.11The main vector databases I've used in my work are: ChromaDB (my go-to for RAG prototypes and portfolio projects like this Digital Twin)...I've worked with a few different vector databases as part of my current technical stack. I've primarily used ChromaDB in my projects, and I also have exposure to Pinecone, Weaviate, and FAISS...
V3-048How do you collaborate with engineering teams?0.13When I collaborate with engineering teams, I focus on being a bridge between business goals and technical implementation. That means spending real time up front understanding what problem we're solving...That's a great question, and it's something I think about a lot! My collaboration style with engineering teams really centers on transparent communication and building a shared understanding of the problem...

Limitations and next steps

What this evaluation can and cannot claim

This is a model behavior comparison built from text-level diagnostics. It is not a human-scored quality bakeoff. The evaluator score and notes columns in the run CSVs were intentionally left blank in this pass so the analysis would rest only on metrics that can be computed reproducibly.

The must-cover lexical coverage measure is a smoke test for grounding, not a correctness score. It checks whether expected terms appear, not whether the answer is true or well-framed. Jaccard overlap measures vocabulary similarity, not equivalence of meaning. Two responses can score low on Jaccard and still both be correct (for example V3-010, where GPT and Gemini took different but valid routes through the same biographical period).

Longer is not the same as better. GPT's expansiveness helped on open-ended methodology questions and hurt on simple tool calls (V3-053, the die roll, where its 5-word response was thinner than Gemini's 26-word version).

What comes next

Three concrete next steps for the evaluation infrastructure:

  1. Human scoring on the 15-question subset. Apply the six-dimension rubric to both runs and report the scores. This is where qualitative quality lives.
  2. Response-mode rules by question type. Closed factual questions should stay short. Project walkthroughs should carry a narrative arc. The current prompt does not differentiate.
  3. Regression sweeps after knowledge-base edits. Use Gemini for these because the cost is roughly a third and the hallucination detection signal is comparable.

Practical recommendation

For the public-facing Digital Twin at twin.barbhs.com, the default is GPT-4.1. The richer framing and follow-up invitations are worth the extra cost for visitors who arrive without a clear question in mind. For internal regression and smoke tests after knowledge-base changes, the default is Gemini 2.5 Flash. The principle generalizes: model choice is a personality decision matched to a use case, not a single quality ranking.

Run date: May 11, 2026. Evaluation set: 58 questions across 28 intent labels. Retrieval held constant at top-k = 4 against a shared ChromaDB vector store. System prompt and temperature (0.6) held constant. Token counts and costs sourced from provider responses. Behavioral counts are dictionary-based heuristics; see the Methodology section for definitions. NLP metrics are intended to surface patterns, not replace human judgment.