Question 1
Difficulty: medium
How do you define success when evaluating a large language model for a business use case?
Sample answer
I define success by tying the model to the actual business outcome, not just generic model quality. First, I clarify the task: is the model summarizing, classifying, answering questions, or generating content? Then I identify the most important dimensions, such as accuracy, relevance, faithfulness, tone, latency, and safety. From there, I work with stakeholders to turn those goals into measurable criteria and a representative evaluation set. For example, if the use case is customer support, I care a lot about resolution quality, refusal behavior on unsupported issues, and consistency across common edge cases. I also make sure there is a baseline so improvements are meaningful, not just subjective. In practice, I look for a balanced evaluation approach: quantitative metrics, human review, and error analysis. A model is successful when it performs well on the real user journeys that matter most and fails safely in the areas where it should not guess.
Question 2
Difficulty: hard
Walk me through how you would build an evaluation dataset for an LLM task.
Sample answer
I start by defining the task clearly and mapping the main user scenarios. The dataset should reflect the real distribution of inputs, including straightforward examples, edge cases, ambiguous prompts, and adversarial cases. I would gather source material from production logs, internal examples, synthetic generation, and subject-matter experts, depending on the use case and privacy constraints. Then I’d create a labeling scheme that is detailed enough to reduce ambiguity but simple enough for reviewers to apply consistently. I usually include rubrics, gold examples, and instructions for edge cases so the dataset is usable by both human evaluators and automated checks. I also pay attention to class balance and coverage; otherwise, the evaluation can look good while missing key failure modes. Finally, I’d validate the dataset by running a pilot, reviewing disagreement patterns, and refining the rubric before scaling. That way, the dataset becomes a reliable benchmark rather than a one-time test file.
Question 3
Difficulty: medium
What metrics would you use to evaluate an LLM that generates customer-facing responses?
Sample answer
I would combine task-level metrics with quality and safety checks. For customer-facing generation, I’d look at correctness, completeness, tone, and helpfulness first, because those directly affect user trust. If the model must stay grounded in provided context, I’d also evaluate faithfulness or groundedness to ensure it does not invent details. For high-stakes support workflows, I’d add refusal accuracy, policy adherence, and escalation behavior so the model knows when to hand off to a human. On the operational side, latency, response length, and cost per request matter because they affect deployment viability. I do not rely on one metric alone, since a model can score well on surface similarity while still giving a wrong or unsafe answer. I prefer a rubric with weighted dimensions and human review on a stratified sample. That gives a much clearer picture of actual user experience and helps identify whether the issue is factuality, style, or policy compliance.
Question 4
Difficulty: medium
Tell me about a time you found a problem in an evaluation process and how you fixed it.
Sample answer
In one evaluation cycle, we initially thought a model was improving because the overall score was trending up. But when I broke the results down by prompt type, I found that the model was doing well on easy examples and getting worse on multi-step queries. The evaluation set had become too concentrated on common cases, so it was hiding the failures that mattered most. I proposed a rebalancing of the dataset and added a separate slice for hard reasoning and ambiguous prompts. I also tightened the rubric so reviewers could distinguish between partially correct answers and genuinely reliable ones. After that change, the score dropped at first, which was actually useful because it reflected reality more accurately. Once the team retrained and iterated, the model improved in the right areas. That experience reinforced for me that a good evaluation process has to be honest, stable, and representative, even when the results are uncomfortable.
Question 5
Difficulty: medium
How do you handle disagreements between human annotators when evaluating LLM outputs?
Sample answer
I treat disagreement as valuable signal, not just noise. First, I look at whether the disagreement comes from unclear instructions, a genuinely subjective judgment, or an edge case the rubric did not anticipate. If the instructions are vague, I revise the rubric and add better examples so reviewers can apply it consistently. If the task is inherently subjective, I make sure the evaluation framework reflects that by using calibrated guidelines and possibly a graded scale instead of a binary one. I also compare disagreement patterns across annotators to see if one person is consistently stricter or more lenient. In some cases, I would run an adjudication step with a senior reviewer to resolve high-impact cases and create a gold set for future calibration. The goal is not to force perfect agreement on everything, but to make sure we understand why people differ and whether the model is actually ambiguous or the evaluation criteria need work.
Question 6
Difficulty: hard
How would you evaluate whether an LLM is hallucinating in a domain like healthcare or finance?
Sample answer
I would focus on groundedness, evidence use, and risk-aware judgment. In a domain like healthcare or finance, hallucination is not just about factual errors; it is also about overconfidence, missing caveats, and giving advice beyond the model’s scope. I’d build an evaluation set with prompts that require the model to use provided documents, follow policies, and recognize when information is insufficient. Then I’d score outputs for supportability: does each key claim trace back to the source, or is the model inventing details? I would also check refusal behavior and escalation quality, because a strong model should know when to say it cannot provide a definitive answer. Human review is especially important here, ideally with domain experts, because subtle mistakes can be high risk. I’d complement that with error analysis by failure type, such as fabrication, partial truth, outdated information, or unsafe recommendation. The main objective is not just accuracy, but controlled, reliable behavior under uncertainty.
Question 7
Difficulty: medium
What is your approach to evaluating prompt changes or model iterations over time?
Sample answer
My approach is to make comparisons controlled and repeatable. I would start with a frozen evaluation set or a versioned benchmark that covers the core use cases and failure modes. Every prompt or model change should be tested against the same baseline so we can attribute differences accurately. I usually track both aggregate metrics and slice-level performance, because a small overall improvement can hide regressions in important subgroups. I also like to include a regression suite for known tricky cases, especially if the model has previously failed on them. Beyond scores, I review a sample of outputs side by side to understand qualitative differences in tone, verbosity, and reasoning quality. If the change improves one area but hurts another, I work with the team to decide whether that tradeoff is acceptable for the product goal. The key is to keep evaluation consistent enough to trust trends, while still updating it when the product or user behavior changes materially.
Question 8
Difficulty: easy
How do you decide when an automated metric is enough and when you need human evaluation?
Sample answer
I use automated metrics when they are well aligned with the task and can reliably capture the behavior we care about. For example, exact match or structured output validation can be useful for extraction tasks, and some similarity metrics can be helpful as a signal in summarization. But for open-ended generation, automated metrics often miss the important parts, like whether the answer is actually useful, grounded, and safe. In those cases, I lean on human evaluation because people can judge nuance, context, and intent better than a metric can. I also think about the cost of error. If the use case is low risk, automated checks plus spot review may be enough. If the use case affects customers, compliance, or reputation, human evaluation should be part of the core process. In practice, I prefer a hybrid approach: automation for scale, human review for depth, and a clear rubric that connects both. That gives you efficiency without losing judgment.
Question 9
Difficulty: medium
Imagine a model scores well in offline evaluation but users complain it feels unhelpful. What would you do?
Sample answer
I would assume the evaluation and the user experience are measuring different things, and I’d investigate that gap quickly. First, I’d review the offline rubric and ask whether it rewards the right behavior. A model can be factually correct but still be vague, overly cautious, or poorly structured. Next, I’d compare real user queries with the benchmark set to see whether the evaluation data reflects actual usage. Often the issue is that the benchmark is cleaner than the real world, so it misses messy prompts, short follow-ups, or implicit intent. I’d also sample user conversations and look for patterns such as incomplete answers, bad tone, or failure to ask clarifying questions. If possible, I’d run a small human review focused specifically on usefulness rather than just correctness. From there, I would update the evaluation framework and propose prompt or model changes. The important thing is to treat user feedback as evidence, not anecdote.
Question 10
Difficulty: easy
Why do you want to work as an LLM Evaluation Analyst, and what makes you effective in this role?
Sample answer
I like this role because it sits at the intersection of analysis, product thinking, and model behavior. I’m motivated by problems where the obvious answer is not always the right one, and LLM evaluation is exactly that kind of work. A model can appear strong on a surface metric while still failing users in subtle ways, so the analyst has to look carefully at data, rubrics, and real usage patterns. What makes me effective is that I’m comfortable moving between quantitative analysis and qualitative judgment. I can design a structured evaluation, dig into disagreements, and translate findings into actions the team can actually use. I also care a lot about clarity, which matters when writing rubrics or explaining tradeoffs to stakeholders. I enjoy being the person who helps a team understand not just whether a model is good, but where it is strong, where it breaks, and what to do next. That combination is what draws me to the role.