LLM Engineer

Interview questions for LLM Engineer roles.

10 questions

Question 1

Difficulty: medium

How do you approach building an LLM-powered feature from vague product requirements to a production-ready system?

Sample answer

I start by turning the vague request into a concrete user problem and a measurable success metric. For example, if the ask is “make support responses smarter,” I’d first clarify whether the goal is faster resolution, better deflection, more consistent tone, or all three. Then I’d define the constraints: latency budget, privacy requirements, cost per request, and acceptable failure modes. From there, I’d build a thin prototype using the smallest model and simplest orchestration that can prove value quickly. I usually prefer to validate with real examples early, because LLM behavior can look great in demos but fail on edge cases. Once the prototype is working, I’d add evaluation scripts, logging, guardrails, and fallback paths. My goal is to get to a version that is reliable enough for users, not just impressive in a notebook. I like shipping in stages so the team can learn from actual usage instead of assuming the first design is correct.

Question 2

Difficulty: medium

Describe your experience choosing between prompting, RAG, fine-tuning, and agentic workflows for an LLM use case.

Sample answer

I treat those as tools with different tradeoffs, not as competing philosophies. Prompting is my first choice when the task is narrow and the knowledge is already in the model, because it is fast to iterate and easy to maintain. I use RAG when the model needs current, proprietary, or domain-specific information, especially when accuracy and traceability matter. Fine-tuning makes sense when I want consistent style, formatting, or behavior that prompting alone can’t stabilize, and I have enough high-quality examples. Agentic workflows are useful when the task requires multi-step reasoning, tool use, or conditional branching, but I’m careful because they can become expensive and hard to debug. In practice, I usually start with prompting plus RAG, then add fine-tuning only if I can prove it improves quality enough to justify the effort. I prefer the simplest approach that meets the business goal and can be observed clearly in production.

Question 3

Difficulty: hard

How do you evaluate whether an LLM application is actually good enough to launch?

Sample answer

I don’t rely on a single score, because LLM systems usually fail in more than one way. I define evaluation around the user journey and break it into dimensions like correctness, completeness, tone, safety, latency, and cost. For some tasks, I use a labeled test set with golden answers. For others, I create rubric-based judging with human review, especially when the output is subjective. I also look at failure categories, not just aggregate metrics, because a system that scores well overall can still be unusable if it hallucinates on important edge cases. In addition, I want a small set of adversarial cases that reflect real production risks, such as prompt injection, ambiguous queries, or missing context. Before launch, I compare the system to a baseline, ideally the current manual workflow. If it’s not clearly better on the core business metric, I keep iterating. I care most about whether it helps users reliably in real conditions.

Question 4

Difficulty: hard

Tell me about a time you had to reduce hallucinations in an LLM application.

Sample answer

In one project, the model was generating polished but incorrect answers in a domain where factual accuracy mattered. My first step was to isolate where the errors were coming from: retrieval quality, prompt structure, or the model filling gaps too aggressively. I found that the retrieval step was returning relevant documents, but the prompt didn’t strongly enforce citation-based answering, so the model was blending retrieved facts with its own assumptions. I tightened the prompt to require answers only from retrieved context and added an explicit “say you don’t know” behavior when the evidence was weak. I also improved chunking and metadata filters so the retriever brought back more focused context. After that, I built a small evaluation set of tricky queries and monitored unsupported claims over time. The biggest lesson was that hallucination reduction is usually a system problem, not just a model problem. You need better retrieval, clearer instructions, and evaluation that reflects the real risk.

Question 5

Difficulty: hard

How would you design a retrieval-augmented generation system for an internal knowledge base?

Sample answer

I’d design it around accuracy, permissioning, and maintainability. First, I’d identify the document sources and how often they change, because indexing strategy depends on freshness. Then I’d preprocess documents into clean chunks with metadata such as source, date, department, and access level. For retrieval, I’d likely use hybrid search so keyword matches and semantic similarity complement each other. I’d also include reranking, because the top retrieved chunks are often not the best ones without a second pass. On the generation side, I’d constrain the model to answer only from retrieved context and include citations so users can verify the source. Access control is critical: the retrieval layer must respect user permissions before the model ever sees the content. Finally, I’d log queries, retrieved passages, and user feedback so I can improve both retrieval quality and chunking over time. A good RAG system is not just about clever embeddings; it is about data hygiene, retrieval discipline, and safe orchestration.

Question 6

Difficulty: hard

How do you handle prompt injection or malicious user input in an LLM product?

Sample answer

I assume prompt injection will happen, so I design defense in depth. The first layer is architectural: I separate system instructions, user input, and retrieved content as much as possible, and I do not let untrusted content rewrite policy. If the product uses tools or retrieval, I validate inputs before they reach the model and restrict what the model can do through tools with tight permissions. I also make sure the model is not trusted to decide security-sensitive actions on its own. For example, if it can access internal documents, I enforce authorization outside the model rather than hoping the prompt will protect it. On the content side, I test against jailbreaks, instruction overrides, and malicious documents that try to hijack the response. I also monitor for suspicious patterns in production. Most importantly, I treat the model as a potentially manipulable component, not as an agent with judgment. Safety comes from system design, not from optimism about the prompt.

Question 7

Difficulty: medium

How do you balance latency, cost, and quality when deploying LLM features?

Sample answer

I treat it like a product optimization problem, not just an engineering one. First, I identify which quality attributes matter most for the use case. A customer support summary can tolerate a bit more latency than a live autocomplete feature, for example. Then I look at the main cost drivers: model size, context length, number of calls, retrieval overhead, and tool usage. I usually start with a capable but smaller model and measure whether it meets the bar before jumping to a larger one. If quality is close but latency is too high, I’ll try prompt simplification, caching, shorter contexts, or response streaming before changing models. I also like to route requests intelligently—simple queries can go to a cheaper model, while harder ones escalate to a stronger model. The key is to make tradeoffs visible with metrics so we’re not optimizing blindly. Good LLM systems are usually the result of careful orchestration, not just more compute.

Question 8

Difficulty: medium

Describe a situation where an LLM output was technically correct but still not useful. What did you do?

Sample answer

I’ve seen this happen when the model gives a correct answer but in a format that doesn’t match how users actually work. In one case, a model generated a long, accurate explanation, but the user needed a short decision and a clear next step. The output was technically fine, but it created friction because people had to extract the actionable piece themselves. I addressed it by redesigning the response contract rather than just tweaking the prompt. We changed the system to produce a concise recommendation first, followed by supporting detail only if needed. I also added structure like bullet points, confidence indicators, and specific follow-up actions. After that, the same underlying information became much more usable. That experience reminded me that LLM quality is not only about truthfulness; it is about whether the answer fits the user’s workflow. A perfect answer that is hard to consume is still a weak product.

Question 9

Difficulty: medium

How do you keep an LLM system maintainable as prompts, models, and requirements change over time?

Sample answer

I try to make the LLM layer as testable and modular as any other production system. Prompts should live in version control, with clear ownership and change history, not scattered across notebooks or ad hoc configs. I also keep the orchestration logic separate from prompt text so we can evolve one without breaking the other. For maintenance, I rely on a regression suite with representative examples, edge cases, and failure scenarios. Every meaningful prompt or model change should run against that suite before release. I also track production metrics like user satisfaction, refusal rates, retrieval precision, and latency so we notice drift early. When possible, I add structured outputs, schema validation, and fallback handling to reduce fragility. Documentation matters too, especially around why a prompt exists and what it is supposed to do. LLM systems change quickly, so maintainability comes from discipline: tests, versioning, observability, and a preference for simple designs that are easy to revise.

Question 10

Difficulty: easy

Why do you want to work as an LLM Engineer, and what makes you effective in this role?

Sample answer

I like this role because it sits at the intersection of product thinking, machine learning, and software engineering, which is where I do my best work. I enjoy turning ambiguous ideas into systems that solve real problems for users, especially when the answer is not just “train a bigger model.” What makes me effective is that I’m comfortable moving between experimentation and production concerns. I can build quickly enough to learn, but I also care about evaluation, reliability, and cost once something is live. I’m also realistic about what LLMs can and cannot do, which helps me avoid overpromising and build systems that are actually trustworthy. I work well with product teams because I ask about the user outcome first, and I work well with engineering teams because I think in interfaces, failure modes, and operational metrics. For me, the exciting part is making these systems genuinely useful, not just impressive in demos.