Search Relevance Engineer

Interview questions for Search Relevance Engineer roles.

10 questions

Question 1

Difficulty: medium

How would you diagnose a drop in search relevance after a new ranking model is deployed?

Sample answer

I’d start by separating whether the issue is model quality, feature distribution shift, or a bug in the serving pipeline. First I’d compare offline metrics before and after deployment, then validate against live traffic using query-level slices, because overall averages can hide serious regressions. I’d look at high-volume head queries, long-tail queries, navigational queries, and queries with fresh content to see where the drop is concentrated. Next I’d inspect feature values in production to confirm they match training-time assumptions, especially for missing values, timestamp-based features, and any lexical or behavioral signals. If the model itself looks fine offline, I’d check search logs for reranking issues, index freshness, and business rules that may be overriding scores. I also like to sample real results and read them manually, because relevance failures often show up in ways metrics don’t capture. My goal is to isolate the smallest failing component and fix that first.

Question 2

Difficulty: medium

Describe how you would evaluate whether a change to the search ranking algorithm actually improved relevance.

Sample answer

I’d use a layered evaluation approach rather than trusting a single metric. First, I’d define success with product and search stakeholders: are we optimizing for click-through, satisfied queries, task completion, or reduced reformulations? Then I’d run an offline evaluation using judgment labels or historical interaction data to compare the old and new ranking on a representative query set. I’d break results down by query intent, language, device, and traffic segment so we can see where the change helps or hurts. After that, I’d use an A/B test with guardrails like latency, abandonment, and zero-result rate. I’m careful not to overread short-term click lift, because a change can boost clicks while making results less useful. I also monitor query reformulation rate and successful session completion, since those are often better relevance indicators. If the data is mixed, I’d review query samples manually and decide whether to iterate before broad rollout.

Question 3

Difficulty: medium

Tell me about a time you had to improve relevance with incomplete or noisy training data.

Sample answer

In one project, the click logs we relied on were noisy because popular results were getting clicks even when they were not the best answer. Instead of treating clicks as ground truth, I built a more careful labeling strategy. I combined multiple signals: dwell time, reformulation behavior, and explicit human judgments for a smaller but cleaner sample set. I also filtered out ambiguous sessions where the query intent was unclear or where the user likely clicked by position bias. That gave us a much better signal for training and evaluation. On the modeling side, I kept the feature set simple enough to avoid overfitting to artifacts in the logs. We saw a noticeable improvement in ranking stability, especially for mid-tail queries where the old model had been overly dependent on popularity. What I took away from that work is that relevance modeling is often more about data quality and judgment discipline than fancy algorithms. The best model is only as good as the signal behind it.

Question 4

Difficulty: easy

What signals would you consider when building a search relevance model for an e-commerce catalog?

Sample answer

For e-commerce, I’d think in terms of matching, intent, and business usefulness. On the matching side, lexical overlap, synonyms, category alignment, and attributes like brand, size, and color matter a lot. But I would not stop there, because pure matching can surface irrelevant products that happen to share keywords. I’d also include behavioral signals such as historical click-through, add-to-cart rate, and purchase conversion, while being careful to normalize for exposure and position bias. Freshness can matter too, especially for seasonal products or inventory changes. If the business allows it, availability and shipping constraints should be strong signals, because a relevant item that can’t ship is not useful. I’d also pay attention to query intent classification: some queries are highly specific, while others are exploratory, and the ranking strategy should reflect that. Finally, I’d make sure the model doesn’t over-optimize for sales at the expense of user satisfaction, because relevance should solve the query first and support the business second.

Question 5

Difficulty: medium

How do you handle a situation where product managers want higher click-through, but your relevance analysis shows the change could hurt user satisfaction?

Sample answer

I’d try to reframe the discussion around long-term search quality rather than just short-term clicks. Click-through can be useful, but it’s easy to game with attention-grabbing results that don’t actually solve the query. I’d present evidence from multiple sources: offline judgments, reformulation rate, session completion, and user complaints if we have them. Then I’d explain the tradeoff clearly: a ranking change that increases clicks but also increases backtracking or query reformulation is probably creating friction. From there, I’d suggest testing a compromise, like limiting the new behavior to a subset of queries or using it as a secondary signal rather than a primary one. I’ve found that stakeholders respond well when you offer options instead of just saying no. I’d also ask whether the business goal is truly click-through or something downstream like conversions, because those are not always aligned. My job would be to protect the user experience while still moving the business metrics in a responsible way.

Question 6

Difficulty: medium

How would you design a relevance evaluation set for a new search vertical with little historical data?

Sample answer

I’d start by defining the main query types and user intents in the vertical, because that shapes everything else. With little historical data, I would not wait for logs to do all the work. Instead, I’d work with product experts to sample a broad set of representative queries, including common, ambiguous, and edge-case queries. Then I’d create a labeling guide that explains what counts as relevant, highly relevant, and irrelevant in the context of that vertical. I’d want multiple annotators, ideally with adjudication on disagreements, so the labels are consistent. If possible, I’d supplement human judgment with synthetic query generation to expand coverage, but I’d keep those examples clearly separated from the core evaluation set. I’d also make sure the set includes negative examples, because those are critical for testing precision. Finally, I’d version the dataset carefully so future model changes are compared against the same standard. In a new vertical, the evaluation set becomes the foundation for everything that follows.

Question 7

Difficulty: hard

Explain how you would debug a case where search latency improved but relevance got worse.

Sample answer

That kind of regression often means something in the serving path changed, even though the speed improvement looks good on the surface. I’d first compare the candidate and baseline pipelines to see whether the faster version is skipping a reranking stage, using fewer features, or timing out on some requests. Sometimes latency improvements come from simplifying the request path in ways that reduce relevance quality. I’d inspect which queries got worse and whether the degradation is concentrated in long queries, rare queries, or queries that depend on expensive features. Next, I’d check for feature fetch failures, cache misses, or truncation of result sets that might be changing the candidate pool. I’d also look at ranking depth, because if fewer documents are scored, the final order can suffer. I’d validate the issue with manual query review and offline comparisons. My approach is to treat latency and relevance as a coupled system, not separate goals. If speed improved by removing critical signal, I’d push to reintroduce the signal more efficiently rather than accept the tradeoff.

Question 8

Difficulty: medium

Tell me about a time you disagreed with a stakeholder about search quality metrics. How did you handle it?

Sample answer

I once worked with a stakeholder who was focused almost entirely on CTR, while my analysis suggested the proposed change would hurt relevance for informational queries. Rather than arguing abstractly, I brought concrete query examples and showed how the same change behaved differently across intent types. I also walked through session-level metrics, which made the tradeoff more visible: the change increased clicks on some queries but caused more reformulations on others. That helped shift the conversation from a single metric to a more complete view of search quality. I made sure to acknowledge the business concern, because the stakeholder was not wrong to care about engagement. The real issue was that CTR alone was too narrow. We agreed to run a limited experiment with intent-based guardrails and to review results by segment. That approach kept the relationship collaborative and turned a disagreement into a better decision process. I think good search work depends on being able to translate technical findings into language that stakeholders trust.

Question 9

Difficulty: hard

What would you do if a query returns many superficially relevant results, but users still seem dissatisfied?

Sample answer

That’s usually a sign that the system is matching keywords but missing the true intent. I’d look at the query shape first: is it ambiguous, underspecified, or likely to have multiple intents? Then I’d inspect the result set to see whether the pages or products are technically related but not actually useful. User dissatisfaction often shows up as quick bounces, short dwell time, or repeated reformulations even when the top results look reasonable at a glance. In that case, I’d consider adding intent disambiguation, stronger semantic matching, or result diversification if the query naturally spans multiple interpretations. For example, a broad query may need a mix of authoritative, popular, and niche results rather than ten near-duplicates. I’d also review whether ranking is over-weighting popularity or text overlap at the expense of specificity. My goal would be to improve not just relevance in the narrow sense, but usefulness in the user’s context. Good search should make the user feel understood, not just matched.

Question 10

Difficulty: easy

How do you balance manual judgment labels with behavioral data when training a search relevance system?

Sample answer

I treat them as complementary signals, not competing sources of truth. Manual labels are valuable because they reflect a deliberate assessment of relevance, but they’re expensive and usually limited in scale. Behavioral data gives broad coverage and captures real user behavior, but it’s noisy and influenced by position bias, presentation, and intent ambiguity. In practice, I like to use human judgments to define the target standard and behavioral data to improve scale and generalization. For example, labels can anchor evaluation sets and model selection, while clicks, dwell time, and conversion can help with training or feature engineering if they’re cleaned up carefully. I’m especially cautious with implicit feedback on head queries, where popularity can distort interpretation. I also prefer to use behavioral signals in a weighted or debiased way rather than as raw ground truth. The balance depends on the use case, but I always want a system that learns from users without blindly copying their most visible choices. That usually gives the most reliable relevance gains.