Linux System Administrator

Interview questions for Linux System Administrator roles.

10 questions

Question 1

Difficulty: medium

How do you approach managing a Linux server environment to keep systems stable, secure, and easy to maintain?

Sample answer

I start with consistency. In practice, that means standardizing OS builds, package sources, configuration baselines, and naming conventions so the environment is predictable. From there, I focus on monitoring and automation. I want good visibility into CPU, memory, disk, logs, and service health before problems become outages, and I prefer using automation for repetitive tasks like patching, user provisioning, and config changes. Security is part of the routine, not a separate project, so I keep systems patched, minimize exposed services, enforce least privilege, and review SSH and sudo access regularly. I also document changes and keep rollback steps ready because the best-administered systems are the ones you can safely recover. My goal is to reduce manual drift, make troubleshooting faster, and create an environment where new servers behave like existing ones instead of becoming special cases.

Question 2

Difficulty: hard

Tell me about a time you resolved a Linux server outage or major incident. What was your process?

Sample answer

In a recent incident, one of our application servers became unresponsive during business hours and users were seeing timeouts. I first confirmed the scope by checking whether the issue was isolated to one host or part of a wider problem. Then I looked at system load, disk usage, and the logs because those usually tell you whether the failure is resource-related, storage-related, or application-related. In this case, disk space on the root filesystem had filled up because log rotation had stopped working after a config change. I cleared enough space to restore service, then traced the change that broke logrotate and fixed the configuration. After that, I added monitoring for filesystem thresholds and updated our runbook so the team would know exactly what to check first. I think the key was staying methodical and restoring service quickly before digging too deep.

Question 3

Difficulty: medium

How do you handle Linux patching and kernel updates in a production environment without causing unnecessary downtime?

Sample answer

I treat patching as a controlled change process, not just a routine command. Before any production update, I verify the maintenance window, check whether the update affects the kernel, libc, or any service-critical package, and confirm rollback options. I always test patches in a lower environment first if the systems are part of a standard build. For production, I prefer staged rollouts so I can update one group or host at a time and watch for issues before moving on. If a kernel update requires a reboot, I coordinate it with the application owners and make sure backups and failover are in place. I also review changelogs so I know what behavior might change. After patching, I validate service health, log in to confirm the system booted cleanly, and compare monitoring data against the pre-change baseline. That approach keeps security current without turning patch day into an outage.

Question 4

Difficulty: hard

How do you troubleshoot high CPU or memory usage on a Linux server?

Sample answer

I usually start by identifying whether the issue is persistent or a spike. Tools like top, htop, ps, free, vmstat, and sar help me see if the problem is a single process, a runaway service, or system-wide pressure. If CPU is high, I check which process is consuming it and whether that aligns with expected behavior. If memory is the concern, I look at available memory, swap activity, and whether the system is under pressure because of caching or an actual leak. I also check logs and application metrics because Linux symptoms often point to an application issue rather than an OS issue. On one server, a Java service was gradually consuming memory after a deployment, so I confirmed the leak pattern, restarted the service to stabilize production, and escalated with evidence to the app team. My goal is to isolate the cause quickly, restore performance, and collect enough data to prevent the same issue from coming back.

Question 5

Difficulty: medium

What steps would you take to secure SSH access on Linux servers?

Sample answer

I’d start by reducing the attack surface and tightening authentication. That means disabling direct root login, using key-based authentication, and restricting SSH access to approved users and management networks where possible. I also like to change the default SSH configuration carefully, such as limiting idle timeouts, disabling password authentication when the environment allows it, and using strong ciphers and modern protocol settings. On top of that, I would enforce sudo for privileged actions instead of giving broad root access. Logging and monitoring are important too, because failed logins, unusual source IPs, and repeated attempts can reveal an attack early. If the environment is sensitive, I’d add MFA through a bastion host or identity platform. Just as important, I’d document access procedures so emergency access is still possible without weakening security. The balance I aim for is strong protection without making operations impossible for the team.

Question 6

Difficulty: medium

How do you use automation in your Linux administration work? Give an example.

Sample answer

Automation is one of the biggest ways I reduce risk and save time. I use it for tasks that are repetitive, error-prone, or need to be consistent across many servers. For example, I’ve used Ansible to manage user accounts, deploy standard configuration files, and apply package updates in a controlled way. That helped us eliminate drift because every server was being configured from the same source of truth instead of by hand. I also use shell scripts for smaller tasks like log checks, backup validation, and quick reporting, but I prefer automation tools when the process needs to scale or be audited. One useful pattern for me is to build idempotent playbooks, test them in staging, and then roll them out gradually. Automation has to be reliable, so I focus on version control, peer review, and clear rollback paths. If a task has to be done more than a few times, I start thinking about automating it.

Question 7

Difficulty: hard

How would you troubleshoot a Linux server that can ping other systems but cannot reach the internet or a specific application service?

Sample answer

I’d break the problem into layers so I don’t guess. First, I’d confirm whether DNS resolution works, because being able to ping an IP is different from reaching a hostname. Then I’d check the default route, interface status, and firewall rules using tools like ip addr, ip route, ping, traceroute, ss, and firewall-cmd or iptables depending on the system. If the server can reach some systems but not one application service, I’d check port connectivity with curl, telnet, or nc and look for differences in routing, ACLs, or security groups. On the local host, I’d review SELinux status if relevant, because it can block services in ways that look like network issues. I’d also verify whether a proxy is required or whether the application service itself is down. The important part is to separate network, DNS, firewall, and application layers so the fix is targeted instead of random. That usually gets you to the root cause much faster.

Question 8

Difficulty: medium

Describe your experience with Linux file systems, disk management, and logical volume management.

Sample answer

I’m comfortable working with disks, partitions, and LVM because storage issues can affect stability very quickly. In the environments I’ve supported, I’ve used commands like lsblk, df, du, fdisk, parted, pvcreate, vgcreate, lvextend, and resize2fs or xfs_growfs depending on the file system. I like LVM because it gives flexibility when a filesystem needs more space without forcing a redesign. If I see a volume filling up, I first identify whether the growth is expected or caused by logs, caches, or application data. Then I decide whether I can expand the current volume, clean up safely, or move the data to a better location. I’m also careful with mount options, fsck planning, and backups before making changes. A strong admin has to know both how to grow storage and how to avoid creating a mess when capacity runs low. In my experience, regular capacity monitoring prevents most storage emergencies.

Question 9

Difficulty: easy

How do you handle user account management and privilege control on Linux systems?

Sample answer

I try to make access management precise and auditable. For user accounts, I prefer centralized identity where possible, but even on local systems I keep accounts organized, named consistently, and tied to real business needs. I review active users regularly and remove stale accounts quickly. For privilege control, I rely on groups and sudo rather than broad root access. That makes it easier to see who can do what and reduces the chance of someone making a change they don’t fully understand. I also pay attention to home directory permissions, SSH key management, password policies, and account expiration where relevant. If someone needs temporary elevated access, I’d rather grant it for a defined period and log it than leave access open-ended. I’ve found that good privilege management is both a security control and an operations control, because clear ownership makes troubleshooting and audits much easier. It’s one of those areas where small mistakes can create big problems later.

Question 10

Difficulty: hard

If a critical service fails to start after a configuration change, how would you respond?

Sample answer

I’d approach it like a controlled recovery problem. First, I would stop making changes and confirm exactly what changed, when it changed, and whether the failure is isolated to one server or repeated across the environment. Then I’d check the service logs, system journal, and configuration syntax to identify whether the issue is a typo, a missing dependency, a permission problem, or an invalid parameter. If I have a known good backup of the previous config, I’d compare the files and roll back if needed to restore service quickly. I also verify ownership, SELinux context, ports, and any environment variables that the service depends on. Once the service is back, I’d test the new configuration in a lower environment or with a safer validation process before trying again. I think the key is restoring availability first, then learning from the failure so it doesn’t happen again. That keeps both operations and accountability in balance.