JOB DESCRIPTION:

Site Reliability Engineer

ABOUT LINGO

Lingo is building a cutting‑edge digital health platform that fuses continuous biosensor data, high‑performance backend engineering, and advanced analytics to help people live healthier, longer, fuller lives. Our systems process massive volumes of real‑time data, and maintaining the reliability, scalability, and security of our platform is mission‑critical to delivering value to our users.

THE OPPORTUNITY

We are looking for a Site Reliability Engineer (SRE) to join our Platform team and ensure Lingo’s biosensor platform runs reliably and efficiently at scale. You will be a key partner for Backend, Data, and Mobile teams, driving improvements across infrastructure, observability, incident management, and automation. Your goal is to enable high velocity development with confidence, maintain multi region uptime, and embed reliability practices across engineering. You’ll work in production Kubernetes environments, tune service meshes, evolve operational playbooks, and proactively prevent incidents through code, automation, and design.

WHAT YOU’LL DO

  • Establish and improve SLOs, SLIs, and SLAs across services; partner with engineering teams to embed reliability targets into product designs.
  • Build and evolve monitoring, alerting, and tracing systems to ensure rapid detection and resolution of issues.
  • Develop incident response processes, on call rotations, and postmortem practices that drive continuous improvement.
  • Implement automation for deployment pipelines, failover, scaling, and capacity planning to reduce manual operations and error risk.
  • Champion security and compliance driven infrastructure, including secrets management, secure networking, and audit readiness.
  • Collaborate on disaster recovery strategies and resilience testing (chaos engineering, load testing, rolling updates, blue/green deployments).
  • Partner with developers to identify performance bottlenecks, optimize services, and reduce infrastructure costs.
  • Contribute to internal tooling and developer experience to accelerate safe delivery of features in production.

LINGO CULTURE

Customer first, reliability obsessed, and team oriented. At Lingo, SREs are guardians of uptime, performance, and developer velocity. You’ll help us move fast without compromising trust or quality.

N/A

REQUIRED QUALIFICATIONS

  • 5+ years in Site Reliability Engineering, DevOps, or Cloud Infrastructure roles for distributed systems at scale.
  • Deep expertise with Kubernetes, container orchestration, and service meshes in production environments.
  • Strong skills in observability tooling (Prometheus, Grafana, OpenTelemetry, etc.) and incident management systems.
  • Experience designing HA/DR architectures, managing multi region deployments, and optimizing for low latency traffic flows.
  • Proficiency with cloud platforms (AWS/GCP/Azure) and infrastructure as code (Terraform, Helm).
  • Security and compliance mindset, comfortable with regulated environments (HIPAA/GDPR) and auditing requirements.
  • Excellent cross functional communication and collaboration skills.

PREFERRED QUALIFICATIONS

  • Experience with streaming/messaging systems (Kafka, RabbitMQ) in production.
  • Background in digital health, IoT, or other mission critical data platforms.
  • Familiarity with chaos engineering tools and cost optimization strategies for global cloud services.
  • Development experience in a modern backend language (Java, Kotlin, Go, Python) for tooling and automation.