Job Description
US Work Authorization Requirement:
Candidates must be legally authorized to work in the United States without employer sponsorship. This includes, but is not limited to, U.S. Citizens, Permanent Residents, and other individuals with valid U.S. work authorization.
Job Description:
We are seeking a highly experienced Site Reliability Engineer (SRE) with a strong Java development background to lead reliability initiatives and ensure the stability, scalability, and performance of mission-critical systems. This role blends deep hands-on engineering with leadership, ownership, and a proactive approach to reliability and operations.
The ideal candidate is someone who has evolved from a strong developer into an SRE/DevOps leader, understands production systems deeply, and can partner effectively with development, platform, and operations teams.
Key Responsibilities:
- Design, build, and maintain highly reliable, scalable, and fault-tolerant systems in production environments.
- Embed reliability best practices (SLOs, SLIs, error budgets) into the software development lifecycle.
- Work closely with development teams on Java Spring Boot microservices to improve operability and resilience.
- Automate operational workflows to reduce manual effort and improve system efficiency.
- Monitor system health, performance, and availability; proactively identify risks and bottlenecks.
- Lead incident management, on-call support, and root cause analysis for production issues.
- Drive continuous improvement initiatives focused on availability, scalability, and performance.
- Support and oversee release and deployment activities, including after-hours support when required.
- Champion best practices around CI/CD, infrastructure as code, and cloud-native operations.
- Mentor engineers and provide technical leadership across SRE and development teams.
- Collaborate with stakeholders to align reliability goals with business priorities.
Required Qualifications
- 12+ years of IT experience in SRE, DevOps, or Production Engineering
- Strong Java development experience (Java 17+, Spring Boot Microservices, Spring Web)
- Hands-on experience with OpenShift (OCP), Kubernetes, and Docker
- Strong expertise in MongoDB (data modeling, design, optimization)
- Experience with Apache Kafka and event-driven architectures
- Working knowledge of Oracle Database
- Familiarity with BDD practices
- Solid experience with CI/CD, automation, and IaC (Terraform, Ansible)
- Exposure to AI-assisted development tools (e.g., GitHub Copilot)
- Excellent troubleshooting skills in high-pressure production environments
- Strong communication, collaboration, and ownership mindset
Preferred Qualifications:
- Experience with monitoring and observability tools such as Prometheus, Grafana, and the ELK stack.
- Knowledge of security best practices, compliance standards, and production hardening.
- Prior experience leading or mentoring SRE teams or guiding engineers in reliability practices.