
Site Reliability Engineer (LInE)
micro1. full time remote mid
Job Description
Job Title: Site Reliability Engineer
Job Type: Contractor
Location: Remote
Job Summary:
Join our customer's team as an expert Site Reliability Engineer and play a pivotal role in ensuring the performance, reliability, and scalability of mission-critical infrastructure. You'll leverage your deep expertise in Linux, Kubernetes, and Prometheus to architect, monitor, and enhance robust systems supporting innovative applications.
Key Responsibilities:
- Design, implement, and maintain scalable infrastructure using Linux, Kubernetes, and Prometheus.
- Monitor system health, analyze performance metrics, and proactively address bottlenecks or potential failures.
- Automate operational processes to minimize manual intervention and increase system reliability.
- Respond swiftly to incidents, conduct root cause analysis, and drive continuous improvements in incident response procedures.
- Collaborate closely with development and operations teams to deliver seamless deployments and high system availability.
- Create comprehensive documentation and clear runbooks for operational excellence and knowledge sharing.
- Champion best practices in SRE, security, and compliance across the customer's ecosystem.
Required Skills and Qualifications:
- Expert-level hands-on experience with Linux system administration and troubleshooting.
- Advanced proficiency with Kubernetes, including cluster deployment, operations, and management.
- Deep knowledge of Prometheus for monitoring, metrics collection, and alerting.
- Strong scripting abilities (Bash, Python, or similar) for automation and tooling.
- Excellent written and verbal communication skills, with the ability to document and share knowledge effectively.
- Proven track record in site reliability engineering or similar roles in high-availability environments.
- Demonstrated commitment to proactive problem-solving and collaborative teamwork.
Preferred Qualifications:
- Experience with other cloud-native tools (e.g., Grafana, Helm, Istio, or similar).
- Certifications in Kubernetes, Linux, or cloud platforms.
- Background in high-growth or large-scale production environments.