Site Reliability Engineer
Keep our multi-chain infrastructure reliable, fast, and scalable as we grow to 500k+ developers.
About the role
You'll ensure Tokra's infrastructure stays reliable, fast, and scalable as we grow from 50k to 500k+ developers. At a blockchain API company, uptime is everything - you'll own observability, incident response, and infrastructure automation that keeps our systems running 24/7.
What you’ll be doing
Build monitoring and alerting systems across our multi-chain node infrastructure
Optimize blockchain node performance and reliability (Geth, Erigon, archive nodes)
Design and implement disaster recovery and high availability systems
Automate infrastructure provisioning and deployment using Terraform and Kubernetes
Participate in on-call rotation and lead incident response when things break
Improve system observability with metrics, logs, and distributed tracing
Work with backend team to identify and resolve performance bottlenecks
About you
5+ years in SRE, DevOps, or infrastructure engineering roles
Expert with Kubernetes, Docker, and container orchestration at scale
Strong experience with AWS (EC2, RDS, S3, CloudWatch) or GCP
Proficient in Infrastructure as Code (Terraform, Ansible)
Comfortable with Golang or Python for automation and tooling
Experience with observability tools (Prometheus, Grafana, DataDog)
You thrive during incidents and can debug production issues under pressure
Benefits
$170k-$230k + equity
Fully remote with flexible hours
$3k/year learning and certification budget
Premium health insurance with dental and vision
4 weeks PTO + paid sick leave
Latest MacBook Pro + home office budget
On-call compensation + quarterly team offsites
Infrastructure
San Francisco
Remote
Full-time
$170k–$230k