Steven Tan — Senior DevOps Engineer

Practice

A tighter view of how I work: clearer systems, better visibility, and infrastructure that feels calm under pressure.

What I Enjoy Building

Calm systems behind fast-moving teams.

I like making complex engineering environments feel easier to operate, easier to debug, and easier to ship with confidence.

Observability

Make the truth visible.

Tracing, logs, metrics, sampling strategy, dashboards, and workflows that help engineers answer questions faster.

Platform

Reduce friction.

Infrastructure, automation, and deployment paths that feel dependable instead of fragile.

AI Workflows

Use tools with intention.

Adopt AI where it meaningfully speeds up engineering work, documentation, debugging, and iteration.

Incidents

Design for recovery.

Operational playbooks, cleaner alerts, and response loops that help teams stay steady under pressure.

Capabilities

Languages

Python
Golang
Bash
GraphQL
Powershell
Java
C
Javascript
SQL
NoSql

Tools

Cursor
Terraform
Kubernetes
Nomad
GitHub Actions
Pulumi
Docker
Ansible
Puppet
MySQL

Observability

OpenTelemetry
Honeycomb
ELK
New Relic

Cloud

AWS
GCP
EC2
ECS
Lambda
S3
RDS
CloudWatch
Cloud SQL
Cloud Functions

Experience

A timeline of platform, reliability, and DevOps work across fast-moving product and infrastructure teams.

2024 — Now

Observability Platform

Senior DevOps Engineer Scribe · Fully Remote · June 2024 – Current

Led company-wide reliability and observability improvements, while shaping deployment governance and pragmatic AI adoption across engineering workflows.

OpenTelemetry Honeycomb Terraform AWS DMS

Led the "Debug Excellence" initiative—company-wide observability transformation from New Relic to OpenTelemetry and Honeycomb; built full OTel collector infrastructure (traces, logs, metrics), Cloudflare log ingestion, AWS CloudWatch and pganalyze integration; deployed Refinery for tail-based trace sampling; delivered demos and training; resulted in a published Honeycomb customer case study.
Established SonarQube code quality program from scratch: static analysis in CI/CD across 3 projects, hybrid quality gates for PR review, optimized test coverage reporting, documentation and workflows adopted by the full engineering team.
Designed and deployed production data replication pipeline from Aurora PostgreSQL to Snowflake using AWS DMS with CDC; Terraform modules for DMS serverless, S3, Snowpipe, and monitoring dashboards—enabling analytics on production data without impacting the primary database.
Early adopter of AI engineering tools (Cursor, LLMs, MCPs) to accelerate development; mentoring engineers on leveraging these tools.
Led incident management overhaul and infrastructure CI/CD governance: PagerDuty–Honeycomb integration, automated Slack incident channels; observability in GitHub workflows for deployments, PRs, and actions.

2020 — 2024

SRE Scale

Senior Site Reliability Engineer Hasura · Fully Remote · Dec 2020 – June 2024

Built observability and automation systems that improved scale handling, customer support responsiveness, and operational confidence for a fast-growing platform company.

Honeycomb Vector Telegraf GitHub Actions

Implemented Honeycomb.io as main logging and monitoring tool: shipped application, system, and gateway logs at scale via Vector; shipped infrastructure metrics via Telegraf; created alerts, dashboards, and SLOs.
Automated deployments using GitHub Actions and GraphQL API backend; remediated security issues through automations across a wide range of infrastructure.
Added autoscaling based on traffic per unique customer; assisted customer support with paid/enterprise issue resolution.

2020

Advisory DevOps

Customer Architect Chef Software · Fully Remote · June 2020 – Dec 2020

Translated DevOps tooling into business outcomes for customers by guiding automation, modernization, and security decisions.

Chef Pipelines Architecture

Provided technical expertise on Chef products to help customers meet business goals: faster time to market via deployment pipeline automation, architecture modernization, and security throughout the deployment process.

2018 — 2020

Migration Operations

Site Reliability Engineer SAP Concur · Bellevue, WA · Dec 2018 – June 2020

Helped move legacy operational systems toward AWS and modern deployment practices while improving monitoring and on-call resilience.

AWS Ansible Jenkins New Relic

Led design to move on-prem infrastructure code to AWS: restructured designs, converted Puppet to Ansible; collaborated across teams and time zones.
Improved deployment pipelines (Git, Jenkins, Docker); wrote OS-agnostic New Relic custom plugin for certification expiry monitoring across hundreds of servers.
Maintained and improved deployment scripts (Jenkins, Puppet, Bash, PowerShell); added monitoring to legacy services (Filebeat, New Relic, Kibana); on-call and root-cause analysis for 24/7 SLA.

2016 — 2018

Logging Automation

Software Engineer (Infrastructure) Irdeto · North Hollywood, CA · Dec 2016 – Nov 2018

Built the operational foundation early on: centralized logging, infrastructure automation, and container workflows for fast-growing internal systems.

Docker ELK Python Slackbot

Redesigned centralized logging for 10× traffic growth; containerized microservices with Docker, ECS, and Docker Swarm.
Automated legacy ops with Python and Bash (execution time from 30 minutes to seconds); built centralized monitoring and logging (ELK, Splunk, Nagios); developed Slackbot automation for deployments and alerting.