Ana Medina on Chaos Engineering, Game Days, and Learning

02 Oct 2024 (19 days ago)
Ana Medina on Chaos Engineering, Game Days, and Learning

Gremlin's Status Checks

  • Gremlin, where Medina works as a Senior Chaos Engineer, has launched a feature called status checks to check the health of a system before running chaos experiments. (2m46s)
  • Status checks can be integrated with tools like DataDog, New Relic, and PagerDuty, and users can also create their own using API endpoints. (3m50s)

Impact of Complex Systems

  • Complex systems are impacted by many factors, including world events like pandemics. (6m17s)
  • The pandemic highlighted the difference between organizations that were prepared for high traffic and those that were not. (7m30s)

Game Days and Chaos Engineering Workshops

  • Gremlin has resources for running game days, but a fully developed remote game day runbook has not yet been created. (10m33s)
  • Successful virtual game days can be run with proper planning, communication, and collaboration tools like Zoom and Google Docs. (11m40s)
  • Assigning specific roles, such as commander, note-taker, observer, and tester, helps participants focus on their tasks and contributes to a more successful game day experience. (12m30s)
  • Gremlin's chaos engineering workshops incorporate hands-on experiments in a cloud infrastructure environment, using Kubernetes, monitoring tools, and a microservice demo environment, to provide practical experience. (13m14s)

Benefits of Chaos Engineering

  • Chaos engineering can reveal inaccuracies in architecture diagrams by demonstrating how an entire application can break down when traffic to a single service or container is blocked. (16m17s)
  • When implementing chaos engineering, it is recommended to prioritize testing critical, high-impact services (tier zero and tier one) to maximize the return on investment. (18m26s)
  • Past incidents, documented in a blameless postmortem format, provide valuable insights for chaos engineering experiments by highlighting system vulnerabilities and areas for improvement. (19m9s)

Importance of Training and Ethics

  • There is a lack of focus on training and ethics in software engineering despite the potential for technology to cause harm. (22m41s)
  • Organizations should ideally begin planning 3-6 months in advance for important dates like Cyber Monday to ensure system resilience. (24m51s)
  • Code freezes are a warning sign that things need to change and that teams may not be equipped to handle changes during incident-heavy periods. (26m23s)

Gremlin's Resources

  • Gremlin offers free monthly training courses on chaos engineering, including fundamentals and automation. (27m31s)
  • Gremlin's Chaos Conf will be held virtually on October 6-8, featuring tracks on reliability practices, completing the DevOps loop, and data-driven reliability culture. (28m10s)
  • The best way to contact Ana Medina is through her Twitter handle, Anna _ Medina. (29m23s)

Overwhelmed by Endless Content?