{"id":1257,"date":"2024-09-24T05:02:21","date_gmt":"2024-09-24T05:02:21","guid":{"rendered":"https:\/\/rssfeedtelegrambot.bnaya.co.il\/index.php\/2024\/09\/24\/how-we-improved-availability-through-iterative-simplification\/"},"modified":"2024-09-24T05:02:21","modified_gmt":"2024-09-24T05:02:21","slug":"how-we-improved-availability-through-iterative-simplification","status":"publish","type":"post","link":"https:\/\/rssfeedtelegrambot.bnaya.co.il\/index.php\/2024\/09\/24\/how-we-improved-availability-through-iterative-simplification\/","title":{"rendered":"How we improved availability through iterative simplification"},"content":{"rendered":"<p>Solving and staying ahead of problems when scaling up a system of GitHub\u2019s size is a delicate process. The stack is complex, and even small changes can have a big ripple effect. Here\u2019s a look at some of the tools in GitHub\u2019s toolbox, and how we\u2019ve used them to solve problems. We\u2019ll also share some of our wins and lessons we learned along the way.<\/p>\n<h2>Methods and tools<a href=\"https:\/\/github.blog\/engineering\/#methods-and-tools\" class=\"heading-link pl-2 text-italic text-bold\"><\/a><\/h2>\n<p>There are several tools that we use to keep pace with our growing system. While we can\u2019t list them all, here are some that have been instrumental for our growth.<\/p>\n<p>As we serve requests, there is a constant stream of related numbers that we care about. For example, we might want to know how often events are happening or how traffic levels compare to expected use. We can record metrics for each event in Datadog to see patterns over time and break them down across different dimensions, identifying areas that need focus.<br \/>\nEvents also contain context that can help identify details for issues we\u2019re troubleshooting. We send all this context to Splunk for further analysis.<br \/>\nMuch of our application data is stored in MySQL, and query performance can degrade over time due to factors like database size and query frequency. We have written custom monitors that detect and report slow and timed-out queries for further investigation and remediation.<br \/>\nWhen we introduce changes, we often need to know how those changes affect performance. We use <a href=\"https:\/\/github.com\/github\/scientist\">Scientist<\/a> to test proposed changes. With this tool, we measure and report results before making the changes permanent.<br \/>\nWhen we\u2019re ready to release a change, we roll it out incrementally to ensure it works as expected for all use cases. We also need to be able to roll back in the event of unexpected behavior. We use <a href=\"https:\/\/github.com\/flippercloud\/flipper\">Flipper<\/a> to limit the rollout to early access users, then to an increasing percentage of users as we build the confidence.<\/p>\n<h2>Achieving faster database queries<a href=\"https:\/\/github.blog\/engineering\/#achieving-faster-database-queries\" class=\"heading-link pl-2 text-italic text-bold\"><\/a><\/h2>\n<p>We recently observed a SQL query causing a high number of timeouts. Our investigation in Splunk tracked it down to GitHub\u2019s Command Palette feature, which was loading a list of repositories. The code to generate that list looked something like this:<\/p>\n<p>org_repo_ids = Repository.where(owner: org).pluck(:id)<br \/>\nsuggested_repo_ids = Contribution.where(user: viewer, repository_id: org_repo_ids).pluck(:repository_id)<\/p>\n<p>If an org has many active repositories, the second line could generate a SQL query with a large IN (&#8230;) clause with an increased risk of timing out. While we\u2019d seen this type of problem before, there was something unique about this particular use case. We might be able to improve performance by querying the user first since a given user contributes to a relatively small number of repositories.<\/p>\n<p>contributor_repo_ids = Contribution.where(user: viewer).pluck(:repository_id)<br \/>\nsuggested_repo_ids = Repository.where(owner: org, id: contributor_repo_ids)<\/p>\n<p>We created a Scientist experiment with a new candidate code block to evaluate performance. The Datadog dashboard for the experiment confirmed two things: the candidate code block returned the same results and improved performance by 80-90%.<\/p>\n<p>We also did a deeper dive into the queries this feature was generating and found a couple of possible additional improvements.<\/p>\n<p>The first involved eliminating a SQL query and sorting results in the application rather than asking the SQL server to sort. We followed the same process with a new experiment and found that the candidate code block performed 40-80% worse than the control. We removed the candidate code block and ended the experiment.<\/p>\n<p>The second was a query filtering results based on the viewer\u2019s level of access and did so by iterating through the list of results. The access check we needed can be batched. So, we started another experiment to do the filtering with a single batched query and confirmed that the candidate code block improved performance by another 20-80%.<\/p>\n<p>While we were wrapping up these experiments, we checked for similar patterns in related code and found a similar filter we could batch. We confirmed a <strong>30-40% performance improvement<\/strong> with a final experiment, and left the feature in a better place that made our developers, database administrators, and users happier.<\/p>\n<h2>Removing unused code<a href=\"https:\/\/github.blog\/engineering\/#removing-unused-code\" class=\"heading-link pl-2 text-italic text-bold\"><\/a><\/h2>\n<p>While our tooling does surface problem areas to focus on, it\u2019s preferable to get ahead of performance issues and fix problematic areas before they cause a degraded experience. We recently analyzed the busiest request endpoints for one of our teams and found room to improve one of them before it escalated to an urgent problem.<\/p>\n<p>Data for each request to the GitHub Rails application is logged in Splunk and tagged with the associated controller and action. We started by querying Splunk for the top 10 controller\/action pairs in the endpoints owned by the team. We used that list to create a Datadog dashboard with a set of graphs for each controller\/action that showed the total request volume, average and P99 request latency, and max request latency. We found that the busiest endpoint on the dashboard was an action responsible for a simple redirect, and that performance regularly degraded to the timeout threshold.<\/p>\n<p>We needed to know what was slowing these requests down, so we dug into Datadog\u2019s APM feature to show requests for the problematic controller\/endpoint. We sorted those requests by elapsed request time to see the slowest requests first. We identified a pattern where slow requests spent a long time performing an access check that wasn\u2019t required to send the redirect response.<\/p>\n<p>Most requests to the GitHub Rails application generate HTML responses where we need to be careful to ensure that all data in the response is accessible to the viewer. We\u2019re able to simplify the code involved by using shared Rails controller filters to verify that the viewer is allowed to see the resources they\u2019re requesting that run before the server renders a response. These checks aren\u2019t required for the redirect, so we wanted to confirm we could serve those requests using a different set of filters and that this approach would improve performance.<\/p>\n<p>Since Rails controller filters are configured when the application boots rather than when each request is processed, we weren\u2019t able to use a Scientist experiment to test a candidate code block. However, filters can be configured to run conditionally, which enabled us to use a Flipper feature flag to change behavior. We identified the set of filters that weren\u2019t required for the redirect, and configured the controller to skip those filters when the feature flag was enabled. The feature flag controls let us ramp up this behavior while monitoring both performance and request status via Datadog and keeping watch for unexpected problems via Splunk.<\/p>\n<p>After confirming that performance improved for P75\/P99 request latency\u2014and more importantly, reduced max latency to be more consistent and much less likely to time out\u2014we graduated the feature and generalized the behavior so other similar controllers can use it.<\/p>\n<h2>What did we learn?<a href=\"https:\/\/github.blog\/engineering\/#what-did-we-learn\" class=\"heading-link pl-2 text-italic text-bold\"><\/a><\/h2>\n<p>There are several lessons we learned throughout this process. Here are some of the main points we keep in mind.<\/p>\n<p>The investment in observability is totally worth it! We identified and solved problems quickly because of the metric and log information we track.<br \/>\nEven when you\u2019re troubleshooting a problem that\u2019s been traditionally difficult to solve, the use case may be subtly different in a way that presents a new solution.<br \/>\nWhen you\u2019re working on a fix, look around at adjacent code. There may be related issues you can tackle while you\u2019re there.<br \/>\nPerformance problems are a moving target. Keeping an eye open for the next one helps you fix it when it\u2019s gotten slow rather than when it starts causing timeouts and breaking things.<br \/>\nMake small changes in ways that you can control with a gradual rollout and measure results.<\/p>\n<p>The post <a href=\"https:\/\/github.blog\/engineering\/engineering-principles\/how-we-improved-availability-through-iterative-simplification\/\">How we improved availability through iterative simplification<\/a> appeared first on <a href=\"https:\/\/github.blog\/\">The GitHub Blog<\/a>.<\/p>","protected":false},"excerpt":{"rendered":"<p>Solving and staying ahead of problems when scaling up a system of GitHub\u2019s size is a delicate process. The stack [&hellip;]<\/p>\n","protected":false},"author":0,"featured_media":0,"comment_status":"","ping_status":"","sticky":false,"template":"","format":"standard","meta":{"site-sidebar-layout":"default","site-content-layout":"","ast-site-content-layout":"default","site-content-style":"default","site-sidebar-style":"default","ast-global-header-display":"","ast-banner-title-visibility":"","ast-main-header-display":"","ast-hfb-above-header-display":"","ast-hfb-below-header-display":"","ast-hfb-mobile-header-display":"","site-post-title":"","ast-breadcrumbs-content":"","ast-featured-img":"","footer-sml-layout":"","ast-disable-related-posts":"","theme-transparent-header-meta":"","adv-header-id-meta":"","stick-header-meta":"","header-above-stick-meta":"","header-main-stick-meta":"","header-below-stick-meta":"","astra-migrate-meta-layouts":"default","ast-page-background-enabled":"default","ast-page-background-meta":{"desktop":{"background-color":"var(--ast-global-color-4)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"tablet":{"background-color":"","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"mobile":{"background-color":"","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""}},"ast-content-background-meta":{"desktop":{"background-color":"var(--ast-global-color-5)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"tablet":{"background-color":"var(--ast-global-color-5)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"mobile":{"background-color":"var(--ast-global-color-5)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""}},"footnotes":""},"categories":[8],"tags":[],"class_list":["post-1257","post","type-post","status-publish","format-standard","hentry","category-github-engineering"],"_links":{"self":[{"href":"https:\/\/rssfeedtelegrambot.bnaya.co.il\/index.php\/wp-json\/wp\/v2\/posts\/1257","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/rssfeedtelegrambot.bnaya.co.il\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/rssfeedtelegrambot.bnaya.co.il\/index.php\/wp-json\/wp\/v2\/types\/post"}],"replies":[{"embeddable":true,"href":"https:\/\/rssfeedtelegrambot.bnaya.co.il\/index.php\/wp-json\/wp\/v2\/comments?post=1257"}],"version-history":[{"count":0,"href":"https:\/\/rssfeedtelegrambot.bnaya.co.il\/index.php\/wp-json\/wp\/v2\/posts\/1257\/revisions"}],"wp:attachment":[{"href":"https:\/\/rssfeedtelegrambot.bnaya.co.il\/index.php\/wp-json\/wp\/v2\/media?parent=1257"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/rssfeedtelegrambot.bnaya.co.il\/index.php\/wp-json\/wp\/v2\/categories?post=1257"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/rssfeedtelegrambot.bnaya.co.il\/index.php\/wp-json\/wp\/v2\/tags?post=1257"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}