{"id":908,"date":"2024-06-20T11:13:30","date_gmt":"2024-06-20T11:13:30","guid":{"rendered":"https:\/\/rssfeedtelegrambot.bnaya.co.il\/index.php\/2024\/06\/20\/how-we-improved-push-processing-on-github\/"},"modified":"2024-06-20T11:13:30","modified_gmt":"2024-06-20T11:13:30","slug":"how-we-improved-push-processing-on-github","status":"publish","type":"post","link":"https:\/\/rssfeedtelegrambot.bnaya.co.il\/index.php\/2024\/06\/20\/how-we-improved-push-processing-on-github\/","title":{"rendered":"How we improved push processing on GitHub"},"content":{"rendered":"<p>What happens when you <a href=\"https:\/\/github.blog\/2024-06-10-top-12-git-commands-every-developer-must-know\/\">push to GitHub<\/a>? The answer, \u201cMy repository gets my changes\u201d or maybe, \u201cThe refs on my remote get updated\u201d is pretty much right\u2014and that is a really important thing that happens, but there\u2019s a whole lot more that goes on after that. To name a few examples:<\/p>\n<p>Pull requests are synchronized, meaning the diff and commits in your pull request reflect your newly pushed changes.<br \/>\nPush webhooks are dispatched.<br \/>\nWorkflows are triggered.<br \/>\nIf you push an app configuration file (like for Dependabot or GitHub Actions), the app is automatically installed on your repository.<br \/>\nGitHub Pages are published.<br \/>\nCodespaces configuration is updated.<br \/>\nAnd much, much more.<\/p>\n<p>Those are some pretty important things, and this is just a sample of what goes on for every push. In fact, in the GitHub monolith, there are over 60 different pieces of logic owned by 20 different services that run in direct response to a push. That\u2019s actually really cool\u2014we <em>should<\/em> be doing a bunch of interesting things when code gets pushed to GitHub. In some sense, that\u2019s a big part of what GitHub is, the place you push code<a href=\"https:\/\/github.blog\/category\/engineering\/#fn-78344-1\" class=\"jetpack-footnote\" title=\"Read footnote.\">1<\/a> and then cool stuff happens.<\/p>\n<h2>The problem<a href=\"https:\/\/github.blog\/category\/engineering\/#the-problem\" class=\"heading-link pl-2 text-italic text-bold\"><\/a><\/h2>\n<p>What\u2019s not so cool is that, up until recently, all of these things were the responsibility of a single, enormous background job. Whenever GitHub\u2019s Ruby on Rails monolith was notified of a push, it enqueued a massive job called the RepositoryPushJob<strong>. <\/strong>This job was the home for all push processing logic, and its size and complexity led to many problems. The job triggered one thing after another in a long, sequential series of steps, kind of like this:<\/p>\n<p><a href=\"https:\/\/github.blog\/wp-content\/uploads\/2024\/06\/push-processing-original.png\"><\/a><\/p>\n<p>There are a few things wrong with this picture. Let\u2019s highlight some of them:<\/p>\n<p><strong>This job was huge, and hard to retry.<\/strong> The size of the RepositoryPushJob made it very difficult for different push processing tasks to be retried correctly. On a retry, all the logic of the job is repeated from the beginning, which is not always appropriate for individual tasks. For example:<\/p>\n<p>Writing Push records to the database can be retried liberally on errors and reattempted any amount of time after the push, and will gracefully handle duplicate data.<br \/>\nSending push webhooks, on the other hand, is much more time-sensitive and should not be reattempted too long after the push has occurred. It is also not desirable to dispatch multiples of the <em>same<\/em> webhook. <\/p>\n<p><strong>Most of these steps were never retried at all.<\/strong> The above difficulties with conflicting retry concerns ultimately led to retries of RepositoryPushJob being avoided in most cases. To prevent one step from killing the entire job, however, much of the push handling logic was wrapped in code catching any and all errors. This lack of retries led to issues where crucial pieces of push processing never occurred.<br \/>\n<strong>Tight coupling of many concerns created a huge blast radius for problems.<\/strong> While most of the dozens of tasks in this job rescued all errors, for historical reasons, a few pieces of work in the beginning of the job did not. This meant that all of the later steps had an implicit dependency on the initial parts of the job. As more concerns are combined within the same job, the likelihood of errors impacting the entire job increases.<\/p>\n<p>For example, writing data to our Pushes MySQL cluster occurred in the beginning of the RepositoryPushJob. This meant that everything occurring after that had an implicit dependency on this cluster. This structure led to incidents where errors from this database cluster meant that user pull requests were not synchronized, even though pull requests have no explicit need to connect to this cluster.<\/p>\n<p><strong>A super long sequential process is bad for latency.<\/strong> It\u2019s fine for the first few steps, but what about the things that happen last? They have to wait for every other piece of logic to run before they get a chance. In some cases, this structure led to a second or more of unnecessary latency for user-facing push tasks, including pull request synchronization. <\/p>\n<h2>What did we do about this?<a href=\"https:\/\/github.blog\/category\/engineering\/#what-did-we-do-about-this\" class=\"heading-link pl-2 text-italic text-bold\"><\/a><\/h2>\n<p>At a high level, we took this very long sequential process and decoupled it into many isolated, parallel processes. We used the following approach:<\/p>\n<p>We added a new Kafka topic that we publish an event to for each push.<br \/>\nWe examined each of the many push processing tasks and grouped them by owning service and\/or logical relationships (for example, order dependency, retry-ability).<br \/>\nFor each coherent group of tasks, we placed them into a new background job with a clear owner and appropriate retry configuration.<br \/>\nFinally, we configured these jobs to be enqueued for each publish of the new Kafka event.<\/p>\n<p>To do this, we used an internal system at GitHub that facilitates enqueueing background jobs in response to Kafka events via independent consumers. <\/p>\n<p>We had to make investments in several areas to support this architecture, including:<\/p>\n<p>Creating a reliable publisher for our Kafka event\u2013one that would retry until broker acknowledgement.<br \/>\nSetting up a dedicated pool of job workers to handle the new job queues we\u2019d need for this level of fan out.<br \/>\nImproving observability to ensure we could carefully monitor the flow of push events throughout this pipeline and detect any bottlenecks or problems.<br \/>\nDevising a system for consistent per-event feature flagging, to ensure that we could gradually roll out (and roll back if needed) the new system without risk of data loss or double processing of events between the old and new pipelines. <\/p>\n<p>Now, things look like this:<\/p>\n<p><a href=\"https:\/\/github.blog\/wp-content\/uploads\/2024\/06\/push-processing-improved.png\"><\/a><\/p>\n<p>A push triggers a Kafka event, which is fanned out via independent consumers to many isolated jobs that can process the event without worrying about any <em>other<\/em> consumers.<\/p>\n<h2>Results<a href=\"https:\/\/github.blog\/category\/engineering\/#results\" class=\"heading-link pl-2 text-italic text-bold\"><\/a><\/h2>\n<p><strong>A smaller blast radius for problems.<\/strong><\/p>\n<p>This can be clearly seen from the diagram. Previously, an issue with a single step in the very long push handling process could impact everything downstream. Now, issues with one piece of push handling logic don\u2019t have the ability to take down much else.<br \/>\nStructurally, this decreases the risk of dependencies. For example, there are around 300 million push processing operations executed per day in the new pipeline that previously implicitly depended on the Pushes MySQL cluster and now have no such dependency, simply as a product of being moved into isolated processes.<br \/>\nDecoupling also means better ownership. In splitting up these jobs, we distributed ownership of the push processing code from one owning team to 15+ more appropriate service owners. New push functionality in our monolith can be added and iterated on by the owning team without unintentional impact to other teams.<\/p>\n<p><strong>Pushes are processed with lower latency.<\/strong><\/p>\n<p>By running these jobs in parallel, no push processing task has to wait for others to complete. This means better latency for just about everything that happens on push.<br \/>\nFor example, we can see a notable decrease in pull request sync time:<\/p>\n<p><a href=\"https:\/\/github.blog\/wp-content\/uploads\/2024\/06\/pull-request-sync-time.png\"><\/a><\/p>\n<p><strong>Improved observability.<\/strong><\/p>\n<p>By breaking things up into smaller jobs, we get a much clearer picture of what\u2019s going on with each job. This lets us set up observability and monitoring that is much more finely scoped than anything we had before, and helps us to quickly pinpoint any problems with pushes.<\/p>\n<p><strong>Pushes are more reliably processed.<\/strong><\/p>\n<p>By reducing the size and complexity of the jobs that process pushes, we are able to retry <em>more<\/em> things than in the previous system. Each job can have retry configuration that\u2019s appropriate for its own small set of concerns, without having to worry about re-executing other, unrelated logic on retry.<br \/>\nIf we define a \u201cfully processed\u201d push as a push event for which <em>all<\/em> the desired operations are completed with no failures, the old RepositoryPushJob system fully processed about <strong>99.897%<\/strong> of pushes.<br \/>\nIn the worst-case estimate, the new pipeline fully processes <strong>99.999%<\/strong> of pushes.<\/p>\n<h2>Conclusion<a href=\"https:\/\/github.blog\/category\/engineering\/#conclusion\" class=\"heading-link pl-2 text-italic text-bold\"><\/a><\/h2>\n<p>Pushing code to GitHub is one of the most fundamental interactions that developers have with GitHub every day. It\u2019s important that our system handles everyone\u2019s pushes reliably and efficiently, and over the past several months we have significantly improved the ability of our monolith to correctly and fully process pushes from our users. Through platform level investments like this one, we strive to make GitHub the home for all developers (and their many pushes!) far into the future.<\/p>\n<p><!-- Footnotes themselves at the bottom. --><\/p>\n<h4>Notes<a href=\"https:\/\/github.blog\/category\/engineering\/#notes\" class=\"heading-link pl-2 text-italic text-bold\"><\/a><\/h4>\n<div class=\"footnotes\">\n<p>People push to GitHub a whole lot, as you can imagine. In the last 30 days, we\u2019ve received around 500 million pushes from 8.5 million users.\u00a0<a href=\"https:\/\/github.blog\/category\/engineering\/#fnref-78344-1\" title=\"Return to main content.\">\u21a9<\/a><\/p>\n<\/div>\n<p>The post <a href=\"https:\/\/github.blog\/2024-06-11-how-we-improved-push-processing-on-github\/\">How we improved push processing on GitHub<\/a> appeared first on <a href=\"https:\/\/github.blog\/\">The GitHub Blog<\/a>.<\/p>","protected":false},"excerpt":{"rendered":"<p>What happens when you push to GitHub? The answer, \u201cMy repository gets my changes\u201d or maybe, \u201cThe refs on my [&hellip;]<\/p>\n","protected":false},"author":0,"featured_media":769,"comment_status":"","ping_status":"","sticky":false,"template":"","format":"standard","meta":{"site-sidebar-layout":"default","site-content-layout":"","ast-site-content-layout":"default","site-content-style":"default","site-sidebar-style":"default","ast-global-header-display":"","ast-banner-title-visibility":"","ast-main-header-display":"","ast-hfb-above-header-display":"","ast-hfb-below-header-display":"","ast-hfb-mobile-header-display":"","site-post-title":"","ast-breadcrumbs-content":"","ast-featured-img":"","footer-sml-layout":"","ast-disable-related-posts":"","theme-transparent-header-meta":"","adv-header-id-meta":"","stick-header-meta":"","header-above-stick-meta":"","header-main-stick-meta":"","header-below-stick-meta":"","astra-migrate-meta-layouts":"default","ast-page-background-enabled":"default","ast-page-background-meta":{"desktop":{"background-color":"var(--ast-global-color-4)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"tablet":{"background-color":"","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"mobile":{"background-color":"","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""}},"ast-content-background-meta":{"desktop":{"background-color":"var(--ast-global-color-5)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"tablet":{"background-color":"var(--ast-global-color-5)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"mobile":{"background-color":"var(--ast-global-color-5)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""}},"footnotes":""},"categories":[8],"tags":[],"class_list":["post-908","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-github-engineering"],"_links":{"self":[{"href":"https:\/\/rssfeedtelegrambot.bnaya.co.il\/index.php\/wp-json\/wp\/v2\/posts\/908","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/rssfeedtelegrambot.bnaya.co.il\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/rssfeedtelegrambot.bnaya.co.il\/index.php\/wp-json\/wp\/v2\/types\/post"}],"replies":[{"embeddable":true,"href":"https:\/\/rssfeedtelegrambot.bnaya.co.il\/index.php\/wp-json\/wp\/v2\/comments?post=908"}],"version-history":[{"count":0,"href":"https:\/\/rssfeedtelegrambot.bnaya.co.il\/index.php\/wp-json\/wp\/v2\/posts\/908\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/rssfeedtelegrambot.bnaya.co.il\/index.php\/wp-json\/wp\/v2\/media\/769"}],"wp:attachment":[{"href":"https:\/\/rssfeedtelegrambot.bnaya.co.il\/index.php\/wp-json\/wp\/v2\/media?parent=908"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/rssfeedtelegrambot.bnaya.co.il\/index.php\/wp-json\/wp\/v2\/categories?post=908"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/rssfeedtelegrambot.bnaya.co.il\/index.php\/wp-json\/wp\/v2\/tags?post=908"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}