[{"data":1,"prerenderedAt":107},["ShallowReactive",2],{"blog-\u002Fblog\u002F2023\u002F02\u002Fhighly-available-node-red":3},{"id":4,"title":5,"body":6,"description":12,"extension":96,"meta":97,"navigation":102,"path":103,"seo":104,"stem":105,"__hash__":106},"blog\u002Fblog\u002F2023\u002F02\u002Fhighly-available-node-red.md","Toward Highly Available Node-RED",{"type":7,"value":8,"toc":87},"minimark",[9,13,16,27,32,40,43,51,55,58,62,71,74,78],[10,11,12],"p",{},"Over the past few months we've held a lot of product discovery sessions and a topic\nwhich keeps coming up is \"HA Node-RED\". All software will have failures, with\nHA (high availability) the intent is to allow the workload to be processed\nregardless. There's quite a few considerations which are often not covered\nduring product discovery calls, I'm going to discuss some of those points in this article.",[10,14,15],{},"When my job title was software engineer I was fortunate to design and\nimplement a HA system. It's an incredibly challenging and rewarding task for\nany software engineer. As a topic, it's studied when obtaining a Computer Science\nBachelors degree, masters and even PhD. When tasked\nto make a HA system, it took me a good month to define what our goals\nwere, and what we were willing to exchange for the properties sought. This\nmight be extra hardware, engineering hours, as well as organizational challenges.\nFor now, let's focus on the first two.",[10,17,18,19,26],{},"Let's start with defining the goal; reduce the impact of a Node-RED instance\nbeing unresponsive for an arbitrary reason. In many use-cases the MTTR (Mean Time To\nRecovery) is what's measured. When for example a hardware failure takes down the instance and the time to\ndetection is zero, it will likely still take a few hours to recover. Most of\nthe recovery work is also manual, and knowledge on how to recover is usually\n",[20,21,25],"a",{"href":22,"rel":23},"https:\u002F\u002Fen.wikipedia.org\u002Fwiki\u002FTribal_knowledge",[24],"nofollow","tribal knowledge",". If the right\nperson is on-site, the right hardware is available, you kept great backups,\nand are able to deploy the new hardware right away without support from other\nfunctions you might just achieve an MTTR of 120 minutes!",[28,29,31],"h3",{"id":30},"_5-10-minute-mean-time-to-recovery","5-10 minute Mean Time to Recovery",[10,33,34,35,39],{},"What's needed to bring this back to say 10 minutes? First, adopting FlowFuse will\nhelp massively here. FlowFuse can be installed on-premise, or you can use our\nmanaged Cloud offering. The software is the same, provided the on-premise\ninstall uses our ",[20,36,38],{"href":37},"\u002Fdocs\u002Finstall\u002Fkubernetes\u002F","Kubernetes install"," method.",[10,41,42],{},"The key of the installation is the fact that the hardware layer is generalized\nas a fleet. Detecting failures is included in the install, and very fast. Comparing\nthat to most alerting systems currently, it's usually a difference between night\nand day. Furthermore, to decrease the recovery time significantly\nthere's a requirement to make software responsible for the whole procedure. Human intervention is much too slow.",[10,44,45,46,50],{},"To get the MTTR down to 5 minutes there's a requirement to either make hardware\nautomatically available to the fleet, or to over-provision (more hardware is\navailable than is needed at any given moment). When a hardware failure occurs\nFlowFuse is configured to ensure all Node-RED instances that are KIA are\nreplaced. Bringing down the time to recovery to about 5 minutes.\nFor many use-cases a MTTR of 5 minutes is ",[47,48,49],"em",{},"good enough",".",[28,52,54],{"id":53},"sub-minute-mttr","Sub minute MTTR",[10,56,57],{},"To go below the minute, or dare I say go below 10 seconds, we'll need to increase\nthe number of running Node-RED instances. Let's start with a hot-spare. Meaning\nthere's a running Node-RED instance with the flows exactly the same as another,\nready to pick up the work when the first has some failure. Note this isn't like\na relay race, there's no baton being passed from one Node-RED to the other. While\nsome data and messages might be lost, it's possible to redirect all workload from\nthe plagued Node-RED to the hot-spare in a matter of seconds. Hot-spares taking\nover are usually only observed by humans a good few minutes after they replace a failed instance.",[28,59,61],{"id":60},"sub-second","Sub second!",[10,63,64,65,70],{},"Before this post turns into a theoretical exercise we'd really need to understand\nwhich trade-offs are acceptable to you. There's the ",[20,66,69],{"href":67,"rel":68},"https:\u002F\u002Fen.wikipedia.org\u002Fwiki\u002FCAP_theorem",[24],"CAP Theorem","\nwhich states 3 guarantees are wanted: Consistency, Availability, and Partition Tolerance. You get to pick\nonly two. In manufacturing the line must never be stopped due to a software failure where possible,\nso Availability is the most important. The question is, what comes next? Is it Consistency meaning\nall 3 instances have the same view of the global state? Or maybe Partition Tolerance where it's vital for each instance need to be able to predict the intended action even if it can't communicate to the others?",[10,72,73],{},"Whichever 2 you choose will dictate engineering choices in the pursuit of a great HA solution.",[28,75,77],{"id":76},"the-roadmap","The roadmap",[10,79,80,81,86],{},"With FlowFuse v1.4, released February 2023, a 5 minute mean time to recovery is\nachieved for all flows running locally, that is: in the cluster. Going beyond this\nmilestone requires your input! I'd love to chat about your challenges, please\n",[20,82,85],{"href":83,"rel":84},"https:\u002F\u002Fmeetings-eu1.hubspot.com\u002Fzeger-jan",[24],"pick a timeslot to discuss your requirements","!",{"title":88,"searchDepth":89,"depth":89,"links":90},"",2,[91,93,94,95],{"id":30,"depth":92,"text":31},3,{"id":53,"depth":92,"text":54},{"id":60,"depth":92,"text":61},{"id":76,"depth":92,"text":77},"md",{"navTitle":5,"excerpt":98},{"type":7,"value":99},[100],[10,101,12],{},true,"\u002Fblog\u002F2023\u002F02\u002Fhighly-available-node-red",{"title":5,"description":12},"blog\u002F2023\u002F02\u002Fhighly-available-node-red","Y-5-8u8UTfCM6Uf20P0LH4QeaiyRRmkXuTlZ09GQVGw",1780070550239]