[{"data":1,"prerenderedAt":193},["ShallowReactive",2],{"blog-\u002Fblog\u002F2023\u002F05\u002Fbringing-high-availability-to-node-red":3},{"id":4,"title":5,"body":6,"description":179,"extension":180,"meta":181,"navigation":188,"path":189,"seo":190,"stem":191,"__hash__":192},"blog\u002Fblog\u002F2023\u002F05\u002Fbringing-high-availability-to-node-red.md","Bringing High Availability to Node-RED",{"type":7,"value":8,"toc":169},"minimark",[9,19,22,40,43,48,59,62,65,68,71,74,77,80,84,87,90,93,96,99,102,105,108,111,115,118,121,124,128,131,134,137,140,143,147,150,153,156,166],[10,11,12,13,18],"p",{},"Many companies look to deploy Node-RED into use cases that require the application\nto have a high degree of availability, reliability, and scalability. Following up\nour ",[14,15,17],"a",{"href":16},"\u002Fblog\u002F2023\u002F02\u002Fhighly-available-node-red\u002F","previous post on the subject",", in this\npost I’m going to look at some of the technical details of achieving HA, the\napproaches available and what that means for the work we’re doing at FlowFuse\nand upstream in Node-RED.",[10,20,21],{},"Everyone we speak to has a different set of requirements for this topic. To help\nwith the discussion, I’m going to look at two ways of approaching it:",[23,24,25,34],"ul",{},[26,27,28,29,33],"li",{},"The ",[30,31,32],"strong",{},"hot-spare approach"," where you have a second instance of the application\nready to take over when the primary fails. This achieves availability but\ndoesn’t contribute to scalability.",[26,35,28,36,39],{},[30,37,38],{},"load-balanced approach"," where you have a second active instance of the\napplication and work is shared between them. If either fails, the other\ncontinues running. A side-effect of this approach is a higher potential\nthrough-put and scalability; although in practice you need to ensure capacity\nto tolerate an instance failing.",[10,41,42],{},"To consider which approach is most appropriate in the context of Node-RED, we\nneed to look at the benefits and complications of each approach. It comes down\nto two factors; statefulness and how work is routed.",[44,45,47],"h3",{"id":46},"statefulness","Statefulness",[10,49,50,51,54,55,58],{},"There are two types of state to consider when thinking about a Node-RED flow:\n",[30,52,53],{},"explicit"," and ",[30,56,57],{},"implicit"," state.",[10,60,61],{},"Explicit state is what is programmed into the flow. For example, a flow may store\nstate in Context or use an external database service. Within FlowFuse we provide\ntwo types of context - the default in-memory context store and a database-backed\npersistent store. Currently the database-backed store includes a memory-caching\nlayer to provide better performance and interoperability. That gets tricky when\nyou want to have multiple instances sharing the same store. The context API\ndoesn’t provide a way to atomically update values - so you can get into classic\nconcurrency issues around two applications trying to update the same value.",[10,63,64],{},"The other type of state is that which is implicitly maintained in a flow - even\nif the user hasn’t explicitly configured it. For example, the Smooth node can be\nused to calculate a running average value of messages passing through it. The\nnode does that by keeping in memory the recent values so it can recalculate the\naverage with each update. If you have multiple instances, then the node will be\ncalculating the average for just the message its instances sees.",[10,66,67],{},"Another example of implicit state is the Batch node that can be used to group\nmessages into batches. Again - it will only be able to do that for the selection\nof messages the instance receives.",[10,69,70],{},"It very much depends on the requirements of a flow and what nodes it uses, as to\nhow the state can be handled.",[10,72,73],{},"In the hot-spare approach, as only one instance is active at any time, a lot of\nthe explicit state handling will work as expected. However the implicit state\nremains bound to the individual Node-RED instances.",[10,75,76],{},"In the load-balanced approach, care has to be taken to ensure any state generated\nby the flow is done in a way that copes with multiple instances accessing it at\nthe same time.",[10,78,79],{},"A key take-away from this being that a flow has to be created with HA and\u002For scaling\nin mind.",[44,81,83],{"id":82},"routing-work","Routing work",[10,85,86],{},"Node-RED makes it easy to integrate with lots of different sources of events.\nA couple of the most common being HTTP and MQTT. When considering how to handle\nmultiple instances of an application we need to think about how work is routed\nto those instances.",[10,88,89],{},"HTTP is the most well understood; you put a load-balancing proxy in front of the\nNode-RED instances and it takes care of sharing out the incoming requests. In\nthe hot-spare scenario, the proxy needs to know which instance is active - that\nrequires some coordination within the platform to track that properly.",[10,91,92],{},"MQTT is commonly used with Node-RED, but unlike HTTP which is in-bound, MQTT\nworks by having Node-RED create an out-bound connection to a broker and then\nsubscribing to the topics of interest. In the early days of MQTT that would mean\neach instance would subscribe to the same set of topics and receive every message.\nThat doesn’t really fit any HA model.",[10,94,95],{},"With the publication of MQTTv5, the concept of Shared Subscriptions was added;\nthe ability for a group of clients to connect, subscribe to the same topic and\nhave the broker distribute messages between them. At this point you do get load\nbalancing across your Node-RED instances - as long as the MQTT nodes are suitably\nconfigured.",[10,97,98],{},"There are lots of other nodes that can be used to trigger flows, whether by\nlistening for events on an API, connecting to locally attached hardware and many\nthings in between. Typically, those that are more cloud-aligned, such as messaging\nsystems like Kafka and AMQP will have very well established ways of doing load\nbalancing.",[10,100,101],{},"Managing out-bound connections gets more complicated in the hot-spare scenario.",[10,103,104],{},"If we only had to deal with in-bound connections, the hot-spare instance can just\nsit there waiting for work to be passed its way. But once you have out-bound\nconnections, then you have a problem. The hot-spare instance should only create\nits out-bound connections when it becomes the active instance. In real terms,\nthat means the Node-RED flows should only be started when the instance becomes\nactive.",[10,106,107],{},"With our goal to minimize the Mean Time To Recovery (MTTR), we need to find a\nway to get that spare instance running as quickly as possible; if it takes just\nas long to start the spare instance as it does to restart the failed primary\ninstance, then it isn’t much of an improvement.",[10,109,110],{},"The key here is that Node-RED allows you to start the runtime without the flows\nrunning. That gets everything loaded and the runtime ready ahead of time. It can\nthen start the flows at a moment's notice with a simple call to the runtime admin\nAPI.",[44,112,114],{"id":113},"detecting-failure","Detecting failure",[10,116,117],{},"A key requirement of the hot-spare approach to HA is knowing when to failover to\nthe spare.",[10,119,120],{},"This requires close monitoring of the active instance to know whether it's still\nworking. How quickly you can detect failure is key to reducing the time to recovery.\nThis is where you have to think about the different ways an instance could fail -\nhas it crashed, has it hung, has it got ‘stuck’?",[10,122,123],{},"Detecting failure usually involves some combination of heartbeat ‘pings’ between\nthe instances to check each is able to respond to requests. The spare instance\nthen needs to be able to decide for itself whether it should become the active\ninstance - and do so safely. You do not want to accidentally have two instances\nactive at the same time. This can get quite complicated to achieve safely, but\nthere are a number of approaches that can be used. We’ll be exploring them as we\ncontinue our journey towards HA.",[44,125,127],{"id":126},"editing-flows","Editing Flows",[10,129,130],{},"Within the Node-RED architecture, each instance also serves up its own editor.\nThis is what you get when you point your web browser at it.",[10,132,133],{},"In a HA world, once you have multiple instances running behind an HTTP load\nbalancer, there is a tricky question of how you edit the flows. If each request\nhits a different instance, just loading the editor will result in different bits\ncoming from different instances. That can typically be solved at the load balancer\nlevel by creating sticky-sessions; ensuring for a given client, each request is\nrouted to a consistent instance. That solves part of the issue, but the next\nchallenge is what to do when the Deploy button is pressed. That is how new flows\nare passed from the editor to the runtime. When you have multiple instances, we\nneed to make sure that they all get updated. That is quite a tricky problem to\nsolve with the current Node-RED APIs - and something we’ll be working on both in\nFlowFuse and in the upstream Node-RED project to resolve.",[10,135,136],{},"That said, a more immediate solution could well be to take advantage of separate\ndevelopment\u002Fproduction instances. You develop in a single instance and, when happy\nwith what you’ve got, roll it out to your HA-ready production instance. This\nbypasses the need to edit the flows in the HA environment at all.",[10,138,139],{},"Whichever method is used, there is a question of how you minimize downtime whilst\ndeploying an update. In a purely in-bound environment, solutions can be built\nwhere the new application is deployed alongside the old version and, when everything\nis ready, the in-bound events are redirected to the new version. But that isn’t\nfeasible when you have out-bound connections to deal with as well. For some users,\nhaving a scheduled maintenance window for doing updates will be completely acceptable.",[10,141,142],{},"As with the hot-spare approach to failover, a similar method could be used that\nstarts new instances of Node-RED alongside the old, but with the flows all stopped.\nThen, once everything is ready, the old instances are stopped and the new instances\nstarted - minimizing the downtime, although not completely removing it.",[44,144,146],{"id":145},"continuing-the-ha-journey-at-flowfuse","Continuing the HA journey at FlowFuse",[10,148,149],{},"So the question is how are we going to apply all of this to what we’re building\nat FlowFuse. We cannot do everything at once, so we have to prioritize which\nscenarios we’re going to address first. Consequently, drawing from customer\nfeedback, we have chosen to start with the scaling side of high availability -\nallowing multiple copies of an instance to be run with appropriate load\nbalancing put in front of it.",[10,151,152],{},"We are building FlowFuse as an open platform with the ability to run on top of\nDocker Compose and Kubernetes. As we get into some of these HA features, we will\nneed to look carefully at where we can lean on these underlying technologies -\nwe don’t want to reinvent the wheel here.",[10,154,155],{},"Our initial focus is going to be when running in a Kubernetes environment - just\nas we do with our hosted FlowFuse Cloud platform. Kubernetes provides lots of\nthe building blocks for creating a scalable and highly available solution, but\nit certainly doesn’t do all of the work for you.",[10,157,158,159,165],{},"We've identified our initial set of tasks and changes to how we'll run Node-RED\ninstance with the k8s environment. You can follow our progress with this\n",[14,160,164],{"href":161,"rel":162},"https:\u002F\u002Fgithub.com\u002FFlowFuse\u002Fflowfuse\u002Fissues\u002F2156",[163],"nofollow","issue"," on our backlog.",[10,167,168],{},"I hope this post has given some useful insight into the problems we’re looking\nto solve at FlowFuse. As it's such an important requirement for many users we’ll\nkeep you updated as we make progress.",{"title":170,"searchDepth":171,"depth":171,"links":172},"",2,[173,175,176,177,178],{"id":46,"depth":174,"text":47},3,{"id":82,"depth":174,"text":83},{"id":113,"depth":174,"text":114},{"id":126,"depth":174,"text":127},{"id":145,"depth":174,"text":146},"Many companies look to deploy Node-RED into use cases that require the application\nto have a high degree of availability, reliability, and scalability. Following up\nour previous post on the subject, in this\npost I’m going to look at some of the technical details of achieving HA, the\napproaches available and what that means for the work we’re doing at FlowFuse\nand upstream in Node-RED.","md",{"navTitle":5,"excerpt":182},{"type":7,"value":183},[184],[10,185,12,186,18],{},[14,187,17],{"href":16},true,"\u002Fblog\u002F2023\u002F05\u002Fbringing-high-availability-to-node-red",{"title":5,"description":179},"blog\u002F2023\u002F05\u002Fbringing-high-availability-to-node-red","lylMrPnVi8mZv21J5uoUjmnQH9qiKCHTYVZy7tdxd3U",1780070550503]