Postcards from the edge | Ultra ‘six-nines’ reliability – and why it’s madness (Reader Forum)


Four nines, five nines, six nines – everyone wants more nines. Every enterprise wants ultra reliability, with guaranteed uptime of 99.99 percent (or 99.999 percent, or 99.9999 percent). But here’s the thing; a flippant rule of thumb says every extra nine in pursuit of total reliability costs an extra zero. And when it comes to the crunch, when signing-off infrastructure investments, most enterprises will settle for something less. They will hedge their bets and ride their luck. Which is sometimes their best shot at ultra reliability, anyway. Let me explain.

The idea that an extra nine (to go from 99.99 percent ‘reliability’ to 99.999 percent, for example) costs an extra zero leads directly to two different questions. The first is whether your business case justifies the cost. Because we all want cool stuff, but building systems which are fundamentally uneconomic does not make sense – ever, in any circumstances. It cannot work out in the long term. The second question, then, is about how to reconcile the desire for extremely high availability, as advertised with things like 5G, with real-world operational requirements. 

Rolfe — a tangled mess of interdependent systems, which needs to be ordered, controlled, and replicated

Let us consider this by splitting-out some of the different drivers in a critical-edge system – and some of the possible points of failure. Because Industry 4.0 is a tangled mess of interdependent systems, which needs to be ordered and controlled and replicated – if, for instance, performance is to be guaranteed in writing as part of a service level agreement (SLA) with the supplier. So how, exactly, should we define system ‘availability’? Because SLA requirements very quickly multiply infrastructure complexity and escalate financial investments.

Should latency come into it, for example? If an SLA stipulates 10ms, and the system responds in 70ms, then are we ‘down’? Is it okay to send a ‘busy signal’ when the network is not operating properly – as per the phone system? What about downtime? Google calculates downtime by the percentage of users affected – on the assumption that something-somewhere is always down if the system is big enough. What about planned downtime for patching security holes, say? Being fully-patched, all the time, messes with your availability targets.

What about networking? Does a system which is running, but not visible, count as downtime? What about the public cloud? AWS offers 99.9 percent (three nines) availability, assuming you run in three zones. So how do you turn three nines in a cloud-attached critical edge system into five or six nines? How do you do that if any part of your solution runs in a public cloud? What about APIs to third-party services? Are all of these functions, on which a whole Industry 4.0 system relies, included in the SLAs on system availability?

What gives? How can we rationalise all of that in the face of some promise of ultra-reliability – as talked about currently with incoming industrial 5G systems? And actually, we are only getting started. We have walked through the entry requirements for a critical Industry 4.0 system, and already we are amazed it is even up and running – let alone that it is doing so with a respectable number of nines. But now the fun really begins. Because almost every trend in software is increasing system complexity and making it harder to achieve five- or six-nines reliability.

Take microservices, loved by developers, but heavy going in terms of footprint, execution time, and network traffic. If your definition of ‘up’ means an aggressive SLA for latency and response time, then microservices may not be your friend. Because all those calls add up – and bottlenecks, stalls, and Java GC events can appear at odd moments, as  ‘emergent behaviour’ that’s nearly impossible to debug. There’s also a degree of opacity that’s an intentional aspect of microservices. How do you know you won’t need sudden patches if you don’t know the code you’re running?

At a fundamental level, every time you make a synchronous API call to anything you are handing control of time, which is precious, to Someone Else’s System – someone who may not share your priorities. And given such systems spawn giant recursive trees of API calls, you need to understand the overhead you are introducing. Consider the famous (probably fictional) IBM ‘empty box’ experiment, where it found it would take nine months to make and ship a product, even if it was an empty box. If it takes 50 calls to solve a business problem, even if the call time is 100 microseconds (0.1ms), it is still going to take 5ms just to make the calls, never mind to do any work.

DevOps is another risk factor. The software industry has gone from the despair of annual releases to the excitement of daily ones. Which is great, until you realise you’ve just signed up for six-nines of availability and six releases a week. Accurate, reliable data on outages is hard to find, but by some measures 50 percent are caused by human error – either directly by fat-fingering IP addresses, say, or indirectly by failing to renew security certificates or pay hosting bills. Whatever the reality, we should agree the risk of human error is linearly related to the number of deployments. 

And just do that maths: a one percent chance of a five-minute outage, per daily release – I mean, you’d struggle to maintain five-nines even if everything else was bulletproof. Which it isn’t. Lifecycles are another factor. A vendor might refresh a product every year and stop support after three; venture firms want their money back in seven, or less. But customers expect 10 or 20 years of life out of these things. Are you really comfortable making promises about availability when some of the stuff your system interacts with might be unsupported within four years?

Have I ruined your afternoon yet? What is the magic fix, then? How do you make it so the Industry 4.0 vendor community quit their jobs, and join the French Foreign Legion? There is no magic fix; there never is. But… if we make a conscious effort in the high-availability / low-latency edge space to do things differently, we can probably prevail. But it takes a number of steps; here are some simple rules, as follows:

1 | Avoid gratuitous complexity – don’t add components if you don’t need them.

2 | Build a test environment – with a copy of every piece of equipment you own, in vaguely-realistic configuration. Telstra, for example, runs a copy of the entire Australian phone system, with a Faraday cage to make calls using new software it intends to inflict on the general public. Sound expensive? It is; but it is not as expensive as bringing the national phone system down.

3 | Choose vertically-integrated apps – which sounds like heretical advice. But if you’re going to own the latency/availability SLA, then you need to own as much of the call path as possible. What you own, you control. Time spent implementing minimal versions of stuff will be repaid when the next patch-panic happens and you’re exempt because you did it yourself. Again, this is heretical. But what is the alternative?

4 | Adopt modern DevOps practices – follow the Google model on SLAs, for example, where developers have a ‘downtime budget’; it makes sense when your SLAs are so stretched and multi-dependent.

5 | Track and report response times – especially in cases where you have to surrender control to someone else’s API or web service. Because sooner or later one of your dependencies will misbehave, and you will become the visible point of failure – so you will need to show the failure is elsewhere. The time to write the code is when you write the app, and not during an outage. Which may seem cynical, but others will point the finger at you if they can. I speak from experience.

David Rolfe is a senior technologist and head of product marketing at Volt Active Data. His 30-year career has been spent working with data, especially in and around the telecoms industry.

For more from David Rolfe, tune in to the upcoming webinar on Critical 5G Edge Workloads on September 27 – also with ABI Research, Kyndryl, and Southern California Edison.

All entries in the Postcards from the Edge series are available below.

Postcards from the edge | Compute is critical, 5G is useful (sometimes) – says NTT
Postcards from the edge | Cloud is (quite) secure, edge is not (always) – says Factry
Postcards from the edge | Rules-of-thumb for critical Industry 4.0 workloads – by Kyndryl
Postcards from the edge | No single recipe for Industry 4.0 success – says PwC
Postcards from the edge | Ultra (‘six nines’) reliability – and why it’s madness (Reader Forum)



Source link

Next Post

Senate Votes to Confirm Anna Gomez to FCC

Fri Sep 8 , 2023
The Senate voted 55-43 on Thursday to confirm Anna Gomez to the fifth and vacant seat on the FCC. The vote means an end to the 2-2 party deadlock at the agency. Gomez will also be the first Latina Commissioner the FCC has had since 2001. Ted Cruz (R-TX) led […]

Newsletter

COMING SOON! Signup for our newsletter, get hot news plus speical deals from us and our partners...

Best Omni-Directional on Market