Read The Times Australia

Daily Bulletin

Server down: what caused the ATO systems to crash

  • Written by: Robert Merkel, Lecturer in Software Engineering, Monash University

Many Australian Tax Office IT systems have been unavailable for days after a major fault, apparently caused by a problem with a large-scale storage server.

The ATO’s online systems, including its public website and portals for taxation agents, were down for several days. At the time of writing, the ATO reports that most services are now operational but may experience slowdowns.

There were also reports that up to one petabyte of data was affected by the fault. The ATO has reported that no taxpayer data have been lost, although it is unclear as to whether any internal data have been lost.

Outage in a SAN

According to the ATO and media reports, the system outage was caused by a failure in a 3PAR StoreServe storage area network (SAN) made by Hewlett Packard Enterprise (HPE).

These devices contain racks full of hard disks and/or solid-state storage devices to store data on a gargantuan scale, and fast network interfaces to provide that data to the various “application servers” that provide the ATO’s online systems.

The two units purchased by the ATO were reportedly capable of storing up to a petabyte – that’s 1,000 terabytes or 1 million gigabytes – of data each. They would have cost hundreds of thousands of dollars.

While these devices are expensive, they allow IT staff to allocate storage efficiently and flexibly to where it is needed, and thus (in theory) can improve reliability.

image Even Hewlett Packard Enterprise’s state of the art storage system was vulnerable to data corruption. Hewlett Packard Enterprise

Multiple levels of redundancy, made redundant

Entrusting so much of the IT operations of a large organisation like the ATO to a single storage server requires a high degree of confidence that it will function reliably. As such, a number of levels of redundancy are incorporated into this kind of storage system.

As a first protection against a failure of a single disk (or solid-state storage device), data are “mirrored” across multiple physical disks. If monitoring systems detect a failure, operations can fall back on the mirrored data.

The faulty disk can be replaced and the full mirror restored, all without interrupting user operations. High-end systems such as these also incorporate redundancy into their controller electronics.

However, if a major hardware failure occurs, such as a power failure that is not covered by a backup power supply, many such systems have a second level of redundancy. The entire contents of the SAN is “mirrored” to a second system, often in another physical location, and systems switch over to the backup automatically.

According to iTnews, all of this redundancy was made moot by the nature of the problem: corrupted data were being written to the SAN for some reason, and this corrupted data were then mirrored to the backup SAN.

In this situation, all the redundancy within and between the SANs does not help, as the bad data were replicated across the entire system. This is why keeping traditional backup snapshots – copies of data as it previously existed in the system – is so important, regardless of any amount of mirroring.

The ATO appears to have comprehensive backups of the stored data; however, restoring all of it and returning the SANs to an operational configuration has had to be done manually. It is not surprising that this has taken several days to complete.

Assessing the ATO’s response

While it is tempting to pile on to another large-scale government IT failure, a fair assessment should take into consideration the nature of the failure and the ATO’s response.

Firstly, it appears that the ATO heeded one of the key lessons from the Census website meltdown and communicated what was going on to the public effectively. It responded to the failures by providing informative updates on social media and more comprehensive information on a functioning part of its website.

Secondly, it appears that its backup strategy was sufficient to get all systems back up and running without data loss, despite a nearly worst-case failure in their primary storage system.

If its incident response can be criticised, it may have been able to restore services much faster if more of that process had been automated. However, this appears to be a highly unusual incident.

Restoring one set of application data due to corruption caused by the application itself is a relatively common situation. Restoring many different sets of data because of an apparent bug in the storage server is extremely rare.

Furthermore, while few people ever see them, SANs like this are very common devices in data centres. They provide a generic low-level storage service and are expected to provide it highly reliably.

Indeed, HPE markets its enterprise storage systems with a “99.9999% uptime guarantee”, which requires that a device is non-operational for no more than 30 seconds per year.

Over the past few days, the IT staff at the Australian Tax Office have probably had a few sleepless nights. It’s likely that engineers at HPE will have a few more trying to get to the bottom of why their enterprise storage system seems to have failed so comprehensively.

Authors: Robert Merkel, Lecturer in Software Engineering, Monash University

Read more http://theconversation.com/server-down-what-caused-the-ato-systems-to-crash-70396

Business News

How Telematics Helps Australian Companies Improve Productivity

Operating a commercial fleet in Australia is a uniquely demanding endeavour. Between the sprawling urban sprawl of cities like Sydney and Melbourne and the immense, unforgiving stretches of the Outb...

Daily Bulletin - avatar Daily Bulletin

Inside the Icon: The BridgeMuseum Officially Opens at the Sydney Harbour Bridge

A bold new way to experience one of Australia’s most recognisable landmarks has arrived, with BridgeClimb Sydney officially opening the all-new BridgeMuseum.  Located inside the Sydney Harbour Brid...

Daily Bulletin - avatar Daily Bulletin

Is Your Brand Showing Up in AI Search? Most Melbourne Brands Aren't.

The New Front Door Nobody Told You About Something changed. Quietly. Without a press release. The way buyers find businesses in Australia has been rewired. Not replaced, rewired. Google isn't dead...

Daily Bulletin - avatar Daily Bulletin

How Australian Businesses Can Measure SEO ROI

SEO can feel vague when you are staring at a dashboard full of numbers that do not clearly connect to revenue. The key is to measure the right signals in the right order, then tie them back to outcome...

Daily Bulletin - avatar Daily Bulletin

How Commercial Roller Shutters Improve Site Security Without Slowing Operations

Security upgrades can be frustrating when they make everyday work harder. A door that takes too long to open, creates bottlenecks at shift change, or fails at the worst time can turn “better protectio...

Daily Bulletin - avatar Daily Bulletin

Why a Document Destruction Service Still Matters for Modern Businesses

Businesses generate large volumes of information every day, from staff records and contracts to invoices, reports and customer files. While attention often focuses on how documents are stored, the way...

Daily Bulletin - avatar Daily Bulletin

Bicycle Rack Safety and Space-Smart Storage

Bike storage problems usually show up as small annoyances first: tangled handlebars, scratched frames, and bikes that topple when you pull one out. Over time, those issues become safety risks, especia...

Daily Bulletin - avatar Daily Bulletin

How to Tell if a Childcare Centre Is a Good Fit for Your Child

Choosing childcare can feel like you’re making a huge decision with limited information. Tours are short, centres are often on their best behaviour, and your child might act differently in a new space...

Daily Bulletin - avatar Daily Bulletin

Car Import Timeline: What Usually Happens at Each Stage

Importing a car into Australia can feel confusing because multiple agencies and checkpoints are involved, and the timeline is shaped as much by paperwork quality as it is by shipping speed. The most u...

Daily Bulletin - avatar Daily Bulletin

The Daily Magazine

Gold Migration Lawyers in Liquidation: How the Closure Affects Your ART Appeal

If your appeal was with Gold Migration Lawyers, a recent change to how the Tribunal decides cases ...

The pressure cooker: life in urban Australia in 2026

Australian cities have always been demanding. Long commutes, rising housing costs, busy schedules a...

What Actually Makes a Good Criminal Lawyer in Melbourne

Most people only think about this question once. That is usually too late. Most people charged wi...

Why Working With A Chatswood Tutor Can Improve Academic Performance

Academic expectations continue increasing for students across primary school, high school, and senio...

Is It Worth Getting Solar Panels in Melbourne?

The real question is not whether solar works in Melbourne. It works. The question is what it is co...

How A Diploma Of Project Management Builds Practical Skills For Modern Work Environments

Developing the ability to plan, execute, and deliver outcomes efficiently is a key requirement in to...

How to Choose the Right Football for Every Level

Choosing a football may seem straightforward, but the right option depends on who will be using it a...

What to Ask a Wedding Photographer Before You Book

Booking a wedding photographer can feel deceptively simple: you like the photos, you like the vibe...

Why Stress Relief For Dogs Is Essential For Emotional Balance And Long-Term Wellbeing

Managing emotional health is just as important as physical care when it comes to pets, which is why ...