Read The Times Australia

Daily Bulletin

Big Data analyses depend on starting with clean data points

  • Written by: The Conversation
imageWhat you get out is what you put in.Keys image via www.shutterstock.com

Popularly referred to as “Big Data,” mammoth sets of information about almost every aspect of our lives have triggered great excitement about what we can glean from analyzing these diverse data sets. Benefits range from better investment of resources, whether for government services or for sales promotions, to more effective medical treatments. However, real insights can be obtained only from data that are accurate and complete, so it’s critical to keep in mind how the data were collected.

Data scientists know the importance of accurate and complete data. After all, if the data itself is unreliable, you’ll wind up making invalid conclusions based on your analysis.

imageOh, did I press that?Marcin Wichary, CC BY

To avoid that pitfall, one major cost for most data analysis projects comes from data preparation and cleaning – that is, finding and correcting errors in the data. These errors include incorrect values, missing entries, aliasing (where information about two distinct entities has been merged in error, for example, because two people have the same name) and multiple entry (where information about the same entity is split up, for example, because the name has been spelled differently for the same person). When data sets are small, the analyst can manually examine and validate each entry. With large data sets, we have to rely on computer-executed algorithms. The development of such algorithms is now a subfield itself.

The old truism “garbage in, garbage out” is more apt than ever in this era of complex and gargantuan data sets – and the sometimes weighty consequences of trusting what they seem to imply.

How inaccuracies creep in

Errors in data can arise for a variety of reasons. For example, users often make mistakes when filling in web forms. Data cleaning software can verify that the zip code matches the street address, and possibly even correct it. So if the state has been entered along with the town in the city field (for example, “Plainfield, NJ” for city), data cleaning can move the state entry to the correct field. Or if a street has only house numbers 1–80, data cleaning software can flag as erroneous a house number entered as “125.” Many inadvertent errors can be caught, and possibly fixed, by clever software.

Bad data entry isn’t the only source of inaccuracies. One common place where errors arise is in linking data across data sets. Unless both data sets use a unique identifier – such as a social security number – with each entry, it is challenging to match entries across data sets: there are likely to be entries that wind up linked even though they should be distinct, and entries that are not linked even though they correspond.

Another frequent source of mistakes is when computer software creates table entries based on other, more complex, data. For example, if you write a review of a product, this may be condensed into one of a few buckets (eg, loved/liked/hated) along a few simple axes (eg, ambiance, food taste, service, value for money). The condensed form is amenable to quantitative analysis, which the original text form is not. But errors can be made in the process of condensing.

imageIf the data aren’t good, neither are the interpretations.Pete Birkinshaw, CC BY

At least don’t motivate people to lie

Dirty data are almost impossible to clean when errors are due to intentional user choice as opposed to inadvertent causes. Suppose you enter your neighbor’s address as yours: clever software cannot catch this lie without knowing more about you – after all, the address entered is technically a valid entry, it’s just not correct.

If we are to trust the results of analysis, we must ensure that the data collection procedures at least don’t give users incentive to cheat.

Consider web forms that routinely ask us to fill out information about ourselves. Many users enter a bogus email address in these forms, perhaps for fear of possible spam mail. Some websites confirm the email address entered, for instance, by sending a verification link that the user has to click. But such verification is expensive and unfriendly. The complementary approach is for the website to develop a reputation for trustworthiness so that users are willing to share their email addresses without worrying about the potential for misuse.

In fact, people (and businesses and other entities) will provide correct and complete data only if they feel they can trust the data collection. The US Census Bureau is able to collect high-quality data because it can assure citizens that what they report in the census will not be used for tax collection or any other such government purpose, other than statistical reporting. While it might be desirable to catch tax cheats and obvious that census data could greatly enhance the government’s ability to identify them, laws in most countries prevent such use of census data, because the moment citizens know census data can be used for tax computation, they will be motivated to lie to the census-taker.

imageCould big data have helped prevent the Germanwings plane crash?Emmanuel Foudrot/Reuters

Big data can’t outsmart high-stakes incentives to lie

Maybe you don’t really care whether or not you get the right targeted weekly email highlighting sales of possible interest to you at a local chain store. But there are certainly other instances where the stakes for big data accuracy are much higher.

For instance, take the current spotlight on German privacy laws centered on the mental health of pilot Andreas Lubitz. He allegedly crashed a plane intentionally into the Alps and killed 150 people in March. Given his mental health, he probably should not have been flying an airplane. Some people advocate that his employer, Lufthansa, parent company of Germanwings, should have had complete access to Lubitz’s mental health record and thus been able to keep him out of the cockpit before he had a chance to bring down a flight.

But weakening privacy laws would not reveal to authorities the true mental health of people like Lubitz. Rather, it would make it less likely that the official health record is a reliable record of fact. Someone like Lubitz, who is keen to fly and dreams of becoming a pilot, would likely do everything possible to hide any disqualifying condition from his official medical record if he knew it could be used against him. The incentive for omission and falsehood would undermine the ability to collect and use a reliable data set. In this case, privacy would be sacrificed without any safety payoff. Much better to keep the medical record data clean, and qualify pilots through tests run outside the formal medical system.

It’s great for us as a society to make use of all the data resources we have. But it’s important not to ruin the quality of this data resource in our enthusiasm to use it, even if with good intentions. Unless we are careful about how we deploy these big data sets, we’ll collect data of poor quality – particularly so where there are individual points of concern, such as Lubitz’s health record. The inferences we draw from big data are only as good as the individual data points we feed in.

H V Jagadish's research on Big Data is funded in part by the National Science Foundation and the National Institutes of Health.

Authors: The Conversation

Read more http://theconversation.com/big-data-analyses-depend-on-starting-with-clean-data-points-43687

Business News

Reducing Sales Friction Through Centralized Content Delivery

Sales friction appears whenever buyers or sales teams face unnecessary obstacles in the buying journey. It can happen when information is hard to find, when messaging feels inconsistent, when product ...

Daily Bulletin - avatar Daily Bulletin

Why Choosing the Right Bollard Supplier Matters for Australian Businesses and Public Spaces

From busy CBD streetscapes to sprawling warehouse loading docks, bollards have become one of the most essential safety and security fixtures across Australia. Whether protecting pedestrians from veh...

Daily Bulletin - avatar Daily Bulletin

Why Modular Content Is Transforming Modern Marketing Teams

Modern marketing teams are expected to produce more content than ever before. They need to support websites, landing pages, email campaigns, social channels, product pages, sales enablement material...

Daily Bulletin - avatar Daily Bulletin

Everything You Need to Know About Getting Support from Optus

Whether you've been an Optus customer for years or you've just switched over, at some point you'll probably need to contact their support team. Maybe your bill looks different from what you expected. ...

Daily Bulletin - avatar Daily Bulletin

The Marketing Strategy That’s Quietly Draining Sydney Business Owners’ Bank Accounts

Sydney businesses are investing more in digital marketing than ever before. The intention is clear. More visibility should mean more leads, more customers, and steady growth. However, many business ...

Daily Bulletin - avatar Daily Bulletin

Why Mining Hose Solutions Are Essential For High-Performance Industrial Operations

In environments where the ground itself is constantly shifting, breaking, and being reshaped, every component must be built to endure. Mining operations are among the most demanding in the industria...

Daily Bulletin - avatar Daily Bulletin

The Reason Talented Teams Underperform

If you’re in business, you might have seen it before. A team of capable and smart people just suddenly slows down, and things start spiraling out of control. On paper, everything looks perfect, but ...

Daily Bulletin - avatar Daily Bulletin

Why More Aussie Tradies Are Moving Away From Paid Ads

Across Australia, a lot of tradies are busy. There’s no shortage of demand in industries like plumbing, electrical, landscaping, and building. But being busy doesn’t always mean running a smooth or...

Daily Bulletin - avatar Daily Bulletin

Why Careers In The Defence Industry Are Growing Rapidly

The defence sector has evolved far beyond traditional roles, opening doors to a wide range of opportunities across technology, engineering, intelligence, and operations. This is where defense industry...

Daily Bulletin - avatar Daily Bulletin

The Daily Magazine

Australia’s Best Walking Trails and the Shoes You Need to Tackle Them

Australia is not short on spectacular walks. You can follow ocean cliffs in Victoria, cross ancien...

Why Pre-Purchase Building Inspections Are Essential Before Buying a Home in Australia

source Have you ever walked through an open home and started picturing your furniture, family d...

5 Signs Your Car Needs Immediate Attention Before It Breaks Down

Car problems rarely appear without warning. In most cases, your vehicle gives clear signals before...

Ensuring Safety and Efficiency with Professional Electrical Solutions

For businesses in Newcastle, a safe and fully functioning workplace remains a key part of day-to-d...

Choosing The Right Bin Hire Solution For Hassle-Free Waste Management

When it comes to managing waste efficiently, finding the right solution can save both time and eff...

Why Cleanliness Is Critical In Childcare Environments

Children explore the world with curiosity, often touching surfaces, sharing toys, and interacting ...

What to Look for in a Reliable Australian Engineering Partner

Choosing an engineering partner is rarely just about technical capability. Most businesses can fin...

How to Choose a Funeral Home That Supports Families with Care

Choosing a funeral home is rarely something families do under ideal circumstances. It often happen...

Why Premium Coffee Matters in Modern Hospitality Venues

In hospitality, details shape perception long before a guest consciously evaluates them.  Lightin...