The rise of Big Data has been one of most important trends for business in the last decade, as digital transformation has reshaped entire industries – but this shift has also driven unprecedented competition between companies as they seek to access and control data.
“Data is the new oil. Like oil, data is valuable, but if unrefined it cannot really be used,” British mathematician and entrepreneur Clive Humby predicted as far back as 2006. “It has to be changed into gas, plastic, chemicals, etc. to create a valuable entity that drives profitable activity.”
In recent years, the process of ‘refining’ data has been supercharged by the latest advances in machine learning – artificial intelligence (AI) algorithms that require huge amounts of data to model the real world. Offering unique advantages in pattern recognition and operational efficiency, machine learning is predicated on the idea of big data sets, with each model traditionally trained using real-world data.
Despite global data production rising by around 30 per cent year-on-year, availability of high-quality data is still limited in many sectors and this is causing many businesses and researchers to explore an alternative – synthetic data – information that is created artificially by computers rather than being generated by actual real-world events.
“Organizations today face a challenging environment where it’s easy to fall behind. Many enterprises are looking for ways to gain a competitive advantage via technology, synthetic data being one of the hot topics on the table,” Tobias Hann, CEO of synthetic data specialist Mostly AI, comments.
Erick Brethenoux, head of AI research at tech consultancy Gartner, predicts that some 60% of the data used for the development of AI and analytics projects will be synthetically generated by 2024 and notes that his firm is receiving “an increasing number of questions” regarding synthetic data. Forecasts suggest global data consumption will top 463 million terabytes per day by 2025.
Cost savings
Unlike real-world data, that can be laborious and costly to obtain, synthetic data is generated programmatically, based on parameters defined by the content matter that the algorithm is modelling. This is designed to reflect the important statistical properties of real-world data without the need to collect information.
“Without getting too technical, algorithms need a training process, where they go through incredible amounts of annotated data, data that has been marked with different identifiers. And this is where synthetic data comes in,” says Dor Herman, CEO of Tel Aviv startup OneView.
The main drivers for most businesses to generate synthetic data is the relative cost and speed with which it allows new AI models to be built. Whereas real-world data gathering can take months and can be prone to errors or unforeseen biases, the process of generating synthetic data is relatively straightforward from a programmatic perspective.
Research firm AIMultiple forecasts that the synthetic data market will continue to grow at more than 10 per cent per annum over the next decade and will help eliminate security gaps that traditional anonymization techniques cannot prevent.
Sensitive data
Alongside the cost of gathering real-world data, many companies also face significant compliance costs when dealing with personal customer information, as well as risks to brand and social networks if they employ weak privacy-preservation mechanisms. With growing awareness of the value of personal data and the misuses it can be put to businesses must jump through a number of hoops to usefully build on customer data.
“The beauty of synthetic data is that it is highly flexible; you can create, share, and discard this data at will,” Hann of Mostly AI says. “It’s as good as your production data, yet it is exempt from the General Data Protection Regulation (GDPR) and the California Consumer Privacy Act. It’s capable of improving data quality for AI and can be used to modify existing datasets, e.g., to correct for present biases.”
Since synthetic data does not contain information about real individuals it is possible to use in any way you want, allowing AI models to be built rapidly with no need for costly compliance.
Research firm Mostly AI estimates that as much as 99% of the information and value of a real-world dataset can be preserved by using synthetic data, while also protecting sensitive data from re-identification.