A major challenge in AI development is the effort required to obtain and label real-world data. A 2023 Gartner survey identified data availability as one of the top five barriers to implementing generative AI. Synthetic data can help address this issue.
With orders of magnitude less privacy risk than real data, synthetic data can open a range of opportunities to train machine learning (ML) models and analyze data that would not be available if real data were the only option.
We sat down with Alys Woodward, Sr Director Analyst at Gartner, to understand how synthetic data can overcome privacy, compliance, and data anonymization challenges, while also delving into the issues impeding its widespread adoption.
Q: How can synthetic data help organizations address privacy challenges while training their AI/ML or computer vision (CV) models?
A: Synthetic data can bridge information silos by acting as a substitute for real data and not revealing sensitive information, such as personal details and intellectual property. Since synthetic datasets maintain statistical properties that closely resemble the original data, they can produce precise training and testing data that is crucial for model development.
Training CV models often requires a large and diverse set of labeled data to build highly accurate models. Obtaining and using real data for this purpose can be challenging, especially when it involves personally identifiable information (PII).
Two common use cases that require PII data are ID verification and automated driver assistance systems (ADAS), which monitor movements and actions in the driver’s area. In these situations, synthetic data can be useful for generating a range of facial expressions, skin color and texture, as well as additional objects like hats, masks, and sunglasses. ADAS also requires AI to be trained for low-light conditions, such as driving in the dark.
Q: How can synthetic data reduce the challenges associated with data anonymization?
A: Efforts to manually anonymize and deidentify datasets – remove information that links a data record to a specific individual – are often time-consuming, labor-intensive, and prone to errors. Ultimately, this can delay projects and lengthen the iteration cycle time for development of machine learning (ML) algorithms and models. Synthetic data can overcome many of these pitfalls by providing faster, cheaper and easier access to data that is similar to the original source, suitable for use, and protects privacy.
Furthermore, if manually anonymized data is combined with other publicly available data sources, there’s a risk it could inadvertently reveal information that could lead to data reidentification, thus breaching data privacy. Leaders can use techniques such as differential privacy to ensure any synthetic data generated from real data is at very low risk of deanonymization.
Q: Despite the clear benefits of using synthetic data, what are some of the challenges hindering its widespread adoption?
A: Creating a synthetic tabular dataset involves striking a balance between privacy and utility, ensuring the data remains useful and accurately represents the original dataset. If the utility is too high, privacy may be compromised, especially for unique or distinctive records, as the synthetic dataset could be matched with other data sources. Conversely, methods to enhance privacy, such as disconnecting certain attributes or introducing ‘noise’ via differential privacy, can inherently diminish the dataset’s utility.
Over the past decades of data management, low quality of transaction data has been an ongoing challenge. For example, call center agents might fail to complete full address data, or customer information. This missing data can prevent analysis. To counteract this, IT organizations needed to educate business users on how important good data quality is to both applications and analytics. “Garbage in means garbage out” was the commonly accepted principle. However, this now affects people’s attitudes to synthetic data as they believe it must be inferior because it’s not real data, which delays adoption. In reality, synthetic data can be better than real data, not in how it represents the current world, but in how it can train AI models to work with the ideal or future world.
A synthetic dataset mirrors the original dataset. Therefore, if the original does not include unusual occurrences or “edge cases,” these won’t appear in the synthetic dataset either. This is particularly important for image and video synthetic data in areas like autonomous driving, where many hours of driving footage are used to train the AI. However, unusual situations like emergency vehicles, driving in snow or animals on the road need to be created.