Businesses count the cost of using questionable AI datasets

Jun 07, 2023

•

4 min read

•

By Adonis Celestine, Director of Automation at Applause

Generative AI tools like ChatGPT represent a significant opportunity for businesses because they can perform a variety of tasks at a faster rate than people. But they also pose significant ethical and compliance risks if the data sets they’re built on are left unchecked. Adonis Celestine, Director of Automation at digital quality and testing company Applause explores the risks and how to avoid them.

It has been a whirlwind six months since ChatGPT burst onto the scene. Its ability to mimic writing styles to produce content on request thanks to natural language processing has given many people a real taste of what AI powered services are capable of. Generative AI tools are powered by sophisticated algorithms based on huge amounts of data in the form of text, images, video and audio files scraped off the internet. But, GPT-4, the latest iteration of ChatGPT’s underlying algorithm does not reveal what data sets it is trained on, citing competitive reasons, which raises serious ethical questions and concerns on data bias and privacy.

Regulators in Italy have gone as far as to limit the processing of Italian users’ data until it has reassurances about policies surrounding ChatGPT’s use of personal data. OpenAI, the company behind ChatGPT, has insisted that it complies with GDPR and other privacy laws. Trade unions in Germany that represent the creative industries have also expressed concerns about potential copyright infringement, demanding new rules to restrict ChatGPT’s use of copyrighted material. It’s unclear who owns the copyrights of content generated by AI. Legally only humans hold copyrights. Which raises the question, who is liable for copyright infringements, the AI or its creator?

Examining the ethics of AI

Advances in AI and machine learning (ML) are happening so quickly that regulators are playing catch up. But this isn’t the first-time generative AI has been scrutinised. Last year a user discovered a sensitive personal photo on the data sets used to train an image-generating AI tool. The user had not given consent for the image to be used. This exposed a massive flaw in the way data is extracted. Businesses and public organisations need to be mindful of what is contained in the data sets used to train AI services, if that data complies with data privacy and copyright laws, and if it has been ethically sourced. The European AI Act, designed to address these matters with punitive fines, will come into effect in a few years. Much like GDPR, it will apply to businesses that operate in the EU and is expected to influence similar legislation around the world.

Consumers give their verdict on AI

A survey conducted in February found that a large majority of people believe AI should be regulated. Only 6% of 4000 respondents said they did not think AI should be regulated at all.

More than half (53%) said AI should be regulated depending on its use and 35% said it should always be regulated.

The same study also looked at sentiment around the inherent biases that can affect interactions with generative AI services. Bias occurs when the underlying algorithm has been trained with poor or insufficient data. When asked about bias in generative AI tools, 86% of survey respondents expressed concern.

Up to one third said they were dissatisfied with the AI experience, and 32% agreed with the statement “I would use chatbots more if they responded more accurately to the way I phrase things.” Natural language processing failures can reflect gaps in training data, including limited data from various regional, generational and ethnic groups. As consumer and regulatory scrutiny intensifies, businesses developing AI tools need to ensure they’re collecting training data legally and ethically.

Building valid and ethical data sets

To build high-quality data sets, companies should focus on these four key points:

Make sure your organisation’s terms and conditions/privacy policy cover AI training use cases. If you’re planning to use customer data to train AI, then make sure they understand how the data will be used and how it will benefit them (e.g., improved product and service offerings).
Make sure participants have opted in. Businesses are on solid ground when participants have agreed to provide data that may explicitly be used to train AI algorithms
Actively work to eliminate bias. Look at the data and make sure it accurately reflects the diversity of your customer base and target audience.

Consider creating synthetic balanced data based on patterns and abstractions.

Similarly, while data warehouses can provide artefacts at scale, it’s important for buyers to ensure the data may be expressly used to train AI algorithms without risk or repercussions. It’s important to ask if contributors have granted permission to have their biometrics used to train body or facial recognition technology, voice applications or other AI products.

Diverse data sets reflect authentic experiences

While consent is key, diversity of data and experience are also essential for training AI algorithms. Businesses need to ensure their data sets include samples from people with disabilities, different ages, genders, races, and other key demographics. An example of this done well was by an international fitness company that sourced AI training data from 1,500 users with a variety of body types and fitness levels. The project produced 36,000 workout videos that were vetted to ensure relevance and quality, and finally approved by the fitness company’s Digital Quality Analyst (DQA) team. The videos had the required diversity of data including BMI, fitness abilities and varied workout clothing, and zero data bias.

As we continue to see AI used for an ever-growing number of new and different use cases, it is essential to ensure the quality and integrity of the experiences. Companies that focus on ethically collecting and training algorithms with diverse, quality data will see the biggest successes in both releasing great AI experiences, and in doing what’s right for their customers.