ChatGPT, beyond the hype
February 2023
That is pretty much the universal reaction I get when I ask people about Josh Dzieza’s recent article in The Verge. The article examines the origins of the large data sets that feed AI systems and the way data is procured for AI algorithms to be trained, tuned, and tailored for particular tasks. If you haven’t done so already, please read it. It is an eye-opening description of the enormous amount of manual effort that is required to optimize the algorithms at the heart of many kinds of AI approaches and build the Large Language Models (LLMs) that drive generative AI tools such as ChatGPT and Bard.
It turns out that making machines appear to be human actually takes a remarkable number of people to create the data sources that drive the AI algorithms and fuel the analytics used in decision making. And as Dzieza says:
“You might miss this if you believe AI is a brilliant, thinking machine. But if you pull back the curtain even a little, it looks more familiar, the latest iteration of a particularly Silicon Valley division of labor, in which the futuristic gleam of new technologies hides a sprawling manufacturing apparatus and the people who make it run.”
The article is a wake-up call and a reminder that in times of massive technological change, people suffer. Sometimes a lot of people. It’s a troubling aspect of the digital transformation of business and society that all of us must face up to. With the recent acceleration of AI adoption, taking time to reflect on the role of data in driving AI and the dilemmas raised by advances in this technology is essential.
It is worth repeating that the secret to AI is people: Humans and machines working together and making the most of one another. However, we also need to beware of the potential negative impacts of this relationship. With increasing adoption of AI technologies, it is becoming clear that such interaction comes with a variety of troubling human costs:
Job Displacement and Reskilling: Automation driven by AI can lead to the displacement of certain jobs, particularly those involving repetitive and routine tasks. While AI creates new job opportunities in areas such as AI development, data analysis, and AI ethics, the transition is hard on individuals whose skills become obsolete. Many people will struggle to adjust.
Bias and Fairness Concerns: Biases and influences from many areas place pressure on the ways that AI systems are built and evolve. This can exacerbate existing inequalities and lead to discriminatory outcomes in areas like hiring, lending, and law enforcement.
Privacy and Security: AI technologies, particularly in the collection and analysis of personal data, raise significant privacy concerns. The extensive collection and analysis of personal data for profiling and decision-making can erode individual privacy rights and lead to unintended consequences, such as inappropriate monitoring and profiling.
Ethical Dilemmas: Deploying AI systems brings many kinds of ethical dilemmas in their decision-making processes. For instance, there are well known case studies that highlight the issues faced in self-driving cars as they make split-second decisions that involve weighing different priorities to choose a “least bad action”. Determining the “right” course of action in such situations is complex, ambiguous, and open to ethical challenges.
Depersonalization of Customer Service: The use of AI-powered chatbots and automated customer service systems can result in a depersonalized customer experience. While these technologies offer efficiency, they can be viewed as “dehumanizing” and lack the empathy and nuanced understanding that human interactions provide.
Mental Health Impact: Constant connectivity, social media algorithms, and AI-driven content recommendations have been linked to negative impacts on mental health. These technologies can contribute to feelings of social isolation, lack of self-worth, and addiction.
Loss of Human Judgment: Overreliance on AI systems can lead to a decline in human judgment and critical thinking. From Nicholas Carr’s warnings about “Google making us stupid” to more recent comments on the need for “explainable AI”, blindly following technology-driven AI recommendations reduces individual participation and understanding of complex situations, and removes the need for people to learn how decisions are made.
Beneath each of these human dilemmas is a story about data. The way that AI systems procure, manage, and apply data is a determining factor in the human-machine relationship. This shows its head most obviously in the way AI systems are trained.
The quality and effectiveness of AI systems are intricately tied to the source and calibre of their training data. Good training data is the foundation upon which AI models are built, shaping their capabilities, accuracy, and real-world applicability. It serves as the essential raw material that allows AI algorithms to recognize patterns, make predictions, and perform tasks with accuracy and relevance.
In a rapidly advancing AI landscape, the role of high-quality training data cannot be overstated. It is the cornerstone on which the entire AI infrastructure rests. Investments in obtaining and maintaining good training data pay off by yielding AI systems that provide accurate, reliable, and valuable insights, ultimately determining their success and impact across various industries and applications.
Training data essentially guides AI models in understanding the complexities of the world. When the data is comprehensive, diverse, and representative, the AI system can generalize from the examples it has seen during training to make informed decisions on new, unseen data. This capacity for generalization is what makes AI systems valuable and adaptable to different scenarios. Conversely, poor-quality or biased training data can lead to skewed outcomes and unreliable predictions.
Getting good training data can be costly. And is harder to come by than many people think. Ensuring good training data involves meticulous curation, validation, and augmentation. Data then needs to be cleaned, verified, and balanced to mitigate biases and inaccuracies. Moreover, the continuous refinement of training data is vital to keep AI models up-to-date and relevant as trends and contexts evolve. As Dzeiza’s reminds us, this takes people — a lot of people.
The latest AI advances illustrate just how much data is required. GPT-3.5, the LLM underlying OpenAI’s ChatGPT was estimated to have been trained on 570GB of text data from the internet, which OpenAI says included books, articles, websites, and social media. The importance of good training data is particularly evident in supervised learning, where AI models learn from labelled examples. If the labels are incorrect or inconsistent, the AI’s understanding becomes flawed. In addition, the absence of specific examples can hinder the AI’s ability to grasp the full scope of a task, limiting its performance.
Regardless of the collection or generation process, the quality and volume of training data are critical factors that significantly influence the performance and reliability of AI models. There are three major concerns related to these aspects that we all must try to avoid.
The first is bias and unfairness. One of the foremost concerns is the presence of bias in training data. If the training data is biased, the AI model will learn and perpetuate those biases, potentially leading to discriminatory or unfair outcomes. Biases can arise from historical inequalities present in the data or from sampling biases that don’t accurately represent the diversity of the real world. For instance, if a facial recognition system is trained predominantly on one demographic group, it might perform poorly on other groups, exacerbating existing societal biases. Ensuring a diverse and representative dataset is crucial to mitigate bias and promote fairness.
The second is Data Quality and Labelling. The accuracy and reliability of the training data labels are paramount. Incorrectly labelled data or noisy data can mislead the AI model and result in poor performance. In supervised learning, where models learn from labelled examples, even a small percentage of mislabelled data can have a significant negative impact. Maintaining data quality requires careful validation, error correction, and constant monitoring. In domains like medical diagnosis or autonomous driving, unreliable labels can lead to serious consequences, making data quality a critical concern.
The third is Data Volume and Generalization. The volume of training data plays a crucial role in the generalization ability of AI models. Too little data might result in overfitting, where the model memorizes the training data but fails to perform well on new, unseen data. On the other hand, insufficient data can limit the model’s ability to grasp the complexities of a task. While deep learning models thrive on large datasets, collecting and annotating massive amounts of data can be time-consuming and resource-intensive.
Addressing these concerns requires a multi-faceted approach. It involves careful data collection, pre-processing, and augmentation to ensure a diverse and representative dataset. Implementing techniques to detect and mitigate bias, both in data collection and model training, is crucial. Data quality control measures, such as crowd-sourced validation or expert reviews, can help maintain accurate labels. Additionally, techniques like transfer learning can enable models to leverage knowledge from one domain to improve performance in another, even when data is limited.
A recent eye-opening article by Josh Dzieza in The Verge shatters the illusion of AI’s autonomous brilliance. Behind the scenes, the success of AI hinges on something often overlooked: high-quality training data. This is often expensive and difficult to create and manage, and requires a lot of people doing challenging work. In this era of rapid technological change, it’s crucial to recognize this as part of the interplay between humans and AI. While AI offers tremendous benefits, it also presents significant challenges that demand our attention. Job displacement, bias, privacy concerns, and ethical dilemmas are real issues that need careful consideration.
But above all, quality training data is the bedrock of AI’s capabilities. It’s the raw material that shapes AI models’ accuracy, adaptability, and real-world performance. As we embrace AI’s potential, it is essential that we emphasize the importance of meticulous data curation, diverse representation, and bias mitigation. By doing so, we can pave the way for AI systems that enhance our lives while upholding ethical and societal standards.
Originally posted here