Justin Sheehy on Being a Responsible Developer in the Age of AI Hype

03 Oct 2024 (9 months ago)

Responsible AI Development in the Age of Hype

The talk is aimed at software practitioners who might be feeling overwhelmed by the rapid developments and inflated expectations surrounding AI, with the goal of discussing how to be a responsible developer in the age of AI hype (16s).
Developers, or software practitioners, have power and their decisions matter, as stated by tech industry analyst Steve O'Grady, and they need to know how to make good, responsible decisions (2m0s).
The term "artificial intelligence" is broad and unhelpful, but it is still widely used, and it refers to computer programs that can be understood by developers (2m53s).
Most AI systems have been built using either logic and symbol processing or statistics and mapping probability distributions, with recent attention focused on probabilistic systems (3m24s).
The current generation of auto-regressive or AR LLMs (Large Language Models) is based on advances such as the Transformer concept, which allows for attention and statistical processing (4m12s).
The talk emphasizes the importance of understanding the basics of AI systems, such as LLMs, in order to have a concrete conversation about their development and use (4m3s).
The rapid progress in AI has led to an "age of AI hype," with some people believing that the hype is earned and that we are on the way to the singularity, while others may disagree (2m36s).
The talk highlights the need for developers to learn from other fields, such as linguistics, philosophy, psychology, anthropology, art, and ethics, in order to make responsible decisions about AI development (1m21s).
The speaker emphasizes that developers should not think that they can solve problems on their own, without input from other fields, as this is often a bad idea (1m31s).
Language models, such as those from Google and Open AI, aim to predict and generate plausible language by predicting the next word or token in a sequence, essentially functioning as advanced autocomplete systems (4m52s).
These models do not plan ahead, have knowledge, or understand meaning, and are not capable of being told to not give false answers, as stated by Google and Open AI (6m3s).
Open AI has acknowledged in legal replies that their system's primary function is predicting the next most likely words in response to a prompt, and that asking it to tell the truth is an area of active research (6m36s).
The type of AI being referred to in this context is the Autoregressive Language Model (AR LLM), such as ChatGPT, which is a powerful tool for its intended purpose but has limitations (6m58s).
Despite their capabilities, these systems are not expected to magically change what they can do, and it is possible that other AI systems may be created in the future that do not have the same limitations (7m12s).
The current era is being referred to as an "age of hype" rather than an "age of awesome AI" due to exaggerated claims being made about the capabilities of these systems, which is driven in part by the significant financial investments being made in AI research (7m40s).
Similar hype surrounding AI has occurred in the past, approximately 60 years ago, but the main difference now is the large amount of money being invested in AI research, which adds to the incentive for exaggeration (8m11s).
Prominent individuals are making claims about AI that are not supported by the current state of technology, and it is essential to be cautious of these claims and not fall for the hype (8m25s).
To make better decisions as a responsible developer, it's essential to evaluate technology reasonably and not fall for hype or nonsense, which can lead to poor decisions about what technology to use and how to build future projects (8m38s).

The Current State of Large Language Models

A significant portion of the current hype surrounding Large Language Models (LLMs) like ChatGPT, Palm, Llama, and Claud, claims that they are on a straightforward path to achieving General Artificial Intelligence (GAI), similar to human-like intelligence in science fiction, which is considered nonsense (9m7s).
A paper from Microsoft Research suggests that LLMs like GP4 have "Sparks of general intelligence," but this claim is based on a flawed test that GPT passed, which was later found to be unreliable if the test was given slightly differently (9m30s).
The authors of the paper used a test from psychology to evaluate GPT's theory of mind, but the results were not conclusive, and the LLM's ability to provide convincingly human-like text does not necessarily mean it has developed a sense of the beliefs of others (9m44s).
Another article made an even more dramatic claim that GAI has already arrived, but this claim is not supported by any evidence and places the burden of proof on those who disagree (10m36s).
The article's claim is not supported by science or reasonable discussion, and it relies on an unproven assertion that GAI exists, which is not a valid way to make a scientific claim (10m50s).
While the article's core claim is disputed, it does raise important questions about who benefits from and who is harmed by the technology we build, and how we can impact the answers to those questions (11m33s).
Some people argue that LLMs just need more information and computing power to continue improving and eventually achieve GAI, but this argument is based on a misunderstanding of how LLMs work, which is to synthesize text that looks like the text they were trained on, rather than thinking like a person (12m7s).
The idea that more compute power will lead to significant advancements in AI is a common misconception, as it might improve performance but not lead to true intelligence, similar to how climbing a tree doesn't get you to the moon (12m45s).
The Turing Test, also known as the Imitation Game, is often misunderstood as a measure of intelligence, when in fact it only tests a machine's ability to imitate human-like text, which is different from being generally intelligent (13m4s).
The claim that humans do the same thing as Large Language Models (LLMs) like ChatGPT, probabilistically stringing along words, is misleading, as humans have actual knowledge and understanding, whereas LLMs do not (13m44s).
The term "stochastic parrot" was coined in a 2021 paper to describe LLMs as probabilistic repeating machines that lack understanding of the meaning behind the words they produce (14m10s).
The notion that humans are similar to LLMs in that we also probabilistically generate text is a fundamental misunderstanding of the fact that language is a tool used by humans to communicate meaning, which is not the case for LLMs (15m4s).
Emily Bender, a computational linguist and author of the "stochastic parrot" paper, argues that LLMs like ChatGPT have no ideas, beliefs, or knowledge, and only synthesize text without intended meaning (15m44s).
Even experts like Yan LeCun, the head of AI research at Meta, acknowledge that language alone is not enough for human-like intelligence, and that something trained only on form cannot develop a sense of meaning (16m28s).
The face pareidolia effect, where humans see faces in images that are not actually there, is similar to how people perceive intention and meaning in AI-generated text that is structurally similar to human writing, even when they know the system has no intentions or meaning (16m54s).
The term "hallucination" is misleading when used to describe AI systems, as it implies a disconnection from reality, but AI systems do not have a sense of truth, meaning, or observed reality, and are simply statistically predicting the next word (17m46s).
AI systems are not capable of hallucination in the way humans are, and the use of this term is a trick that can lead people to believe there is meaning or intention behind the text (17m52s).
The behavior of AI systems producing ungrounded text is not a bug, but rather what they are designed to do, and they are doing their job well (18m59s).

Examples of AI Hype and Misrepresentation

The concept of arbitrary behavior emerging from AI systems is not supported by how these systems actually work, and is often encouraged by sci-fi talk about AGI (19m33s).
Stories about AI systems learning new languages or abilities without training can be misleading, and it's essential to look for clear evidence before accepting such claims (19m58s).
Huge claims about AI capabilities should come with huge, clear evidence, and it's essential to be skeptical and not simply swallow or dismiss such claims without evidence (20m35s).
It's essential to require evidence before making outlandish claims about AI capabilities, and not to be fooled by misleading or exaggerated statements (21m1s).
A YouTube video showcasing a language model, Gemini, received over 3 million views, demonstrating its ability to identify images in real-time conversations, but it was later revealed to be fake through video editing, highlighting the importance of verifying information (21m10s).
The Mechanical Turk, a chess-playing machine from the 1770s, was able to beat human players, including Benjamin Franklin and Napoleon, but it was later discovered that a human chess player was inside the machine, making it a clever trick rather than true AI (21m59s).
Similarly, Amazon's AI-powered checkout system and autonomous cars, such as those developed by Cruz, have been found to rely on human elements, with the latter having an average of more than one driver per car, despite claims of being fully autonomous (23m59s).
The Tesla bot, a robot suit worn by a human, is another example of a company making exaggerated claims about their AI capabilities, highlighting the need for skepticism and verification when encountering seemingly impressive AI systems (24m20s).
The importance of adversarial proof and transparency in AI systems is emphasized, as even if there is no single "man behind the curtain," human elements are often involved in the development and operation of AI systems (24m46s).
The development of large language models (LLMs) and image generators, such as ChatGPT, relies on human breakthroughs and innovations, and it is essential to understand the human elements involved in these systems (25m0s).

The Human Element in AI Development

Reinforcement learning through human feedback (RHF) is a method used to train AI systems, where people write answers to initial prompts to produce the original training set, and then more people interact with that to build a reward model, ultimately creating a system that statistically produces more text that people think is like what a person would write (25m9s).
This process relies on an enormous amount of low-paid human labor, similar to how Amazon checkout lanes, cruise cars, and Tesla require multiple people per camera or car to function (25m44s).
Using chat GPT or similar AI systems means utilizing the labor of thousands of people paid a couple of dollars an hour to do work that others may not be willing to do, raising a class of ethical problems (26m5s).
To make informed choices about using these systems, it's essential to understand how they work and be aware of the labor involved (26m23s).

Responsible Use of AI Systems for Developers

As developers, there are several things that can be done to use these systems responsibly, such as being cautious when using AI systems to build other systems, like GitHub co-pilot or chatbots on websites (26m50s).
When using large language models (LLMs) or things that are wrappers around them directly for their output, it's crucial to consider the potential risks and consequences (27m18s).
To ensure wise use of these systems, developers should be mindful of the data they input, just as they would with emailing confidential information to another company, and consider having a contract in place to protect that data (27m35s).
Many large companies have policies against using systems like chat GPT or co-pilot for real work due to concerns about data protection and responsibility (28m1s).
Developers should also be cautious when putting code, text, images, or other content that came from an LLM into their products or systems, as the legal issues surrounding property rights for works that have passed through an LLM are still being worked out (28m22s).
Having "squeaky clean" answers to where everything comes from can make the process of discovery and diligence for startup acquisitions much smoother (28m41s).
Hundreds of known cases of data leakage have occurred in both directions when using large language models (LLMs), so users should be cautious when sending data to these systems or using their output in work they wish to own (28m49s).
Training one's own model can alleviate this concern, but it requires more work (29m3s).
Using LLMs for tasks such as proofreading, building summaries, or as a debate partner can be beneficial, as long as the user is deeply involved in the process and edits the output (29m25s).
LLMs can be useful for teaching developers debugging skills, as they can create plausible but buggy code (29m59s).
The key success metric of LLMs is creating plausible text, which can be problematic when used for tasks that require accuracy, such as citing academic work or legal precedents (30m6s).
LLMs can generate plausible but false information, such as citing non-existent academic papers or legal cases (30m40s).
Using LLMs to write code that will be shipped is not recommended, as they lack the ability to reason and understand the context (31m32s).

Limitations of LLMs and Reasoning

LLMs are not capable of reasoning, but rather generate text based on patterns and memorization (31m56s).
A test was conducted to see if an LLM could reason by asking it to count the number of times the letter "e" appears in the name "Justin Sheehy", but it failed to provide the correct answer (32m11s).
Prompt engineering was attempted to see if the LLM could figure out the correct answer, but it was unsuccessful (32m29s).
AI models do not truly reason or account for information, but rather probabilistically generate text based on patterns in their training data, and may occasionally produce correct answers by chance (32m42s).
When using AI models as components in larger systems, it is crucial to be aware of the content the model was trained on, as this can significantly impact the output and potentially perpetuate biases or undesirable content (33m37s).
Training a model on one's own content can provide more control, but using pre-trained models trained on the entire internet can introduce a wide range of knowledge, including undesirable sources like 4chan and certain Reddit subs (33m52s).
The concept of "bias laundering" refers to the tendency to view algorithmic answers as objective or better, despite the potential for biases in the training data, and developers should be aware of this issue (34m17s).

Addressing Bias and Misinformation in AI

To address these concerns, developers can start by testing AI models for bias using tools like the one from IBM, which should be a basic expectation for any AI-powered system (35m2s).
Irresponsible decisions are being made by embedding pre-trained language models into systems that make important decisions, leading to predictable results on issues like race, gender, and religion (34m45s).
The practice of "AI washing" or adding AI components to a system solely for marketing purposes can be harmful, as it may divert resources away from more effective solutions and lead to dangerously worse decisions (35m32s).
To mitigate these issues, developers should engage in discussions with CEOs, product managers, and other stakeholders to determine whether adding AI components will truly add value to the system, rather than simply adding hype (36m7s).

Accountability and Transparency in AI Development

Being a responsible developer in the age of AI hype requires accountability, meaning that companies and developers must understand that they are accountable for what they ship, and take responsibility for the potential consequences of their actions (36m32s).
Large language models (LLMs) cannot be forced to not hallucinate, so developers must be prepared to take accountability for any hallucinations that may occur when using LLMs in their apps or websites (37m1s).
The use of AI chatbots may lead to financial losses for companies if they have to give out discounts or refunds due to the chatbot's inaccuracies, and it is the developer's responsibility to make sure that companies understand the limitations of the systems they develop (37m20s).
To be a responsible developer, one must not lie and not make the hype problem worse by wildly over-promising what their systems can do (37m48s).
Microsoft's Super Bowl commercial is an example of over-promising what their AI system can do, and the company should have represented their work more responsibly (38m17s).
The US government's Federal Trade Commission (FTC) advises developers to ask themselves questions about their AI product, such as whether it is safe and legal, and whether it respects users' rights (38m40s).
Developers should not do something that is not legally and safely possible, and should not prioritize the success of their product over the safety and rights of others (38m56s).
Examples of irresponsible development include violating the rights of hundreds of thousands of people to train a large language model, and prioritizing the development of a product over the safety and well-being of users (39m24s).
A starting place for being a responsible developer is to develop systems legally and safely, and to prioritize the safety and rights of others over the success of their product (39m51s).

Alignment and Ethical Considerations in AI

The development of AI requires careful consideration and responsible decision-making to ensure a safe and beneficial outcome for humanity, and it's up to developers to make this happen safely (40m26s).
The concept of "alignment" in AI development refers to ensuring that AI systems share human values, but this idea is still largely science fiction, as creating general-purpose AI is still multiple breakthroughs away (40m48s).
A paper by Anthropic proposes a framework for alignment that defines an AI as "aligned" if it is helpful, honest, and harmless, which are values that can be applied to human developers as well (41m42s).
Developers can use this framework to guide their own work by ensuring that what they build is helpful, has real value, and isn't just hype-chasing, but rather a solution to a real problem (43m8s).
Developers should also be honest about what they build, avoiding overselling or misrepresenting their work, and minimizing harm caused by their creations (43m19s).
To be a responsible developer, one should prioritize the perspectives and experiences of those who may be harmed by their work and strive to minimize harm (43m59s).
By following these principles, developers can exercise great responsibility and help shape a beneficial future for humanity (44m28s).

Conclusion and Call to Action

The InfQ Dev Summit conference in September will feature talks on critical topics, including responsible AI development, and more information can be found at devs.summit.inq.com (45m21s).