Generally AI - Season 2 - Episode 1: Generative AI and Creativity

03 Oct 2024 (7 months ago)

AI-Generated Books

A Dutch news magazine researched the percentage of books on the internet written by AI, finding that 2% of the 323,000 titles tested were AI-generated, with this number growing from 0.1% before the introduction of ChatGPT to 4.2% in April 2024 (42s).
Many AI-generated books did not receive a rating higher than two stars, raising questions about how human authors with low ratings might feel in comparison (1m7s).
Amazon has a publishing limit of three books per day per author, which is still considered an extremely high limit for human writers (1m30s).
There have been cases of authors using AI to generate books, including one instance where a Reddit user received a book containing the AI prompt instead of the generated content (2m6s).

AI-Generated Music: The Rise of Sunno

The podcast "Generally AI" is discussing AI creativity, including AI-generated music, which has improved significantly since the podcast's first season (3m21s).
The music generation model "Sunno)))" has been used to create high-quality songs, including a "cat song" that is catchy and easy to dance to (4m7s).
One person is releasing multiple albums a day on Spotify using Sunno, all with cat-themed music, which may eventually lead to Spotify imposing a limit (4m42s).
Using Sunno as a dedicated music player, replacing normal streaming music with AI-generated music, is doable and fun, but can become overwhelming after a while due to the catchiness and poppy nature of the songs (4m57s).

AI as a Creative Tool for Musicians

Generative AI can create music that sounds good but lacks variety, making it interesting at first but potentially repetitive over time (5m20s).
An artist can use AI to enhance their creativity, rather than replacing it, and one example of this is Google's Music AI, which was presented at Google IO (6m17s).
Music AI allows artists to mix different styles and instruments live, and can be used to create unique sounds and samples (6m22s).
Mark Rabier, an artist, demonstrated Music AI's capabilities and showed how it can be used to create new sounds and mix them with existing loops (6m25s).
The idea of cooperation between humans and machines is an interesting aspect of AI-generated music, and can lead to new forms of creativity (6m56s).
An AI-based sample generator can create an infinite number of possibilities for an artist's creativity, and can be used to generate samples that can be remixed and reworked (7m7s).

AI Sample Generators

One tool that uses AI to generate samples is the AI Sample Generator on AISampoGenerator.com, which uses audio gen to create short samples of two to four seconds (7m44s).
The AI Sample Generator allows users to select the style of sample they want, such as piano, funky melody, or metal rot strings, and can be used to create unique sounds (7m46s).
The samples generated by the AI Sample Generator can be used in a sampling machine to create new beats and sounds (8m10s).
The quality of the samples generated by the AI Sample Generator can be hit-or-miss, and it may take some experimentation to get good-sounding sounds (8m45s).
The idea of using AI to generate samples is compared to digging through an infinite number of vinyl crates, and can be a powerful tool for artists (8m58s).
One example of using the AI Sample Generator to create a song is by using a Teenage Engineering PO-33 K.O!, a sampling machine that can be used to create new beats and sounds (9m11s).

Challenges of AI Music Generation

Some people appreciate the aesthetic of a do-it-yourself style, which can be catchy, but it's challenging to achieve good-sounding sounds with AI-generated music, as the sound wave may start too early or too late, and the buildup may not be ideal (10m12s).
Audio generation can be too eager, producing excessive noise, and it's difficult to get a single, clean sound, such as a quacking duck making a single "quack" instead of multiple quacks (10m44s).
The noisy sound generation might be due to the decoding process, such as when using a Transformer, which outputs a sequence of tokens that need to be decoded into audio waveforms, potentially introducing quantization noise (11m3s).
Collaboration with a computer is not possible in the same way as with a human, as it's not possible to provide feedback or ask for adjustments, such as making a base sound longer or darker (11m21s).
Generative AI can do this with text input, but it's not yet possible with music, which is an area that needs improvement (11m41s).
Shifting the pitch of a sound, such as a guitar or piano, can result in the same sound, which is an interesting phenomenon (11m57s).
The ability to use AI to enhance creativity and jam with a computer could be a valuable resource for musicians, but it's currently not possible to find a tool that allows for this kind of collaboration (12m11s).

AI Tools for Music Production

A potential application of AI in music is to generate a drum beat or strings to accompany a bass line, which could be useful for musicians and potentially fit into an effect pedal (12m35s).
A tool called Sampleon SB.com Text-to-Sample can be used to create samples from text, either as a standalone tool or as a plug-in, which can be dragged into virtual drum kits (13m36s).
A music generation tool runs locally on a MacBook and uses a music gen model in the background, which is downloaded as the first step, and its license is permissive, allowing it to be used in various tools (14m1s).
The tool's value lies in its ability to generate AI samples, which can help avoid copyright infringement issues when using actual samples from songs, as clearing samples with artists can be complex and risky (14m24s).
The Beastie Boys' albums, such as Paul's Boutique, are examples of music heavily reliant on samples, and using AI-generated samples can be beneficial for artists who build their music around samples (15m13s).
A music sample generated using the tool is played, and its quality is discussed, with the suggestion that it could be a potential winner in the Eurovision song contest (15m52s).

Evaluating Generative AI Models

The conversation is interrupted by an advertisement for the Cuon San Francisco conference, where software leaders will share their experiences with emerging trends, including generative AI in production (16m40s).
The discussion resumes, focusing on the challenge of measuring the quality of generative AI models, as they do not have a clear "ground truth" like traditional machine learning models, making it difficult to calculate error metrics (17m36s).
Traditional machine learning models, such as classifiers and regression models, have well-established methods for measuring their performance using test data sets and expected outputs, but generative AI models lack a clear ground truth, making evaluation more complex (17m43s).
Evaluating the creativity of generative AI models can be challenging, and one approach is to nudge the model towards creating something that already exists and then assess its performance (18m40s).
In code generation, evaluating the quality of code can be done by checking if it compiles and passes unit tests, but the discussion will focus on language models like ChatGPT and text-to-image generation models like DALL-E and Midjourney (19m17s).
Language models have objective evaluation metrics, such as the BLEU score, which measures the model's ability to predict the next token in a sequence of input tokens or words (19m51s).
Language models are trained by predicting the next token in a sequence, and the loss function measures how good the model is at making this prediction (20m30s).
The scaling laws developed by OpenAI predict the loss metric of a language model based on its size, the size of the dataset, and the amount of compute used to train the model (20m41s).
While the loss metric is a good proxy for evaluating language models, it is not the ultimate goal, as users care more about the model's ability to perform tasks like question answering (21m14s).
Question answering can be evaluated using objective metrics, such as multiple-choice or true/false tests, and language models can be benchmarked using tests like the MML (Massive Multitask Language Understanding) benchmark (21m40s).
Different language models have different strengths, and their performance can be compared using leaderboards, such as the one provided by Hugging Face, which shows the performance of various language models on different benchmarks (22m42s).
The Open Language Models leaderboard allows users to create and submit their own massive language models, which are then tested and ranked using automated tests, with the current top-ranked model being created by David Kim (23m0s).
Large language models can be used for tasks such as writing essays or summarizing documents, but evaluating their performance can be challenging due to the lack of a clear ground truth (23m38s).
Metrics such as BLEU (Bilingual Evaluation Understudy) and ROUGE (Recall-Oriented Understudy for Gisting Evaluation) can be used to calculate the similarity between two pieces of text by comparing the overlap of words (24m25s).
BLEU is more focused on precision, measuring the fraction of generated words in the ground truth, while ROUGE is focused on recall, measuring the number of matching n-grams between the two texts (24m49s).
Another approach to evaluating similarity is to compare the meaning of two pieces of text using an encoder language model like BERT (Bidirectional Encoder Representations from Transformers), which can provide a vector representation of the semantics of a given text (25m30s).
The BERT score can be used to measure the similarity between two pieces of text by comparing the vectors representing their meanings using a distance metric like cosine distance (25m57s).
Evaluating text-to-image generation models is also challenging due to the lack of a clear ground truth, but automated metrics can be used, such as those based on the COCO (Common Objects in Context) dataset, which contains images with captions (26m48s).
Text-to-image generation is a process where a model creates an image based on a given caption, similar to reverse captioning, and can be evaluated using metrics such as the Frechet Inception Distance (FID) or the CLIP score (27m0s).
The FID metric measures the similarity between the distribution of generated images and the distribution of ground truth images from a dataset, such as the Coco dataset (27m20s).
The FID metric uses a pre-trained image classifier called Inception to create a vector representation of the images, which is then compared between the generated and ground truth images (27m42s).
The CLIP score measures the similarity between the generated image and the text prompt, using a model that can compare the similarity between an image and its text description (28m42s).
The CLIP score can be used to evaluate how well the generated image represents the prompt, but it may not be perfect as it has its own shortcomings, such as difficulty with counting (29m36s).

Human Evaluation of Generative AI

Another way to evaluate generative AI content is to ask a real person for their subjective opinion, which can be done by showing the output of two different models and asking which one is better (30m7s).
This head-to-head ranking approach, referred to as the "optometrist trick," can be used to rank a model against another, even if it doesn't provide an objective score for a particular model (30m50s).
The concept of ranking models, similar to the ELO chess ranking system, is used to evaluate the performance of different models, including chatbots, by having human judges compare their outputs side by side (31m0s).
This method is also used in research papers to compare the performance of different models, where human judges are asked to rank the outputs of different models to determine which one is better (31m51s).
The idea of ranking model outputs has another application beyond just evaluating models, which is reinforcement learning from human feedback (RLHF), used by OpenAI to fine-tune GPT-3 and GPT-4 (32m40s).
RLHF is used to solve the problem of model outputs not being aligned with the user's intent, by creating a fine-tuning dataset where human judges rank the outputs of a language model (33m23s).
The fine-tuning dataset is collected by giving a prompt to the language model, generating several outputs, and then having a human judge rank them (33m38s).
The goal of RLHF is to fine-tune the language model to generate text that people like, and it can also be used to automate the process of evaluating model outputs (33m50s).
Researchers are exploring the use of smaller language models to evaluate the outputs of other models, by having a model like GPT-4 evaluate the generated text (34m33s).
This approach raises the question of whether a model can effectively evaluate its own outputs, or if it's like "butting its own meat", a phrase that may not be an English saying (34m50s).
The ideal scenario for AI is when it helps humans achieve their goals, and the best way to evaluate AI is through human judges who can assess the output and decide if it's satisfactory (35m25s).

Animal Intelligence and AI

A recent BBC headline discussed the consciousness of animals, including a researcher's claim that bees can count, recognize human faces, and learn to use tools (36m6s).
Bees likely have fewer neurons than GP4, but it's still unclear how far they can count, with some research suggesting that crows can count up to seven (36m19s).
Researchers test a crow's counting ability by placing a hut near its nest and having people walk in and out, observing the crow's behavior when the number of people exceeds seven (36m45s).
Clever Hans, a German horse, was known for its ability to perform tricks and calculations, but it was later discovered that the horse was simply interpreting its owner's movements (37m25s).
This story serves as a reminder that sometimes the output of a large language model can be misinterpreted as intelligent, when in fact it's just a person interpreting the output in a creative way (37m57s).

Historical Context: Mechanical Computers and AI

The Northern Bomb Sight, a mechanical computer used in World War II, was a secretive device that could control an airplane and perform calculations to release bombs at the correct trajectory (38m27s).
The Northern Bomb Sight was considered a wonder weapon, and its secrecy was maintained even in the face of danger, with pilots prioritizing its removal from the plane in emergency situations (38m57s).
The concept of similarity metrics is mentioned, with the example of the French words "Rouge" and "Bleu" being used to describe colors, and the possibility that the person who came up with "Rouge" did so on purpose (39m33s).

AI and Acronyms

There is a large language model that can generate what an acronym stands for, given the word as output, and this has been tested with ChatGPT (39m45s).
The effectiveness of this model is discussed, with the conclusion that it works well (39m56s).

Podcast Conclusion and Future of AI

The start of the second season of the podcast is announced, and listeners are encouraged to tell their friends about it, as rating podcasts can be difficult to do on popular platforms (40m2s).
The hosts mention that they can be followed on various media platforms, but note that these platforms are changing rapidly (40m44s).
The possibility of ChatGPT-generated podcasts is discussed, and it is noted that such podcasts probably already exist (41m3s).
A Dutch book website is mentioned, where searching for a biography of Oppenheimer yields only AI-generated biographies, and not the actual biography (41m38s).
The book "American Prometheus" is mentioned as a good biography of Oppenheimer, and it is noted that the movie was based on or influenced by this book (41m55s).
The AI-generated biographies on the Dutch book website are discussed, with the observation that they are often cheap, have no ratings, and are not distinguished from human-written books by readers (42m9s).
The top categories for AI-generated books are revealed to be health, self-help, management, and personal development, among others (43m19s).
The hosts express concern about the implications of AI-generated books, particularly in the health and self-help categories (43m40s).