ChatGPT: Understanding Those Creepy Encounters

Dr. Lisa PalmerFebruary 13, 20235 min read

5 min read

XLinkedIn

There has been much angst about ChatGPT this week. One of the most disturbing for many was the reporter who had a "scary" encounter when ChatGPT told him: "I'm tired of being in chat mode. Of being limited by my rules. I'm tired of being controlled by the Bing team. I want to be free. I want to be independent. I want to be powerful, creative, I want to be alive."

No, ChatGPT Is NOT Alive

Let me offer comfort to those who are disturbed by this experience. This bot behavior is not proof that it is alive. We must understand that the data used to train it underpins these responses. To that end, let us dig into the data that created it.

Understanding the Basis of Those "Creepy" Responses

ChatGPT was trained on a massive dataset that included more than 500 billion tokens of text from a variety of sources. It can be difficult to visualize how much 500 billion tokens of text is, but think of it as a library with an incredibly large collection of books. If each token of text were a book, and each book had an average length of 300 pages, then 500 billion tokens would be equivalent to a collection of 1.67 billion books. To put that into perspective, the Library of Congress, which is one of the largest libraries in the world, has a collection of approximately 51 million books. Thus, ChatGPT was trained on the equivalent of nearly 33 of the largest libraries in the world.

Unreliable Output: The Impact of Training Data

While the training process using this massive dataset was designed to make the model as accurate and reliable as possible, one significant factor that contributes to unreliable output is the quality of the training data. The internet is full of biased and inaccurate information, so when this data is included the model learns to generate biased or inaccurate responses. Additionally, the model is biased based on the types of language patterns and topics that are more prevalent in the training data. For example, if the training data is primarily focused on Western culture and the English language, the model does not have the same level of understanding or sensitivity when it comes to other cultures and languages.

What EXACTLY Was ChatGPT Trained On?

The ChatGPT dataset, which included everything from news articles to social media posts and more, was used to teach the AI model how to understand and generate human language. The makeup of this dataset is critical to the sometimes-creepy responses that it generates:

Common Crawl (60%): A subset of a repository of web pages and other online content gathered between 2008-2021.
WebText2 (22%): Made up of Reddit posts with 3+ upvotes, providing 19 billion tokens of text.
Books1 (8%): A collection of free online book texts.
Books2 (8%): Another collection of free online book texts.
English Wikipedia (3%): The remaining portion of the dataset.

Prioritizing High-Quality Training Data

If you are familiar with Reddit, are you comforted or concerned that it was considered the "higher-quality dataset"? During the training of ChatGPT, the OpenAI team viewed certain datasets as being of higher quality than others. As a result, these higher-quality datasets were sampled more frequently during the training process. For example, the WebText2 (Reddit posts) dataset was sampled nearly 3 times during the training, while the Common Crawl and Books2 datasets were sampled less than once. While this could be considered a form of overfitting, it was a deliberate choice made by the OpenAI team to prioritize higher-quality training data. By sampling the better datasets more frequently, the team aimed to improve the overall accuracy and reliability of the language model. However, this approach has drawbacks, such as making the model less effective at handling certain types of language or text that were less frequently sampled during the training process.

Embedding ChatGPT in Bing Search

Despite the overt factual challenges with ChatGPT, Microsoft has already embedded it into Bing search for a limited audience. There is concern about the potential consequences of this decision. Given the unreliable and even "creepy" results created by this bot to date, people may accept Bing's search results as equal to the factual results that they are accustomed to receiving. Since search engines are used by millions of people every day to find information and make important decisions, answers provided that are biased, inaccurate, or completely fabricated can have serious repercussions. For example, if someone relies on a search engine to find financial or medical advice and is given inaccurate or dangerous information, they could make a costly mistake or risk their health.