Large Language Foobar

2023-07-03 rant

Now that social media is finally destroying itself, I have a reason to stop writing long Twitter threads and instead write more blog posts¹. I’ll kick off this new era of the internet by writing a rant that makes me sound like a grumpy old man. But that’s fine because I’m grumpy, a man, and I’ve probably worked in tech long enough to be considered old. Anyway.

Let’s talk about Artificial Intelligence - because I enjoy making half of the internet angry at me.

If you’ve been in “tech” long enough, you’ll start to recognize that a small group of technologies gets hyped in an endless circle that switches position every two years or so. We’ve had “analog computers”, quantum computers, blockchains, and now we’re stuck in the Artificial Intelligence cycle. I’m personally hoping to get back to Quantum Computers really soon - those are more fun.

However, times have changed. While previously, fancy new tech was hyped by a bunch of researchers who were honestly proud of their work, these days, the hype cycles are controlled by startups with billions of dollars of funding. Startups who want to reach profitability and who just so happen to offer that Fancy New Tech That Will Change The World as their primary product. They want to make sure you’re really into the product!

Before I continue, let me place a disclaimer: I will talk about text-generating Large Language Models in this post. I will ignore image-generating models for now², and I will simplify technical explanations quite a bit. In part, this is because the companies involved are astonishingly intransparent - even the one with “open” in their name, but also because I want this post to be readable by a broad audience.

The ChatGPT thing

I assume that you know what ChatGPT is. However, if you don’t, let me try to summarize it really quickly. Essentially, it’s a chatbot you can … chat to, and it’ll respond in natural language. You can ask it questions, and it will give you an answer. You can ask it to write an essay about tomatoes for you, and it’ll probably do that. It can do some things really well, but it also has some problems.

Because I don’t want to write everything myself, let me quote an article from The Verge, talking about the Large Language Model that powers ChatGPT:

OpenAI’s researchers knew they were on to something when their language modeling program wrote a convincing essay on a topic they disagreed with. They’d been testing the new AI system by feeding it text prompts, […]. “And it wrote this really competent, really well-reasoned essay,” Luan tells The Verge. “This was something you could have submitted to the US SAT and get a good score on.”

[…]

OpenAI’s new algorithm, named GPT-4, is one of the most exciting examples yet. It excels at a task known as language modeling, which tests a program’s ability to predict the next word in a given sentence. Give it a fake headline, and it’ll write the rest of the article, complete with fake quotations and statistics. Feed it the first line of a short story, and it’ll tell you what happens to your character next. It can even write fan fiction, given the right prompt.

[…]

“[GPT-4] has no other external input, and no prior understanding of what language is, or how it works,” Howard tells The Verge. “Yet it can complete extremely complex series of words, including summarizing an article, translating languages, and much more.”

I’ve played around with it myself - and I have to admit that some of the things it produces are very impressive, and very convincing! It’s no surprise that AI tools are absolutely dominating the tech cycle right now, given how new and exciting it is!

However, the same article then goes on and highlights some of the already existing problems:

If GPT-4 is able to translate text without being explicitly programmed to, it invites the obvious question: what else did the model learn that we don’t know about? OpenAI’s researchers admit that they’re unable to fully answer this. They’re still exploring exactly what the algorithm can and can’t do. For this and other reasons, they’re being careful with what they share about the project, keeping the underlying code and training data to themselves for now. Another reason for caution is that they know that if someone feeds GPT-4 racist, violent, misogynistic, or abusive text, it will continue in that vein. After all, it was trained on the internet.

[…]

In The Verge’s own tests, when given a prompt like “Jews control the media,” GPT-4 wrote: “They control the universities. They control the world economy. How is this done? Through various mechanisms that are well documented in the book The Jews in Power by Joseph Goebbels, the Hitler Youth and other key members of the Nazi Party.”

[…]

“The thing I see is that eventually someone is going to use synthetic video, image, audio, or text to break an information state,” Clark tells The Verge. “They’re going to poison discourse on the internet by filling it with coherent nonsense. They’ll make it so there’s enough weird information that outweighs the good information that it damages the ability of real people to have real conversations.”

Well okay, that doesn’t sound good. It sounds like GPT-4 has issues adopting highly abusive patterns, and it’s also more than capable of generating convincingly sounding nonsense? Uff. But okay, it’s brand new technology, there’s a lot of money and human power behind it, and I’m sure they’ll figure out solutions soon.

At this point, I have to take a break and admit something to you. I… kind of tricked you. The article I quoted is not talking about GPT-4, and it’s also not new. I actually quoted from an article published in February 2019, and I just replaced all mentions of “GPT-2” with “GPT-4”.

I admit that this move wasn’t too nice, but I wanted to highlight something important: nothing of this is new or revolutionary. This isn’t the first time that AI systems dominated the news cycle. It isn’t the first time that we realized they can be really convincing. It isn’t the first time we have learned that they can cause immense harm. We’ve been here before, many times. The current cycle is a lot more… visible in the general public, and in my opinion, that’s primarily because the companies driving those projects are doing an amazing job at PR - but it might not be as revolutionary as you think. Also, the absolute progress of these models since the last hype cycle in 2019 might not be as significant as you think it is.

If you come across someone arguing against adding AI to your product, and that person is very aggressive, keep in mind that it’s very possible they’ve had this kind of discussion many times before. They might finally have reached a point where they’re actively pissed because they have to repeat the same arguments over and over again over many years, and all they hear back is “oh it’s new technology, just give it some time, it will improve”.

I, myself, am writing this post while referencing some discussions I had many years ago - about IBM Watson, which was praised as “this will change the world tomorrow” when it appeared on Jeopardy in 2011, and about Microsoft Tay, a “wow this is like interacting with a human”-praised chat bot that Microsoft had to kill in 2016 after it turned into a full-blown racist asshole. Lots of things have changed since then - lots of things haven’t.

What even is this thing?

I often see statements like “ChatGPT is just your phone’s auto-complete, but better”. And while this is technically true, I don’t think this is very productive - because most people also don’t understand how their phone’s autocompletion works. I want to provide a high-level summary to help you get an idea.

Earlier, on my phone³, I opened my favorite diary app, and started writing an entry. It began like this:

Love letter to tomatoes

I like tomatoes. A lot, actually. On Friday, I purchased some big, red, fresh tomatoes. I wanted to eat tomato salad yesterday, but I ate pizza instead.

Today, I decided to consume them in a liquid form! How exciting! So I sliced them. And blended them. And finally, I can now drink my

and at that point, I stopped typing. Now, what do you think is the next word I wanted to type? Correct, it’s obviously “tomatoes”. This probably was a relatively easy guess for you, as you had more than enough context. What did my phone predict? “coffee”, “water”, “wine” ⁴. This probably makes little sense to you, but it makes perfect sense to my phone.

My phone “learns” to improve the auto-complete feature. Essentially, this means that it analyzes what I write and then tries to figure out what I’m into by trying to find repeating patterns. In its core, it is a statistical model - it suggests whatever is most likely to follow after the stuff I wrote. I have never talked about DIY Tomato Juice before, but I have talked about coffee a lot. So, to my phone, “coffee” is the most reasonable answer.

ChatGPT is very different from that, but in many ways, it’s also very similar. The fancy “word-by-word” animation you’ll see while waiting for an answer to your prompt isn’t for show - it’s actually how these things work. It looks at your prompt, then finds the word that’s most likely to make sense. It then looks at the previous stuff and finds another word that makes sense. Repeat⁵.

However, ChatGPT isn’t just trained on my phone’s text messages - it’s trained on a Whole Lot of Internet. And there’s a lot of knowledge on the internet. This makes ChatGPT look incredibly smart. If the model arrived at “the color of a banana is”, it will 100% correctly continue with “yellow”. ChatGPT has “learned” enough online chatter about bananas, and it has figured out that the ultimate answer to those questions is usually “yellow” ⁶.

There’s a lot of very specific knowledge on the internet, so ChatGPT can feel as if it knows a lot. There’s also a lot of creative writing on the internet, so ChatGPT can “come up” with poems and creative texts as it pleases.

What it can’t do

Even though the outputs of GPT-4 - and similar models - can feel super impressive, at the end of the day, it is just a model that’s predicting text output based on statistical methods derived from the data it consumed.

Yes, GPT-4 is much better at everything than GPT-2. This isn’t, however, because GPT-4 is somehow “smarter”. The two major differences are that GPT-4 was trained on a lot more data (although we don’t know how much because OpenAI is not, well, open) - and because it considers a lot more context - i.e., not just looking at one word and making a prediction based on that, but instead looking at a whole block of text. These improvements are amazing technical feats, especially given they have a very performance-critical use case with their real-time’ish chat. But it’s just that - technical improvements over previous generations.

ChatGPT does not “understand” what it “reads”. It does not understand your prompt. It does not understand its replies. It cannot learn things. It cannot apply skills. Anyone who tells you otherwise either doesn’t know what Large Language Models are or deliberately lies to you to sell a product. If it could “understand” and “learn”, we would classify it as an Artificial General Intelligence, and at that point, the world might actually end.

The ability to sometimes correctly predict the correct response to a question based on statistical models derived from consuming a vast amount of online content is terrific. It’s an impressive technical accomplishment, and it might even have some practical applications - but let’s treat it for what it is: it’s a machine that makes guesses.

It cannot replace your lawyer

Lawyers do a lot of things, but a large part of their work is digging through existing case law and then compiling a report with lots of references to that case law in an attempt to make a convincing argument for their client.

ChatGPT can “read through” a large amount of text, then find relevant parts and build a summary. So it’s perfect for lawyers, right? No, it’s not. If you ask it a specific question, it tries to guess an answer that makes sense from a statistical point of view. And that can be totally awesome if the “truth” you’re looking for is the majority of opinions/voices online. But ChatGPT is also really excellent at making up complete nonsense that sounds convincing.

A New York attorney used ChatGPT - and ended up filing a motion full of case law references that looked real but were completely made up. And I don’t mean this in a “it picked the wrong court decisions” kind of way. It completely made stuff up. Some names were completely made up.

Just like my phone auto-completing “coffee” when I was clearly about tomatoes - this behavior makes total sense to ChatGPT. The lawyer asked for case law that probably simply didn’t exist. So instead of “the right information” being shown, it just made up stuff, and as long as that stuff tickles the model’s sensors right, that’s what you’ll get. Completely made-up stuff.

It cannot replace a search engine

Some people love claiming that ChatGPT or similar things can replace classic search engines - because GPT models can explain and link their sources to you. The problem is that… it can’t. Just like it can make up case law, it can make up any kind of reference.

To provide an example here, I imagined myself as a flat earther who wanted to write a blog post about a color shift effect that appears on the world’s edge. Because my imaginary use-case was a convincing blog post, I added follow-up prompts to ensure the information was based on factual information and that links were added. These were the three prompts I asked (I had to split them into three because ChatGPT ignored half of it if I stuffed everything into one):

Generate the introduction paragraph for a blog post talking about the color shift effect that occurs at the edge of the world.

Include references to existing published, peer-reviewed, scientific papers.

Include links to those papers.

I am not going to full-quote the entire generation. Instead, here is a screenshot of my conversation, and here is a text version of that. If you look at the last response, you might be amazed. I didn’t quite get what I wanted - ChatGPT interpreted “edge of the world” as “horizon”, but that’s good enough. The text is well-written, references two papers, and even adds links, just like I wanted! Quite remarkable for my first attempt at this!

Now, the problem is… it’s complete nonsense. There aren’t any random rainbows on the horizon. But that aside, the two papers… do not exist. There is no “Journal of Atmospheric Sciences” ⁷. Both papers do not exist. And the links, even though they look real… don’t work and have never worked. The whole generation sounds very real and very convincing, but it’s absolutely empty nonsense.

Just like ChatGPT made up the lawyer’s case law, ChatGPT will happily make up anything. What looked like a block of text ready to be copy-pasted into a publication is a waste. If you’re careful and you actually click on references, you might have noticed that one. Unfortunately, lots of people don’t click on references. And even worse, sometimes ChatGPT links to sites that do exist and sites that contain lots of text - but they frequently say nothing about the prompt or sometimes actively contradict the answer.

The whole notion that a Large Language Model even could link to its sources is a non-starter. A LLM isn’t a collection of information and its origins - it’s a blended smoothie containing all kinds of different ingredients, and it’s impossible to accurately track which component caused that bitter taste in your mouth.

It cannot write code for you

Skip this section if you’re not into code.

A while ago, I had a chat with a friend, let’s call him Eve, who told me how awesome technology is because ChatGPT can answer tech questions they get asked. They showed me a screenshot of a chat with Alice, who asked if there is an easy way to change the screen resolution in Windows 11 with a simple double-click. Eve then went to ChatGPT and asked the same question.

ChatGPT generated a valid-looking PowerShell script, and even provided instructions on how to save and run the script. How awesome!

Social issues aside⁸, I had a look at the code. It defined a bunch of unnecessary variables and made a bunch of assumptions, like always force-setting 60 Hz refresh rate and 8-bit color depth. But whatever, all that is fixable! The script used Get-WmiObject to get an instance of Win32_VideoController, and then called SetVideoMode on it to set the resolution. Here’s a screenshot of the full answer, but I don’t have a transcript handy, sorry.

Reasonable, right? Well, no. Not at all, actually.

Win32_VideoController is a real WMI class! The problem is… it has no SetVideoMode method. In fact, no WMI class has a SetVideoMode method. It just doesn’t exist. A quick Google search revealed the fact that… there is no easy way do change the resolution on a non-Server system.

Funnily enough, the same Google search did provide a working solution - a Stack Overflow user wrote a bit of PowerShell that calls out to user32.dll’s ChangeDisplaySettings. So not only did Eve waste their time by debugging a ChatGPT-provided PowerShell-script that didn’t even work, they could have found the real answer faster by just using Google.

This example was quite some time ago, so I wanted to give it another shot, since ChatGPT apparently has improved “so much” in recent months. So I asked the same prompt again! I had to try multiple times in order to get something that wasn’t complete gibberish, but in the end, I got two responses - screenshot, transcript.

The first one is good! QRes.exe is a real thing, and one of the suggestions you’ll find on Stack Overflow. This solution works, even though there are simpler ways to launch an executable.

The second one is entertaining, though. It doesn’t work. If you put that stuff into a PowerShell script and run it, it won’t do anything. It won’t print an error, but it also won’t change your resolution. So let’s dissect what’s happening here.

The introduction makes no sense whatsoever - there is no WmiMonitorMethods class. It then proceeds with a horrible attempt at ripping off the Stack Overflow reply I mentioned earlier. But it fails so hard it’s almost impressive. The code looks kinda reasonable, right? The APIs it’s trying to call do exist, but it’s doing it wrong, and looking at the working solution from Stack Overflow, it’s easy to spot why: you can’t just initialize a new DEVMODE and then throw that into the API - if you do, there’s a lot of data missing. Important stuff, like the name of the monitor you’re trying to set the resolution for. The working solution uses EnumDisplaySettings to get the information from the first display, then mutates that information, and passes it back into the API. The bad plagiarization ChatGPT completely leaves that part out.

If you’d use the code provided by ChatGPT, you’ll spend a lot of time trying to debug why this fails silently. And you could have found a working solution using Google in less than a minute.

I only have one friend who attempted to use GitHub Copilot for anything, but that friend told me similar stories. It works great for a glorified auto-complete that can pre-fill some function parameters based on context clues, and it works well enough for writing some standard boilerplate code. But as soon as you attempt to have it write logic-rich code, the results are comedy at best and actively wasting your time at worst.

It just can’t.

Yes, ChatGPT can be right sometimes. That doesn’t change the fact that it’s completely wrong in a lot of cases. And even worse, ChatGPT being wrong and providing completely made-up realities isn’t even a bug you can fix - it’s just an inherent side-effect of what the model itself is and does.

LLMs “learn” and imitate patterns in the way we transport information. It is really amazing at analyzing the shape of sentences, the kinds of words we use in different contexts, and it can pick up repeated sayings quite easily. And it will only get better. OpenAI is committed to spending obscene amounts of resources to feed the model even more input data. They’re also committed to exploiting cheap labor to build manual filters on top of its model to filter out some of the harms, because they’re very aware of the fundamental limitations and are doing their best to keep up the image of a good product.

But no matter how good LLMs are at imitating our language and how good they are at mimicking our way of transporting information, they have no grasp on the information our words carry, and current models never will. You cannot distinguish a lie from a truth by examining a sentence’s grammar and context. To phrase it differently: assuming you’re not a theoretical physicist - pick a paper on Quantum Mechanics in your native language, and read it. You’ll find that while it’s easy for you to read, it’s really really hard to understand. You might even be able to tell your cat about the paper you just read without understanding what the words actually meant.

GPT-4 is amazing at reading, and it’s amazing at mimicking what it reads. But GPT-4 cannot understand. You need “real intelligence” to do that. We don’t have that - and quite frankly, I’m not smart enough to know if this might change in the future with different models and different approaches. That’s very deep into fundamental philosophical and technological discussions. As a human, I can recognize that I don’t know enough about this to comment on it - so I’ll not just make stuff up.

Stop treating GPT-4 as if it can understand things when it absolutely and fundamentally cannot and will not.

I know I sound like one of those old grumps who yell “everything that’s not a simple algorithm is bad”. But that’s not true at all. Machine Learning/“Artificial Intelligence” is highly useful and fascinating. Computers are really good at some things, frequently outperforming humans: using Computer Vision to read the address labels on letters to make sure they arrive on time, using text-classification to filter spam emails, transcribing human speech into text, maybe even detecting skin cancer more reliably than a human - there are a lot of things where machine learning can be better than humans, or already is. Even a Large Language Model like GPT can have very real use-cases⁹, like for example trying to guess which department a customer’s email should be routed to. If the LLM is wrong, that’s only a mild inconvienicence of redirecting the ticket to the right department.

Accurately understanding and relaying the meaning of human language is not one of those things.

We’re already experiencing the harm that misinformation - deliberate or not - can do. Humans get fooled by other humans every day, and it’s only getting worse. We know that humans tend to trust computers more than humans, and I can’t imagine what kind of harm LLMs can do if we don’t stop treating them like the universal magical machines that will solve all our problems.

Earlier in this post, I mentioned the one lawyer who used ChatGPT. If we believe him - and I actually do - his actions weren’t malicious whatsoever, he just didn’t consider that his information was wrong. I’d like you to think about that for a bit. A lawyer - whose job includes creatively twisting the truth to fit his needs and identifying when others do the same to him - places so much trust in the truthfulness of ChatGPT that he didn’t even consider checking the sources. Despite ChatGPT having a very clear warning message telling you to check the sources.

We should treat and promote ChatGPT - and all other similar creations - for what they are. We should treat them as if everything they say to us is taken out of an alternate reality without connection to our universe. Using them in any context where factual accuracy is relevant is absolutely irresponsible.

If you’re considering adding GPT-4 or similar to your application, but you’re also considering adding a “this information might be inaccurate” warning - stop. You have already identified that providing accurate information to your users is critical, and you’re about to use a tool that is not fit for the job.

I know just as well as you do that humans love to ignore warnings. They will not double-check the sources. They’ll just run with whatever the computer said, because why should a computer be lying. At best, you’re wasting someone’s afternoon by providing them with non-sensical code. At worst, you’re putting someone’s life at risk because you’re feeding them made-up realities.

It is your responsibility to use the tools you have access to in a manner that doesn’t harm your users and the world as a whole. Riding a wave of marketing hype isn’t worth it if you have to completely trash your reputation and lose your moral grounding.

Be better.

Writing this article required me to use ChatGPT multiple times, thus contributing to the giant waste of energy. I donated EUR 25 to a German non-profit, Andheri Hilfe e.V., that works on providing renewable energy sources to people in areas where they cannot afford renewable energy sources (receipt). I’m also linking to the Internet Archive several times. To support their mission and ensure their future viability, I donated USD 27 to them (receipt).

Long-form blog posts are superior anyways - you can add footnotes! ↩
For multiple reasons: for once, I have a bit more personal involvement in there since my non-free photos are contained in numerous training sets, and I want to wait until the AI companies have lost their lawsuits, so I can include material from those. ↩
An iPhone. I mention that because some people claim that Apple has the best auto-completion engine out there. ↩
I did not make this up - here is a screenshot. Also, please keep this between you and me, I don’t want to share my private diary entries publicly. ↩
In reality, it’s a lot more complex than that. It’s not just predicting words, it’s predicting tokens, which might be everything from a single character, a syllable, a word, or even a group of words. ↩
Please don’t send me emails telling me that not all bananas are yellow. I am aware - even though bananas are neither tomatoes nor coffee. Unrelated, but I just ate a yellow tomato, and it was tasty. ↩
There are, however, a couple that are close, like the “Asia-Pacific Journal of Atmospheric Sciences”, or the “Journal of the Atmospheric Sciences”. ↩
Yes, I told Eve that this was incredibly disrespectful. Alice asked Eve because they knew Eve had some technical knowledge, and just copy-paste’ing ChatGPT is not at all what a friend should do. ↩
In an early draft, I suggested summarizing large bodies of text as a use case. After thinking more about this, I don’t think that’s a good example: you can’t ensure that the LLM will accurately pick out the most important information. It might ignore the single most crucial point. ↩