ChatGPT, Gato, Flamingo… has this year shown us the building blocks for the future of machine intelligence?

For data science nerds, 2022 has been an incredible year. The recent release of ChatGPT, which we discussed in last week's blog, was just the latest in a long line of stunning achievements. These have included breakthroughs in retrieval-augmented reinforcement learning, cross-domain multimodal AI, autoregressive visual language modelling, and much more.

On the other hand, as lab-level progress accelerates, the gulf between research breakthroughs and the applications we experience in daily life is only widening. In my home world of clinical medicine - not exactly famous for early adoption of exciting new tech - this is even truer than most industries. One day I can be asking ChatGPT to summarise the subtler elements of quantum mechanics, the next it’ll take me 10 minutes to log into one of our hospital’s Windows 7 PCs and look up some bloods.

Given this disconnect, it’s little wonder that when next-gen technology like ChatGPT cuts through to the public consciousness, it tends to be met by a circus of wild optimism and doomsday prophesying. The Washington Post, for example, ran a piece within the last fortnight entitled, “How to Save Your Job from ChatGPT”. To those of us in the field, ChatGPT is a culmination of progressive technological advances over the last decade. To an outsider, though, it might easily seem like a visitor from another age.

Robot apocalypse

One of the biggest downsides to all this hype surrounding AI is that it can obscure the true lay of the land. So, as we see out 2022, where do things really stand? Stare hard enough at some of the key AI moments of 2022 and I can't help but wonder: do they start to look a lot like the pieces of a rather extraordinary jigsaw?

Three AI highlights from 2022

1. Gato

Although ChatGPT made the mainstream headlines more than any other AI this year, DeepMind’s Gato was probably the data scientists’ breakthrough of 2022.

What sets Gato apart is its ability to function across multiple data modalities and very different domains. Whereas most AI models over the last decade have been limited to one type of input data - e.g. images, text, sound - Gato can handle natural language, photographs, and even video feeds from Atari games. What’s more, it produces different types of output appropriate to different input types. If it’s asked a question in plain English, for example, its output will be in English too. If it’s given the video feed of an Atari game, it will output commands for an Atari joystick.

Under the hood, Gato uses the same type of neural network that powers ChatGPT: an “autoregressive transformer”. And if you read last week's blog on ChatGPT or have interacted with it directly, you’ll already know that autoregressive transformers can be extremely powerful. To take that technology and show that a single AI can learn two tasks as profoundly different as “speaking” English and playing Atari - that’s a huge leap forwards.


It’s also worth noting that since Gato was announced, DeepMind have created another AI called Flamingo. This is also an autoregressive transformer model, but Flamingo’s unique selling point is that it handles multiple data types interleaved with one another. Which is to say that a single input sequence can contain a mix of text, images and even video. Flamingo will reason over all of these data types together, and produce a natural language output that takes each into account. That might seem like just a natural extension of Gato, but it’s an important one.

2. ChatGPT

ChatGPT may not be as fundamentally novel as Gato - as far as we can tell, most of the techniques it uses have been pioneered by other groups - but it’s an astounding feat of engineering with potential applications in many walks of life. If you want to learn more then check out last week's blog on what ChatGPT means for healthcare. For today, though, we’ll just highlight three key reasons that ChatGPT is a significant landmark for progress in AI.

Firstly, it is by far the most compelling application of retrieval-augmented language modelling to date. As a bit of context for this: since transformers took off in late 2017, there has been a trend towards mammoth language generation (“autoregressive”) models. These run into trillions of parameters - over 16,000 times bigger than the neural networks that kicked off the “deep learning revolution” a decade ago. A lot of the capacity within these giant neural networks appears to be consumed by memorising (or, more accurately, “amortising”) factual information. But it turns out that you can cut way down on the number of neural parameters if you allow your AI to retrieve information from external sources at runtime instead. Not only does this make the neural network cheaper and faster to execute; it also ensures the AI has access to up-to-date information, and probably increases the total amount of information the system can access.

Secondly, ChatGPT convincingly demonstrates an ability to reason across different domains. For example, as we saw last week, it can take a description of chest pain expressed in lay language, combine that with medical guidelines, propose next steps and even explain its reasoning in precise terms of computer code. This implies that the conceptual knowledge it acquires during training is much more abstract and multi-faceted - dare we say, “human-like”? - than state-of-the-art AI from even a few years ago. We’ve seen other AI systems like GPT-3 demonstrate similar capabilities before, but ChatGPT seems to be next-level.

Lastly, ChatGPT was fine-tuned using reinforcement learning techniques. This means that after its initial training on huge bodies of text, it was able to refine its language generation capabilities by trial-and-error learning directly from human conversation. That’s pretty significant: reinforcement learning is difficult to do at the best of times, let alone with complex, autoregressive language models. But if you can crack the problem, it means that your AI can continuously learn through experience, getting better and better with every conversation it has. ChatGPT wasn’t the first example of reinforcement learning with large language models (in fact, Gato’s Atari-playing skills are another great example of this paradigm), but it has provided the most masterful example in the context of natural language generation to date.


3. Retrieval-augmented reinforcement learning

Language generation isn’t the only area where retrieval is taking hold. In a paper presented at this year’s NeurIPS conference, researchers from DeepMind describe large-scale retrieval for reinforcement learning. (To avoid confusion: ChatGPT was fine tuned using reinforcement learning techniques, but is not a “pure” reinforcement learning AI.) This didn’t receive anywhere near as much attention as Gato or ChatGPT, but it might turn out to be a very big deal.

We’ve already mentioned reinforcement learning in this blog, but as a belated primer: this type of AI is generally used for multi-step strategic planning in dynamic environments. Reinforcement learning systems can be used for things like self-driving cars and autonomous robotics. Or in the case of AlphaGo (probably the world’s most famous AI after it was the star of a Netflix show), playing board games.

Real-time information retrieval has similar benefits for reinforcement learning as it does for language generation. But where language generators like ChatGPT retrieve factual information in order to answer questions, reinforcement learning AIs need to access information about their past experience… “memories”, if you will. For example, a self-driving car approaching a red light needs to access data about its previous similar experiences in order to ascertain which actions worked well (stopping, presumably). And, of course, which actions didn’t.

Reinforcement learning

Most reinforcement learning systems currently amortise this past experience within their central decision-making neural network (the “policy network”). But by outsourcing information about past experience to external storage and retrieving it as needed, freed-up capacity within the policy network can be used to improve its reasoning and decision making abilities. Perhaps even more importantly, multiple instances of a reinforcement learning system that share a common conceptual language (or, in technical terms, use the same “latent space embeddings”) can leverage their collective experience this way.

Bringing it all together

OK, time to take the 30,000 foot view and ask ourselves: what does this all mean?

The answer to that depends on what you’re trying to achieve. Ultimately, all of this technology is useless unless you have a problem to solve. So to frame a little mental exercise, here’s the vision that originally got me into AI:

Imagine an AI broadly similar to Samantha from the sci-fi hit Her. In case you haven’t seen it, Samantha is a disembodied, cloud-based virtual assistant AI. She has access to an audiovisual feed of a user’s environment via their smartphone, and communicates with the user via a set of wireless earbuds. If memory serves, she can operate smart devices - but in any case, we’ll say that our AI can do that. Unlike Samantha though, who has a love affair with one of her users, our AI isn’t designed to be a personal companion. Rather, we’re going to build a professional assistant for healthcare staff. We’ll call her “Dr Sam”.

Given what we’ve covered so far today, then, how would we go about building Dr Sam? Well, perhaps a bit like this…

Going up

In the bottom left corner, a real-world clinical consultation is taking place. This is being captured digitally via video and audio feeds, plus there’s a data stream for the electronic health record (EHR). All this information is being sent to the cloud, where it’s available to Dr Sam.

The fourth data stream going Dr Sam’s way will contain commands and requests from a user - in this case, a human clinician. Some examples might be, “Hey Sam, I think Mr Smith may have an acute exacerbation of COPD. Can you some take bloods, book a chest X-ray, complete the admissions protocol and send a message to the bed manager please?”. Or perhaps, “Hey Sam, could you summarise this consultation into a clinical note and post it to Mr Smith’s EHR record?” Maybe even, “Hey Sam, I’m not completely sure about the diagnosis here. Any thoughts?

To transform the raw data inputs into high level conceptual information that Dr Sam can make use of (in technical terms: “latent space embeddings”), we’re going to use technology very much like we’ve already seen from ChatGPT - only, with Gato and Flamingo’s ability to incorporate multiple data types. We can also make use of separate specialised AIs to help with this process. For example, a speech-to-text AI will save Dr Sam from having to learn how to handle raw audio input, sparing her neural capacity for other tasks. And, as with the original ChatGPT, this part of Dr Sam’s “brain” will retrieve information from external sources (e.g. medical textbooks, clinical guidelines) to enrich the data it receives directly from the consultation.

Having “embedded” the raw data into retrieval-enriched conceptual information, Dr Sam will need to create and execute a strategy to complete the task set by our human clinician. To do that, she’ll use retrieval-augmented reinforcement learning techniques. As we noted earlier, her memory banks will include records from all the Dr Sam instances across the world, so she’ll be able to learn from a huge base of real-world experience, very quickly.

But before we move over to the “down” side of our diagram and talk about how Dr Sam will execute on her plans, it’s worth noting that her sensory processing and decision-making centres - currently shown as separate components in the diagram - could easily comprise a single neural network (as is the case with Gato). This way, all of Dr Sam faculties could be constantly improved by experience, not just the reinforcement learning-based decision centre. She could get better at understanding regional accents, spotting subtle physical signs of disease from video feeds, engaging in diagnostic reasoning… even reading human emotion and exercising “empathy”, if the reward system we define for her reinforcement learning system encourages this.

Cloud doctor

Going down

On the right of the picture, Dr Sam’s decision-making centre creates a high level plan and passes this to another part of her digital brain for execution. Again, these don’t need to be totally separate neural networks as shown in the diagram. To stray into the risky world of analogies with human intelligence, these different processing centres could be something akin to the specialised cortical areas of the biological brain.

However, just as we used a speech-to-text model to help process Dr Sam’s sensory input, it’s likely that she will want to delegate at least some output tasks to other AI systems. This will allow her to focus on the big picture and let them worry about the granular details of execution. For example, if our human clinician asked Dr Sam to draw blood, Dr Sam may well pass this off to a highly specialised AI that controls the robotics of an automatic blood-drawing machine. The two AIs could continue to communicate at a high level - for example, the blood-drawing machine could alert Dr Sam if it accidentally hit an artery, so that Dr Sam could translate this into plain English and let a human physician know. But not having to learn how to directly operate the robotics apparatus of a blood drawing machine will save Dr Sam from dedicating valuable neural capacity to learning a very niche task that is unlikely to change much over time.

Of course, we’re making this all sound very easy by ignoring a plethora of technical challenges that crop up from a blueprint like this. Plus there are the sticky ethical and regulatory issues that surround the use of AI in a clinical setting. Nonetheless, most of Dr Sam’s fundamental building blocks - multimodal AI capable of retrieval-augmented, cross-domain reasoning; reinforcement learning AI capable of leveraging external memory of complex strategic planning; generative models capable of multi-modal outputs, including interaction with other software systems via API - have been shown to work in a lab setting. To me, the question is less whether there's a there there. Rather, it's...

How far off is an AI like Dr Sam?

Bear in mind that “When?” can be a really difficult question in AI. Shortly before AlphaGo beat Lee Sedol, the reigning world Go champion, most experts were predicting that such a feat would be impossible for at least a decade. Back in 2016, on the other hand, Professor Geoffrey Hinton - widely regarded as one of the forefathers of deep learning - predicted that radiologists would be obsolete within five years.

In the case of building Dr Sam, access to domain-specific training data is likely to be a blocker. Plus, refining her skills using reinforcement learning in a live clinical setting could be a big challenge. However, having seen what ChatGPT can do, it seems plausible that a convincing prototype - something that could make clinical recommendations based on the analysis of natural language consultations, EHR data, blood tests and static medical images - could be demoed in a lab environment within this decade. A production version capable of handling all the types of data a human clinician might encounter, on the other hand… who knows?

The point I want to leave you with here, though, is that this discussion would have gone a lot differently as recently as a year ago. With GPT-3 as the state of the art in language modelling back then, and multimodal cross-domain transformers still largely an interesting idea, the question wouldn’t have been when we might build something like Dr Sam, but if. And that, more than any individual breakthrough, is what I will remember 2022 for: the year that might just have shown us the building blocks from which AI’s future might be built.

future city