Previously on bedside cyborgs... Part 2
Welcome back to this three-part blog series on intelligent search in a healthcare context.
In the last post, we talked about how industry-leading internet search engines - which have already changed the way we live our lives since the turn of the millennium - have undergone rapid evolution over the last few years.
We compared a couple of well established, technically capable search engines against Google and Microsoft’s bleeding edge capabilities to illustrate this difference. We highlighted the fact that next-generation search engines are not just able to suggest a list of relevant web pages, but are able to “read” the contents of those pages and directly answer our medical questions. That’s a huge leap forwards, and something we’ll take a closer look at today.
We also tried to quantify the impact of advanced internet search capabilities in the medical domain with a quick-and-dirty experiment. We concluded that applying industry-leading search technology to all the information healthcare professionals need on a daily basis - including the information that doesn't live on the internet - could create efficiency gains worth millions of care hours or billions of pounds. But we also noted that there are some important challenges to overcome before we realise these.
In the second half of the post, we started to look at how internet search engines work. We made particular note of the importance of good indexing. We talked about how, until about 10 years ago, even the almighty Google relied on lexical search and indexing methods, which are essentially fancy ways of counting letters.
The last thing we said was that since about 2015, things have started to change very quickly. In this article, we’ll discuss the advances in machine intelligence that underpin these changes, and why they are going to drive some dramatic updates to the way we manage information in healthcare.
How do machines make sense of the web?
Our main focus here will be the science of natural language processing. NLP, as it's known in tech, is about getting machines to understand the messy world of human communication, as opposed to highly structured coding languages. It’s the key to unlocking a wealth of applications that need to make sense of data from electronic health records, medical textbooks, guidelines, medicines formularies, research articles, health websites, and so on. It’s also the core technology behind the successor to lexical search, known as “semantic” search.
If you haven’t already done so, take a couple of minutes to check out this article in the UK’s Guardian newspaper. The article was written by GPT-3, a ground-breaking NLP algorithm whose creator (the Elon Musk-backed research organisation OpenAI) initially declared was “too dangerous to release”.
No doubt OpenAI’s PR department had more than a little to do with that announcement, but when you read the article you can certainly see why GPT-3 got people feeling uncomfortable. And if that article isn’t quite eerie enough for your tastes, you might want to check out the story of Google NLP engineer Blake Lemoine. He was working on a similar algorithm and recently became so convinced that his subject had become sentient that he jeopardised his own career to go public.
So are we starting to live out the plot of an Arthur C. Clarke story, or is this all sleight of hand? To answer that, we’ll need to cover two key concepts: machine learning and words as numbers.
When did machines start to learn?
Alan Turing was possibly the first person to moot the idea of “a machine that can learn from experience” during a lecture he gave in 1947. This could, he proposed, be achieved by “letting the machine alter its own instructions”. As an interesting aside, he expanded on this idea in a 1948 report entitled Intelligent Machinery, which many experts now recognise as the earliest description of the central concepts behind today’s AI. However, Intelligent Machinery wasn't published until 1968 - 14 years after Turing's death.
Over the course of the 20th century, AI evolved in fits and starts. There were long periods known as ‘AI winters’, where machine learning theory would outpace the capabilities of the day's best hardware, and everything would stall until Moore's Law redressed the balance. But by about 2010, AI theory was in pretty good shape, a new type of computer chip (the GPU) was fuelling the next generation of super-fast computers, and things were starting to take off like they never had before.
How do machines learn?
The basic premise of machine learning is that computers work really fast, so many problems that would be impractically slow for humans to solve by trial-and-error can be solved by computers. However, there are a couple of caveats.
Firstly, even the most advanced computers are just number-crunching machines, so you need to be able to frame your task as a mathematical problem if you want your computer to solve it. Happily, Darwinian evolution has provided us with the ultimate blueprint for turning real-world tasks into computing problems, solving them, and doing useful stuff with the output. That is, of course, the biological brain.
Nature has shown us that if you stick enough neurons together in a densely connected network, you can encode some incredibly complex logic processes by tweaking the strength of inter-neuronal connections. Mathematicians have been working on mathematical models of artificial neurons since 1943, so we’re well covered for caveat number one.
The second caveat is that if your problem is complex enough, you’ll need to be smart about how you manage the trial-and-error process. Modern AI algorithms have multiple, stacked layers of artificial neurons (known as “deep neural networks”, or DNNs), which can have billions of adjustable connections. Blindly tweaking the strength of those connections until your DNN can understand natural language would be a bit like putting the infinite monkey theorem to the test.
This challenge is usually overcome by labelling up loads of data with the ‘ideal’ output you would like from your DNN. Then, each time the DNN makes an update to the strength of its neuronal connections, it can evaluate whether the update improved or worsened its performance. If the update had a positive effect, it can make more similar updates. If not, it can try something new. That way, the trial-and-error process transforms from a random, shooting-in-the-dark exercise to a directed learning activity. The theory that underpins all this is called stochastic gradient descent.
And those are pretty much the key ingredients for machines that learn. It’s worth noting that there are other types of machine learning algorithms aside from artificial neural networks, but we won’t get into that during this series. The key take-home for today is that with modern AI theory and really fast processors, modern computers can discover ways to perform even very complicated tasks.
How do you represent words as numbers?
We’ve already mentioned that you need to be able to frame your problem as a maths exercise to use machine learning. So how does that work with language?
The easy answer is that you use a key for every word in the dictionary:
And so on.
But words mapped to numbers using a simple key do not contain any semantic information. In others words, these numbers contain no clue as to the words' deeper meaning. Yet the AI algorithm that wrote the Guardian article demonstrated a clear understanding of the meaning behind the words it used. So how on earth do you get from “aardvark = 1” to that?
A basic representation of a word as a number, as in the table above, is called a “token”. The magic happens when your AI algorithm learns to take that token, process it through an artificial neural network and spit out something called a “word embedding”. In tech-speak, a word embedding is a numerical vector representing a position in some arbitrary semantic (or “latent”) space. But unless you’ve studied AI before, that will sound like complete gibberish. So let’s illustrate with an example.
As humans, we most often encounter numerical vectors as coordinates on a graph or map. For example, the (X, Y) coordinates of a point on a graph make up a 2-dimensional vector. If you’re fancypants and your graph has three axes, the (X, Y, Z) coordinates of any point on that graph will be a 3D vector. But for our example, let's keep it simple and create a 2D graph. We'll say the X axis represents “rate of acceleration” and the Y axis is “top speed”. We'll assign a relative scale of 1-10 to each axis, rather than absolute units like "miles per hour".
Now we’ll think of some random words to place on this graph. Let’s go for “cheetah” first. It can accelerate very fast, but its top speed is pretty low compared to, say, a Ferrari. So we’ll place it at (9, 3). Now let's pick some other words and assign them each a location on our graph: we'll put “jet plane” at (6, 10) and “express train” at (1, 7). And let’s pick a few non-nouns: “zippy” might go at (8, 2), “meander” at (1, 1), “sprinting” at (7, 3). Et voila:
Because we’ve assigned meaning to the axes of our graph, we can refer to the graph area as a “semantic space”. A vector that represents coordinates within a semantic space can be called a “semantic vector”. But in the case of language processing, we usually call it a “word embedding”. So the word embedding for "cheetah" in this example is (9, 3).
Once we’ve generated our word embeddings, we can use them in all sorts of ways. For example, we might want to search through all the nouns on our graph (shown in bold) to find the one that is best described by the adjective “zippy”. All we need to do is find the noun that is semantically closest: i.e. the nearest noun on our graph. As you can see, a semantic search algorithm would return the word “cheetah”, which most of us would agree is the best answer. A lexical search, on the other hand, would have been useless - “zippy” and “cheetah” have no letters in common.
If we wanted to get really clever, we could also search for the word that is semantically farthest from “zippy”, which would be “express train”. Or we could ask which is most likely to "meander": a "jet plane" or an "express train"? Again, the answer would be "express train". In each case, just by transforming each word into a 2-dimensional word embedding, we've created a situation where a semantic search engine can answer our queries based on underlying meaning. Pretty cool, right?
Hopefully, that’s given you an intuitive sense of how words can be representative as numbers, and how good word embeddings can open up all sorts of applications that would be impossible with conventional lexical indexing.
How do you train NLP algorithms at scale?
If that felt like tough going, you’ll be pleased to hear the hard bit is over. The main difference between this simple example and the real world is scale. AI algorithms usually learn to generate embeddings that have hundreds or even thousands of dimensions, with each one representing a different semantic descriptor (i.e. the equivalent to the labels on our graph’s axes). In reality, AI algorithms learn their own semantic descriptors rather than using terms we humans would understand, but let’s not worry about that here.
Some modern algorithms - GPT-3 included - generate embeddings that are informed by multiple words at the same time, which allows them to capture the meaning of words in context. This latter technology only came into its own in 2018, but it drove huge performance improvements that really ignited the whole field, and things have been moving at incredible pace since.
Which leaves us with one final question: how do you make an AI algorithm learn to translate word tokens into word embeddings? I love this one, because the answer is so beautifully simple.
Most AI algorithms today use some variation of an approach called “masked language model” training. This entails feeding the algorithm as much of the internet as you can get your hands on and, as you do so, blanking out some of the words in each sentence. For example:
Buckingham Palace is located in London, the capital city of [BLANK].
The algorithm’s job is to guess the missing words. Incredibly, set the whole thing up with the right starting parameters, let it run for a few weeks on a multi-million dollar supercomputer, and hey presto! You’ve got an intelligent machine.
What’s particularly cool about this approach is that it doesn’t require any manual data labelling. And yet it’s amazingly effective. In order to reliably guess the missing words in a sentence, the AI algorithm has to learn word embeddings that contain loads of semantic information. The result, as you saw in the Guardian article, is something that can appear remarkably like “true” intelligence.
How do you deploy fully trained NLP algorithms?
We’ve already mentioned that training a large AI language algorithm needs a multi-million dollar supercomputer. The good news is that using a fully trained algorithm to process data (known as "running inference") is far less computationally expensive than that. The bad news it's still not something you can do on your hospital's existing IT infrastructure.
We definitely won’t get into too much detail on this one, but suffice to say that AI is so fundamentally different from conventional computer programs that it needs a special type of processor to run (called a GPU). Really big algorithms like GPT-3 need a lot of these processors linked together with super-high-speed networking. It all gets very complicated very quickly, and expecting hospital IT teams to manage this type of infrastructure is a bit like asking a GP to manage a complex post-operative neurosurgical patient.
This creates a particular problem in clinical settings because central doctrine of traditional health IT is that you manage everything in-house. This means owning and managing your own servers on hospital-owned sites, employing your own systems administrators, keeping everything linked over a local network, and only allowing internet access over heavily restricted connections to the wider internet.
Back in the days when most internet access was via dial-up connections and cybersecurity was a term that only existed in science fiction, that made a lot of sense. Over the last couple of decades, however, three things have changed:
Firstly, there has been huge investment in making the internet fast, secure and robust. Secondly, as we’ve already discussed, some of today’s most commonly used software applications require highly specialised infrastructure that cannot be managed at a local level. Thirdly, personal smartphones have overtaken desktop computers to become the most common method of accessing digital information. Forcing staff to restrict themselves to intranet-connected workstations flies in the face of the way the rest of the world is working.
And that’s where ‘the cloud’ comes in. Exotic as it sounds, using the cloud just means using someone else’s computer over the internet. Usually, of course, that ‘someone’ is a big tech corporation (the three largest public cloud providers are Google, Amazon and Microsoft) that owns lots of computers and hosts them all together in a huge, specialised data centre.
From a healthcare perspective, the cloud is essentially IT’s equivalent to a major tertiary hospital: a place to concentrate expertise (in terms of both infrastructure and security), keep costs down by bulk-buying, and operate more efficiently (which, in many cases, includes being more environmentally friendly). It also allows you to run heavy duty applications like NLP algorithms remotely, then access the results on your smartphone. The cloud has enough advantages that the NHS, the wider UK government, the US Federal Government and many other major institutions now operate cloud-first policies. And it’s forecast to grow substantially over the coming decade.
However, it’s not quite a panacea for our digital woes. In the final post of this series, we’ll talk a bit more about the challenges of using the cloud in healthcare, along with some of the other blockers to applying advanced information management technologies to health data. We’ll also discuss some strategies that might help to overcome these.
See you next time!