Applied research: A collaboration with Queen’s University Belfast
Here at Eolas Medical, our mission is to put medical knowledge at clinical fingertips. One of the ways we are doing this is by using artificial intelligence to make sense of the clinical information hosted on our platform. By taking us beyond ‘simple’ search techniques and into the world of meaning-driven information management, AI helps us connect frontline healthcare professionals with the right information at the right time.
AI approaches to making sense of human language have evolved in leaps and bounds over the last few years. Modern “natural language processing” (NLP) approaches involve huge digital brains with billions of artificial neurons. As we discussed in a recent blog on how machines make sense of the internet, these “neural networks” encode the meaning of words, sentences and even paragraphs as lists of numbers.
Those numbers - or, more accurately, “embeddings” - can be used for a range of different tasks, including search and even natural language question answering. Both search and question answering can be hugely impactful at the point of care and provide the focus for our collaboration with researchers at QUB.
A motivating use-case - Wells score at your finger tips.
To look at this work through the lens of a concrete example, imagine I”m the clinical director of an emergency department. We’ve just updated our local DVT pathway, and I’m uploading it as a PDF to the My Emergency Department app.
The next day, a junior doctor in my emergency department is seeing a patient with a suspected DVT. She opens the My Emergency Department app and runs the following search:
DVT risk assessment tool
It just so happens that our DVT pathway contains exactly the sentence she’s looking for:
The likelihood of venous thrombosis of the leg is evaluated using the Wells score, which can be calculated with MDCalc.
But we’ve got a problem. None of the search terms appear in our target sentence. To a conventional search algorithm that looks for matching words, the two appear completely unrelated.
And yet this is a really important challenge to solve. There might be 50 people waiting to be seen in our department, and if this is an NHS hospital we’re almost certainly short-staffed. If our junior doctor can’t find the information she needs via the app, she’ll probably waste valuable minutes hunting out a paper copy of the DVT pathway. With emergency department waiting times at an all-time high and clear evidence that long waiting times put patients at risk, that’s a problem.
To connect our doctor with the information she needs and let her get back to caring for patients, we need a search algorithm that understands the meaning behind the words.
In other words, we need intelligent search.
The current approach to intelligent medical search
Here’s how we can implement intelligent search in this situation:
When I upload the DVT pathway document to the My Emergency Department app, the app’s ‘indexing’ algorithm breaks it down into individual sentences and passes each sentence through an NLP algorithm. The algorithm generates an embedding - a list of numbers - for each sentence. Thanks to some clever science (find out more in our previous blog post), those numbers contain a lot of information about the underlying meaning of each sentence. They’ll be stored alongside the original document on the Eolas platform.
The next day, when our junior doctor runs a search on the app, the search terms are passed through that same NLP algorithm. Just like it did for the sentences in our DVT pathway, the algorithm will generate an embedding for our search terms. Once we have embeddings for our documents and our search terms, we can calculate the distance between the search term embedding and every sentence embedding in our document store using some simple maths.
Finally, we show our junior doctor the top few sentences whose embeddings are closest to our search term embedding. If our NLP algorithm did a good job of generating embeddings, these sentences will be very relevant. Our target sentence above - the one that suggests calculating the Wells score using MDCalc - should be in the top couple of hits. Our junior doctor can follow the link to MDCalc and have the patient’s Wells score calculated in just a few seconds.
The research question - How to make sense of medical knowledge
Sounds easy, right? Sure, we’ve skimmed over a few technical points. For example, running embedding distance calculations gets pretty computationally expensive if you’re searching through thousands of documents, so we generally use more conventional approaches to shortlist the most promising documents first. But at a high level, we’ve described a pretty reasonable approach to intelligent search.
The big issue here is that not all NLP algorithms are created equal. The performance of your intelligent search engine is only as good as the quality of your embeddings, so choosing the right algorithm really matters. This is a particular challenge in the biomedical field.
In a previous blog on the challenges of bringing AI to healthcare, we talked about something called “domain shift”. In the context of NLP, an example of domain shift might be training an algorithm using data from Wikipedia, Twitter and Reddit, and then asking it to make sense of PubMed articles. Just as a human who has no relevant training is unlikely to make much of academic biomedical literature, a general-purpose NLP algorithm is unlikely to fare well either. More specifically, the embeddings it generates will not capture the salient meaning of sentences about biomedical topics.
To illustrate this using our example above, imagine our NLP algorithm has not been adequately exposed to biomedical literature and doesn't know what a DVT is. Faced with the search term “DVT risk assessment tool”, its embedding will only capture the meaning of the phrase “risk assessment tool”. So an embedding similarity search is more likely to return irrelevant sentences like this:
The MDCalc HIV Needle Stick Risk Assessment Stratification Protocol (RASP) can be found here.
Than the one that actually answers our junior doctor’s question:
The likelihood of venous thrombosis of the leg is evaluated using the Wells score, which can be calculated with MDCalc.
As we’ve discussed in previous blog posts, training huge NLP algorithms is really hard and super expensive. Happily, the organisations who have resources and expertise to train these models - Microsoft, Google, NVIDIA, Facebook, etc. - often make them available to the wider public. So as an AI application developer, the main question you need to answer is which of these “off the shelf” models generates the best embeddings for your purposes.
Our approach
To answer this question for our use case, we teamed up with a fantastic group at the ECIT Institute at Queen's University Belfast and asked them to evaluate a range of state-of-the-art NLP algorithms for biomedical question-answering. To do this, we asked them to use the BioASQ dataset. This consists of a large collection of biomedical documents, along with a bank of questions that have been created by a team of biomedical experts. For each question, those same experts also highlighted the ‘ideal’ answer from within the documents.
The ECIT team used a series of trained NLP algorithms to generate embeddings for each sentence of every document in the BioASQ database, along with each question in the BioASQ question bank. Then, for each question, they calculated the most relevant 1, 5 and 10 sentences using embedding distance calculations.
If the sentence containing the ‘ideal’ answer (as defined by BioASQ’s panel of biomedical experts) appeared within those top-ranked sentences, that NLP algorithm would score a point. Thus, each algorithm was assigned a “top 1” score, a “top 5” score and a “top 10” score.
The results
What was really striking about this exercise was how much NLP algorithms have improved over just a few years. The BioASQ data was first released in 2019. Back then, a number of research groups ran exactly the same experiment as I’ve just described. When we compared those results with the updated results from the ECIT team’s work, the 2022 algorithms were far better. That’s not exactly a huge surprise - a group of researchers at Google made something of a breakthrough in 2017 with a new type of NLP algorithm and progress has been steaming ahead since then. But it was still great to see that progress translate into the biomedical world.
Nonetheless, top 5 and top 10 scores were significantly better than top 1 scores. This is partly because even state-of-the-art NLP algorithms are far from perfect, and partly because the ‘ideal answers’ assigned by the BioASQ experts were not always clear-cut. When we reviewed some of the results manually as part of an error analysis exercise, it was clear that the NLP algorithm had actually provided a better top answer than the human experts.
From our perspective at Eolas, great top 5 results are fine. Our aim is not to provide a single definitive answer to users’ questions the way that Alexa or Siri might - it’s to provide healthcare professionals with a shortlist of relevant information sources they can choose from. For us, the most important outcome from this exercise was that a particular algorithm known as a generalisable T5-based dense retriever (catchy name, right?) obtained the best results. This is another Google-designed algorithm that was first described at the very end of 2021 and only made public earlier this year.
The impact - medical knowledge at clinical fingertips
As the Eolas platform continues to scale globally, we’re hosting more and more essential clinical information for some of the world’s leading medical centres. The amazing feedback from users about the impact we’re having at the point of care is what gets us up and at ‘em every day.
But the more we grow, the more important high-quality search becomes. After all, it’s a lot easier to pick out a few really good hits from a thousand documents than from a hundred thousand. Following the collaboration with QUB, we’re in the process of integrating GTR (generalisable T5-based dense retriever) into our own search engine, which we hope will make our platform even more effective and able to deliver great clinical search results even from a very large body of clinical information.
All in all, applied research in collaboration with thought leaders like the team at Queen’s helps us to constantly evolve our technical capabilities. In turn, that allows us to continue delivering a great user experience that improves the efficiency of the clinical workforce and ultimate contributes to better patient care.