AI and the problem of identifying concepts from keywords in scholarly article search

Searching the scholarly literature has long presented challenges to researchers, who grapple with online searching using keyword methods when they want to find concepts. Peter Webster, Technology Services Librarian at Saint Mary's University, Patrick Power Library, in Canada, became interested in how AI tools could make academic searching easier and solve the problems of concept identification. Here are his thoughts.

Online searchers want/wish online search tools to find the concepts they seek, based on a few simple keywords—and many AI search tools promise they can do this. Take this quote from gaming search engine Splore:

“With (AI) the search engine can understand your intent and the meaning behind your search, not just the specific words you enter.”

The popular resource Semantic Scholar makes a similar, but more measured claim:

“Our system extracts meaning and identifies connections from within papers, then surfaces these insights.”

However, it seems that the potential of AI search tools has yet to be fully realized. It is important for searchers to understand the capabilities and the limitations of AI search.

To illustrate the complexity of the process that we are hoping AI methods can accomplish for us, I chose, as an example, an evidence synthesis paper by Luong Thanh BY, et al., titled “Behavioural Interventions to Promote Workers' Use of Respiratory Protective Equipment”. These researchers needed to use keywords to sum up the concepts of “behaviour intervention”, “workers” and “respiratory protections”, as shown in this Venn diagram.

But here is the Cochrane Review record for this paper, showing the dozens of carefully developed and interconnected keywords needed to effectively address these concepts.

This seems a good illustration of the complexity of the keyword to concept determination process that we are hoping Artificial Intelligence will do for us.

There is no doubt that AI methods like Natural Language Processing (NLP), Semantic Machine Learning, combined with traditional keyword methods, can effectively derive concepts from search keywords. This is a complex automated process or set of processes that rely on having sufficient available information about each article. These AI methods will be a game changer for scholarly research search in the very near future.

BUT... The success of AI methods is dependent on consistent and sufficient metadata. Subject descriptive titles, detailed abstracts, or preferably access to full text, are essential for reliable AI concept determination.

Consistent article subject headings, or journal classification (Ontology) are a key element for improving AI search success. For example, AI methods might easily determine that articles with the common consistent subject headings of “Vapor Hazards" or “Dust Abatement” concern the concept of “Respiratory Protection”. Articles about “air quality” in an “Industrial Safety” journal address the concept of “Respiratory Protection”.

Limited and inconsistent available metadata limits AI search

Limited and inconsistent metadata limits what AI can do to successfully determine article concepts. Yet today there are considerable limitations to the metadata available to AI search tools.

The Semantic Scholar database offers metadata for over 2 million articles coming from more than 60 sources, including OA resources like PubMed, and many private publishers. Semantic Scholar is the metadata source used by a number of well-known AI search tools. Research Rabbit, Elicit’s AI Research Assistant, and others rely on this source.

Semantic Scholar is a remarkable resource. But it relies on metadata from a wide array of different sources of greatly divergent detail and quality. There are no consistent subject headings or journal classifications. That places reliance on title and descriptive abstracts for determining search concepts. In my limited search testing of Semantic Scholar, I find that from 25% to 40% of article records do not even have an abstract. So AI concept determination can only be based on title words.

Because good detailed and consistent metadata from open resources like PubMed or ERIC are freely available, search results from many current AI search tools are biased toward results in these OA sources.

The for-profit search indexes Scopus and Web of Science are also rapidly developing AI methods to enhance their search capabilities. These resources have excellent curation, journal subject classification and citation context. But they also rely on variable metadata provided by publishers. They rely on author-assigned keywords, rather than consistently assigned subject headings.

Changes needed to the overall scholarly metadata landscape.

Larger changes to the overall scholarly content landscape are needed for the potential of AI methods to be realized. The effort for better AI searching coincides with several other efforts, including Crossref and OpenAlex, to create a more open and comprehensive metadata record of all scholarly publications.

Currently the overall body of metadata about scholarly articles remains siloed, and non-interoperable. There is no comprehensive source of scholarly metadata available to build AI search resources on.

For-profit publishers, as well indexing databases, continue to restrict access to their full metadata, an increasingly valuable commercial commodity. So business models are one of the barriers to better AI search.

Thankfully the overall scholarly metadata landscape is changing rapidly. There are several developments that will make AI search capabilities better.

Metadata source interchange and cross comparison is needed between different scholarly metadata sources. Metadata resources like Crossref, OpenAlex and ORCID are working to exchange information with OA resources, and with many publishers. Google and Microsoft remain largely holdouts.

Automated methods for using AI to enhance metadata are developing rapidly. These methods add subject information from article citations and references, to improve available metadata. Pre-search mining of information from networks of associated papers is as active area of research.

Using AI methods to build enhanced metadata using for-profit publisher metadata, while restricting access to the actually proprietary subject headings and descriptions, seems to be another approach being developed.

This is a brief summary of a large and quickly changing area. But as we embrace exciting advancements like artificial intelligence in academic searching it is important for search professionals, and for researchers to gain a better understanding of these resources, their great potential capabilities, but also where they currently fall short. It is also important for us to be aware of the overall state of the body of scholarly research information that we have to work with, and the changes and advances that are needed there.

This article is based on a talk, “Artificial Intelligence for Scholarly Literature Searching: Magic Bullet or Missing the Mark??”, that Webster recently presented at the Library 2.0 online conference 18 April 2024. The video of the talk is available on the Library 2.0 channel on YouTube.