Google Dataset search – first impressions

The launch of Google Dataset Search has been big news. Aaron Tay considers the implications for librarians and researchers.

Page 1 of 2 next >>

A new library challenge

Google has described Google Dataset Search as being similar to Google Scholar, allowing users to "…find datasets wherever they’re hosted, whether it’s a publisher's site, a digital library, or an author's personal web page."

In a sense this move by Google isn't surprising. As I noted in July 2018, general dataset discovery is a ‘new’ library challenge. 

Google Dataset search feels raw even for a "beta" product

Google's UI style is well known (limited filters and facets, less focus on advanced search) but currently the Dataset search function is barebones even by Google’s standards.

You get a nice auto-complete but as it stands a very basic function is missing. 

I was trying to check if every dataset on a data repository was included in the search by doing a site:Domainname search but the number of results for each search is missing. This omission makes trying to determine if all records are included much harder.

For searching syntax wise, the obvious assumption is what works for Google or perhaps Google Scholar would work for Google Dataset but that would not be a safe assumption.  

It would be nice to have the search syntax to search by date, or by filetype (e.g. csv).

 In any case, it's probably pointless to try to figure out what works for searching when everything is so raw.

In terms of coverage, it's hard to do a comprehensive comparison, but I do notice that not just open data can appear. For example, closed/ proprietary data from CEIC is findable too! There is no link though. This makes me wonder will it also surface closed datasets that have metadata in Schema.org?

How does Google Dataset search actually work?

Like Google Scholar, the dataset searcher crawls the web for content to index. But unlike Google Scholar, metadata is critical for the dataset content to appear

DataCite, which seems to be a major partner of this initiative, states there are two main requirements for datasets to appear in the search.

Firstly, you need Schema.org metadata to be embedded into the dataset landing page so that the Google indexer can find it, and secondly the data repository needs to provide a sitemaps file with the URLs of all dataset landing pages. 

As the Google blog post states, "A search tool like this one is only as good as the metadata that data publishers are willing to provide".

Given that Google is relying on Schema.org metadata, I confess I still don't understand how it detects cites to datasets. Surely there aren't that many articles or datasets flowing around with Schema.org using the citation property? 

Implications for libraries for data reposi
tories

The first obvious implication for libraries is to ensure the datasets they host in their repositories are visible in Google Dataset search. Given the name recognition Google and Scholar have with faculty, I predict a very common strategy would be for librarians to show faculty that their datasets deposited appear in Google Dataset search.

The main requirement is that Schema.org metadata is embedded into your data repository pages.

Unfortunately, while many data repositories like Figshare, Dyrad, Dataverse, MendeleyData support Schema.org (see list here), many more do not.

In particular, I suspect many institutions using open source data repositories such as DSpace or EPrints will need to do some extra work to add in this feature with a plugin (if it exists), or code up something to handle this. 

I have no doubt that a lot more repositories will start prioritising this.

What if you have a data repository but do not have the capability to add Schema.org to the landing pages? Are you totally invisible? Not quite. Chances are you registered your dataset with a DataCite DOI and DataCite has done some work to ensure their DataCite entry is indexed in Google.

In effect, even if your data repository is invisible to the search, DataCite will ensure its record is indexable and hence a searcher can find their way to your repository via that record at the cost of an additional click. 

The technical details can be very complicated (see for example this case) but this gives you a taste of how it works. 

 

Page 1 of 2 next >>