Identify and address Google search and Dataset search issues #12

amoeba · 2021-08-11T02:34:56Z

Being comprehensively indexed by search engines such as Google is a substantial benefit for DataONE and DataONE Members. Ideally, the whole variety of information (datasets, people, portals, metrics, etc.) housed within DataONE would be findable through traditional search engines and, for datasets, also Google Dataset Search.

As of 2021, our primary tool for knowing whether or not we are comprehensively indexed is the Google Search Console which provides a whole suite of tools for diagnosing issues.

Some of the problems we've addressed in the past include:

Improve MetacatUI's mobile support (https://github.com/NCEAS/metacatui/issues?q=is%3Aissue+is%3Aclosed+mobile)
- This is important because Googlebot uses a mobile UA and penalized non-mobile-friendly pages
We now publish sitemaps on the CNs and MNs Extend Sitemap functionality to support Portals NCEAS/metacat#1432 Refactor Sitemap class to use the CN Object Formats list NCEAS/metacat#1434 Refactor sitemap generation approach to be XML-aware NCEAS/metacat#1369 Reconfigure how Metacat runs the sitemap task NCEAS/metacat#1368 Update Metacat's sitemap implementation NCEAS/metacat#1263
- This is important because, while crawlers implement whichever methods they wish, some (like Google) use sitemaps to discover and define the set of [canonical](https://developers.google.com/search/docs/advanced/crawling/consolidate-duplicate-urls URLs for a site

Issues we have ahead of us include:

There's a separation between how we're doing SEO on dataone.org and search.dataone.org. We should probably integrate the two under dataone.org
For specific types of content (eg portals), our search index presence is very far from complete. Portal users would find it very valuable to show up in Google search. We need to find a way to include all of this stuff, not just datasets
Our dataset URLs aren't being fully indexed (see below)

Dataset coverage

Summary

At the time of writing, we have 846,622 dataset and portal URLs listed in our sitemaps and Google has discovered them all correctly. 812k of these are marked as "Excluded". When we drill down into the index coverage of those URLs, we get this breakdown:

Type	# Pages
Discovered - not currently indexed	770,302
Duplicate, submitted URL not selected as canonical	30,977
Duplicate without user-selected canonical	12,265
Crawled - currently not indexed	7,986
Pages with redirect	424
Blocked due to other 4xx issue	6

Discovered - not currently indexed

The majority of these are "Discovered - not currently indexed", which his defined as:

Discovered - currently not indexed: The page was found by Google, but not crawled yet. Typically, Google wanted to crawl the URL but this was expected to overload the site; therefore Google rescheduled the crawl. This is why the last crawl date is empty on the report.

This makes sense to me as our individual dataset landing pages are very slow relative to what Google expects. I'm hoping that we can hear more back from the Google team about whether this is truly what's going on or if it's something else. We know from unofficial sources that Google's crawl infrastructure has two queues: One for fast sites, and one for sites it had hoped were fast but put in a separate, lower-priority queue. I'd guess we're in the latter.

Duplicate, submitted URL not selected as canonical

What we see here is that Google is selecting a URL like https://dataone.org/datasets/R1-x138-079-0042-010 as our canonical. These URLs are reported in our embedded JSON-LD and they really are our canonical URLs. I think we should consider switching our sitemap implementation on the CNs to use https://dataone.org/datasets URLs instead.

Other things

Aside from the above categories, some of the odd stuff we've seen is:

Some URLs are listed as not present in sitemaps even when they are. e.g.,
- https://search.dataone.org/view/0aa7877f6fe1de79bd0d71a54115f9e5
- https://search.dataone.org/view/ess-dive-c9503304a84ea68-20201029T231811168
Google sets a different URL as canonical
- Good case: I found one example where a user broke their version chain somehow and Google actually fixed it.
- Bad case: Google picks https://search.dataone.org/signin as a canonical for some URLs. Not sure what's up with this

[Note: Please feel to edit this issue to be more complete]

Next steps

Correctly configure robots.txt and web server configs on various hosts
- Production
  - Verify robots.txt Allow + Sitemap directives. Disallow on API routes within (ie /metacat)
- Non-production
  - Header Set X-Robots-Tag "noindex, nofollow"

The text was updated successfully, but these errors were encountered:

mbjones · 2021-08-11T18:39:18Z

Thanks for the summary, Bryce.

One thing we can do right off is to eliminate crawling of our stage, dev, and sandbox hosts, to eliminate some of the errors we see from those hosts. I suggest that we add:

    ## Disallow Robots for all content
    Header Set X-Robots-Tag "noindex, nofollow"

to the apache config for all non-production search and API servers. This is the google recommended way to indicate robot preferences (instead of robots.txt), and it is really easy to implement. I tested it on cn-stage.test.dataone.org and it seems to deliver the expected headers.

amoeba · 2021-08-14T01:11:04Z

Thanks @mbjones, that sounds like a good step.

As an update about our coverage in Google Search, since writing this issue and talking with folks at Google, our coverage has gone up a bit, by maybe 10k records. So we can track progress here, I'm counting 48,434 sitemap entries covered, 812k excluded.

mbjones · 2021-08-16T23:12:35Z

I also see a similar exclusion pattern for the Arctic Data Center (18.3K Valid, 11.6K Excluded). The following is the breakdown of the excluded URIs by reason on the ADC site:

amoeba · 2021-09-14T23:04:59Z

Just updating on status here: I've had a few back and forths with folks at Google and they're planning to take a closer look at things. The vast majority of our dataset landing pages (~660k) aren't indexed at all, despite being discovered by Google via our sitemaps. At this point, I'm guessing they'll come back and say it's because our individual pages are slow enough to cause their crawlers to think they're overloading us (they aren't) and for them to stop crawling.

We have at least one ticket on this topic but I can't seem to find it right now.

I'll update here when I hear back from Google.

amoeba · 2021-10-12T00:25:40Z

Ran into a new twist today. It looks like a Dataset record will fail Google's validation if it doesn't have a description between 50 and 5000 characters. @mbjones found this out while looking at some the invalid datasets in our Search Console and checking Google's guidelines.

@mbjones said on Slack:

I pulled our abstracts from SOLR for all non-obsoleted metadata in DataONE; we have 857 datasets with abstract > 5000 chars, and 434,422 with abstract < 50 chars. Of the latter , 409695 are missing abstracts

So about half of DataONE's content may be failing validation for this reason. Adding this note here just for the papertrail.

laurenwalker · 2021-10-12T18:58:52Z

Can we create a description from other parts of the metadata if there is no abstract? Such as piece together the creator names, title, location, etc. to create a pseudo abstract?

amoeba · 2021-10-12T19:11:23Z

Yeah @laurenwalker, that's a good idea. I think an easy thing to do would be to detect too-short-for-Google abstracts and pad on a phrase like "For complete metadata, visit https://arcticdata.io/catalog/view/doi:1234/AA/5678." which is, by itself, 81 characters.

mbjones · 2021-10-13T17:44:23Z

That sounds like a great idea!

mbjones · 2021-10-13T17:44:43Z

And I guess truncate the ones that are too long.

amoeba self-assigned this Aug 11, 2021

amoeba mentioned this issue Oct 13, 2021

Truncate description field in injected JSON-LD to 5000 characters NCEAS/metacatui#1899

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Identify and address Google search and Dataset search issues #12

Identify and address Google search and Dataset search issues #12

amoeba commented Aug 11, 2021 •

edited

Loading

mbjones commented Aug 11, 2021

amoeba commented Aug 14, 2021

mbjones commented Aug 16, 2021

amoeba commented Sep 14, 2021

amoeba commented Oct 12, 2021

laurenwalker commented Oct 12, 2021

amoeba commented Oct 12, 2021

mbjones commented Oct 13, 2021

mbjones commented Oct 13, 2021

Identify and address Google search and Dataset search issues #12

Identify and address Google search and Dataset search issues #12

Comments

amoeba commented Aug 11, 2021 • edited Loading

Dataset coverage

Summary

Discovered - not currently indexed

Duplicate, submitted URL not selected as canonical

Other things

Next steps

mbjones commented Aug 11, 2021

amoeba commented Aug 14, 2021

mbjones commented Aug 16, 2021

amoeba commented Sep 14, 2021

amoeba commented Oct 12, 2021

laurenwalker commented Oct 12, 2021

amoeba commented Oct 12, 2021

mbjones commented Oct 13, 2021

mbjones commented Oct 13, 2021

amoeba commented Aug 11, 2021 •

edited

Loading