-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Identify and address Google search and Dataset search issues #12
Comments
Thanks for the summary, Bryce. One thing we can do right off is to eliminate crawling of our stage, dev, and sandbox hosts, to eliminate some of the errors we see from those hosts. I suggest that we add:
to the apache config for all non-production search and API servers. This is the google recommended way to indicate robot preferences (instead of robots.txt), and it is really easy to implement. I tested it on cn-stage.test.dataone.org and it seems to deliver the expected headers. |
Thanks @mbjones, that sounds like a good step. As an update about our coverage in Google Search, since writing this issue and talking with folks at Google, our coverage has gone up a bit, by maybe 10k records. So we can track progress here, I'm counting 48,434 sitemap entries covered, 812k excluded. |
Just updating on status here: I've had a few back and forths with folks at Google and they're planning to take a closer look at things. The vast majority of our dataset landing pages (~660k) aren't indexed at all, despite being discovered by Google via our sitemaps. At this point, I'm guessing they'll come back and say it's because our individual pages are slow enough to cause their crawlers to think they're overloading us (they aren't) and for them to stop crawling. We have at least one ticket on this topic but I can't seem to find it right now. I'll update here when I hear back from Google. |
Ran into a new twist today. It looks like a Dataset record will fail Google's validation if it doesn't have a @mbjones said on Slack:
So about half of DataONE's content may be failing validation for this reason. Adding this note here just for the papertrail. |
Can we create a |
Yeah @laurenwalker, that's a good idea. I think an easy thing to do would be to detect too-short-for-Google abstracts and pad on a phrase like "For complete metadata, visit https://arcticdata.io/catalog/view/doi:1234/AA/5678." which is, by itself, 81 characters. |
That sounds like a great idea! |
And I guess truncate the ones that are too long. |
Being comprehensively indexed by search engines such as Google is a substantial benefit for DataONE and DataONE Members. Ideally, the whole variety of information (datasets, people, portals, metrics, etc.) housed within DataONE would be findable through traditional search engines and, for datasets, also Google Dataset Search.
As of 2021, our primary tool for knowing whether or not we are comprehensively indexed is the Google Search Console which provides a whole suite of tools for diagnosing issues.
Some of the problems we've addressed in the past include:
Issues we have ahead of us include:
Dataset coverage
Summary
At the time of writing, we have 846,622 dataset and portal URLs listed in our sitemaps and Google has discovered them all correctly. 812k of these are marked as "Excluded". When we drill down into the index coverage of those URLs, we get this breakdown:
Discovered - not currently indexed
The majority of these are "Discovered - not currently indexed", which his defined as:
This makes sense to me as our individual dataset landing pages are very slow relative to what Google expects. I'm hoping that we can hear more back from the Google team about whether this is truly what's going on or if it's something else. We know from unofficial sources that Google's crawl infrastructure has two queues: One for fast sites, and one for sites it had hoped were fast but put in a separate, lower-priority queue. I'd guess we're in the latter.
Duplicate, submitted URL not selected as canonical
What we see here is that Google is selecting a URL like https://dataone.org/datasets/R1-x138-079-0042-010 as our canonical. These URLs are reported in our embedded JSON-LD and they really are our canonical URLs. I think we should consider switching our sitemap implementation on the CNs to use https://dataone.org/datasets URLs instead.
Other things
Aside from the above categories, some of the odd stuff we've seen is:
[Note: Please feel to edit this issue to be more complete]
Next steps
Header Set X-Robots-Tag "noindex, nofollow"
The text was updated successfully, but these errors were encountered: