-
Notifications
You must be signed in to change notification settings - Fork 131
GitHub prevents crawling of repository's Wiki pages - no Google search #1683
Comments
Got an update from GH today,
|
I urge the Github team to remove this restriction on robots scanning the github wiki pages. I put a great deal of effort into providing wiki pages that would assist users of my open source software. I also hoped that they would help potential users find the project by providing meaningful content related to the problems my software addresses. The fact that Google cannot index my pages seriously limits the effectiveness of that content. For example, I've written an article on Natural Neighbor Interpolation, which is a function my software supports. It's a specialty topic and the information I supply is not well-covered elsewhere. Enough people have linked to my article that if your run a Google search on "Natural Neighbor Interpolation" my wiki page comes up as the fourth item in the search. But, disappointingly, the description line on Google's search page reads "No information is available for this page". Therefore I respectfully request that Github reconsider its position on restricting web crawlers from indexing wiki pages. |
Same problem here. Any updates on GitHub removing this entry (it does more harm than good) |
Still blocked from crawlers. |
Please can GitHub remove the restriction on Google, etc crawing Wiki pages. I want my Wiki to be seen! |
GitHub should remove this entry in the robots.txt file and let the repo owner descide. The default setting for a wiki page could be "noindex, nofollow" set in a meta tag but it should be possible to unset it. |
They appear to have shifted to a custom crawling process. First two lines of current
Google and DuckDuckGo still aren't indexing the GitHub wiki pages. I found another obscure search engine called Bing, that also gives no results for wiki. I've not had any further updates from them. I'll prod them again to see why they've ignored this request and persist in a partial crawl of GitHub. For reference: github.com.robots.20210120.txt |
I just put a new support ticket in for GitHub to review this fiasco and I mentioned this ticket for detail and support for the fix. |
GitHub support says:
Kevin responds:
|
wiki not be crawlable is a nonsense |
The comma.ai community put a lot of work into the FAQ and many other pages. It's a bummer that it isn't indexed. I'm sure a few other projects have similar wikis with lots of content in them that are pretty much invisible. Maybe there should be a warning put on the Wiki functionality that the content in Wikis is generally invisible to search engines. |
The suggestion that a 'closed' Wiki that does not allow comments should be eligible to be crawled sounds sensible to me. This would stop people spamming GitHub, and would allow each project to decide if they wanted to make their Wiki searchable. In any event, if someone wanted to spam GitHub, most projects allow issues to be raised. The argument that preventing a Wiki from being crawled is to stop spamming is a bit thin because Issues could just as easily be used as a vector for trolling. Please allow projects to make their Wiki crawlable. |
As a way of sharing useful information, a wiki's whole purpose is defeated if it cannot be used to do as much as widely as their creators deem as applicable. Sure there should be a way to allow "private" wikis, but there should also be a way to have public ones. Otherwise projects will use other services to host such things (which I've seen in the past and not understood until now). Setting non-crawlable as a default seems reasonable, but not allowing projects to choose otherwise does not. Please reconsider. |
…ines and are excluded by `robots.txt`. github#4115 https://web.archive.org/web/20210403000950/github.com/robots.txt isaacs/github#1683
I think that the URL is visible to Google and other search engines. When I search for terms that match that, the URLs are bolded with the search terms and they do come up in the search. I am not sure if the content is used though. If you've ever searched for something that exist on StackOverflow, you may have noticed some mirrors of StackOverflow content mirroring also ranking highly. I don't particularly like these operations but maybe what they're doing can help here. I hastily made this service to try to get the comma.ai openpilot wiki content indexed: https://github-wiki-see.page/m/commaai/openpilot/wiki It's quite sloppy but it should work for other wikis too if a relevant link is placed in a crawlable place. I'm no SEO expert so this experiment may very well crater but I figured I'll try something for not a lot of money. I doubt it'll rank highly since there are no links to it and it is in no way canonical. I've also made some PRs as you can see in the issue reference alerts to update the GitHub documentation. In it, I've also suggested adding that users who want content that is crawlable and accepting of public contributions to produce a GitHub Page site backed by a public repository. To be honest though, setting up that setup kind of a pain in the ass for all parties and we're all lazy bastards. |
💸 I ran this big boy of a query in BigQuery as part of my project to generate sitemaps for my workaround: #standardSQL
CREATE TEMPORARY FUNCTION
parsePayload(payload STRING)
RETURNS ARRAY<STRING>
LANGUAGE js AS """ try { return JSON.parse(payload).pages.reduce((a,
s) => {a.push(s.html_url); return a},
[]); } catch (e) { return []; } """;
SELECT
*
FROM (
WITH
parsed_payloads AS (
SELECT
parsePayload(payload) AS html_urls,
created_at
FROM
`githubarchive.month.*`
WHERE type = "GollumEvent")
SELECT
DISTINCT html_url,
created_at,
ROW_NUMBER() OVER(PARTITION BY html_url ORDER BY created_at DESC) AS rn
FROM
parsed_payloads
CROSS JOIN
UNNEST(parsed_payloads.html_urls) AS html_url)
WHERE
rn = 1
AND html_url NOT LIKE "%/wiki/Home"
AND html_url NOT LIKE "%/wiki/_Sidebar"
AND html_url NOT LIKE "%/wiki/_Footer"
AND html_url NOT LIKE "%/wiki/_Header" $45-less later, I had a list of 4,566,331 wiki pages that have been touched over the last decade excluding Home and trimmings. That's a lot of content being excluded from I've saved the results into the publically accessible I've also been using the litmus test of If you searched for I think search engines don't index the content if |
I've since produced a new BigQuery table and a new bundle of sitemaps from that that has checked all the links and only includes |
FWIW, I've made my mirroring tool append the attribute |
It turns out they already attach |
GitHub currently has a
robots.txt
which is preventing crawling of the paths associated with the Wiki area for each and every repository. This is explicit and looks very intentional. I've asked about this (19-Oct-2019) and got no response, ticket number is430217
.I've attached the current (27-Oct-2019) robots.txt file.
github.com.robots.20191027.txt
The gist of it:
I would like this to change to make the Wiki areas searchable using popular search engines.
The text was updated successfully, but these errors were encountered: