- A Search Engine Spider also known as a crawler, Robot, SearchBot or simply a Bot
- A program that most search engines use to find what’s new on the Internet. (e.g. GoogleBot)
- It “crawls” the web and collects documents to build a searchable index for the different search engines. The program starts at a website and follows every hyperlink on each page.
- Everything on the web will eventually be found and spidered, as the “spider” crawls from one website to another.
- Search engines may run thousands of instances of their web crawling programs simultaneously, on multiple servers.
- First, the search bot starts by crawling the pages of your site.
- Second, it continues indexing the words and content of the site.
- Lastly, it visit the links (web page addresses or URLs) that are found in your site. When the spider doesn’t find a page, it will eventually be deleted from the index.
- Some of the spiders will check again for a second time to verify that the page really is offline.
- When a web crawler visits one of your pages, it loads the site’s content into a database. Once a page has been fetched, the text of your page is loaded into the search engine’s index, which is a massive database of words, and where they occur on different web pages.
- The first thing a spider is supposed to do when it visits your website is look for a file called “robots.txt”.
- This file contains instructions for the spider on which parts of the website to index, and which parts to ignore.
- The only way to control what a spider sees on your site is by using a robots.txt file.
- All spiders are supposed to follow some rules, and the major search engines do follow these rules for the most part.
- Including a robots.txt file can request bots to index only parts of a website, or nothing at all.
- Web crawlers can copy all the pages they visit for later processing by a search engine which indexes the downloaded pages so the users can search much more efficiently.
- Crawlers can validate hyperlinks and HTML code. They can also be used for web scraping, data-driven programming.