Find Crawl Index focuses on Google, but can generally be applied to most search engines. If you understand what search engines want to do with their resources and what they want to surface to their users, then you can start to get a handle on delivering it to them, so that your pages can be linked by Google in response to its users’ search queries. #SEOStrategy Find Crawl Index: The Basics Google wants to: Find Crawl Index Content which is: Relevant; Created by an Expert; Who is viewed as an Authority. So that it can rank that content and link to it in response to a user search query I’ve put together a brief PDF as a takeaway. you can grab the Find Crawl Index PDF here. Find: Finding URLs Google wants to Find content. It does this via the following means: Links from existing URLs which it crawls. This is a primary method for finding new URLs to crawl. When Google retrieves a URL it reads the content and any links it finds may be added to its crawl queue. Although there are examples of URLs ranking which do not have any other links pointing to them, it is always recommended to have some external links pointing to a URL. URLs within XML Sitemaps submitted via Google Search Console This is a primary method for finding new or updated URLs to crawl. This is the best way to directly alert Google’s crawlers to come and request a page, and download its contents. If you want a URL indexed, it is always advised to put it into an XML Sitemap. Sign-up for Google Search Console here and find out about XML Sitemaps at sitemaps.org URLs within XML Sitemaps submitted via pinging Google This is a secondary method for find new or updated URLs to crawl. This is primarily used for alerting Google to the fact that an XML Sitemap has been updated, but it may also be used to alert Google to a new XML Sitemap. The XML Sitemap should be in Google Search Console, but it does not seem to be a pre-requisite. URLs submitted directly to Google This is a secondary method for find new or updated URLs to crawl. It is likely still available for historical reasons, but with the plethora of new URLs to crawl as the web expands exponentially, it is not recommended as a reliable method. Google Submit URL Other methods: Most other method, such as URLs within Apps or Social Media URLs are effectively variants of links from existing URLs. Despite suspicions, there is no evidence that Google uses Google Analytics data, general Google Chrome browsing information (they did appear to use Google Toolbar information, however that has effectively gone the way of the dodo), or other data from Google’s services for feeding discovered URLs to its crawlers. The reputational risk for using data broken out of those walled gardens is probably too great. Crawl: Crawling URLs Google wants to crawl URLs it has found or which have been submitted to it. It does this by working through a queue of URLs using a script referred to as googlebot (and other variants of the same basic script): Googlebot checks a domain’s robots.txt file to see if it is allowed to crawl the URL in its list. This is the robots.txt standard. This is Google’s robots.txt basic help area This is Google’s much more detailed developer guide for robots.txt NB: excluding a URL using robots.txt will not prevent it from being indexed. Google will construct indexing information from links pointing to the URL. Googlebot sends out an HTTP HEAD request for each URL in its queue. This contains basic information about the URL, such as what its HTTP STATUS is (eg 200 OK, 404 File Not Found, 301 Redirect etc), the date it was Last Modified, what type of content it is, eg text or image, and how long the content is in bytes. The information in the HEAD defines what happens next. If the HTTP STATUS is 200 OK, googlebot will perform a GET request on the URL. This retrieves the code and text content of the URL. A backend part of googlebot will attempt to read / parse the retrieved content, based on the Content-Type identified in the HEAD request. Images, scripts, CSS and other rich content referenced with SRC (source) in the code will also be requested and retrieved. This may not happen sequentially, they may be added to the queue. HREF links to other content will be parsed and added to googlebot’s queue. If the Content-Type is incorrect for what the content actually is, googlebot may attempt to use other methods to read the content. If the content is unreadable to it, googlebot will stop. That content will not be indexed, although the URL may be, based on links pointing to the URL. For other HTTP STATUS responses, googlebot will take other action. Here is a brief, and definitely not exhaustive, list. The full list is in Google’s developer guide: 404 File Not Found Update its index, including removal. De-rank from search results. May or may not re-crawl. 301 Redirect (Permanent) Add the destination URL to its crawl queue. Update its index to reflect the new URL. May or may not re-crawl. 302 Redirect (Temporary) Add the destination URL to its crawl queue. Update its index to reflect the content of the new URL. Will re-crawl. 5xx Server Errors Note the error in Search Console. Update its index. May de-rank from search results. Likely to re-crawl. Index: Indexing URLs and Content Google wants to maintain a list of URLs in its index which: Are available in a web browser and return a 200 OK HTTP Status. Have content which Google has read and has been able to cache. Google has visited recently so it is using the latest information to power its search results. Do not contain malicious code likely to hijack a user’s information or corrupt their system. NB: Not all URLs crawled will make it into the index. Google claims it maintains only one index. At various points it has maintained a Supplementary index, a mobile index and others. It is likely these still exist in some form, even if they are not named. Content: Relevance, Expertise, Authority Google wants to rank content and link to it in response to a user search query. In order to rank that content, it should be: Relevant Textual. Think keywords, and other words that people use when referring to a subject. These words should appear in text on-page, on-site, in links or mentions of the URL, or on-page / on-site of an external site mentioning the URL. This is logical and lexical semantics, linguistics and semiotics. Temporal. This is where something is relevant to a specific moment in time. It may be explicit eg “movies being released this week”, or implicit searches for “Olympics” around the time of the Olympics are likely to be looking for information relating specifically to that Olympiad. Outside that time they are likely to be after more generic information. Geographical. This is where something is related to a specific location, which may be explicit, eg “hairdressers in Coogee”, or implicit, eg a search for “bars” on a mobile device is likely to surface bars which are local to my current location. Contextual. Knowledge of the user’s context may influence the search results surfaced, eg on a mobile device at home, I may have more time to consume information, so get more detailed results. On a mobile, on a bus, on my way home from work, I may need a much snappier type of information. Google has the information to surface this type of result, but doesn’t (yet) do it terribly well. History Influenced. This is where results presented are influenced by my previous searches on the topic, or related searches and sites clicked on in previous search sessions. Cohort Influenced. This is where search results are influenced by the cohort, where there is limited individualisation and where previous users from the same group have behaved in similar ways. This may be grouped by things such as web browser, operating system, browser language, location, time of day, IP address, ISP, device used, and so on. Expert Generally, unless you’re watching certain news channels, experts have a depth and breadth of knowledge about a subject. Depth comes from the fact that they display lots of knowledge about a topic, eg they know all about red widgets, their types, sizes, designs, etc. Breadth comes from the fact that the know a fair amount about the topic and surrounding areas, eg, they know about red widgets, but also know about green widgets, blue widgets, widget substitutes. Expert content tends to be well-organised, well-structured and detailed. Experts also have a tendency to link and reference other sources of information on a topic, experts and otherwise. Authority Comes when an expert on a topic is regarded as an authority by other experts. Comes from being extensively linked and referenced by those other experts. Comes from being mentioned or referenced by non-experts in the context of general conversations around a subject. Generally, an authority tends to have well-designed, easily accessible and digestible content which is not dominated by advertising or referral links. Authority content tends to exist for its own sake, not for the sake of passing on the click.