Conclusion: yes, Web search engines can index wikis. Google Loves Wiki.
Therefore, a HTTP search engine is a useful tool to provide increased search capability for any wiki.
However, the Robots.txt file should be properly configured.
And the appropriate META tags should be added to "Edit Page" and "Edit Copy" to exclude all robots from indexing obsolete text.
Robots.txt can be configured to allow a local search engine to index the site while excluding public Internet search engines, something like this:
To allow a single robot
User-agent:
FriendlyWebCrawler
Disallow:
User-agent:
*
Disallow:
/
See Robots Dot Txt for a link to the file format spec.
Discussion
Any HTTP search engine can index any wiki. It is just a matter of combination of configuration, URL space and page content (META tags) whether it is possible. There are no inherent technical limits. (copied from Wiki Navigation Pattern)
Note the following:
Some wikis use robots.txt to avoid being indexed.
As for any other Web site, the searchbot increases traffic on the wiki, but, hopefully, the searchbot is well behaved and the increased traffic is negligible.
As for any other Web site, the index is not continually updated, and thus not always up to date.
If the term "caching" of wiki pages is used, my belief is that the term is used to describe a sort of virtual caching that occurs because the searchbot indexes a (dynamic) Web page that "disappears" between "instantiations". The index is all that remains between instantiations of the dynamic page.
I don't know if there is a way for a wiki (or other Web site) to force a searchbot to update its index.
References
A sample robots.txt (for the Portland Pattern Repository):
# Prevent all robots from getting lost in my databases User-agent: * Disallow: /cgi/ Disallow: /cgi-bin/
usemod.com
:
A useful link to learn more about preventing a wiki site from being indexed, with "sublinks" worth following.
Comments
Sure, why not? Meatball Wiki is indexed by Google. Some searches are bizarre. We got one trolling for kiddie porn once. Recently, Cliff decided to exclude bots from hitting the edit pages, but they are greatly encouraged to index the whole site there. On the other hand, Wiki is not indexed, nor will it likely ever be directly, although it is through the mirrors that aren't blocked. -- Sunir Shah
I wrote up a document about how to give the wiki pages some "fake" URLs that end in .html. This doc is at www.riceball.com
-- John Kawakami
<META NAME="ROBOTS" CONTENT="NOINDEX, NOFOLLOW">
in the HTML header of the Edit page and the Edit Copy page. Google, like all properly-written robots, ignore such pages. See Wiki Spam for details.
Feb 2. 2005: But, using the above implementation, the spider must still access the Edit Page or Edit Copy links to then find out they aren't supposed to be indexed. Now google and blogs are implementing a rel="nofollow" tag in links. This could be added to the Edit Page button on each page to give robots a heads-up to not even bother trying to access the Edit Page. What is still unclear is whether the rel="nofollow" tag really means "do not follow this link" to robots, or whether it means "follow this link, but don't give it any credit in your page ranking scheme".
---
May 02, 2005: Meta Tag Tutorial Available
www.rahsoftware.com ![]()
See original on c2.com ![]()