Re-launched, but still slightly under construction. :-)

Thursday, February 22, 2007

Department of Redundancy Department


This e-mail came in to our news reporter. I think it's kind of dangerous to be tossing the word URL around like that. I bet it's one of the acronyms that the National Security Agency looks for in terrorist dragnets.

Thank you for your note. After some investigation, we've found that our
system cannot crawl your articles because of the format of their URLs.
Following the general technical guidelines below should help our crawler
find and index articles from your site correctly:

1. In order for our crawler to correctly gather articles, each page that
displays an article's full text needs to have a unique URL that doesn't
change. We can't include sites in Google News that display multiple
articles at the same URL.

2. The URL for each article must contain a unique number consisting of at
least three digits.

For example, our news crawler wouldn't crawl articles with the following
URLs:
www.google.com/news/article23.html
www.google.com/lemurs_in_the_mist.html

It would crawl these pages:
www.google.com/news/08112003/article.html
www.google.com/news/lemurs_in_the_mist/23467.html

3. Keep in mind that we are unable to include sites for which the URL of
the main page includes a date. URLs with dates in them often change on a
daily or weekly basis. This prevents us from crawling the site for new
content, as we're unable to detect the most current URL to be crawled.

For example, if a URL changes from /novembernews.html to
/decembernews.html, Google will continue to crawl the novembernews.html
page, and thus not find any new content.

An example of a site that we're able to crawl successfully is
http://english.chosun.com. Please note that each article on this site has
a unique and unchanging URL.

If you're able to make changes on your end that would allow us to crawl
your content, please let us know.

Regards,
The Google Team

0 comments :

Post a Comment