Re: SETI Helping outsiders learn, while focussing ourselves.


David Woolley (david@djwhome.demon.co.uk)
Thu, 29 Jul 1999 08:27:28 +0100 (BST)


>
> The two mechanisms do seem redundant but the differences are subtle.
> According to the Robot Exclusion Protocol, see:

I've updated myself on the background and some of the reasons for
the META tag, but those documents also point out that many crawlers
are not aware of the new standard.

> Interesting. I imagine the ability to parse HTML/Javascript varies spider
> to spider. But it seems to me that a spider that can't digest
> HTML/Javascript is inherently defective because Javascript code generally
> resides within HTML comment blocks and hence should be ignored by anything

They are only comment blocks if you are using a DTD that doesn't include
SCRIPT. For anything aware of the DTD they are significant text in which
only the specific closing tag </script> is recognized. Putting them in
comments is a guideline to allow for older browsers. Lynx does not handle
Javascript, but does parse script elements; the same may be true of wget.

> but a Javascript compliant browser. It would behoove the spider developers
> to include Javascript capability for there are additional links to search
> for hidden within the Javascript code. No doubt the larger, deeper
> pocketed, search engines have accounted for this in their robots.

I suspect not. In the general case, it is an impossible problem, because
it it equivalent to the halting problem in general computer science -
you cannot create an algorithm that will determine which of a set of
algorithms which have been presented to it will ever halt, for an
arbitrary set of algorithms, so any JS interpreting crawler would need
to abandon the algorithm at an arbitrary point.

In any case, they may well take the view, reasonable in my opinion, that
the more the script content, the less the useful content, and those
with heavy script content who really want to be crawled effectively will
use META/LINK elements (don't have the details to hand, but one of these
exists for this purpose) to reveal the structure that is hidden in the
scripting. (Again, I think there is some correlation between useful
content and understanding HTML, although cook book rules for improving
indexability do seem to abound.)

There is also the issue that the Javascript object models of even the
big 2 browsers are not tightly specified, and some use VBscript.

Fortunately I got Javascript bypasses put into the SETI League pages,
although I don't think they are well maintained (I think they may
actually be more complete than the JS pulldowns!) and the page would
have been smaller, but contained the same information, if it had
used fragment links to submenus on the same page.
>
> As for potential theft of service, this is certainly a possibility and of
> course it is always good form to check with the owners of the site before
> proceeding to use their links. As for this specific situation, with the SL
> using Altavista, it seems a safe bet that Altavista(Compaq) will allow the
> symbiosis since the more web page hits translate into more ad profits.

The Altavista Ts and Cs are not a clearly written as they might be, but
their definition of "commercial use" would require permission, and I think
that the SETI League is sufficiently borderline that it ought to get
that permission. Disclaimer is the relavant link, at the bottom of the
home page. (I suspect one of the things you get with the commercial service
is the ability to do site selective searches from clean HTML.)

Sorry, for wandering so far off SETI and yes I know that "people asked
for the Javascript".



This archive was generated by hypermail 2.0b3 on Sun Aug 01 1999 - 16:28:47 PDT