[ < ] | [ > ] | [ << ] | [ Up ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
This chapter contains some references I consider useful.
9.1 Robots Wget as a WWW robot. 9.2 Security Considerations Security with Wget. 9.3 Contributors People who helped.
[ < ] | [ > ] | [ << ] | [ Up ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
It is extremely easy to make Wget wander aimlessly around a web site, sucking all the available data in progress. `wget -r site', and you're set. Great? Not for the server admin.
While Wget is retrieving static pages, there's not much of a problem. But for Wget, there is no real difference between the smallest static page and the hardest, most demanding CGI or dynamic page. For instance, a site I know has a section handled by an, uh, bitchin' CGI script that converts all the Info files to HTML. The script can and does bring the machine to its knees without providing anything useful to the downloader.
For such and similar cases various robot exclusion schemes have been devised as a means for the server administrators and document authors to protect chosen portions of their sites from the wandering of robots.
The more popular mechanism is the Robots Exclusion Standard written by Martijn Koster et al. in 1994. It is specified by placing a file named `/robots.txt' in the server root, which the robots are supposed to download and parse. Wget supports this specification.
Norobots support is turned on only when retrieving recursively, and never for the first page. Thus, you may issue:
wget -r http://fly.srk.fer.hr/ |
First the index of fly.srk.fer.hr will be downloaded. If Wget finds anything worth downloading on the same host, only then will it load the robots, and decide whether or not to load the links after all. `/robots.txt' is loaded only once per host.
Note that the exlusion standard discussed here has undergone some revisions. However, but Wget supports only the first version of RES, the one written by Martijn Koster in 1994, available at http://info.webcrawler.com/mak/projects/robots/norobots.html. A later version exists in the form of an internet draft <draft-koster-robots-00.txt> titled "A Method for Web Robots Control", which expired on June 4, 1997. I am not aware if it ever made to an RFC. The text of the draft is available at http://info.webcrawler.com/mak/projects/robots/norobots-rfc.html. Wget does not yet support the new directives specified by this draft, but we plan to add them.
This manual no longer includes the text of the old standard.
The second, less known mechanism, enables the author of an individual
document to specify whether they want the links from the file to be
followed by a robot. This is achieved using the META
tag, like
this:
<meta name="robots" content="nofollow"> |
This is explained in some detail at http://info.webcrawler.com/mak/projects/robots/meta-user.html. Wget supports this method of robot exclusion in addition to the usual `/robots.txt' exclusion.
[ < ] | [ > ] | [ << ] | [ Up ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
When using Wget, you must be aware that it sends unencrypted passwords through the network, which may present a security problem. Here are the main issues, and some solutions.
ps
. If this
is a problem, avoid putting passwords from the command line--e.g. you
can use `.netrc' for this.
[ < ] | [ > ] | [ << ] | [ Up ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
GNU Wget was written by Hrvoje Niksic hniksic@arsdigita.com. However, its development could never have gone as far as it has, were it not for the help of many people, either with bug reports, feature proposals, patches, or letters saying "Thanks!".
Special thanks goes to the following people (no particular order):
ansi2knr
-ization. Lots of
portability fixes.
Digest
authentication.
The following people have provided patches, bug/build reports, useful suggestions, beta testing services, fan mail and all the other things that make maintenance so much fun:
Tim Adam, Adrian Aichner, Martin Baehr, Dieter Baron, Roger Beeman and the Gurus at Cisco, Dan Berger, Paul Bludov, Mark Boyns, John Burden, Wanderlei Cavassin, Gilles Cedoc, Tim Charron, Noel Cragg, Kristijan Conkas, John Daily, Andrew Davison, Andrew Deryabin, Ulrich Drepper, Marc Duponcheel, Damir Dzeko, Aleksandar Erkalovic, Andy Eskilsson, Christian Fraenkel, Masashi Fujita, Howard Gayle, Marcel Gerrits, Hans Grobler, Mathieu Guillaume, Dan Harkless, Heiko Herold, Karl Heuer, HIROSE Masaaki, Gregor Hoffleit, Erik Magnus Hulthen, Richard Huveneers, Jonas Jensen, Simon Josefsson, Mario Juric, Hack Kampbjorn, Const Kaplinsky, Goran Kezunovic, Robert Kleine, KOJIMA Haime, Fila Kolodny, Alexander Kourakos, Martin Kraemer, Simos KSenitellis, Hrvoje Lacko, Daniel S. Lewart, Nicolas Lichtmeier, Dave Love, Alexander V. Lukyanov, Jordan Mendelson, Lin Zhe Min, Tim Mooney, Simon Munton, Charlie Negyesi, R. K. Owen, Andrew Pollock, Steve Pothier, Jan Prikryl, Marin Purgar, Csaba Raduly, Keith Refson, Tyler Riddle, Tobias Ringstrom, Juan Jose Rodrigues, Edward J. Sabol, Heinz Salzmann, Robert Schmidt, Andreas Schwab, Toomas Soome, Tage Stabell-Kulo, Sven Sternberger, Markus Strasser, John Summerfield, Szakacsits Szabolcs, Mike Thomas, Philipp Thomas, Russell Vincent, Charles G Waldman, Douglas E. Wegscheid, Jasmin Zainul, Bojan Zdrnja, Kristijan Zimmer.
Apologies to all who I accidentally left out, and many thanks to all the subscribers of the Wget mailing list.
[ << ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |