Archive of UserLand's first discussion group, started October 5, 1998.

Re: I Require Permission

Author:Mark Nottingham
Posted:9/2/1999; 5:50:57 PM
Topic:Automated deep linking
Msg #:10479 (In response to 10452)
Prev/Next:10478 / 10480

Another factor here is how the scraping is used/presented to the end user; once again, it's a matter of degrees.

I imagine almost all content owners would be very disagreeable with their content being *repurposed*; i.e., scraped and then presented as someone else's. Many (if not most) would not like it being scraped and linked on a pay-for-access or very commercial site. I *think* relatively few would be averse to scraping for a completely personal use page.

Now, the interesting distinction here is where that personal page lives. If it's on my personal site, and it's just for me, aggregating all of the headline sites that are out there, is it different if this facility is provided to me by someone else? (I'm not sure what the answer is here)

As has been pointed out, there are many more complications; the fact that automated access (spidering) is generally allowable, unless specifically prohibited by robots.txt; the concept of browsing and public content/fair use.

Another thing; legitimate (well, at least relatively) scraping captures *summaries*, not the content itself. The end user still has to go to the scraped site to get the actual content.

I'd hate for scraping to be labeled as 'bad' top-to-bottom; there are some legitimate uses for it. In the long run, though, it should be replaced by the active creation of a RSS (or whatever) file by the content owner.

Unfortunately, RSS needs to be much more widely utilised for scraping to be unnessessary; as Dave points out, there are things we can do to shorten this process.

Two of the things that IMHO can help make it more ubiquitous are: * Standardise on ONE aggregation XML format, so there's no confusion in the Webmaster's mind about what format to publish in. I still think that the format need to be more open/public than Netscape has been with RSS development.

* Have a greater variety of non-commercial, non-specific clients for use of RSS, to get more people interested in it. If every intranet and special interest group has a aggregation engine, there will be more user demand for publication of RSS files.

In the meantime, content owners who explicity don't want any scraping at all will probably have to state exactly that in their copyright notice. Scraping, after all, isn't an *entirely* automatic process; the scraper has to configure the software, and that means that a human has to look at the site. The onus is on them when they do that to understand the conditions of use.

I'm done rambling now.


There are responses to this message:


This page was archived on 6/13/2001; 4:52:22 PM.

© Copyright 1998-2001 UserLand Software, Inc.