Archive of UserLand's first discussion group, started October 5, 1998.

Re: An RSS categorization proposal

Author:Jeremy Bowers
Posted:9/9/1999; 9:37:07 PM
Topic:rss channels via email
Msg #:10850 (In response to 10840)
Prev/Next:10849 / 10851

The problem with those controlled vocabularies produced by library science is that it more-or-less takes library scientists to accurately classify a given item. I doubt anybody is really going to deny that such a thing is ideal, but personal experience says that theory won't fly in the real worldall of the time.

(Approximate real-world conversation: (keeping in mind that with a web staff of 1.5, where I'm the .5, I'm a little bit of everything... site designer, content organizer, primary Frontier guy, web theorist...)

"Them": "Here's a document, complete with classifications, title, and the place we want it."

Me: (actually, I never say this, only think it) "Your document should be in three pieces, one of which should be integrated into the current document that sort of already does that, your title is garbage, and your place is all wrong; it doesn't deserve its own catagory, it belongs under X."

"Them": "Too bad, do it anyhow."

This more or less happened last week; I've only expanded... say... 25%. (I love statistics like that, don't you?) )

LOC & DDS work great, but only when the sorters are intelligent (in this limited definition of the term :-) ). I think the systems will crash and burn when the general public has to do it. Considering how many years we are into the web revolution with the majority of sites still ignoring some of the original Top Ten Bad Things On The Web from 4 years ago, I think it will be a long time before we can count on Web Librarian showing up in the Want Ads (something I feel safe in saying it will be another web-year or two before this even attracts preliminary main-stream attention).

It seems to me that the problem of classification, assuming a sufficiently large (and, virtually synonymously, unorganized) population of RSS users, is equivalent to the problem of sorting static web pages via a search engine.

Forcing the DDS or the LOC system onto RSS collections would be roughly equivalent to the Yahoo method of operation. We need something smart enough to really look at the content and help the user classify.

I like the idea of simply letting the user add some catagories, as they feel it should be classified, which achieves the "churn" Jon wants (and I agree is vital; libraries may move a lot of material, but it tends towards the static, updating at most once a week for certain journals, which tend to be in the same class anyhow), and using 2002 or 2003 search engine technology to filter the information (as I assume it will be this long until it's an issue; you know the problem with assumptions), using the user-provided parameters both for hints about how to classify it, and using mechanisms already being developed to verify that it is indeed relevant (as there will be liars).

One could even use the system iteratively; here's what my engine thinks of your content, please help it using the content-type tags. Keep trying until everybody agrees enough to call it quits.

It won't be perfect... but as there is a strong incentive to make sure your listing is correct (and hopefully, if the tools are around more-or-less since day one, more people will take advantage of them), and we can always use some humans to balance things out, we should be able to hit the high 90%'s with reasonable effort on any amount of RSS channels, since we can use the research in search engines.

It also won't be easy, for obvious reasons.

This would produce some sort of computer classification system (probably expressed as "degree of relationship" to some expert-human chosen baselines (large numbers of them), and maybe eventually computer selected) upon which a more human classification system would need to be projected, but that's mostly sweat-of-the-brow work, not an insoluable problem.


There are responses to this message:


This page was archived on 6/13/2001; 4:52:32 PM.

© Copyright 1998-2001 UserLand Software, Inc.