roundtable: Re: New way to label information on the Internet(A Primer)
roundtable: Re: New way to label information on the Internet(A Primer)
Re: New way to label information on the Internet(A Primer)
W. Curtiss Priest (BMSLIB@mitvma.mit.edu)
Thu, 21 Dec 95 09:18:32 EST
Message-Id: <9512211524.AA18910@a.cni.org>
Date: Thu, 21 Dec 95 09:18:32 EST
From: "W. Curtiss Priest" <BMSLIB@mitvma.mit.edu>
Subject: Re: New way to label information on the Internet(A Primer)
To: Telecommunications Policy Roundtable <ROUNDTABLE@CNI.ORG>
In-Reply-To: Message of Wed, 20 Dec 1995 19:28:13 -0500 (EST) from
Center for Information, Technology & Society
W. Curtiss Priest, Director
December 21, 1995
An Impromtu Primer on Information Search
Strategies
>re: web as source of information
The status of the web as a source of information is a complex one.
To understand what you can and can't ask of a webcrawler (and thus
the web) one must think in terms of what information on the Internet
reaches web pages (much ftp and gopher information does not) and what
propels information onto the web (mostly free of charge).
Let me illustrate with materials I have posted. Some of my posts, I
take the time to sent to eff.org and Stanton kindly puts them under
his gopher which is accessible from his web page. This means that
I can use the DEC webcrawler and actually find my papers on the
economic character of information or on LINCT or my occasional newsletter
on "The Will to Create the Future"
Of the many lists I post to, the Media Forum appears to be the only
list that connects their archive to the web. So all of my posts that
have included the Media Forum can be found via the DEC webcrawler.
To what extent will my web page (via EFF) and my posts to the Media Forum
meet any one search for information on the net. Haphazardly so.
If you are interested in the economics of information as a commodity, you
will find my material and, perhaps, Hal Varian's material (provided he
has upgraded his gopher site on economics and information to be web
accessible.
Then there are the news services -- Infoseek, Newsbytes, Individual, etc.
who post web accessible stories in various categories. For example,
Individual's www.newspage.com carries mostly computer and multimedia
related stories. A story is available for free -- only the day it is
posted. So by the time it is inventoried by a webcrawler, it isn't
available for free anymore. (I don't know how the designers of
webcrawlers deal with transient pages such as these; after 10 days
these news stories disappear altogether from that page.
Yet -- while they are there, they are a remarkable source of information --
for current events in computers and media.)
Give me an assignment to study an area and I will selectively use
web pages and web crawlers. For example, I am finishing a K-12 networking
study. I have pulled wonderful things from the ISTE (International
Society for Technology in Education) gopher site (not web accessible --
and keep in mind a browser set to gopher:// or ftp:// is no longer a
web browser).
I was given a number of sites to look at via email suggestion, and
found that the Smart Valley Inc. group in CA had a great page that
took me to a NASA K-12 page which took me to an information Cisco
Systems page, etc. etc.(http://www.svi.org)
This was too haphazard for my liking, but it produced a number of important
items (pushing me to go get Ghostscript to read the .PS files, postscript,
and digging out the Acrobat reader for the .PDF files)
But I didn't start my literature search on the web! I started by
going to ERIC. This happens to be accessible via Firstsearch or Dialog
(while ERIC digests and AskERIC is available on the net for free
I haven't seen full boolean searching of ERIC except through a
database provider.)
On other topics I go the Knowledge Index (about 30 databases available
through Compuserv for $24/hr) or to FirstSearch through a university
account.
I prefer Dialog to OCLC's FirstSearch. Dialog provides much better
searching with proximity. For most folks, Dialog is too complex
and an online search professional -- there is one at most libraries --
will help select from over 600 databases and work out a search strategy.
Remember, the search capability of webcrawlers is based on WAIS
technology. This is a fuzzy, inference searching approach.
You type in a bunch of keywords, and it computes a score for each
item that contains some of them. If the words all occur close
together, it gets a great score.
But to the information specialist, this is much too imprecise. You
can't pick selective databases (narrowing the search coherently) and
you can't do complex boolean searches that state more clearly
how close you want word y to be to word z before you want to see it.
And doing a good search is iterative. You do a search strategy and
see what happens. You examine some output and realize that (in a
good database) -- the abstracting company assigned a critical
descriptor (as opposed to just a keyword) to your subject area.
You include that descriptor and BINGO, you have a much better
search.
So web sites clearly could use some classification. This is why
I posted the piece about PICS classification for web content selection.
Without classification it is like walking into library stacks where
books are randomly placed on the shelves.
Who is going to do the work of classification? Remember that
the way it works today, much of that work is done by many people
employed at each of the abstracting services for documents. You
may take for granted that you find useful stuff when you
go to Sociological Abstracts -- but that took a lot of work. And
when I search Soc Abstracts on dialogue, part of the sizeable
online search cost goes back to Soc Abstracts to pay for all this
work.
Let's look at the costs. Dialog, for a subset, off hours is $24 an
hour. I can do a pretty good search for $3-4 because I'm fast
and I have a stack of materials that tell me what is in each
database there. But the other 550 files are not available through
Knowledge Index, and they can cost $60-$80-$120-$200 per hour depending
on the rarity of the information. For example, I can search for
Trademarks in the $150 range. Someone diligently records both
federal and state trademarks.
What I don't see is a transition strategy from the $150/hr database
to something most of us can afford. If Dialog suddenly got 100 times
the volume, for trademarks, they could reduce their fees a lot.
The most likely way we are going to see this transition is via the
"new entrant" -- someone who sees the possibility for the greater
volume, and duplicates all the work, and puts it on the net -- typically
wanting a credit card to access it.
Witness what ProCD did with the yellow pages. NYNEX and others wanted
too much money, so they shipped all the yellow pages to Taiwan and
had low paid workers retype all the yellow pages. As a result you
can get all country's yellow, and white, pages on CD-ROM for around
$100 -- all 6 650 meg disks.
In conclusion, web searches are only as good as:
1. The sources -- quite variable, all "free"
2. The tool -- fuzzy, inference searching
3. The links -- once you have found one useful site, there is
intelligent linking to others (to some extent)
Whatever, don't advise students or others to look to the Web as a
place to start. It is an "interesting supplement" to a well thought
out search for information
_______________________________________________________________________________
| W. Curtiss Priest, Ph.D., Director *********************** |
| Center for Information, Technology, & Society * Improving humanity * |
| * through technology * |
| 466 Pleasant Street *********************** |
| Melrose, MA 02176-4522 BMSLIB@MITVMA.MIT.EDU |
| Voice: 617-662-4044 |
| Fax: 617-662-6882 *** Gopher or WWW to our publications: gopher.eff.org |
| (under Groups & Organizations Supporting the Online Community, CITS) |
| WWW: gopher://gopher.eff.org/hh/Groups/CITS |
|Policy & Systems Division, Educational Products Information Exchange (EPIE) |
| Over 16,000 K12 educational software programs catalogued on CD-ROM |
|Dean of Computer & Information Sciences, Athena Virtual Online University |
| Visit Athena(VOU):Telnet to athena.edu 8888 | http://www.athena.edu|
|A member of LINCT (Learning and Information Networks for Community |
| Telecomputing) |
_____________________________________________________________________________|