Re: GILS Report available


Subject: Re: GILS Report available
Carl Hage (carl@chage.com)
Date: Sun, 9 Aug 1998 14:47:28 -0800


Message-Id: <199808092230.RAA13635@rgate2.ricochet.net>
From: "Carl Hage" <carl@chage.com>
To: gils@cni.org
Date: Sun, 9 Aug 1998 14:47:28 -0800
Subject: Re: GILS Report available
In-Reply-To: <199808090816.KAA10290@bagel.indexdata.dk>
References: <199808071958.OAA15272@rgate2.ricochet.net>

On 09 Aug 1998, Sebastian Hammer <quinn@indexdata.dk> wrote:
>
> On Fri, 7 Aug 1998, David Landsbergen <landsbergen.1@osu.edu> wrote:
> >
> > ...
> > the tools must be available to these agencies to allow them to
> > implement.
>
> I think the major challenge here is that the "tools" needed to manage and
> manipulate GILS records depend heavily on the information managament
> systems already in place at the institution.

I agree, but the tools in place are often Wordperfect or MS-Word. If
an agency has something like MARC/Z39 software, it may be relatively
easy. Looking at the GILS records counts, a number like 14 for an
agency means an IM system probably isn't being used. We also don't see
existing data being converted (in US Federal GILS), e.g. the huge MOCAT
database or the US Government Manual listing agencies.

IM systems, in my opinion, are too much like isolated cathedrals run
by high priests rather than like a bazzar of interacting common-folk.
There's little sharing or exchange. Look at the US Federal IM, for
example. Within access.gpo.gov the content of MOCAT, browse, and
GILS@GPO are different. Even within GILS, there's no sharing of data.
You can search GILS records at GPO, but only those hosted by GPO, and
you need to follow links to other sites to search elsewhere. With disk
drives costing $50/GB, it's practical for each GILS site to have a
private copy of the entire US Federal IM holdings. Instead of building
another set of more grandiose cathedrals, we need to build a bazzar.

Instead of implementing a solution to the problem of distributed
searches, we should be addressing the problem of duplicating and
synchronizing metadata across parallel databases, each offering
different access methods and reuse of the data. This isn't
particularly difficult-- Usenet/NNTP has been doing this on a massive
scale since before the Internet existed.

> Of course, it is possible to maintain the GILS records in a completely
> separate information management framework - divorced from the goings
> and comings of the organisation. However, that approach frequently
> leads to a database which is out-of-date, and/or treated as an inferior
> source of information.

A good approach is building parallel databases linked by data exchange
software so content is identical. One database might be a legacy
mainframe, another might be a simple database linked to HTML generation,
another might be derived from a web crawler, perhaps with tools to
annotate index terms.

The key is to change the focus of GILS standards and documentation
to cater to optimizing the exchange of data between systems. Yes,
information is there, and we need easy ways to perform the collection
and conversion.

A human/machine readable format, easy to type and easy to process with
software is needed. XML is one such format, but other even simpler
equivalent ASCII formats are possible. In any case, SUTRS-like ASCII,
basic XML, XML-RDF, or other similar formats are trivially converted.

> Rather than producing software tools and/or strict guidelines, I would
> recommend that a more introductory text is produced, to supplement the
> standard, and focusing on local management issues.

More _text_ is needed in the standards -- a typical 4 word definition
should be a page of detailed explaination. Instead of having an
incomplete vague spec augmented by lots of tutorials and side documents,
the source spec itself should include sufficient detail to minimize the
need for separate guidelines books.

Even more important is the need to represent the GILS standards as
machine readable data itself, e.g. as an XML file, from which printed
standards documents are synthesized, as well as feeding a variety of
tools. In my opinion XML as it exists today is of little significance
(yet another record format), but the emerging XML-Data and XML Schema
(XSC) standards are changing that. Unfortunately, both XML-Data and
XSC are woefully inadequate in the DTDs for documentation.

Developing tools is difficult because schema definition data needed
to process fields in a general manner is scattered across a variety
of paper documents. Some definitions needed for human interaction
are missing. We need to switch standards development methodology.
Incidentally, the same problems exist in other data exchange standards
like EDI-- a horrible mess.

> I can heartily support this statement. In the Scandinavian countries,
> we have extensive experience using GILS to manage WWW indices. The
> experiences are completely positive, in no small part thanks to the
> effort made by this group to track the development of the Dublin Core
> data elements. An automated webcrawler scanning a website containing
> documents with embedded metadata (Dublin Core or otherwise) can
> produce a very nice GILS database with practically zero effort.

I think that's the right approach. The next step to address is
sharing and exchanging data outside a WWW site-- with people as well as
machines. In the Linux Software Map analogy, new entries are embedded
within an annoucement message broadcast over the comp.os.linux.announce
newsgroup. Software scanning this newsgroup allows harvesting of LSM
metadata, keeping distributed databases in sync. Rather than relying on
web crawlers to discover a new or changed document, the push technology
of newsgroups allows distributed databases to be updated instantly upon
release. The gov.* newsgroups could be used by GILS as a method for
broadcasting human/machine readable document announcements with
embedded metadata. The document announcement (marketing blurb along
with pretty-formatted GILS content) could be cross-posted to
gov.topic.info.abstracts and corresponding gov.topic.* subject specific
group (or a gov.dk.* etc group). People accessing these groups could
read about documents and access them with newsreader hot-links, and
could use a variety of third party software, e.g. AltaVista or
DejaNews. GILS-aware infosystems with software scanning the abstracts
groups could easily extract and update a local mirrored metadata
database. (100,000 sites can be updated within about 10 seconds using
NNTP.) The same distribution system could be used to update and
maintain the integrity of redundant distributed electronic archives.

> I agree that in GILS-related projects, the CV-part is often forgotten
> in the effort to get the basic system operational. Now, GILS is used
> in many different settings (outside of the US government as well), and
> LCSH is by no means appropriate for all.

Yes, I mentioned LCSH as an example of an existing dataset suitable
as a very general purpose starting point (capable of indexing almost
anything), and as an example where copyright and access hamper the
ability to use it. Other more specialized CVs need to be developed and
made accessable, e.g. a thesaurus for the Environment, Research Grants,
etc. Some, like NSF research grants are based on copyrighted private
thesauri. Many other useful CVs exist, but mostly as paper documents
formatted in an ad-hoch manner with no cross-linkage to other CVs.

What is badly needed is standards and an infrastructure to develop and
maintain network-accessable CVs. Instead of having a vague reference
to an external paper standard, each GILS field containing CV content
(more than index terms) should have live linkage to a network accessable
dataset defining the CV. As XML for example, the GILS schema definition
or CV index term should have an XML-Link to another XML database
containing the CV terms and definitions. I believe a public GILS record
should be defined as noncompliant if any field has content based on a
CV which does not have a freely copyable network-accessable definition
dataset.

In my experience with exchanging and using data, the record format (the
focus of most standards) is almost irrelevant. The major problems are
1) the semantics of the field is undefined or ambiguous, and 2) the
definitions of external standards and CV values are inaccessable or
undefined. Errors and inconsistencies related to common CV terms is
routine. I've worked with databases where the maintainers have no
idea what fields mean. It common to have CV fields where no complete
definition exists. (For example, US weather bulletins are identified
by product code, but every existing database listing product codes
has errors -- the codes used don't match definitions.) The FGDC/STDS
standards are invalid because a CV field definition gives the postal
mailing address of NIST for a withdrawn standard (definition of
state-plane coordinate systems).

When I visualize the Information Infrastructure, I think of something
like a world-wide web of cross-linked XML datasets, containing
foundation information like CV terms needed to define and correlate
databases... Building foundation definition databases just to make all
external GILS CV defintions net-accessable will be a major undertaking.
Take the names of cities in the US for example, where NIST, Census,
USPS, and USGS databases mismatch with each other (haven't looked at
LOC and others). USPS copyrights changes to postal zip codes, which
invalidate databases referencing these codes. You download more data
in graphics on thier web site just to find out you have to buy zip code
definitions. Any reference to an ISO standard creates a major problem,
since these are copyrighted paper-only standards with restricted access,
since funding is based on sales of the definitions.

You mentioned the problem of having one database get out of date with
another. This really isn't the problem I've found, since it's easy
to automatically copy and sync a database with today's networking
technology. The real problem is that definition data like CVs change,
invalidating databases. For example, the USPS changes postal zip
codes with no public notice, and local county governments changing
roads don't change the GIS databases documenting those roads.

Our ability to manipulate information in this new Internet-era will
be crippled until we address the problem of maintaining definitions
of public data as accessable public data. (Or for that matter,
definitions of public physical infrastructure (like a road) as public
data.) Tools these days, don't just need to access data, they need
to access metadata and definitions as well, and support human-machine
interaction, not just mainframe-mainframe interaction.

--------------------------------------------------------------------------
Carl Hage C. Hage Associates
<mailto:carl@chage.com> Voice/Fax: 1-408-244-8410 1180 Reed Ave #51
<http://www.chage.com/chage/> Sunnyvale, CA 94086



This archive was generated by hypermail 2a16 : Tue Mar 23 1999 - 03:55:43 EST