Re: GILS Report available


Subject: Re: GILS Report available
Dan Schneider (schneidd@usdoj.gov)
Date: Mon, 10 Aug 1998 09:54:06 -0400 (EDT)


Date: Mon, 10 Aug 1998 09:54:06 -0400 (EDT)
From: Dan Schneider <schneidd@usdoj.gov>
To: Multiple recipients of list <gils@cni.org>
Subject: Re: GILS Report available
In-Reply-To: <199808092230.RAA13635@rgate2.ricochet.net>
Message-Id: <Pine.NEB.3.95.980810092656.4271B-100000@sleepy.usdoj.gov>

On Sun, 9 Aug 1998, Carl Hage <carl@chage.com> wrote:
>
> On 09 Aug 1998, Sebastian Hammer <quinn@indexdata.dk> wrote:
> >
> > On Fri, 7 Aug 1998, David Landsbergen <landsbergen.1@osu.edu> wrote:
> > >
> > > ...
> > > the tools must be available to these agencies to allow them to
> > > implement.
> >
> > I think the major challenge here is that the "tools" needed to manage
> > and manipulate GILS records depend heavily on the information
> > managament systems already in place at the institution.
>
> I agree, but the tools in place are often Wordperfect or MS-Word. If
> an agency has something like MARC/Z39 software, it may be relatively
> easy. Looking at the GILS records counts, a number like 14 for an
> agency means an IM system probably isn't being used. We also don't see
> existing data being converted (in US Federal GILS), e.g. the huge MOCAT
> database or the US Government Manual listing agencies.
>
> IM systems, in my opinion, are too much like isolated cathedrals run
> by high priests rather than like a bazzar of interacting common-folk.
> There's little sharing or exchange. Look at the US Federal IM, for
> example. Within access.gpo.gov the content of MOCAT, browse, and
> GILS@GPO are different. Even within GILS, there's no sharing of data.
> You can search GILS records at GPO, but only those hosted by GPO, and
> you need to follow links to other sites to search elsewhere. With disk
> drives costing $50/GB, it's practical for each GILS site to have a
> private copy of the entire US Federal IM holdings. Instead of building
> another set of more grandiose cathedrals, we need to build a bazzar.
>
> Instead of implementing a solution to the problem of distributed
> searches, we should be addressing the problem of duplicating and
> synchronizing metadata across parallel databases, each offering
> different access methods and reuse of the data. This isn't
> particularly difficult-- Usenet/NNTP has been doing this on a massive
> scale since before the Internet existed.
>
> > Of course, it is possible to maintain the GILS records in a completely
> > separate information management framework - divorced from the goings
> > and comings of the organisation. However, that approach frequently
> > leads to a database which is out-of-date, and/or treated as an inferior
> > source of information.
>
> A good approach is building parallel databases linked by data exchange
> software so content is identical. One database might be a legacy
> mainframe, another might be a simple database linked to HTML generation,
> another might be derived from a web crawler, perhaps with tools to
> annotate index terms.
>
> The key is to change the focus of GILS standards and documentation
> to cater to optimizing the exchange of data between systems. Yes,
> information is there, and we need easy ways to perform the collection
> and conversion.
>
> A human/machine readable format, easy to type and easy to process with
> software is needed. XML is one such format, but other even simpler
> equivalent ASCII formats are possible. In any case, SUTRS-like ASCII,
> basic XML, XML-RDF, or other similar formats are trivially converted.
>
> > Rather than producing software tools and/or strict guidelines, I would
> > recommend that a more introductory text is produced, to supplement the
> > standard, and focusing on local management issues.
>
> More _text_ is needed in the standards -- a typical 4 word definition
> should be a page of detailed explaination. Instead of having an
> incomplete vague spec augmented by lots of tutorials and side documents,
> the source spec itself should include sufficient detail to minimize the
> need for separate guidelines books.
>
> Even more important is the need to represent the GILS standards as
> machine readable data itself, e.g. as an XML file, from which printed
> standards documents are synthesized, as well as feeding a variety of
> tools. In my opinion XML as it exists today is of little significance
> (yet another record format), but the emerging XML-Data and XML Schema
> (XSC) standards are changing that. Unfortunately, both XML-Data and
> XSC are woefully inadequate in the DTDs for documentation.
>
> Developing tools is difficult because schema definition data needed
> to process fields in a general manner is scattered across a variety
> of paper documents. Some definitions needed for human interaction
> are missing. We need to switch standards development methodology.
> Incidentally, the same problems exist in other data exchange standards
> like EDI-- a horrible mess.
>
> > I can heartily support this statement. In the Scandinavian countries,
> > we have extensive experience using GILS to manage WWW indices. The
> > experiences are completely positive, in no small part thanks to the
> > effort made by this group to track the development of the Dublin Core
> > data elements. An automated webcrawler scanning a website containing
> > documents with embedded metadata (Dublin Core or otherwise) can
> > produce a very nice GILS database with practically zero effort.
>
> I think that's the right approach. The next step to address is
> sharing and exchanging data outside a WWW site-- with people as well as
> machines. In the Linux Software Map analogy, new entries are embedded
> within an annoucement message broadcast over the comp.os.linux.announce
> newsgroup. Software scanning this newsgroup allows harvesting of LSM
> metadata, keeping distributed databases in sync. Rather than relying on
> web crawlers to discover a new or changed document, the push technology
> of newsgroups allows distributed databases to be updated instantly upon
> release. The gov.* newsgroups could be used by GILS as a method for
> broadcasting human/machine readable document announcements with
> embedded metadata. The document announcement (marketing blurb along
> with pretty-formatted GILS content) could be cross-posted to
> gov.topic.info.abstracts and corresponding gov.topic.* subject specific
> group (or a gov.dk.* etc group). People accessing these groups could
> read about documents and access them with newsreader hot-links, and
> could use a variety of third party software, e.g. AltaVista or
> DejaNews. GILS-aware infosystems with software scanning the abstracts
> groups could easily extract and update a local mirrored metadata
> database. (100,000 sites can be updated within about 10 seconds using
> NNTP.) The same distribution system could be used to update and
> maintain the integrity of redundant distributed electronic archives.
>
> > I agree that in GILS-related projects, the CV-part is often forgotten
> > in the effort to get the basic system operational. Now, GILS is used
> > in many different settings (outside of the US government as well), and
> > LCSH is by no means appropriate for all.
>
> Yes, I mentioned LCSH as an example of an existing dataset suitable
> as a very general purpose starting point (capable of indexing almost
> anything), and as an example where copyright and access hamper the
> ability to use it. Other more specialized CVs need to be developed and
> made accessable, e.g. a thesaurus for the Environment, Research Grants,
> etc. Some, like NSF research grants are based on copyrighted private
> thesauri. Many other useful CVs exist, but mostly as paper documents
> formatted in an ad-hoch manner with no cross-linkage to other CVs.
>
> What is badly needed is standards and an infrastructure to develop and
> maintain network-accessable CVs. Instead of having a vague reference
> to an external paper standard, each GILS field containing CV content
> (more than index terms) should have live linkage to a network accessable
> dataset defining the CV. As XML for example, the GILS schema definition
> or CV index term should have an XML-Link to another XML database
> containing the CV terms and definitions. I believe a public GILS record
> should be defined as noncompliant if any field has content based on a
> CV which does not have a freely copyable network-accessable definition
> dataset.
>
> In my experience with exchanging and using data, the record format (the
> focus of most standards) is almost irrelevant. The major problems are
> 1) the semantics of the field is undefined or ambiguous, and 2) the
> definitions of external standards and CV values are inaccessable or
> undefined. Errors and inconsistencies related to common CV terms is
> routine. I've worked with databases where the maintainers have no
> idea what fields mean. It common to have CV fields where no complete
> definition exists. (For example, US weather bulletins are identified
> by product code, but every existing database listing product codes
> has errors -- the codes used don't match definitions.) The FGDC/STDS
> standards are invalid because a CV field definition gives the postal
> mailing address of NIST for a withdrawn standard (definition of
> state-plane coordinate systems).
>
> When I visualize the Information Infrastructure, I think of something
> like a world-wide web of cross-linked XML datasets, containing
> foundation information like CV terms needed to define and correlate
> databases... Building foundation definition databases just to make all
> external GILS CV defintions net-accessable will be a major undertaking.
> Take the names of cities in the US for example, where NIST, Census,
> USPS, and USGS databases mismatch with each other (haven't looked at
> LOC and others). USPS copyrights changes to postal zip codes, which
> invalidate databases referencing these codes. You download more data
> in graphics on thier web site just to find out you have to buy zip code
> definitions. Any reference to an ISO standard creates a major problem,
> since these are copyrighted paper-only standards with restricted access,
> since funding is based on sales of the definitions.
>
> You mentioned the problem of having one database get out of date with
> another. This really isn't the problem I've found, since it's easy
> to automatically copy and sync a database with today's networking
> technology. The real problem is that definition data like CVs change,
> invalidating databases. For example, the USPS changes postal zip
> codes with no public notice, and local county governments changing
> roads don't change the GIS databases documenting those roads.
>
> Our ability to manipulate information in this new Internet-era will
> be crippled until we address the problem of maintaining definitions
> of public data as accessable public data. (Or for that matter,
> definitions of public physical infrastructure (like a road) as public
> data.) Tools these days, don't just need to access data, they need
> to access metadata and definitions as well, and support human-machine
> interaction, not just mainframe-mainframe interaction.

This has been a splendid dialog, but it seems to me to be at the level
of technology more than Public Policy. I would invite people to
(1) explore the US Government agency Web sites, especially for agencies
whose mission is benefits-transfer or other activity that is not-R&D,
not-data gathering/analysis; (2) explore the GILS postings of those
agencies; and (3) contemplate what the cost-benefit is to the American
Taxpayer for the creation and maintenance of the GILS postings by those
agencies. In contemplating the Return-on-Investment for GILS for
textual materials, regulatory postings, and E-FOIA Reading Rooms, I
invite folks to differentiate between documents accessible and readable
with ordinary Web browsers, and Data Bases of scientific, technical and
research data that are not so accessible and readable.

I subscribe to the Public Policy school of thought that says what the
Federal Government needs to give the American Public is an ability to
enter one inquiry, one time, to one place, in subject-matter terms,
independent of agency, and receive a relevancy-ranked hit list return
that pulls from ALL agencies, across the ENTIRE Federal Government.
I cannot do this with the GILS that I know.

Dan Schneider
USDOJ-JMD/IMSS
<schneidd@usdoj.gov>



This archive was generated by hypermail 2a16 : Tue Mar 23 1999 - 03:55:43 EST