Abstract
This article offers some thoughts on
the problems of access to information in a machine-sensible environment, and
the potential of modern library techniques to help in solving them. It explains
how authors and publishers can make information more accessible by providing
indexing information that uses controlled vocabulary, terms from a thesaurus,
or other linguistic assistance to searchers and readers.
Introduction
Effective communication in any context is made
easier by the use of a common language that both parties understand. In human
speech, common language may occasionally be ambiguous or misleading because of
the way it is used, but usually, the quantity and feedback of exchanged
messages allows the parties to communicate with each other. In the world of
recorded information, there is not the same opportunity for immediate
interaction, feedback and realization. Writers are not consistent in their use
of all the words that might describe or refer to topics, so a searcher who
chooses to scan the full text of a document must include in a search statement
all the synonyms, related terms, and all levels of detail that the author might
have used. The searcher is not talking to the author directly but to the
document itself, a static instrument. Therefore we need a substitute for the
result of the human, spoken interchange: the "Ah, yes, I see. What you
mean is ... " That needed substitute is the controlled index language
(thesaurus) that the indexer uses to interpret and represent the themes,
concepts, and language of the author and that the searcher uses to interpret
and represent sometimes vague expressions of a need to know.
Indexing
languages
There are three main types of
indexing languages.
- Controlled indexing language - Only approved terms can be used by the indexer to describe the document
- Natural language indexing language - Any term from the document in question can be used to describe the document.
- Free indexing language - Any term (not only from the document) can be used to describe the document.
When indexing a document, the
indexer also has to choose the level of indexing exhaustivity, the level of
detail in which the document is described. For example using low indexing
exhaustivity, minor aspects of the work will not be described with index terms.
In general the higher the indexing exhaustivity, the more terms indexed for
each document.
In recent years free
text search as a means of access to documents
has become popular. This involves using natural language indexing with an
indexing exhaustively set to maximum (every word in the text is indexed).
Many studies have been done to compare the efficiency and effectiveness of free
text searches against documents that have been indexed by experts using a few
well chosen controlled vocabulary descriptors.
Controlled vocabularies are often
claimed to improve the accuracy of free text searching, such as to reduce irrelevant items in the retrieval list. These irrelevant items (false
positives) are often caused by the inherent
ambiguity of natural language.
Take the English word football for example. Football is the name
given to a number of different team
sports. Worldwide the most popular of
these team sports is Association football, which also happens to be called soccer in several countries. The English
language word
football is also applied to Rugby
football (Rugby
union and rugby
league), American
football, Australian rules football, Gaelic
football, and Canadian
football. A search for football
therefore will retrieve documents that are about several completely different
sports. Controlled vocabulary solves this problem by tagging the documents in such a way that the ambiguities are
eliminated.
Compared to free text searching, the
use of a controlled vocabulary can dramatically increase the performance of an
information retrieval system, if performance is measured by precision (the
percentage of documents in the retrieval list that are actually relevant to the search topic).
In some cases controlled vocabulary
can enhance recall as well, because unlike natural language schemes, once the
correct authorized term is searched, you don't need to worry about searching
for other terms that might be synonyms of that term.
However, a controlled vocabulary
search may also lead to unsatisfactory recall, in that it will fail to retrieve some documents that are
actually relevant to the search question.
This is particularly problematic
when the search question involves terms that are sufficiently tangential to the
subject area such that the indexer might have decided to tag it using a
different term (but the searcher might consider the same). Essentially, this
can be avoided only by an experienced user of controlled vocabulary whose
understanding of the vocabulary coincides with the way it is used by the
indexer.
Another possibility is that the
article is just not tagged by the indexer because indexing exhaustivity is low.
For example an article might mention football as a secondary focus, and the
indexer might decide not to tag it with "football" because it is not
important enough compared to the main focus. But it turns out that for the
searcher that article is relevant and hence recall fails. A free text search
would automatically pick up that article regardless.
On the other hand free text searches
have high exhaustivity (you search on every word) so it has potential for high
recall (assuming you solve the problems of synonyms by entering every
combination) but will have much lower precision.
Controlled vocabularies are also
quickly out-dated and in fast developing fields of knowledge, the authorized
terms available might not be available if they are not updated regularly. Even
in the best case scenario, controlled language is often not as specific as
using the words of the text itself. Indexers trying to choose the appropriate
index terms might misinterpret the author, while a free text search is in no
danger of doing so, because it uses the author's own words.
The use of controlled vocabularies
can be costly compared to free text searches because human experts or expensive
automated systems are necessary to index each entry. Furthermore, the user has
to be familiar with the controlled vocabulary scheme to make best use of the
system. But as already mentioned, the control of synonyms, homographs can help
increase precision.
Numerous methodologies have been
developed to assist in the creation of controlled vocabularies, including faceted classification, which enables a given data record or document to be
described in multiple ways.
Applications
Controlled vocabularies, such as the
Library of Congress
Subject Headings, are an essential component of bibliography, the study and classification of books. They were initially
developed in library and
information science. In the 1950s, government agencies
began to develop controlled vocabularies for the burgeoning journal literature
in specialized fields; an example is the Medical Subject Headings (MeSH) developed by the U.S. National
Library of Medicine. Subsequently, for-profit firms
(called Abstracting and indexing services) emerged to index the fast-growing
literature in every field of knowledge. In the 1960s, an online bibliographic
database industry developed based on dialup X.25 networking. These services were seldom made available to
the public because they were difficult to use; specialist librarians called
search intermediaries handled the searching job. In the 1980s, the first full
text databases appeared; these databases
contain the full text of the index articles as well as the bibliographic
information. Online bibliographic databases have migrated to the Internet and
are now publicly available; however, most are proprietary and can be expensive
to use. Students enrolled in colleges and universities may be able to access
some of these services without charge; some of these services may be accessible
without charge at a public library.
In large organizations, controlled
vocabularies may be introduced to improve technical communication. The use of controlled vocabulary ensures that everyone is
using the same word to mean the same thing. This consistency of terms is one of
the most important concepts in technical
writing and knowledge management, where effort is expended to use the same word throughout a
document or organization instead of slightly different ones to refer to the same
thing.
Web searching could be dramatically
improved by the development of a controlled vocabulary for describing Web
pages; the use of such a vocabulary could culminate in a Semantic
Web, in which the content of Web pages
is described using a machine-readable metadata scheme. One of the first proposals for such a scheme is the
Dublin
Core Initiative. An example of a
controlled vocabulary which is usable for indexing
web pages is PSH.
It is unlikely that a single
metadata scheme will ever succeed in describing the content of the entire Web.
To create a Semantic Web, it may be necessary to draw from two or more metadata
systems to describe a Web page's contents. The eXchangeable
Faceted Metadata Language (XFML) is
designed to enable controlled vocabulary creators to publish and share metadata
systems. XFML is designed on faceted classification principles.
Controlled vocabularies of the Semantic
Web define the concepts and relationships
(terms) used to describe a field of interest or area of concern. For instance,
to declare a person in a machine-readable format, a vocabulary is needed that
has the formal definition of “Person”, such as the Friend of a Friend (FOAF) vocabulary, which has a Person class that defines typical
properties of a person including, but not limited to, name, honorific prefix,
affiliation, email address, and homepage, or the Person vocabulary of Schema.org. Similarly, a book can be described using the Book
vocabulary of Schema.org
and general publication terms from the Dublin
Core vocabulary, an event with the Event
vocabulary of Schema.org,
and so on.
To use machine-readable terms from
any controlled vocabulary, web designers can choose from a variety of
annotation formats, including RDFa, HTML5 Microdata,
or JSON-LD in the markup, or RDF serializations (RDF/XML, Turtle, N3, TriG, TriX) in
external files.
Tidak ada komentar:
Posting Komentar