Computational Linguistics
Application areas of computational linguistics:
Speech-to-text
Text-to-speech
Phone-based speech dialog systems
Speech control
Text generation
Translation technology
Information retrieval
Internet search agents with natural language interface
Information extraction
Question-answer systems
Automatic text summary
Spell checking
Speech-to-text:
Speech-to-text technology is used within dictation systems. Dictation systems are systems which convert
spoken language input (via microphone) into a corresponding written text and print out the written text
onto a computer screen.
The spoken language input is analyzed by a speech recognition system. A speech recognition system is a
system which recognizes a spoken language input (utterance) by converting utterance's soundwave in a
corresponding speech signal and by analyzing the speech signal in real time. At the same time, signal specific
information is picked out from speech signal. With the help of signal specific information and with the help of
calculations using theory of probalities, it is tried to assign corresponding words from a lexicon to speech signal.
For this reason, words in lexicon are provided with a specific transcription
alphabet.
During development of speech recognition systems, it must be considered that identical utterances of different
speakers lead to speech signals which are not completely identical because humans have different voices.
Reasons for this are:
- male voices vs. female voices,
- accent because of a regional origin,
- age dependant voice.
Nevertheless a speech recognition system must always come to the same conclusion for different speakers.
back
Text-to-speech:
With the help of text-to-speech systems, written texts (stored on a computer) can be automatically read out.
Computation and read out of acoustic speech output is executed by a text-to-speech system.
Output medium is a loudspeaker. Text-to-speech systems facilitate to customize some parameters such as selection
of language (English, German, etc.) or male vs. female voice.
Text-to-speech systems read out their texts with artificial "voices". When text is read out correct pronounciation
and intonation matters. A single read out word consists of single sounds. And each sound consists of single
sound components. A text-to-speech systems contains all necessary sound components. Each sound component
is a piece of speech signal. To generate a sound or a acoustic word such sound components are connected by
appending single speech signal pieces. Which sound have to be selected also depends on adjoining sound
components.
back
Phone-based speech dialog systems:
Phone-based speech dialog systems facilitate a dialog between a human and a computer. Purpose of such systems is to relieve
call center staff from answering routine questions or to offer full automatic information platforms (voice portals) which normally
only have to answer standardized questions.
For example in case of a phone-based flight schedule information system, flights can be queried. After having inserted data in
form of a natural language input, speech dialog system analyzes the input (sentence) and checks which data have being
inserted and which data are still missed to print out available flights. If the inserted sentence already contains all relevant data
then dialog system can present an answer immediately. If some data are still missed in inserted sentence then speech dialog system
must ask for the missing data in separate dialog steps.
Two examples shall explain how such dialog works in case of a automatic flight schedule information system. For such an information
system, arrival airport, arrival day, arrival time and departure aiport are the relevant data. (Departure airport is queried seperately).
In first example, the user input already contains all relevant data. So information system is able to present an answer.
Dialog Example 1
System> When do you want to be at arrival airport?
User> I want to be in Paris next saturday at 10 a.m.
System> Please say the departure location!
User> Frankfurt.
System> I can offer you following flights:
Flight ... from Frankfurt to Paris at [date and time].
Flight ... from Frankfurt to Paris at [date and time].
Next example shows dialog's behavior when user input does not contain all relevant data, yet.
In this example, arrival time is still missed.
Dialog Example 2
System> When do you want to be at arrival airport?
User> I want to be in Paris next saturday.
System> At what time do you want to be in Paris?
User> At 10 a.m.
System> Please say the departure location!
User> Frankfurt.
System> I can offer you following flights:
Flight ... from Frankfurt to Paris at [date and time].
Flight ... from Frankfurt to Paris at [date and time].
Now all data are available because in an additional dialog step the missing arrival time was inserted.
So, dialog system is able to print out corresponding flights.
Within the framework of a phone, speech language dialog system spoken utterances of a caller are analyzed in real time by a speech
recognition and languge processing system. Language processing system computes a formal meaning of the utterance so that
speech dialog system is able to react correspondantly. Dialog system can understand only such utterances which refer to application
domain of underlying dialog system (in the example: flight schedule information system). Due to speech input, an answer is
computed by dialog system which is printed out by a text generation system and read out by a text-to-speech system.
back
Speech control:
Speech control means that technical systems such as robots, electronic devices or application software are controlled
by spoken language. Normally, speech control consists of speech comands. A speech comand is a single word
or a short sequence of words. Speech comands in case of a robot control system can be: "10 meters forward!" or
"Turning 90 degrees to the left!".
Comand-based speech control stand out due to only short and simple language utterances which have to be processed.
System specific speech comands must be specified by system developers and provided to later users (so that users
can learn these comands). By using fix speech comands it is ensured that speech control relatively works reliably and
that there are only few misunderstandings between user and computer. If necessary several speech comands must be made
until an action is executed completely. In this connection, speech dialog systems can be used.
Speech control systems are applied in situations where a conventional (haptic) control is too intricate.
Meanwhile systems are developed which do not accept only speech comands but also complete sentences to make speech
control more user friendly. Because of this, more information can be put in a language utterance and communication gets
more natural. A possible speech input as sentence for a robot control application could be:
"After forward motion of 10 meters, turning to the left by 90 degrees!"
back
Text generation:
Text generation systems facilitate automatically generating a complete new text. Base of such a system mostly is a
non-language data source like a table. Text generation systems are normally developed for certain application domains
such as automatic generation of weather forecasts.
A text generation system owns a text planing and sentence planing component as well as a knowledge base which
contains domain specific knowledge and where data from data source are loaded in. For this reason, the knowledge base
owns so called application concepts to which corresponding data from data source are assigned. Application concepts
for application domain "weather forecast" could be: wind speed, temperature and air pressure.
Text planing component's task is to describe what text should express (providing the contents) and how text has to
express it (providing the means). For this, text planing component accesses knowledge base. Sentence planing component
firstly forms content-based sentence descriptions and then corresponding sentences (with help of linguistic knowledge
and a domain specific lexicon). Motivation for developing text generation systems is that information in text form is mostly
easier to understand than information in a formal structure.
back
Translation technology:
There are two appoaches in area of translation technology:
- Machine translation and
- computer aided translation
Machine translation systems are systems which can independently translate a given text from one language
to any other language.
There are two approaches concerning machine translation:
- The rule-based approach and
- the statistical approach.
Machine translation systems based on rule-based approach (such as transfer approach) analyze each sentence of
source text one after another by looking at sentential structures of the underlying sentences. For the source
sentence, structures corresponding target language structures are computed and for those corresponding
sentences are generated. Machine translation systems based on statistical approach select the most probable
translation for each source sentence. To facilitate that, statistical machine translation systems must be trained with
huge electronic reference translations. After training, statistical translation system is principally able to translate
new sentences.
Machine translation systems are often confronted with ambiguous words. An example for an ambiguous word is
the german word "bank" which can be translated in english as "bench" or as "bank" (depending on context).
With the help of certain procedures the right context can be determined. The longer and more complex the sentence
are to be translated the harder is translation process. Therefore a preparation or a follow-up treatment are often
necessarily. In case of application domains which get by on simple sentences and restricted vocabulary (such as
technical documentations) so called "controlled language" can be used. Thereby it is ensured that text can be
automatically and trouble-free translated.
In contrast to machine translation computer aided translation deals with tools which support a human translator's work.
So called translation memory systems belong to this. A translation memory system contains ready and correct
translations in form of bilingual sentence pairs. If a human translator wants to translate a sentence then he/she can
make look up that sentence in translation memory system. If translation memory finds a translation for the sentence
to be translated then found translation will be presented. It is not necessary to find exactly the same sentence in
translation memory. Instead of this it is sufficient if only a similiar sentence is found. The degree of similarity is shown
for example by a percentage number.
Translation offices use translation memory systems in order to be able to translate a sentence always identically
(quality insurance).
back
Information retrieval:
Purpose of information retrieval systems is to find documents (e.g. in internet) which correspond to criteria
of inserted search query consisting of search terms. These search terms are compared with indices of available
documents. Indices are terms which describe a document as exact as possible. If the terms in search query
are also contained in document (as document indices) then that document is selected as a "hit".
In order to find also documents which do not contain the exact search query terms but also correct alternative
terms, so information retrieval systems can be equiped with "linguistic intelligence".
Following an example how a retrieval system with linguistic intelligence works: For german search term
"blaues Fahrzeug" (blue vehicle) such a retrieval system can find documents with document indices
{"blau", "Fahrzeug"} and {"blaues","Fahrzeug"}, respectively because it can analyze inserted search terms
in a morpho-syntactic way. I.e., word form "blau" can be automatically derived from word form "blaues".
A further opportunity for retrieval systems is the use of semantic methods. I.e., for an inserted search term
documents with equivalent semantic indices can be found. To make semantic search possible so called thesauri
are necessary. In a thesaurus alternative words for a given word are stored. Thus it is possible to find documents
containing only indices "car" or "Toyota" for example when search term only contains term "vehicle" instead
of this. A combination with morpho-syntactic analysis is possible.
back
Internet search agents with natural language interface:
A internet search agent is computer program that indepently searches in internet for given information.
In course of this it uses search engines and internet databases. The agent analyzes found information,
work up that information, provides it and stores it on computer or send it as e-mail to the user.
Information to be searched can be given to the agent in the form of an (interrogative) sentence by a natural
language interface.
Well suitable for work of search agent with natural language interface is so called Semantic Web ("Web 3.0").
In Semantic Web, information is (semantically) described with special formal description languages. Purpose
of Semantic Web is that also computer programs can read and utilize information contained in web sites (and
not just humans).
An use case of such an internet-based search agent could be a system which searches for certain courses of study.
A corresponding natural language query could look as follows:
> Bachelor in bio-chemistry, at most 100 kilometers away from Heidelberg.
In order to find corresponding offers, semantic web sites are necessary which contain information concerning
study and geographic data such as {"bachelor", "bio-chemistry", "100" (km), "Heidelberg"} as well as
semantic descriptions belonging to them such as {degree, study, distance, place}.
(Note: Distances could be found out by an ontology which contains a distance table).
In such a system, each inserted sentence is read out from natural language interface and given to a language
processing program. The language processing program analyzes the sentence and computes a formal meaning for it.
Because of computed sentence meaning a corresponding computer statement is determined. With that, statement
search agent knows what to do and starts with work.
back
Information extraction:
Information extraction systems are domain specific systems which can analyze texts as regards content.
The task of such a system is to find domain specific information in a text, to extract and to store it in
a database for further use. For this, information extraction system must provide domain specific concepts
which can be assigned to information from text to be analyzed.
An information extraction system must analyze all sentences one after another to compute sentence
meanings (sentence semantics) and to assign information from underlying text to corresponding concepts.
Only because of computed sentence meanings system is able to decide which sentences contain relevant
information.
An example shall explain how a information extraction systems works: An information extraction system
for domain "enterprise specific information" analyzes texts which contain information about selling
enterprises. For this, following concepts are provided for the information extraction system:
- [selled enterprise]:
- [purchasing enterprise]:
- [purchase price]:
For a given text, corresponding information from the underlying text must be assigned to the concepts defined
above. A fictive text could contain following sentence:
- "Smith Inc. was selled for 5 million euros to Miller Inc."
Because of sentence on hand, information extraction system could make following assignment:
- [selled enterprise]: "Smith Inc."
- [purchasing enterprise]: "Miller Inc."
- [purchase price]: "5,000,000 Euro"
Information above extracted from underlying text can now stored in a database. For this purpose, a corresponding
table structure must be provided. By storing in a database, further persons can access and work with extracted
information/data.
back
Question-answer systems:
Question-answer systems are systems which generate and print out an answer text for a corresponding
natural language question as input. Input normally occurs in written form (keybord).
Answer text can as well occur in written form (computer screen) or acoustically (loudspeaker).
Answer text should be detailed and informative. If answer text is printed out by a computer screen then
answer text can be combined with additional multimedial elements such as graphic or sound effects. Normally,
question-answer-systems are restricted to a given application domain such as medical or technical knowledge.
There are two approaches for question-answer systems:
- Generating a new answer text,
- Answer extraction.
In first case, question-answer system consists of analysis and generation. A given interrogative sentence is handed over
to a language processing system which analyzes the interrogative sentence and compares analysis result with a knowledge base.
Knowledge bases contains domain knowledge (e.g. technical knowledge).
For those knowledge entries fitting to given interrogative sentence a corresponding answert text can be generated
by a text generation system. In second case, a fitting text segment (as answer) is seeked in underlying text for a given
interrogative sentence. For this, information extraction methods are applied for underlying text. I.e., there are provided
concepts which can conceptionally describe sentences occuring in text . If a natural language question fits to certain
concepts and their assigned text information then corresponding underlying text segment is presented as answer text.
back
Automatic text summary:
Automatic text summary systems can automatically build a summary for a given text. Such systems can be
used for search in internet (and intranet) for example. Because people do not want to read complete found text
document it would provide to generate an automatic summary for each found text document. Therefore each user
can easier get a general idea of the text. User of such systems should have opportunity to adjust degree (i.e.,
from a very short summary to a larger summary).
There are two approaches for automatic summary:
- Knowledge-based approach and
- statistical approach.
One way concerning knowledge-based approach is to separate important from unimportant parts of a text segment.
For this, central assertion of each text segment is located. The remaining assertions of considered text segment are
only amplifying information. To get a meaningful text summary such a summary (based on described procedure
above) must at least contain all central assertions of original text so that content of original text is kept. But summary
can also contain additional assertions -- depending on degree of summary. A second way concerning knowledge-based
approach is to use an information extraction system. This information extraction system can extract relevant
information of underlying text and can assign that information to domain specific concepts. Based on the concepts
and the assigned information a text generation system can build a text summary.
In case of statistical approach such sentences are selected for the later summary which contain a high number of
"important" words. Which words are important must be determined before by analyzing huge text corpora during
development process. The more frequent a word occurs in such a text corpus the more important is that word. If a text
summary system is made for domain specific texts then also used text corpora must contain corresponding domain
specific texts. Thereby, importance of words correspond to respective domain. (The reason is that a word can be important
in one domain but less important in an alternative one).
back
Spell checking:
Spell checkers are programs which can find possible spelling mistakes, highlight them and correct them independantly.
Spell checkers are frequently integral part of existing word processing programs but they also can be separate
applications.
A spell checker owns a system lexicon which contains all common words of a language (e.g. English). A simple spell
checker runs through a text and compares occuring words in that text with lexicon entries. If spell checker finds a "word"
(a sequence of letters) which is not contained in system lexicon then it tries to find a similiar word in its system lexicon.
Which word is similiar is determined by some theoretical procedures. But sheer wordwise proceeding is not always sufficient.
So, more progressive spell checkers also include context for their work; i.e., they also look at word enviroment (context)
of a found spelling mistake.
back

