Semantic Web Use Cases and Case Studies

Use Case: SINDI-WALKS: Workbench for PLOT-based Technological Information Extraction and Management

 

Sungho Shin, Hong-Woo Chun, Chang-Hoo Jeong, Sung-Pil Choi, Hanmin Jung

KISTI, Korea

Oct. 2012

설명: kisti-logo

 

General Description

Introduction

The maturity of text mining methods enables us to scrutinize up-to-date technological trends and activities in an efficient way. That is, we can make use of many commercial or open source analytic tools and services such as IBM Language Ware, SAS Enterprise Miner, IBM SPSS, GATE, Apache UIMA and so forth to analyze unstructured textual data and to obtain hidden technical expertise from it. However, one of the most critical problems with those systems is that many approaches are usually recall-centric while blurring their impreciseness with overly abundant analysis outputs such as retrieved and clustered document sets or statistical data needed to be further understood by users of the systems.

 

In this situation, it is important and urgent to provide solid mechanisms for effectively analyzing and accessing the vast amount of scientific information such as articles, patents and technical reports. These mechanisms are based on both natural language processing and text mining. One of the most promising ways to analyze the scientific literatures is to identify significant keywords such as the names of persons, locations, organizations, and technological terms, which are designated PLOTs, from texts and use them for the purpose of building a technological knowledge base formed by a large set of semantic triples. These triples can be utilized to construct semantic services for technological intelligence.

 

In order to effectively extract the PLOTs and their semantic relations from scientific literature, it is significant to optimize the extraction engines and support the efficient construction of necessary linguistic resources. We define PLOTs (Person, Location, Organization and Terminology) as all the pivotal named entities appearing in scientific texts as shown in Table 1. By scientific intelligence in a particular domain, we mean a set of semantic triples with each triple composed of two PLOTs and their semantic relation. Our proposed system, called SINDI-WALKS can generate, manage and visualize semantic triples in an efficient way and therefore can provide essential knowledge to analyze technological trends or activities.

 

Table 1. TYPES OF PLOTS AND EXAMPLES

PLOT

Examples

Person

R&D researchers, tech biz practitioners, R&D policy decision makers

Location

nations, cities and physical positions

Organization

universities, research centers, institutes and governmental agencies

Term

technology names, product names, technical models, solutions

 

The solution

(System Architecture) Figure 1 shows all the components and their inter-relationships of our proposed system, SINDI-WALKS. The system contains a total of six core components; SINDI-CORE (PLOT recognition engine), SINDI-LINK (Relation extraction engine), the two test beds for the two engines, a collection construction tool and a graph-based visual management tool.

 

Figure 1. System Component Architecture

 

Firstly, the two test beds provide detailed information about the two engines (SINDI-CORE and SINDI-LINK) in runtime. For example, the test bed for SINDI-LINK enables users to easily inspect lexical and syntactical structures expressing the semantic relations between two PLOTs in a sentence. In addition, the collection construction tool can be used to develop either evaluation or training sets for the enhanced performance of the core engines. Finally, we can intuitively manage, analyze and visualize the generated semantic triple sets by means of the visualization and management tool.

 

(Engines) Fig.2 shows the overall architecture of the core engines (SINDI-CORE and SINDI-LINK). Two engines are conjoined in a single SINDI framework while each engine can be executed independently. They share common resources and components such as linguistic dictionaries, Part-of-Speech (POS) taggers, and syntactic analyzers. Given the input documents, SINDI-CORE recognizes PLOTs appeared in each text based on PLOT dictionary and the term-hood computation. We use an integrated PLOT dictionary collected from various sources as shown in Table 2.

 

Figure 2. SINDI Architecture (SINDI-CORE and SINDI-LINK)

 

Table 2. STATISTICS OF THE PLOT DICTIONARY

Domain

Source

Contents

#entries

Types

Biomedicine

NIH

MeSH

166,358

Term

All

KISTI

NDSL

120,718

Term

All

KISTI

STEAK

1,302,776

Term

Environment

European Environment Agency

GEMET

5,293

Term

Environment

KEITI

Env. Dic.

35,333

Term

Biomedicine

NIH

UMLS

432,822

Term

All

Wikipedia

Wiki Title

3,636,000

Term

All

OntoNote

PLO

10,227

PLO

All

MUC

PLO

6,804

PLO

All

ACE

PLO

13,147

PLO

All

Wikipedia

PLO

48,545

PLO

Total

5,778,023

PLOT

 

With the PLOT-annotated documents generated by SINDI-CORE, SINDI-LINK identifies Predicate Argument Structure (PAS) patterns of the contextual expression of two PLOTs in each sentence, which could express the semantic relations between them. With the previously constructed PAS-based relation pattern dictionary in which each entry is assigned to a particular relation type, SINDI-LINK finds the most similar pattern to the target one in the dictionary and finally assigns the corresponding relation type to the two PLOTs.

 

(Test Bed/Core part) To recognize PLOTs in texts, SINDI-CORE performs five consecutive steps; sentence split, linguistic analysis (POS tagging, chunking), PLOT candidate recognition, dictionary-based PLOT recognition and finally statistical PLOT recognition. Each of steps produces its own interim data. The test bed not only displays the data intuitively but also provides various parameter adjustment functions so that users can perform and benefit both from error monitoring and performance tuning. Figure 3 and 4 show the test bed displaying all the detailed information about each step during the entire process of PLOT recognition. As seen in the figures, we divide the PLOT recognition into nine steps; input, sentence split, tokenization, POS tagging, chunking, named-entity recognition, term recognition, pronoun resolution and output. Note that we can easily customize various unitary functions needed to recognize PLOTs using the SINDI engine platform shown in the Figure 2 and they can be effectively applied into the test bed. For example, we can organize and export a term-hood computation function which produces likelihoods of noun phrases being actual technological terms using the Web search. The test bed can easily incorporate the newly exported function in its main execution flow due to its flexibility and extendibility.

The upper side of each figure shows the raw data generated in each step while the lower part displays a gradually updated document being automatically annotated by each stepwise sub-process. For example, after POS tagging, the upper window displays the so-called BIO format based on raw data produced by the POS tagger and the lower part shows an annotated document in which all the noun and verb phrases have been highlighted by corresponding colors.

 

 

설명: Fig           설명: Fig

Figure 3. SINDI-CORE Test Bed (1/2)                               Figure 4. SINDI-CORE Test Bed (2/2)

 

(Test Bed/Link part) Figures 5 and 6 show screenshots of the SINDI-LINK test bed. The process of relation extraction by the system comprises preprocessing, syntactic analysis, predicate-argument conversion, and PAS-based relation extraction. In order to figure out the semantic relation between a pair of PLOTs in a sentence, our system exploits lexico-syntactic patterns. Therefore, it is essential for the test bed to provide detailed grammatical dependency information as well as the syntactic structure of an input sentence for closely inspecting the behavior of the relation extraction engine. As seen in Figure 6, the input of the system is a PLOT-tagged document processed by SINDI-CORE. The syntactic structure of the document is shown in the third step in the figure, which is not just a conventional parse tree but a nested POS structure for the purpose of maximizing its visual effect. For example, in the nested POS structure in the left of Figure 6, the phrase, “the occurrence of” is a noun phrase (NP) comprising a determiner (DT), “the”, a noun (NN), “occurrence” and a preposition (IN), “of”.

 

설명: Fig            설명: Fig

Figure 5. SINDI-LINK Test Bed (1/2)                                    Figure 6. SINDI-LINK Test Bed (2/2)

 

(PLOT Annotation Tool) Figure 7 shows a PLOT annotation tool for constructing training or evaluation collections that can be used to either improve or validate the performance of the engines. The system has various editing functions each of which enables annotators to efficiently mark PLOTs and their relations in texts. All the annotations can be executed by only using mouse actions only, without any manual input with a keyboard. For example, given an annotated sentence with two PLOTs, annotators can connect them by dragging and connecting the PLOTs with an arrow and select a relation name among a predefined set of semantic relations. The upper left of the figure is the editing area where the manual PLOT annotations are actually executed and the lower part is the control panel where users can pick the types of PLOTs and their relations that they have in the current document. Finally, our toolkit provides a visualization and management tool as seen in Figures 8 and 9. This appliance can present large sets of semantic triples generated by SINDI-CORE and SINDI-LINK (Figure 2) in an easy to understand manner.

 

설명: 설명: EMB00000d3432d8

Figure 7. PLOT Annotation Tool

 

(Visualization Tool) The system dynamically generates a graph whose center node is a PLOT that a user has entered as a query. The graph is gradually expanded by retrieving semantically connected PLOTs until the stop button is pushed. Also, users can enter other PLOTs during the expansion in order to organize and inspect multiple connected graphs, as can be seen in Figures 8 and 9. The figures show two PLOTs, “erectile dysfunction” and “diabetes” are indirectly but importantly connected to each other by a few relations. Every triple, i.e. every pair of directly connected nodes with a relation edge has its own source text as seen in the bottom of the figure showing multiple sentences from which a particular triple has been extracted. In addition, the system provides some statistics currently being handled such as the number of triples, PLOTs and so forth in the small sized window in the bottom left of the figure.

 

설명: 설명: C:\Users\spasis\Documents\DaumCloud\연구실적\0.실적작성방\20120403.KCC\walks1.jpg             설명: 설명: C:\Users\spasis\Documents\DaumCloud\연구실적\0.실적작성방\20120403.KCC\walks6.jpg

Figure 8. Visualization and Management Tool (1/2)                 Figure 9. Visualization and Management Tool (2/2)

 

Conclusion and further work

SINDI-WALKS is a workbench for PLOT-based knowledge extraction and management. With the toolset, one can both easily optimize the performance of knowledge extraction and intuitively manage the semantic triples generated by the core engines. Future works should most importantly involve the enhancement of the two test beds by implementing more sophisticated functions such as the dynamic configuration of the user interface in accordance with particular requirements. Furthermore, it is necessary to precisely evaluate the effects of our toolset for the performance enhancement of knowledge extraction and management.

 

Key benefits of Semantic Web technology

l  The proposed toolset will facilitate the performance optimization of information extraction as well as the efficient management of extracted semantic triples. Therefore, it can provide essential knowledge to analyze technological trends and activities.

l  The system provides a user-friendly interface with a high granularity of the provided information. That is, it displays the data intuitively but also provides various parameter adjustment functions so that users can perform and benefit both from error monitoring and performance tuning. All the detailed information for each step during the entire process of PLOT recognition is shown as well.

l  The application supplies various editing functions, each of which enables annotators to efficiently mark PLOTs and their relations in texts. Annotators only use mouse actions to annotate, without the need for any keyboard input. This can improve the accuracy and the speed of manual annotation.

 

© Copyright 2010-2012, Korea Institute of Science and Technology Information (KISTI)