Workshop on Patent Corpus Processing: July 12, 9:20-9:50 A Patent Document Retrieval System Addressing Both Semantic and Syntactic Properties Liang Chen*,Naoyuki Tokuda+, Hisahiro Adachi+ *University of Northern British Columbia, Canada +Sunflare Company, Japan
Contents • Background • DLSI Approach • Template Structure for Storing Patent Abstracts • The Flow of the Search Process
Background • Content-based and Form-based approaches • The content-based approach for IR & Document Classification, represents documents as term vectors • Encounter a considerable difficulty in dealing with too high a dimensional vector space. • The so-called PCA principle plays an important role in the dimensional reducing problems in image recognition; it has also been exploited extensively as an effective mean of dimension reduction for text document processing, where it is called LSI (latent semantic indexing) technique.
DLSI Method • Based on Principal Component Analysis • LSI provides a most popular dimensional reduction tool. • Two advantages: • LSI finds an optimal solution to dimensional reduction. • LSI with SVD is also effective in offering a dampening effect of synonymy and polysemy problems. • LSI is further improved to a DLSI (differential latent semantic indexing) method, which improves the IR performance provided that each document in document database has more than one document vectors.
Fortunately, for Patent Documents: • All the patent documents are fortunately very well organized with a very precise, human generated abstract • Two document vectors are available for the abstract and for the patent document excluding the abstract separately. • Thus, the DLSI method can be used for patent document retrieval.
But, DLSI is not enough… • The content-based information retrieval system by PCA analysis is not robust enough to be directly applicable to a real system • So, the DLSI method is used only to narrow down the search space at a first stage of filtering in information retrieval. • Form based searching strategy is used to pin down the patent document.
There are some conflicting factors: • The content based IR system tries to search the document in accordance with the similarity of ``meaning", which captures the abstraction of the exact words used. • For example, LSI/DLSI based system should be able to find out a similar set of documents to a query ``information processing devices" and ``Computing Machinery", where probably some of documents obtained might not contain even the phrases ``information processing devices" and ``Computing Machinery", or even do not contain any of these words. • Form based systems, have to depend on the exact words used; i.e., unless a ``perfect" thesaurus dictionary is used, we may not capture the correct documents. Unfortunately we know of no such complete thesaurus dictionary, and even if there is such a dictionary, the matching or collating method will be still too complex with respect to computing resources.
Fortunately again, patent documents: • The abstracts as well as the patent documents themselves, are always written in lengthy sentences. • This reminds us to exploit the automaton-based template structure, which has been originally developed for language tutoring system. • The template method is effective in searching a set of abstracts consisting of lengthy sentences in “form” format, from among the documents having similar meanings. • As the result, we use the template structure to denote the abstract of patent documents, and maintain the template structures in the database.
DLSI Approach What did DLSI do? • To improve the performance as well as the robustness for information retrieval. • To exploit both of the distances to, and the projections onto, a reduced document space
Document vector Distance to the LSI space Projection We can also regard them as: two angles
Technically, … • Two Reduced Differential Document spaces called differential LSI (or DLSI) spaces • I-DLSI Space: Reduced space formed from the differences between normalized term-document vectors for same document • E-DLSI Space: Reduced space formed fromthe differences between normalized term-document vectors belonging to different documents.
i D distance projection
Templates • Each patent document is usually provided with an abstract. The abstract can be used for content-based information retrieval by using DLSI method as described above. • The abstract is used for form based information retrieval. Then, how about the synonymy? • Use templates. • In view of lengthy sentences used in patent documents including their abstracts, template structure is an extremely efficient way of expressing lengthy sentences with their synonymous expressions. Now, what is Template?
Example: There are beautiful parks in Japan across the nation.
Finally, the template in the database will be modified (by machine) into: All the paths may not be always correct!
What if the query sentence is: There are ugly streets in Japan • We could find a path highest similarity by matching from among the all possible candidates.
Is the path found satisfactory? • No! • So, we will have to rule it out so that we will not come up with a template which include the above sentence as a path, or part of a path. • This mechanism should be established by tracking users' response.
The Flow of the Search Process Before starting the search process, we should set up the DLSI for all the patent documents. • Locate the query in the DLSI space. • Find the patent documents whose abstracts located in the vector space close to the query vector space illustrated by similar sentences of figure 1 according to DLSI matching algorithm. • For each of the abstracts obtained by step 2, use the template matching algorithm to calculate the similarity of the summary and the query, select the documents of which the abstracts have a highest similarity value with the query. • Show the result to the user. • According to the users' response, modify the abstracts in the database.