LattesMiner: a Multilingual DSL for Information Extraction from Lattes Platform

LattesMiner: a Multilingual DSL for Information Extraction from Lattes Platform 11th Workshop on Domain-Specific Modeling Alexandre Donizeti Alves Horacio Hideki Yanasse Nei Yoshihiro Soma October 24, 2011

Introduction Lattes Platform is an information system implanted by CNPq (National Council for Scientific and Technological Development) to manage information on science, technology and innovation related to researchers and institutions in Brazil This platform is undoubtedly the major source of information available on Brazilian researchers

Introduction: Lattes Platform http://lattes.cnpq.br

Introduction The Lattes CV system, a curricular information system, is the main component of the platform Currently, the Lattes CV system stores around 2,000,000 curricula of researchers, lectures, students and professionals from diverse areas of knowledge

Introduction: Lattes CV system Jorge Almeida Guimaraes http://buscatextual.cnpq.br/buscatextual

Introduction: Lattes curriculum (English)

Introduction: Lattes curriculum (Portuguese)

Introduction In the last years, many works were developed using data extracted from Lattes Platform of researchers of different areas of knowledge A common problem presented in these works is that the curricula and the information extracted had to be obtained manually

Introduction Therefore, this system has a very high quality information extraction potential

LattesMiner LattesMiner LattesMineris an internal multilingual DSL for automatic information extraction from Lattes curricula Itis composed by a set of classes written in Java that allows developers to implement their own applications with a high-level abstraction and expression power

LattesMiner Data Acquisition is responsible for downloading the Lattes curricula of the researchers from Lattes CV system on the Web. The Data Visualization component is responsible for the identification and visualization of the academic social networks. These networks are identified by checking the relationships between researchers. Data Extraction is the main component of LattesMiner. It is responsible for extracting data from the HTML files. The technique of information extraction based on regular expressions was used. The Data Analysis component is responsible for the analysis of the data extracted and also for the analysis of the relationships identified. The extracted data can be stored in XML files or in any database using the Data Structure component. Data Discoveryis used to find the (ID) number of the researchers. Usually, only the name of the researcher is available.

LattesMiner Perfil Banca The LattesMiner class is composed by instances of classes Biodata and Board, in addition to many others not presented here. lattes.miner.br Biodata BiodataIE LattesMiner lattes.miner.en lattes.miner.ie Board BoardIE lattes.miner BiodataDao BoardDao lattes.miner.dao

LattesMiner LattesMinerwas created through a fluent interface, that provides a compact and yet easy-read representation of the domain problem Fluent interfaces are implemented using the method chaining LattesMiner makes use of static factory methods and imports

Case Study For thefollowingexamplesresearchersoftheComputer Scienceareawith CNPq Research Productivity Scholarshipwereconsidered. The list contains all the names of the researchers. However, their corresponding (ID) number are not provided. http://plsql1.cnpq.br/divulg/RESULTADO_PQ_102003.curso

Listing 1 Java application code importjava.util.*; importlattes.util.Util; importstaticlattes.miner.LattesMiner.*; publicclassListing1 { publicstaticvoidmain(String[] args) { } } List<String> list = newArrayList<String>(); for(Stringname : Util.getList("names.txt")) list.add( ); search(name) Util.setList(list, "ids.txt");

Listing 2 Code fragment used to download the lattes curricula of the researchers. dir("cvs"); for(String id : Util.getList("ids.txt")) save(); download(id) .

Listing 3 This listing shows as to extracted data from Lattes curricula of the researchers. props("mysql"); for(String id : Util.getList("ids.txt")) { } . . load(id) biodata() address(); save(); . JOURNAL publications( )

Listing 4 Code fragment to illustrate how the LattesMiner is used to extract information in different languages. for(String id : Util.getList("ids.txt")) { } // Portuguese . . for(Bancab : ) { } getBancas() carregar(id) bancas() if( ) System.out.println( ); b.ano() == 2010 b.aluno() // English . . load(id) boards() getBoards() for(Boardb : ) { } if( ) System.out.println( ); b.year() == 2010 b.student()

Results The SUCUPIRA is a system for identification and visualization of academic social networks. Here is shows the geographical distribution of the five researchers that have published more articles in scientific journals.

Results This is a graph of contacts of the five researchers that have published more in scientific journals. The graph depicts an academic social network of the five researchers. Nodes are presented with the name of researcher The color of the edges represent the number of relationships among researchers.

Conclusions Currently, the Lattes curricula are available in HTML format LattesMiner however does not depend on the data format because it allows users to program their own applications with a high-level abstraction If the data format is eventually modified, the DSL interface remains the same

Conclusions An advantage of LattesMineris that it searches by the name of the researcher LattesMineris multilingual Another advantage is that the data extracted can are stored in a structural format (XML or database), allowing these data to be easily used by others applications

Future work The future step that is already being implemented in the LattesMiner DSL is a statistical analysis of the data

ACNOWLEDGMENTS

LattesMiner: a Multilingual DSL for Information Extraction from Lattes Platform