Text analysis method using latent topics for field notes in area studies
This presentation is the property of its rightful owner.
Sponsored Links
1 / 22

Text Analysis Method Using Latent Topics for Field Notes in Area Studies PowerPoint PPT Presentation


  • 53 Views
  • Uploaded on
  • Presentation posted in: General

Text Analysis Method Using Latent Topics for Field Notes in Area Studies. Taizo Yamada Historiographical Institute, The University of Tokyo. Contribution. Text analysis for Area Studies applying topic model to a field note for Area studies

Download Presentation

Text Analysis Method Using Latent Topics for Field Notes in Area Studies

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Text analysis method using latent topics for field notes in area studies

Text Analysis Method Using Latent Topics for Field Notes in Area Studies

TaizoYamada

Historiographical Institute, The University of Tokyo

PNC2013


Contribution

Contribution

  • Text analysis for Area Studies

    • applying topic model to a field note for Area studies

      • We use LDA (Latent Dirichlet Allocation) as a topic model.

      • Similar fragments or scenes in field note can be obtained.

    • Visualization of the relationship between place names

      • The place information does not have Latitude and longitude.

      • We don’t have any dictionaries of place name.

PNC2013


O utline

Outline

  • Background, purpose

  • Methodology of text analysis

    • Text structuring,

    • Term extraction

    • Characterization of term

    • Method of obtaining similar text fragments

    • Visualization and System

  • Conclusion

PNC2013


Background

Background

  • Recently, Area Studies has made remarkable progress.

    • Researchers in Area Studies can search and analyze large volumes of data easily and quickly.

    • using information technology such as web technology, data analysis, data engineering,…

    • In order to promote the analysis, the researchers have published databases.

      • catalogues, images, statistical data, spatial data and temporal data.

  • For more the progress of the study,

    • we believe that text analysis is one of the essential elements.

    • a text such as a field note has a description of sights, scenes and customs,

    • but latent topics or subjects can be key elements characterizing the area.

PNC2013


Purpose

Purpose

  • Text analysis method for a field note in Area Studies.

    • We prepare a field note database in which the data unit is a description of a sight or a scene.

    • In order to detect latent topics, we use latent Dirichlet allocation (LDA).

      • LDA is one of a topic model.

      • in LDA each text can be viewed as a mixture of various latent topics and each topic can be viewed as a mixture of various words.

    • In order to detect the gait of investigator in a field note

      • Visualization of the gait shows presentation of relations between place names.

PNC2013


Text 1

Text(1)

  • Target: Koichi Takaya, “The Field note collection2 Sumatra” (in Japanese)

    • 1984. 10. 19 ― 1985. 1. 18

    • Overall Sumatra Island

PNC2013


Text structuring 1

Text structuring (1)

PNC2013


Text structuring 11

Text structuring (1)

PNC2013


Text structuring 2

Text structuring (2)

PNC2013


Term extraction 1

Term extraction(1)

Result of morphological analysis

  • morphological analysis

    • mecab+ipadic (morphological analyzer; dictionary)

マングローブ名詞,一般,*,*,*,*,マングローブ,マングローブ,マングローブ

。記号,句点,*,*,*,*,。,。,。

前面名詞,一般,*,*,*,*,前面,ゼンメン,ゼンメン

の助詞,連体化,*,*,*,*,の,ノ,ノ

海名詞,一般,*,*,*,*,海,ウミ,ウミ

に助詞,格助詞,一般,*,*,*,に,ニ,ニ

は助詞,係助詞,*,*,*,*,は,ハ,ワ

バガン名詞,一般,*,*,*,*,*

。記号,句点,*,*,*,*,。,。,。

魚名詞,一般,*,*,*,*,魚,サカナ,サカナ

取り名詞,接尾,一般,*,*,*,取り,トリ,トリ

用名詞,接尾,一般,*,*,*,用,ヨウ,ヨー

の助詞,連体化,*,*,*,*,の,ノ,ノ

櫓名詞,一般,*,*,*,*,櫓,ロ,ロ

。記号,句点,*,*,*,*,。,。,。

いくつ名詞,代名詞,一般,*,*,*,いくつ,イクツ,イクツ

も助詞,係助詞,*,*,*,*,も,モ,モ

ある動詞,自立,*,*,五段・ラ行,基本形,ある,アル,アル

。記号,句点,*,*,*,*,。,。,。

EOS

Text (a scene)

マングローブ。前面の海にはバガン( 魚取り用の櫓) いくつもある。

“名詞”: Noun, “助詞”: postpositional particle, “記号”:Symbol, “動詞”: Verb

PNC2013


Term extraction 2

Term extraction(2)

Bag-of-Words

Result of morphological analysis

Bakauhumi:1

マングローブ:1

前面:1

海:1

バガン:1

魚取り用:1

櫓:1

ココヤシ:1

下:1

家:1

チョウジ:1

斜面:1

マングローブ名詞,一般,*,*,*,*,マングローブ,マングローブ,マングローブ

。記号,句点,*,*,*,*,。,。,。

前面名詞,一般,*,*,*,*,前面,ゼンメン,ゼンメン

の助詞,連体化,*,*,*,*,の,ノ,ノ

海名詞,一般,*,*,*,*,海,ウミ,ウミ

に助詞,格助詞,一般,*,*,*,に,ニ,ニ

は助詞,係助詞,*,*,*,*,は,ハ,ワ

バガン名詞,一般,*,*,*,*,*

。記号,句点,*,*,*,*,。,。,。

魚名詞,一般,*,*,*,*,魚,サカナ,サカナ

取り名詞,接尾,一般,*,*,*,取り,トリ,トリ

用名詞,接尾,一般,*,*,*,用,ヨウ,ヨー

の助詞,連体化,*,*,*,*,の,ノ,ノ

櫓名詞,一般,*,*,*,*,櫓,ロ,ロ

。記号,句点,*,*,*,*,。,。,。

いくつ名詞,代名詞,一般,*,*,*,いくつ,イクツ,イクツ

も助詞,係助詞,*,*,*,*,も,モ,モ

ある動詞,自立,*,*,五段・ラ行,基本形,ある,アル,アル

。記号,句点,*,*,*,*,。,。,。

EOS

  • Extraction target: only noun

  • But following types are not extracted:

    • pronoun, number,

  • The number of the kinds of term is 5,666.

PNC2013


Term extraction 3

Term extraction(3)

720km: Jakarta 出発

830km: Bakauhumi(*1)

①マングローブ。前面の海にはバガン( 魚取り用の櫓) いくつもある。

 ② ココヤシ多い。この下に少し家ある。

 ③ チョウジの多い斜面。 

853km: 稲。今若実り。

54km: このあたりよりチョウジ多くなる。その下を時に耕している。トウモロコシを植えるらしい。

70km: 水田をよく見る。東に海見える。

77-79km: ココヤシが多い。時に水田あり、それ実っている。

85km: ココヤシ園広い。時にチョウジがある。

90km: 西海岸に来る。マングローブあるが、その背後にはココヤシ多い。

97km: チョウジが多い。この辺りは殆どがジャワ人だという。

01km: Sidomulyo。周り、シラス台地。

11km: 5 ~ 10 年生のココヤシ多い。他に、チョウジ、バナナ、ランブータン、ドリアン。

18km; 左の海にはバガンが100 基ほど見える。

22km: 海岸は広くココヤシ。これ60 年生。高みはチョウジ多い。

  • Markup the extracted terms

    • The terms may characterize the scene in the text.

    • Extracted terms for each scene are different.

  • By the way, What features do the terms have?

    • We should prepare a method of a detection of the features.

    • But we don’t have any thesaurus or dictionaries.

  • Then, in order to detect, we introduce topic model.

    • Using topic model, we can detect latent topics as the features.

PNC2013


Using topic model 1

Using topic model(1)

  • We use LDA(Latent Dirichlet Allocation) as topic model.

    • Topic model

      • Modeling of co-occurrence of terms.

      • The results show term classification.

    • The kind of topic model

      • LSI(Latent Semantic Indexing): the model of introducing latent topic to VSM(Vector Space Model).

      • PLSI(Probabilistic Latent Semantic Indexing): The re-definition as a probabilistic model of LSI.

      • LDA: improved PLSI based on Bayesian learning

PNC2013


Using topic model 2

Using topic model(2)

  • LDA :D.M.Blei, et al. “Latent Dirichlet Allocation”, 2003.

    • document generation model where generating probability of latent topic follows Dirichlet distribution.

    • Latent topics can be determined if parameters of LDA can be tuned.

      • parameter of LDA

      • :latent topic

      • : generating probability

      • : document.: term.: the total number of term in d

      • Dir: Dirichlet distribution

PNC2013


Using topic model 21

Using topic model(2)

  • LDA :D.M.Blei, et al. “Latent Dirichlet Allocation”, 2003.

    • document generation model where generating probability of latent topic follows Dirichlet distribution.

    • Latent topics can be determined if parameters of LDA can be tuned.

      • parameter of LDA

      • :latent topic

      • : generating probability

      • : document.: term.: the total number of term in d

      • Dir: Dirichlet distribution

Topic can be generated according to θ.

Document can be generated according to terms

θ can be generated by α

The term can be generated according to topic z_k and β.

PNC2013


Detection of latent topic

Detection of latent topic

  • Feature of LDA

    • text

      • A set of terms

      • Having multiple topics

    • term

      • Belong to multiple topics

      • Not only specific topic

  • Spatial changing(scene changing)

    • Because of the visualization of detection results, we can understand the changing .

    • Latent topics are changed according to the spatial changing.

  • By the way, which is similar?

PNC2013


Similarity between texts 1

Similarity between texts (1)

  • We introduce VSM (Vector Space Model).

    • Feature vectors are needed by VSM.

    • The vector has an element which is total number of terms per topic.

    • Similarity between vectors is calculated by cosine similarity.

      • x,y: text(scene)

      • :The weight of topic in text x.

      • : tf.idf weighting

      • : the frequency of in text x.

      • : the number of text which has topic .

      • N: the number of text

PNC2013


Similarity between texts 2

Similarity between texts (2)

PNC2013


Track of investigation 1

Track of investigation (1)

  • Beginning of text

    • Date: Oct. 19. ‘84

    • “Jakarta よりKotabumiへ行く。”

    • The text means the movement from ”Jakarta” to ”Kotabumi”.

  • Tracking the movement

    • Extracting place name.

    • Rule:

      • from: ○○[から|より|出発|…]

      • to: ○○[へ|まで|に|泊|…]

    • Unfortunately, we don’t have any dictionaries or gazetteers.

    • I connect extracted place names for the time being.

PNC2013


Track of investigation 2

Track of investigation (2)

Force-Directed Graph

Jakarta

Using D3.js

Nov. ‘84

http://d3js.org/

Oct. ‘84

Pekanbaru

Tembilahan

Solok

Dec. ‘84

Jan. ‘85

Singapore

PNC2013


Conclusion future works

Conclusion, Future works

  • We introduce text analysis for field note in Area Studies.

    • Using topic model LDA

    • Tracking of the investigator.

  • Future work

    • Improvement of text analysis for Area Studies.

      • What is the system that the researcher for Area Studies wants?

      • We consider about the answer, and develop system according to the answer.

PNC2013


Text analysis method using latent topics for field notes in area studies

Thank you for listening to my presentation.

  • E-mail: [email protected]

PNC2013


  • Login