lost language decipheration n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Lost Language Decipheration PowerPoint Presentation
Download Presentation
Lost Language Decipheration

Loading in 2 Seconds...

play fullscreen
1 / 38

Lost Language Decipheration - PowerPoint PPT Presentation


  • 131 Views
  • Uploaded on

Lost Language Decipheration. Kovid Kapoor - 08005037 Aashimi Bhatia – 08D04008 Ravinder Singh – 08005018 Shaunak Chhaparia – 07005019. Outline. Examples of ancient languages which were lost Motivation : Why should we bother about such languages? The Manual process of Decipheration

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Lost Language Decipheration' - osma


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
lost language decipheration

Lost Language Decipheration

KovidKapoor - 08005037

Aashimi Bhatia – 08D04008

Ravinder Singh – 08005018

ShaunakChhaparia– 07005019

outline
Outline
  • Examples of ancient languages which were lost
  • Motivation : Why should we bother about such languages?
  • The Manual process of Decipheration
  • Motivation for a Computational Model
  • A Statistical Method for Decipheration
  • Conclusions
what is a lost language
What is a "lost" language
  • A language is said to be “lost” when modern scholars cannot reconstruct text written in it.
    • Slightly different from a “dead” language – a language which people can translate to/from, but noone uses it anymore in everyday life.
  • Generally happens when one language gets replaced by another.
  • For eg, native American languages were replaced by English, Spanish etc.
examples of lost languages
Examples of Lost Languages
  • Egyptian Hieroglyphs
    • A formal writing system used by ancient Egyptians, containing of logographic and alphabetic symbols.
    • Finally deciphered in the early 19th century, following a lucky finding of “Rosetta Stone”.
  • Ugaritic Language
    • Tablets with engravings found in the lost city of Ugarit, Syria.
    • Researchers recognized that it is related to Hebrew, and could identify some parallel words.
examples of lost languages cont
Examples of Lost Languages (cont.)
  • Indus Script
    • Written in and around Pakistan around 2500 BC
    • Over 4000 samples of the text have been found.
    • Still not deciphered successfully!
    • What makes it difficult to decipher?

http://en.wikipedia.org/wiki/File:Indus_seal_impression.jpg

motivation for decipheration of lost languages
Motivation for Decipheration of Lost Languages
  • Historical knowledge expansion
    • Very helpful in learning about the history of the place where the language was written.
    • Alternate sources of information : coins, drawings, buried tombs.
    • These sources not as precise as reading the literature of the region, which gives a clear idea.
  • Learning about the past explains the present
    • A lot of the culture of a place is derived from ancient cultures.
    • Boosts our understanding of our own culture.
motivation for decipheration of lost languages cont
Motivation for Decipheration of Lost Languages(cont.)
  • From a linguistic point of view
    • We can figure out how certain languages were developed through time.
    • Origin of some of the words explained.
the manual process
The Manual Process
  • Similar to a cryptographic decryption process
  • Frequency analysis based techniques used
  • First step : identify the writing system
    • Logographic, alphabetic or syllaberies?
    • Usually determined by the number of distinct symbols.
  • Identify if there is a closely related known language
  • Hope for finding bitexts : translations of a text of the language in a known language, like Latin, Hebrew etc.

http://www.straightdope.com/columns/read/2206/how-come-we-cant-decipher-the-indus-script

examples of manual decipherment egyptian hieroglyphs
Examples of Manual Decipherment : Egyptian Hieroglyphs
  • Earliest attempt made by Horapollo in the 5th century.
    • However, explanations were mostly wrong!
    • Proved to be an impediment on the process for 1000 years!
  • Arab historians able to partly decipher in the 9th and 10th centuries.
  • Major Breakthrough : Discovery of Rosetta Stone, by Napolean’s troops.

http://www.straightdope.com/columns/read/2206/how-come-we-cant-decipher-the-indus-script

examples of manual decipherment egyptian hieroglyphs1
Examples of Manual Decipherment : Egyptian Hieroglyphs
  • The stone has a decree issued by the king in three languages : hieroglyphs, demotic, and ancient Greek!
  • Finally deciphered in 1820 by  Jean-François Champollion.
  • Note that even with the availability of a bitext, full decipheration took 20 more years!

http://upload.wikimedia.org/wikipedia/commons/thumb/c/ca/Rosetta_Stone_BW.jpeg/200px-Rosetta_Stone_BW.jpeg

examples of manual decipherment ugaritic
Examples of Manual Decipherment : Ugaritic
  • The inscribed words consisted of only 30 distinct symbols.
    • Very likely to be alphabetical.
  • The location of the tablets found suggested that it is closely related to Semitic languages
  • Some words in Ugaritic had the same origin as words in Hebrew
    • For eg, the Ugaritic word for king is the same as the Hebrew word.

http://www.straightdope.com/columns/read/2206/how-come-we-cant-decipher-the-indus-script

examples of manual decipherment ugaritic cont
Examples of Manual Decipherment : Ugaritic (cont.)
  • Lucky discovery : Hans Bauer assumed that the writings on an axe found was the word “axe”!
  • Led to revision of some earlier hypothesis, and resulted in decipherment of the entire script!

http://knp.prs.heacademy.ac.uk/images/cuneiformrevealed/scripts/ugaritic.jpg

conclusions on the manual process
Conclusions on the Manual Process
  • Very time taking exercise; years, even centuries taken for the successful decipherment.
  • Even when some basic information about the language is learnt, like the syntax structure, a closely related languages, long time required to produce character and word mappings.
need for a computerised model
Need for a Computerised Model
  • Once some knowledge about the language has been learnt, is it possible to use a program to produce word mappings?
  • Can the knowledge of a closely related language be used to decipher a lost language?
  • If possible, would save a lot of efforts and time.
  • Successful archaeological decipherment has turned out to require a synthesis of logic and intuition…that computers do not (and presumably cannot) possess.– Andrew Robinson
recent attempts a statistical model
Recent attempts : A Statistical model
  • Notice that manual efforts have some guiding principles
    • A common starting point is to compare letter and word frequencies with a known language
  • Morphological analysis plays a crucial role as well
    • Highly frequent morpheme correspondences can be particularly revealing.
  • The model tries to capture these letter/word level mappings and morpheme correspondences.

http://people.csail.mit.edu/bsnyder/papers/bsnyder_acl2010.pdf

problem formulation
Problem Formulation
  • We are given a corpus in the lost language, and a non-parallel corpus in a related language from the same family.
  • Our primary goals :
    • Finding the mapping between the alphabets of the lost and known language.
    • Translate words in the lost language into corresponding cognates of the known languages

http://people.csail.mit.edu/bsnyder/papers/bsnyder_acl2010.pdf

problem formulation1
Problem Formulation
  • We make several assumptions in this model :
  • That the writing system is alphabetic in nature
    • Can be easily verified by counting the number of symbols in the found record.
  • That the corpus has been transcribed into an electronic format
    • Means that each character is uniquely identified.
  • About the morphology of the language :
    • Each word consists of a stem, prefix and suffix, where the latter two may be omitted
    • Holds true for a large variety of human languages
problem formulation2
Problem Formulation
  • The inventories and the frequencies in the known language are given.
  • In essence, the input consists of two parts :
    • A list of unanalyzed words in a lost language
    • A morphologically analyzed syntax in a known related language
intuition a toy example
Intuition : A toy example
  • Consider the following example, consisting of words in a lost language closely related to English, but written using numerals.
    • 15234 --asked
    • 1525 --- asks
    • 4352 --- desk
  • Notice the pair of endings, -34 and -5, with the same initial sequence 152-
    • Might correspond to –ed and –s respectively.
    • Thus, 3=e, 4=d and 5=s
intuition a toy example1
Intuition : A toy example
  • Now, we can say that 435=des, and using our knowledge of English, we can suppose that this word is very likely to be desk.
  • As this example illustrates, we proceed by discovering both character- and morpheme-level mappings.
  • Another intuition the model should capture is the sparsity of the mapping.
    • Correct mapping will preserve phonetic relations b/w the two related languages
    • Each character in the unknown language will map to a small number of characters in the related language.
model structure
Model Structure
  • We assume that each morpheme is probabilistically generated jointly with a latent counterpart in the lost language
  • The challenge: Each level of correspondence can completely describe the observed data. So using a mechanism based on one leaves no room for the other.
  • The solution: Using a Dirichlet Process to model probabilities (explained further).

http://people.csail.mit.edu/bsnyder/papers/bsnyder_acl2010.pdf

model structure cont
Model Structure (cont…)
  • There are four basic layers in the generative process
    • Structural Sparsity
    • Character-edit Distribution
    • Morpheme-pair Distributions
    • Word Generation
model structure cont1
Model Structure (cont…)

Graphical overview of the Model

http://people.csail.mit.edu/bsnyder/papers/bsnyder_acl2010.pdf

step 1 structural sparsity
Step 1 : Structural Sparsity
  • We need a control on the sparsity of the edit-operation probabilities, encoding the linguistic intuition that character-level mapping should be sparse.
  • The set of edit operations include character substitutions, insertions and deletions. We assign a variable λecorresponding to every edit operation e.
  • The set of character correspondences with the variable set to 1 { (u,h) : λ(u,h) = 1 }conveys a set of phonetically valid correspondences.
  • We define a joint prior over these variables to encourage sparse character mappings.
step 1 structural sparsity cont
Step 1 : Structural Sparsity (cont.)
  • This prior can be viewed as a distribution over binary matrices and is defined to encourage every row and column to sum to low values integer values (typically 1)
  • For a given matrix, define a count c(u) which is the number of corresponding letters that u has in that matrix.Formally, c(u) = ∑h λ(u,h)
  • We now define a function fi = max(0, |{u : c(u) = i}| - bi)For any i other than 1, fi should be as low as possible.
  • Now the probability of this matrix is given by
step 1 structural sparsity cont1
Step 1 : Structural Sparsity (cont…)
  • Here Z is the normalization factor and w is the weight vector.
  • wi is either zero or negative, to ensure that the probability is high for a low value of f.
  • The values of bi and wi can be adjusted depending on the number of characters in the lost language and the related language.
step 2 character edit distribution
Step 2 : Character-Edit Distribution
  • We now draw a base distribution G0 over character edit sequences.
  • The probability of a given edit sequence P(e) depends on the value of the indicator variable of individual edit operations λe,and a function depending on the number of insertions and deletions in the sequence, q(#ins(e), #del(e)).
  • The factor depending on the number of insertions and deletions depends on the average word lengths of the lost language and the related language.
step 2 character edit distribution cont
Step 2 :Character-Edit Distribution (cont.)

Example: Average Ugaritic word is 2 letters longer than an average Herbew word

Therefore, we set our q to be such as to disallow any deletions and allow 1 insertion per sequence, with the probability of 0.4

  • The part depending on the λesmakes the distribution spike at 0 if the value is 0 and keeps it unconstrained otherwise (spike-and slab priors)
step 3 morpheme pair distributions
Step 3 : Morpheme Pair-Distributions
  • The base distribution G0 along with a fixed parameter α define a Dirichlet process, which provides probability over morpheme-pair distributions.
  • The resulting distributions are likely to be skewed in favor of a few frequently occurring morpheme-pairs, while remaining sensitive to character-level probabilities of the base distribution.
  • Our model distinguishes between the 3 kinds of morphemes- prefixes, stems and suffixes. We therefore use different values of α
step 3 morpheme pair distributions cont
Step 3 : Morpheme Pair-Distributions (cont.)
  • Also, since the suffix and prefix depend on the part of speech of the stem, we draw a single distribution Gstm for the stem, we maintain separate distributions Gsuf|stm and Gpre|stm for each possible stem part-of-speech.
step 4 word generation
Step 4 : Word Generation
  • Once the morpheme-pair distributions have been drawn, actual word pairs may now be generated.
  • Based on some prior, we first decide if a word in the lost language has a cognate in the known language.
  • If it does, then a cognate word pair (u, h) is produced:
  • Otherwise, a lone word u is generated.
summarizing the model
Summarizing the Model
  • This model captures both character and lexical level correspondences, while utilizing morphological knowledge of the known language.
  • An additional feature of this multi-layered model structure is that each distribution over morpheme pairs is derived from the single character-level base distribution G0.
  • As a result, any character-level mappings learned from one correspondence will be propagated to other morpheme distributions.
  • Also, the character-level mappings obey sparsity constraints
results of the process
Results of the process
  • Applied on Ugaritic language
  • Undeciphered corpus contains 7,386 unique word types.
  • The Hebrew Bible used for known language corpus, which is close to ancient Ugaritic.
  • Assume morphological and POS annotations availability for the Hebrew lexicon.

http://people.csail.mit.edu/bsnyder/papers/bsnyder_acl2010.pdf

results of the process1
Results of the process
  • The method identifies Hebrew cognates for 2,155 words, covering almost 1/3rd of the Ugaritic vocabulary.
  • The baseline method correctly maps 22 out of 30 characters to their Hebrew counterparts, and translates only 29% of all the cognates
  • This method correctly translates 60.4 % of all cognates.
  • This method yields correct mapping for 29 out of 30 characters.
future work
Future Work
  • Even with character mappings, many words can be correctly translated only by examining their context.
  • The model currently fails to take the contextual information into account.

http://people.csail.mit.edu/bsnyder/papers/bsnyder_acl2010.pdf

conclusions
Conclusions
  • We saw how language decipherment is an extremely complex task.
  • Years of efforts required for successful decipheration of each lost language.
  • Depends on the amount of available corpus in the unknown language.
    • But availability does not make it easy.
  • Statistical model has shown promise.
  • Can be developed further and used for more languages.
references
References
  • Wikipedia article on Decipherment of Hieroglyphs http://en.wikipedia.org/wiki/Decipherment_of_hieroglyphic_writing
  • Lost Languages: The Enigma of the World's Undeciphered Scripts by Andrew Robinson (2009) http://entertainment.timesonline.co.uk/tol/arts_and_entertainment/books/non-fiction/article5859173.ece
  • A Statistical Model for Lost Language Decipherment Benjamin Snyder, Regina Barzilay, and Kevin Knight ACL (2010) (http://people.csail.mit.edu/bsnyder/papers/bsnyder_acl2010.pdf)
references1
References
  • A staff talk from Straight Dope Science Advisory Board – How come we can’t decipher the Indus Script? (2005) http://www.straightdope.com/columns/read/2206/how-come-we-cant-decipher-the-indus-script
  • Wade Davis on Endangered Cultures (2008) http://www.ted.com/talks/wade_davis_on_endangered_cultures.html