1 / 49

Large Scale Machine Translation Architectures

Large Scale Machine Translation Architectures. Qin Gao. Outline. Typical Problems in Machine Translation Program Model for Machine Translation MapReduce Required System Component Supporting software Distributed streaming data storage system Distributed structured data storage system

mavisc
Download Presentation

Large Scale Machine Translation Architectures

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Large Scale Machine Translation Architectures Qin Gao

  2. Outline • Typical Problems in Machine Translation • Program Model for Machine Translation • MapReduce • Required System Component • Supporting software • Distributed streaming data storage system • Distributed structured data storage system • Integrating – How to make a full-distributed system Qin Gao, LTI, CMU

  3. Why large scale MT • We need more data.. • But… Qin Gao, LTI, CMU

  4. Some representative MT problems • Counting events in corpora •  Ngram count • Sorting •  Phrase table extraction • Preprocessing Data • Parsing, tokenizing, etc • Iterative optimization •  GIZA++ (All EM algorithms) Qin Gao, LTI, CMU

  5. Characteristics of different tasks • Counting events in corpora • Extract knowledge from data • Sorting • Process data, knowledge is inside data • Preprocessing Data • Process data, require external knowledge • Iterative optimization • For each iteration, process data using existing knowledge and update knowledge Qin Gao, LTI, CMU

  6. Components required for large scale MT Data Knowledge Qin Gao, LTI, CMU

  7. Components required for large scale MT Data Knowledge Qin Gao, LTI, CMU

  8. Components required for large scale MT Data Stream Data Processor Knowledge Structured Knowledge Qin Gao, LTI, CMU

  9. Problem for each component • Stream data: • As the amount of data grows, even a complete navigation is impossible. • Processor: • Single processor’s computation power is not enough • Knowledge: • The size of the table is too large to fit into memory • Cache-based/distributed knowledge base suffers from low speed Qin Gao, LTI, CMU

  10. Make it simple: What is the underlying problem? • We have a huge cake and we want to cut them into pieces and eat. • Different cases: • We just need to eat the cake. • We also want to count how many peanuts inside the cake • (Sometimes)We have only one folk! Qin Gao, LTI, CMU

  11. Parallelization Data Knowledge Qin Gao, LTI, CMU

  12. Solutions • Large-scale distributed processing • MapReduce: Simplified Data Processing on Large Clusters, Jeffrey Dean, Sanjay Ghemawat, Communications of the ACM, vol. 51, no. 1 (2008), pp. 107-113. • Handling huge streaming data • The Google File System, Sanjay Ghemawat, Howard Gobioff, Shun-Tak Leung, Proceedings of the 19th ACM Symposium on Operating Systems Principles, 2003, pp. 20-43. • Handling structured data • Large Language Models in Machine Translation, Thorsten Brants, Ashok C. Popat, Peng Xu, Franz J. Och, Jeffrey Dean, Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL), pp. 858-867. • Bigtable: A Distributed Storage System for Structured Data, Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach, Mike Burrows, Tushar Chandra, Andrew Fikes, Robert E. Gruber, 7th USENIX Symposium on Operating Systems Design and Implementation (OSDI), 2006, pp. 205-218. Qin Gao, LTI, CMU

  13. MapReduce • MapReduce can refer to • A programming model that deal with massive, unordered, streaming data processing tasks(MUD) • A set of supporting software environment implemented by Google Inc • Alternative implementation: • Hadoop by Apache fundation Qin Gao, LTI, CMU

  14. MapReduce programming model • Abstracts the computation into two functions: • MAP • Reduce • User is responsible for the implementation of the Map and Reduce functions, and supporting software take care of executing them Qin Gao, LTI, CMU

  15. Representation of data • The streaming data is abstracted as a sequence of key/value pairs • Example: • (sentence_id : sentence_content) Qin Gao, LTI, CMU

  16. Map function • The Map function takes an input key/value pair, and output a set of intermediate key/value pairs Key1 : Value1 Key2 : Value2 Key3 : Value3 …….. Key1 : Value1 Map() Key1 : Value2 Key2 : Value1 Key3 : Value3 …….. Key2 : Value2 Map() Qin Gao, LTI, CMU

  17. Reduce function • Reduce function accepts one intermediate key and a set of intermediate values, and produce the result Key1 : Value1 Key1 : Value2 Key1 : Value3 …….. Result Reduce() Key2 : Value1 Key2 : Value2 Key2 : Value3 …….. Result Reduce() Qin Gao, LTI, CMU

  18. The architecture of MapReduce Reduce Function Map function Distributed Sort Qin Gao, LTI, CMU

  19. Benefit of MapReduce • Automatic splitting data • Fault tolerance • High-throughput computing, uses the nodes efficiently • Most important: Simplicity, just need to convert your algorithm to the MapReduce model. Qin Gao, LTI, CMU

  20. Requirement for expressing algorithm in MapReduce • Process Unordered data • The data must be unordered, which means no matter in what order the data is processed, the result should be the same • Produce Independent intermediate key • Reduce function can not see the value of other keys Qin Gao, LTI, CMU

  21. Example • Distributed Word Count (1) • Input key : word • Input value : 1 • Intermediate key : constant • Intermediate value: 1 • Reduce() : Count all intermediate values • Distributed Word Count (2) • Input key : Document/Sentence ID • Input value : Document/Sentence content • Intermediate key : constant • Intermediate value: number of words in the document/sentence • Reduce() : Count all intermediate values Qin Gao, LTI, CMU

  22. Example 2 • Distributed unigram count • Input key : Document/Sentence ID • Input value : Document/Sentence content • Intermediate key : Word • Intermediate value: Number of the word in the document/sentence • Reduce() : Count all intermediate values Qin Gao, LTI, CMU

  23. Example 3 • Distributed Sort • Input key : Entry key • Input value : Entry content • Intermediate key : Entry key (modification may be needed for ascend/descend order) • Intermediate value: Entry content • Reduce() : All the entry content • Making use of built-in sorting functionality Qin Gao, LTI, CMU

  24. Supporting MapReduce: Distributed Storage • Reminder what we are dealing with in MapReduce: • Massive, unordered, streaming data • Motivation: • We need to store large amount of data • Make use of storage in all the nodes • Automatic replication • Fault tolerant • Avoid hot spots client can read from many servers • Google FS and Hadoop FS (HDFS) Qin Gao, LTI, CMU

  25. Design principle of Google FS • Optimizing for special workload: • Large streaming reads, small random reads • Large streaming writes, rare modification • Support concurrent appending • It actually assumes data are unordered • High sustained bandwidth is more important than low latency, fast response time is not important • Fault tolerant Qin Gao, LTI, CMU

  26. Google FS Architecture • Optimize for large streaming reading and large, concurrent writing • Small random reading/writing is also supported, but not optimized • Allow appending to existing files • File are spitted into chunks and stored in several chunk servers • A master is responsible for storage and query of chunk information Qin Gao, LTI, CMU

  27. Google FS architecture Qin Gao, LTI, CMU

  28. Replication • When a chunk is frequently or “simultaneously” read from a client, the client may fail • A fault in one client may cause the file not usable • Solution: store the chunks in multiple machines. • The number of replica of each chunk : replication factor Qin Gao, LTI, CMU

  29. HDFS • HDFS shares similar design principle of Google FS • Write-once-read-many : Can only write file once, even appending is now allowed • “Moving computation is cheaper than moving data” Qin Gao, LTI, CMU

  30. Are we done? NO… Problems about the existing architecture Qin Gao, LTI, CMU

  31. We are good at dealing with data • What about knowledge? I.E. structured data? • What if the size of the knowledge is HUGE? Qin Gao, LTI, CMU

  32. A good example: GIZA • A typical EM algorithm World Alignment Collect Counts Has More Sentences? Y Y N Has More Iterations? Normalize Counts N Qin Gao, LTI, CMU

  33. When parallelized: seems to be a perfect MapReduce application Word Alignment Word Alignment Word Alignment Collect Counts Collect Counts Collect Counts Has More Sentences? Has More Sentences? Has More Sentences? Y Y Y N N N Y Has More Iterations? Normalize Counts N Run on cluster Qin Gao, LTI, CMU

  34. However: Memory Large parallel corpus … Corpus chunks Map Count tables . . . .. . . . . . . . . . . . . .. . . . . . . . . . . . . .. . . . . . . . . . . . . .. . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . Data I/O Reduce Combined count table Memory Renormalization Redistribute for next iteration . . . . .. . . . . . . . . . . . . . . . . . . Statistical lexicon Qin Gao, LTI, CMU

  35. Huge tables • Lexicon probability table: T-Table • Up to 3G in early stages • As the number of workers increases, they all need to load this 3G file! • And all the nodes need to have 3G+ memory – we need a cluster of super computers? Qin Gao, LTI, CMU

  36. Another example, decoding • Consider language models, what can we do if the language model grows to several TBs • We need storage/query mechanism for large, structured data • Consideration: • Distributed storage • Fast access: network has high latency Qin Gao, LTI, CMU

  37. Google Language Model • Storage: • Central storage or distributed storage • How to deal with latency? • Modify the decoder, collect a number of queries and send them in one time. • It is a specific application, we still need something more general. Qin Gao, LTI, CMU

  38. Again, made in Google:Bigtable • It is the specially optimized for structured data • Serving many applications now • It is not a complete database • Definition: • A Bigtable is a sparse,distributed, persistent, multi-dimensional,sorted map Qin Gao, LTI, CMU

  39. Data model in Bigtable • Four dimension table: • Row • Column family • Column • Timestamp • Column family • Column Row Qin Gao, LTI, CMU • Timestamp

  40. Distributed storage unit : Tablet • A tablet consists a range of rows • Tablets can be stored in different nodes, and served by different servers • Concurrent reading multiple rows can be fast Qin Gao, LTI, CMU

  41. Random access unit : Column family • Each tablet is a string-to-string map • (Though not mentioned, the API shows that: ) In the level of column family, the index is loaded into memory so fast random access is possible • Column family should be fixed Qin Gao, LTI, CMU

  42. Tables inside table: Column and Timestamp • Column can be any arbitrary string value • Timestamp is an integer • Value is byte array • Actually it is a table of tables Qin Gao, LTI, CMU

  43. Performance • Number of 1000-byte values read/write per second. • What is shocking: • Effective IO for random read (from GFS) is more than 100 MB/second • Effective IO for random read from memory ismore than 3 GB/second Qin Gao, LTI, CMU

  44. An example : Phrase Table • Row: First bigram/trigram of the source phrase • Column Family: Length of source phrase or some hashed number of remaining part of source phrase • Column: Remaining part of the source phrase • Value: All the phrase pairs of the source phrase Qin Gao, LTI, CMU

  45. Benefit • Different source phrase comes from different servers • The load is balanced and the reading can be concurrent and much faster. • Filtering the phrase table before decoding becomes much more efficient. Qin Gao, LTI, CMU

  46. Another Example: GIZA++ • Lexicon table: • Row: Source word id • Column Family: nothing • Column: Target word id • Value: The probability value • With a simple local cache, the table loading can be extremely efficient comparing to current implemenetation Qin Gao, LTI, CMU

  47. Conclusion • Strangely, the talk is all about how Google does it • A useful framework for distributed MT systems require three components: • MapReduce software • Distributed streaming data storage system • Distributed structured data storage system Qin Gao, LTI, CMU

  48. Open Source Alternatives • MapReduce Library  Hadoop • GoogleFS  Hadoop FS (HDFS) • BigTable  HyperTable Qin Gao, LTI, CMU

  49. THANK YOU! Qin Gao, LTI, CMU

More Related