1 / 20

Labs 3: Bi-Grams

Labs 3: Bi-Grams. Step 1: Get Started. Login: Username: nombre cc5212 Password on board http ://aidanhogan.com/teaching/cc5212-1/mdp-lab3.zip C:/Program Files (x86)/eclipse/ (in Spanish ) File > Import > … http://aidanhogan.com/teaching/cc5212-1/ExternalMergeSort.java

arawn
Download Presentation

Labs 3: Bi-Grams

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Labs 3: Bi-Grams

  2. Step 1: Get Started • Login: • Username: nombre\cc5212 • Password on board • http://aidanhogan.com/teaching/cc5212-1/mdp-lab3.zip • C:/Program Files (x86)/eclipse/ (in Spanish ) • File > Import > … • http://aidanhogan.com/teaching/cc5212-1/ExternalMergeSort.java • Only if you weren’t here last week (half marks) • Use es-abstracts.txt.gz from the last time

  3. Scale! … knowing how to build a scalable system over many machines requires knowing how to build a scalable system on one machine first • How can we count a large set of bi-grams on one machine! • Won’t fit in memory so what do we do?

  4. Phrasing • Bi-grams! • Phrase of two adjacent words • When we counted words … • Counting done in memory • Merging done in memory • Faster on one machine! • More bi-grams than single words! • So how can we scale the computation? • Won’t fit in memory! (or will it?) Tengo a? Tengo de? Tengoque?

  5. Step 2: Fix Some Noise … org.mdp.wc.WordParserIterator loadNext()

  6. Step 2: Extract Bigrams to a File • org.mdp.cli.ExtractBigrams • Small file for testing (): -i[path]\es-abstracts.txt.gz -igz-o [path]\bigrams-10k.txt–n 10000 • Large file for real run (GZipped): -i[path]\es-abstracts.txt.gz -igz-o [path]\bigrams.txt.gz –ogz

  7. Step 3: Try In-memory Count • org.mdp.cli.RunBigramCountInMemory -i [path]\bigrams.txt.gz –igz–k 500 Will it run for the big file?

  8. External Merge-Sort 1: Batch • Sort in batches In-memory sort (Batch size b) Output batches on-disk (⌈n/b⌉ batches) Input on-disk (Input size: n) bigram121 bigram42 bigram732 bigram42 bigram123 bigram149 bigram42 bigram1294 bigram123 bigram42 bigram6 bigram123 bigram42 bigram42 bigram121 bigram732 bigram42 bigram42 bigram121 bigram732 bigram42 bigram123 bigram149 bigram1294 bigram6 bigram42 bigram123 bigram123 bigram42 bigram123 bigram149 bigram1294 bigram6 bigram42 bigram123 bigram123

  9. Step 4: Implement Batching org.mdp.cli.ExternalMergeSort • ImplementwriteSortedBatches() • Load batchSize lines into memory • ArrayList<String> list • When list.size()==batchSize • Dump the data to a batch • String batchName = getBatchFileName(tmpFolder, batchId); • PrintWriter batch = openBatchFileForWriting(batchName); • Clear the list and close the batch file • Add the batch-name to batchNames() • Do some logging! • Forget about reverseOrder for now

  10. Step 5: Implement Merging org.mdp.cli.ExternalMergeSort • ImplementmergeSortedBatches() • Open files for reading • BufferedReader[] brs= new BufferedReader[batches.size()]; • Read a line from each file into memory • Select the lowest line (from file i), write to out • Load the next line from file I • Do some logging! • Forget about reverseOrder for now

  11. External Merge-Sort 2: Merge Sorted output (Output size: n) In-memory sort Input batches on-disk (⌈n/b⌉ batches) bigram6 bigram42 bigram42 bigram42 bigram42 bigram121 bigram123 bigram123 bigram123 bigram149 bigram732 bigram1294 bigram42 bigram42 bigram121 bigram732 bigram42 bigram123 bigram149 bigram1294 bigram6 bigram42 bigram123 bigram123

  12. Step 6: Try Sorting 10k Bigrams org.mdp.cli.ExternalMergeSort -i[path]\bigrams-10k.txt -o [path]\bigrams-10k-sorted.txt –b 3000 If successful, try sorting the large file! Use batches of size 250000. (Don’t forget -igz/-ogz) If not successful, try debugging. If stuck, ask me. 

  13. Counting bigrams is then easy? Could use merge-sort again to order by occurrence! bigram6 bigram42 bigram42 bigram42 bigram42 bigram121 bigram123 bigram123 bigram123 bigram149 bigram732 bigram1294 bigram6, 1 bigram42, 4 bigram121, 1 bigram 123, 3 bigram149, 1 bigram732, 1 bigram1294, 1

  14. Step 7: Implement Counting org.mdp.cli.CountDuplicates • ImplementcountDuplicates() • Store two lines: current and last • If current line same as last line, increment counter • If current line different from last line, print count and line to a file, reset count • Use String sortNum = StringWithNumber.getSortableNumber(dupes);

  15. Step 8: Try Counting 10k Bigrams org.mdp.cli.CountDuplicates -i[path]\bigrams-10k-sorted.txt -o [path]\bigrams-10k-counts.txt If successful, try counting the large file! (Don’t forget -igz/-ogz) If not successful, try debugging. If stuck, ask me. 

  16. Step 9: Implement Reverse Order org.mdp.cli.ExternalMergeSort • In writeSortedBatches() & externalMergeSort()

  17. Step 10: Merge-Sort the Counts org.mdp.cli.ExternalMergeSort -i[path]\bigrams-10k-counts.txt -o [path]\bigrams-10k-counts-sorted.txt –b 3000-r If successful, try sorting the large file! Use batches of size 250000. (Don’t forget -igz/-ogz) If not successful, try debugging. If stuck, ask me. 

  18. Step 11: Get the top 500 org.mdp.cli.CopyLinesFromFile -i[path]\bigrams-counts-sorted.txt.gz –igz-o [path]\bigrams-counts-sorted-top500.txt –n 500

  19. Final Step: Profiling (Optional) Java Interactive Profiler • Run ExternalMergeSortfor a large file • Use VM arguments: -javaagent:lib\profile.jar –noverify • When finished, check profile.txtin your project’s root directory • See if you can optimise something in “Most Expensive Methods”

  20. Final Final Steps • Remove tmp/ folder from mdp-lab3/ folder and recycle bin (Shift + Del) • I set up tareas. 

More Related