1 / 18

CSA405: Unix Trix for Empirical CL

CSA405: Unix Trix for Empirical CL. How to use Unix as a toolbox for NLP applications. Acknowledgements. Contents of this lecture is inspired by Gerald Gazdar, University of Sussex Ken Church, AT&T Thanks. Unix Tools. grep : search for pattern sort : sort a file

Download Presentation

CSA405: Unix Trix for Empirical CL

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CSA405: Unix Trixfor Empirical CL How to use Unix as a toolbox for NLP applications Unix Trix for Emprirical CL

  2. Acknowledgements Contents of this lecture is inspired by Gerald Gazdar, University of Sussex Ken Church, AT&T Thanks Unix Trix for Emprirical CL

  3. Unix Tools • grep: search for pattern • sort: sort a file • uniq: eliminate duplicates • tr: translate characters • wc: count words • sed: edit string • awk: pattern based programming language • cut: cut out selected fields of each line of a file • paste: merge corresponding or subsequent lines of files • comm: select or reject lines common to two files • join: relational database operator • man command for further details of these Unix Trix for Emprirical CL

  4. Text I was intrigued by the article "Cloning a human being a long way off`" (December 3). I attended the well-presented lecture by Dr Bruce Campbell, wherein the cutting edge of the new cloning technology for the harvesting of human stem cells was explained. This involves the transfer of the nucleus from an adult human mature cell, such as skin, hair or mucosa, into the denucleated human ovum of a female of the species, which is then allowed to start developing for a few days to the stage where the placental precursor cells separate from the cells destined to become the foetus. Unix Trix for Emprirical CL l

  5. Punctuation 1 • sed –f markpunct.sed • file contents s/"/ xzzdoublequotezzx /g s/'/ xzzquotezzx /g s/`/ xzzquotezzx /g s/(/ xzzleftparenzzx /g • I was intrigued by the article xzzdoublequotezzx Cloning a human being xzzquotezzx Unix Trix for Emprirical CL

  6. Punctuation 2 • sed –f angle.sed • file contents s/xzz/</g s/zzx/>/g • I was intrigued by the article <doublequote> Cloning a human being <quote> Unix Trix for Emprirical CL

  7. Case • tr 'A-Z''a-z' • i was intrigued by the article "cloning a human being `a long wayoff`" (december 3). i attended the well-presented lecture by dr brucecampbell, wherein the cutting edge of the new cloning technology forthe harvesting of human stem cells was explained. Unix Trix for Emprirical CL

  8. Tokenisation • tr –sc 'a-zA-Z' '\012' • IwasintriguedbythearticleCloninga Unix Trix for Emprirical CL

  9. Sorting • tr 'A-Z''a-z'| tr –sc 'a-zA-Z' '\012' | sort • aaaaadultallowedanarticleasattendedbecomebeingbrucebyby Unix Trix for Emprirical CL

  10. Making a Wordlist • tr 'A-Z''a-z'| tr –sc 'a-zA-Z' '\012' | sort | uniq • aadultallowedanarticleasattendedbecomebeingbrucebycampbellcellcellscloning Unix Trix for Emprirical CL

  11. Counting • tr 'A-Z' 'a-z'| tr –sc 'a-zA-Z' '\012'|sort|uniq -c • 4 a1 adult1 allowed1 an1 article1 as1 attended1 become1 being1 bruce2 by1 campbell1 cell3 cells2 cloning Unix Trix for Emprirical CL

  12. Sorted Frequency List • tr 'A-Z' 'a-z'| tr –sc 'a-zA-Z' '\012'|sort|uniq –c | sort –r • 13 the5 of4 human4 a3 to3 cells2 was2 i2 from2 for2 cloning2 by1 which1 wherein Unix Trix for Emprirical CL

  13. Sorted Frequency List • tr 'A-Z' 'a-z'| tr –sc 'a-zA-Z' '\012'|sort|uniq –c | sort –r | cat -n • 1 13 the 2 5 of 3 4 human 4 4 a 5 3 to 6 3 cells 7 2 was 8 2 i 9 2 from10 2 for11 2 cloning12 2 by Unix Trix for Emprirical CL

  14. Zipf • Principle of least effort: people act so as to minimise their probable average rate of work. • Speaker’s effort is conserved by having a small no of very frequent words, whilst hearer’s effort demands large number of rare words. • Consequence (according to Zipf): relationship between word frequency and rank. • Frequency x Rank = constant Unix Trix for Emprirical CL

  15. Zipf Curve Frequency  Rank  Unix Trix for Emprirical CL

  16. paste and tail • paste: The default operation of paste will concatenate the corresponding lines of the input files. The NEWLINE character of every line except the line from the last input file will be replaced with a TAB character. • tail: The tail utility copies the named file to the standard output beginning at a designated place. • These two utilities can be used to work with n-grams Unix Trix for Emprirical CL

  17. Bigrams tr 'A-Z' 'a-z'| tr –sc 'a-zA-Z' '\012‘> foo tr 'A-Z' 'a-z'| tr –sc 'a-zA-Z' '\012'| tail +2 > foo1 paste foo foo1 | sort : human being human mature human ovum human stem : the article the cells the cutting the denucleated the foetus Unix Trix for Emprirical CL

  18. grep Unix Trix for Emprirical CL

More Related