Vocabulary size of moby dick
This presentation is the property of its rightful owner.
Sponsored Links
1 / 10

Vocabulary Size of Moby Dick PowerPoint PPT Presentation


  • 76 Views
  • Uploaded on
  • Presentation posted in: General

Vocabulary Size of Moby Dick. Algorithm. Read the text into Python (as a huge string) Split the string into words Remove duplicates Count the size. Remove Duplicates. Read the text into Python (as a huge string) Split the string into words Remove duplicates Sort the word list

Download Presentation

Vocabulary Size of Moby Dick

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Vocabulary size of moby dick

Vocabulary Size of Moby Dick


Algorithm

Algorithm

  • Read the text into Python (as a huge string)

  • Split the string into words

  • Remove duplicates

  • Count the size


Remove duplicates

Remove Duplicates

  • Read the text into Python (as a huge string)

  • Split the string into words

  • Remove duplicates

    • Sort the word list

    • Create an empty vocab list, and put the first word in the word list into the vocab list

    • For each of the rest words in the word list

      • If it is the same as the last in the vocab list, do nothing

      • If it is different from the last in the voca list, add it to the vocab list

  • Count the size


  • Algorithm take 2

    Algorithm – Take 2

    • Read the text into Python (as a huge string)

    • Clean up the string

      • Remove punctuations

      • Remove numbers

      • Convert all letters to lowercase

  • Split the string into words

  • Remove duplicates

  • Count the size


  • Now let s get even fancier

    Now Let's Get Even Fancier

    • What about derived words?

      • seem

      • seems

      • seemed

      • seeming

      • seemingly

  • Removing them is almost like removing duplicates, except…


  • Searching through a list

    Searching Through A List

    • Is slow.

    • Proof:

      longList = range(1,20000)

      for i in range(1,20000):

      if 19999 in longList:

      pass

    • It's slow because it's checking equality sequentially. So this happens much faster:

      longList = range(1,20000)

      for i in range(1,20000):

      if 1 in longList:

      pass

    • Can we do better?


    How do we look up a word in a dictionary

    How do we look up a word in a Dictionary?


    You can create a dictionary in python

    You Can Create A Dictionary In Python

    • It provides

      • Not only an thingy (a technical term), but also another thingy associated with it (e.g., a word (string) and its definition (string)

      • Fast searching just like looking up words in a dictionary

  • Working with dictionary is almost like working with list

  • You can press tab to find out what you can do with a list or dictionary (or almost every other thing)


  • Vocabulary size of moby dick

    myList = []

    myList += 'banana'

    myList = ['banana', 'mango']

    myList.pop(0)

    myList[0]

    myList[0] = 'grape'

    myDic = {}

    myDic['banana'] = 5

    myDic = {'banana':5, 'mango':5}

    myDic.pop('banana')

    myDic['mango']

    myDic['mango'] = 10


    Now let s get a bit technical

    Now Let's Get A Bit Technical

    • You have seen a lot of thingies, name them (things that you can assign as a value to a variable)

    • They are all objects (things that can be used in the name.function() way). Maybe except numbers…

    • Lists and dictionaries, being objects themselves, are also data structures (i.e., they organize data). It's like boxes are used to organize objects, and they are objects themselves. Moreover, they can be organized using other boxes.


  • Login