slide1 l.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Smart Software with F# Joel Pobar Language Geek http://callvirt.net/blog PowerPoint Presentation
Download Presentation
Smart Software with F# Joel Pobar Language Geek http://callvirt.net/blog

Loading in 2 Seconds...

play fullscreen
1 / 40

Smart Software with F# Joel Pobar Language Geek http://callvirt.net/blog - PowerPoint PPT Presentation


  • 140 Views
  • Uploaded on

Smart Software with F# Joel Pobar Language Geek http://callvirt.net/blog. Agenda. What is it? F# Intro Algorithms: Search Fuzzy Matching Classification ( SVM) Recommendations Q&A. All This in 45 mins? . This is an awareness session! Lots of content, very broad, very fast

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Smart Software with F# Joel Pobar Language Geek http://callvirt.net/blog' - lavi


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
slide1

Smart Software with F#

Joel Pobar

Language Geek

http://callvirt.net/blog

agenda
Agenda
  • What is it?
  • F# Intro
  • Algorithms:
    • Search
    • Fuzzy Matching
    • Classification (SVM)
    • Recommendations
  • Q&A
all this in 45 mins
All This in 45 mins?
  • This is an awareness session!
    • Lots of content, very broad, very fast
    • You’ll get all demos, pointers, and slide deck to take offline and digest
  • Two takeaways:
    • F# is a great language for data
    • Smart algorithms aren’t hard – use them, explore more!
slide4
F# is

...a functional, object-oriented, imperative and explorativeprogramming language for .NET

what is Functional Programming?

http://callvirt.net/jaoo.zip

what is functional programming
What is Functional Programming?
  • Wikipedia: “A programming paradigm that treats computation as the evaluation of mathematical functions and avoids state and mutable data”
  • -> Emphasizes functions
  • -> Emphasizes shapes of data, rather than impl.
  • -> Modeled on lambda calculus
  • -> Reduced emphasis on imperative
  • -> Safely raises level of abstraction
motivation for functional
Motivation for Functional
  • Simplicity in life is good: cheaper, easier, faster, better.
    • We typically achieve simplicity in software in two ways:
      • By raising the level of abstraction (and OO was one design to raise abstraction)
      • Increasing modularity
  • Increasing signal to noise another good strategy:
    • Communicate more in less time with more clarity
  • Better composition and modularity == reuse
functional programming safer while still being useful
Functional ProgrammingSafer, while still being useful

C#, C++, …

V.Next#

F#

Useful

Haskell

Not Useful

Unsafe

Safe

what is f for
What is F# for?
  • F# is a General Purpose language
    • Can be used for a broad range of programming tasks
    • Superset of imperative and dynamic features
  • Great for learning FP concepts
  • Some particularly important domains
    • Financial modeling and analysis
    • Data mining
    • Scientific data analysis
    • Domain-specific modeling
    • Academic
slide9
Let

Type inference.

The static typing of C# with the succinctness of a scripting language

  • ‘Let’ binds values to identifiers

lethelloWorld = “Hello, World”

print_any helloWorld

let myNum = 12

letmyAddFunction x y =

letsum = x + y

sum

tuples
Tuples
  • Simple, and most useful data structure

letsite1 = (“msdn.com”, 10)

letsite2 = (“abc.net.au”, 12)

letsite3 = (“news.com.au”, 22)

letallSites = (site1, site2, site3)

letfst (a, b) = a

letsnd (a, b) = b

lists arrays seq and options
Lists, Arrays, Seq and Options
  • Lists & Arrays are first-class citizens
  • Options provide a some-or-nothing capability

letlist1 = [“Joel"; "Luke"]

letarray = [|2; 3; 5;|]

letmyseq = seq [0; 1; 2; ]

letoption1 = Some(“Joel")

letoption2 = None

records
Records
  • Simple concrete type definition

type Person =

{ Name: string;

DateOfBirth: System.DateTime; }

letn = { Name = “Joel”;

DateOfBirth = “13/04/81”; }

immutability by default
Immutability (by default)

Data is immutable by default

Values may not be changed

discriminated unions
Discriminated Unions
  • Great for representing the structure of data

type Make = string

type Model = string

type Transport =

| Car of Make * Model

| Bicycle

letme = Car (“Holden”, “Barina”)

letyou = Bicycle

Both of these identifiers are of type “Transport”

functions
Functions
  • Functions: like delegates + unified and simple
  • Deep type inference

(funx ->x + 1)

letmyFunc x = x + 1

valmyFunc : int ->int

let recfactorial n =

if n>1 then n * factorial (n-1)

else 1

let data = [5; 3; 4; 4; 5]

List.sort (fun x y -> x – y) data

pattern matching
Pattern Matching

let (fst, _) = (“first”, “second”)

Console.WriteLine(fst)

let switchOnType(a:obj)

match a with

| :? Int32 -> printfn“int!”

| :? Transport -> printfn“Transport“

| _ -> printfn“Everything Else!”

  • Very important part of F#
  • Helps deal with the ‘teasing apart’ of data
  • Works best with Discriminated Unions & Records
search
Search
  • Given a search term and a large document corpus, rank and return a list of the most relevant results…
search20
Search
  • Words
    • Stemming? Tokenize?
      • E.g ‘Python/Ruby’
  • Markup
    • Title, Author, Date
    • Headings (h1,h2 etc)
    • Paragraphs
  • Links
    • A sign of strength?

Let’s explore something simple…

search21
Search
  • Simplify:
    • For easy machine/language manipulation
    • … and most importantly, easy computation
  • Vectors: natures own quality data structure
    • Convenient machine representation (lists/arrays)
    • Lots of existing vector math algorithms

After a loving incubation period, moonlight 2.0 has been released. <a href=“whatever”>source code</a><br><a href”something else”>FireFox binaries</a> … after

after

incubation

loving

moonlight

firefox

linux

binaries

2

1

1

6

4

6

2

term count
Term Count

the

incubation

crazy

moonlight

firefox

linux

penguin

  • Document1: Linux post:
  • Document2: Animal post:
  • Vector space:

9

1

1

6

4

6

2

crazy

the

dog

penguin

2

2

1

5

the

incubation

crazy

moonlight

firefox

linux

dog

penguin

9

1

1

6

4

6

0

2

2

0

2

0

0

0

1

5

term count issues
Term Count Issues

the

incubation

crazy

moonlight

firefox

linux

dog

penguin

  • ‘the dog penguin’
    • Linux: 9+0+2 = 11
    • Animal: 2+1+5 = 8
  • ‘the’ is overweight
  • Enter TF-IDF: Term Frequency Inverse Document Frequency
    • A weight to evaluate how important a word is to a corpus
      • i.e. if ‘the’ occurs in 98% of all documents, we shouldn’t weight it very highly in the total query

9

1

1

6

4

6

0

2

2

0

2

0

0

0

1

5

tf idf
TF-IDF
  • Normalise the term count:
    • tf = termCount / docWordCount
  • Measure importance of term
    • idf = log ( |D| / termDocumentCount)
      • where |D| is the total documents in the corpus
  • tfidf = tf * idf
    • A high weight is reached by high term frequency, and a low document frequency
fuzzy matching
Fuzzy Matching
  • String similarity algorithms:
    • SoundEx; Metaphone
    • Jaro Winkler Distance; Cosine similarity; Sellers; Euclidean distance; …
    • We’ll look at Levenshtein Distance algorithm
  • Defined as: The minimum edit operations which transforms string1 into string2
fuzzy matching27
Fuzzy Matching
  • Edit costs:
    • In-place copy – cost 0
    • Delete a character in string1 – cost 1
    • Insert a character in string2 – cost 1
    • Substitute a character for another – cost 1
  • Transform ‘kitten’ in to ‘sitting’
    • kitten -> sitten (cost 1 – replace k with s)
    • sitten -> sittin (cost 1 - replace e with i)
    • sittin -> sitting (cost 1 – add g)
  • Levenshtein distance: 3
fuzzy matching28
Fuzzy Matching
  • Estimated string similarity computation costs:
    • Hard on the GC (lots of temporary strings created and thrown away, use arrays if possible.
    • Levenshtein can be computed in O (kl) time, where ‘l’ is the length of the shortest string, and ‘k’ is the maximum distance.
    • Parallelisable – split the set of words to compare across n cores.
    • Can do approximately 10,000 compares per second on a standard single core laptop.
classification
Classification
  • Support Vector Machines (SVM)
    • Supervised learning for binary classification
    • Training Inputs: ‘in’ and ‘out’ vectors.
    • SVM will then find a separating ‘hyperplane’ in an n-dimensional space
  • Training costs, but classification is cheap
  • Can retrain on the fly in some cases
svm issues
SVM Issues
  • Classification on 2 dimensions is easy, but most input is multi-dimensional
  • Some ‘tricks’ are needed to transform the input data
f and algorithms netflix demo
F# and AlgorithmsNetflix Demo
  • Netflix Prize - $1 million USD
    • Must beat Netflix prediction algorithm by 10%
    • 480k users
    • 100 million ratings
    • 18,000 movies
  • Great example of deriving value out of large datasets
  • Earns Netflix loads and loads of $$$!
nearest neighbour algorithm find all my neighbours movies
Nearest Neighbour AlgorithmFind all my neighbours movies
  • Find the best movies my neighbours agree on
a short stop over at vector math
A Short Stop-over at Vector Math

A (x1,y1)

B (x2,y2)

C (x0,y0)

If we want to calculate the distance between A and B, we call on Euclidean Distance

We can represent the points in the same way using Vectors: Magnitude and Direction.

Having this Vector representation, allows us to work in ‘n’ dimensions, yet still achieve

Euclidean Distance/Angle calculations.

slide40
Q & A
  • Any questions?
  • http://callvirt.net/
  • joelpobar@gmail.com
  • THANKS!