1 / 30

Building an Identity Extraction Engine

When it comes to building customized experiences for your users, the biggest key is in understanding who those users are and what they're interested in. The largest problem with the traditional method for doing this, which is through a profile system, is that this is all user-curated content, meaning that the user has the ability to enter in whatever they want and be whoever they want. While this gives people the opportunity to portray themselves how they wish to the outside world, it is an unreliable identity source because it's based on perceived identity. In this session we will take a practical look into constructing an identity entity extraction engine, using PHP, from web sources. This will deliver us a highly personalized, automated identity mechanism to be able to drive customized experiences to users based on their derived personalities. We will explore concepts such as: - Building a categorization profile of interests for users using web sources that the user interacts with. - Using weighting mechanisms, like the Open Graph Protocol, to drive higher levels of entity relevance. - Creating personality overlays between multiple users to surface new content sources. - Dealing with users who are unknown to you by combining identity data capturing with HTML5 storage mechanisms.

jcleblanc
Download Presentation

Building an Identity Extraction Engine

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Premise You can determine the personality profile of a person based on their browsing habits

  2. Technology was the Solution!

  3. Then I Read This… Us & Them The Science of Identity By David Berreby

  4. The Different States of Knowledge What a person knows What a person knows they don’t know What a person doesn’t know they don’t know

  5. Technology was NOT the Solution Identity and discovery are NOT a technology solution

  6. Our Subject Material

  7. Our Subject Material HTML content is unstructured You can’t trust that anything semantically valid will be present There are some pretty bad web practices on the interwebz

  8. How We’ll Capture This Data Start with base linguistics Extend with available extras

  9. The Basic Pieces Keywords Without all the fluff Weighting Word diets FTW Page Data Scrapey Scrapey

  10. Capture Raw Page Data Semantic data on the web is sucktastic Assume 5 year olds built the sites Language is the key

  11. Extract Keywords We now have a big jumble of words. Let’s extract Why is “and” a top word? Stop words = sad panda

  12. Weight Keywords All content is not created equal Meta and headers and semantics oh my! This is where we leech off the work of others

  13. Questions to Keep in Mind Should I use regex to parse web content? How do users interact with page content? What key identifiers can be monitored to detect interest?

  14. Fetching the Data: The Request The Simple Way $html = file_get_contents('URL'); The Controlled Way $c = curl_init('URL');

  15. Fetching the Data: cURL $req = curl_init($url); $options = array( CURLOPT_URL => $url, CURLOPT_HEADER => $header, CURLOPT_RETURNTRANSFER => true, CURLOPT_FOLLOWLOCATION => true, CURLOPT_AUTOREFERER => true, CURLOPT_TIMEOUT => 15, CURLOPT_MAXREDIRS => 10 ); curl_setopt_array($req, $options);

  16. //list of findable / replaceable string characters $find = array('/\r/', '/\n/', '/\s\s+/'); $replace = array(' ', ' ', ' '); //perform page content modification $mod_content = script>#is', '', $page_content); $mod_content = preg_replace('#<style(.*?)>(.*?)</ style>#is', '', $mod_content); preg_replace('#<script(.*?)>(.*?)</ $mod_content = strip_tags($mod_content); $mod_content = strtolower($mod_content); $mod_content = preg_replace($find, $replace, $mod_content); $mod_content = trim($mod_content); $mod_content = explode(' ', $mod_content); natcasesort($mod_content);

  17. //set up list of stop words and the final found stopped list $common_words = array('a', ..., 'zero'); $searched_words = array(); //extract list of keywords with number of occurrences foreach($mod_content as $word) { $word = trim($word); if(strlen($word) > 2 && !in_array($word, $common_words)){ $searched_words[$word]++; } } arsort($searched_words, SORT_NUMERIC);

  18. Scraping Site Meta Data //load scraped page data as a valid DOM document $dom = new DOMDocument(); @$dom->loadHTML($page_content); //scrape title $title = $dom->getElementsByTagName("title"); $title = $title->item(0)->nodeValue;

  19. //loop through all found meta tags $metas = $dom->getElementsByTagName("meta"); for ($i = 0; $i < $metas->length; $i++){ $meta = $metas->item($i); if($meta->getAttribute("property")){ if ($meta->getAttribute("property") == "og:description"){ $dataReturn["description"] = $meta->getAttribute("content"); } } else { if($meta->getAttribute("name") == "description"){ $dataReturn["description"] = $meta->getAttribute("content"); } else if($meta->getAttribute("name") == "keywords”){ $dataReturn[”keywords"] = $meta->getAttribute("content"); } } }

  20. Weighting Important Data Tags you should care about: meta (include OG), title, description, h1+ , header Bonus points for adding in content location modifiers

  21. Weighting Important Tags //our keyword weights $weights = array("keywords" => "3.0", "meta" "header1" "header2" => "2.0", => "1.5", => "1.2"); //add modifier here if(strlen($word) > 2 && !in_array($word, $common_words)){ $searched_words[$word]++; }

  22. Expanding to Phrases 2-3 adjacent words, making up a direct relevant callout Seems easy right? Just like single words Language gets wonky without stop words

  23. Working with Unknown Users The majority of users won’t be immediately targetable Use HTML5 LocalStorage & Cookie backup

  24. Adding in Time Interactions Interaction with a site does not necessarily mean interest in it Time needs to also include an interaction component Gift buying seasons see interest variations

  25. Grouping Using Commonality Common Interests Interests User A Interests User B

More Related