Intelligent Web Agent

Intelligent Web Agent Under the guidance of Dr. S. B. Nair Presented by Sandeep Kumar (001127) Dept. of CSE, IIT Guwahati

Goals • To develop an Intelligent web agent which learns about the user’s behavior with any user intervention and uses the information gathered to help him search pages of his interests by ranking results of google according to his interests.

Intelligent Web Agent • Agent – something that perceives its environment through its sensors and acting upon it through its effectors. • Web Agent – the environment is the world wide web. • Intelligent Web Agent – A rational web agent i.e. one that can make a rational decision when given a choice leading to its goal. • ‘Personalized’ Intelligent Web Agent (PIWA) – It learns user preferences and behavior over a length of time and exhibits ample intelligence in its decisions.

Problem • Explosive growth of internet has made it the largest knowledge repository mankind ever had. • Need to use it efficiently – providing one with information one seeks. • Searching for information is a Problem. • The internet structure - a massive mess of hyperlinks pointing to HTML pages makes it difficult • Search Engines though excellent in their working fail to satisfy an user’s needs returning millions of results - most of which are unwanted

How can we help the user • A web based search engine lacks information about the user. • They are based on algorithms which need more information about the search query – a normal user provides just one or two. • We can have a software agent at the user’s desktop. • It has more access to user’s browsing behavior. • It can learn from it. • Using its knowledge base, it can help user to find information matching his interests.

Profiling • Static – profiles are built beforehand like templates • Dynamic – profile are generated dynamically learning from user’s behavior. • Need • an intelligent search engine over an general purpose search engine. • Perform real time adaptive learning from monitoring user’s habits with no relevance feedback from user. • Must change according to interest changes in the user

The Algorithm • Salient Features: • Representation of user’s interest as a group of words. • Generation of the group of words by unobtrusive monitoring of user’s browsing habits with no relevance feedback from user. • Dynamic membership of the group of words representing a particular user’s interest. • Using the group of words to improvise web query generation. • Using the group of words to ranks results from general purpose search engine.

Interest • Basic knowledge block of user profile. • Represented by a group of 10 words each having an associated weight and a timestamp. • Weight represents the importance of word in that particular interest. • Eg: • said 16, yesterday 15, city 15, sadr 14, iraq 12, holy 12, news 12, east 11, talks11 , tension 11 • user’s interest in what sadr said yesterday in a holy city of iraq and his talks about tension and east

Generation of Interest • Key Point – unobtrusive, user friendly

Implementation • Agent act as a proxy for user’s browser. • Passive monitoring of incoming traffic. • The 10 words are extracted from the very pages the user browses through. • Extract top words from a HTML document. • Get the page. • Do feature extraction. • Do stop-word removal. • Do stemming.

Generation of Interest • From HTML pages browsed by the user. • Feature extraction done using latest features of HTMLEditorKit (available in JDK SE > 1.4) • HTML tags given weights like title 10, meta-names 6, block-quote 4, boldfaced and underline 2, fontsizes, etc • Content tag given weight 1 (similar to Term Frequency (tf)) • Weights are summed up for all words. • Commonly used words removed by stop words elimination and removing words of length <= 2 • Top 10 words selected. • Morphological analysis not done as many words don’t occur in dictionary like yahoo, and the process still is not very efficient.

Creation of Profile • Get 10 keywords from each page visited • 2 possible cases: • Current page (keywords) matches/is similar to a past interest -> list of interest updated • Current page (keywords) is new -> new interest created • Match if 3 words or more (>= 30%) match between keywords of current page and past interest. • Interest Update:- Sum up the weights of the matched words and get the top 10 from the merged list.

Maintenance of Profile • An optimum size needed as too big will have erroneous interests and has performance problems, a small list may not cover all his interests – at present its at 20 interest. • When an interest is created or updated its timestamp is updated (associated with the 11 word marker “1234567890”) • The product of timestamp and sum of weights of the interest is used to determine which interest will remain in list and which one removed.

Use of Profile for web searches • Direct Searching: user provides search query • 3 cases: • Query matches one interest • Query matches more than one interest • Query does not match any interest • No match: simple google search • More than one match: sum up the words of query in the matched interest and select the one for which sum in maximum. • So now we have one matched interest

Trigger Pair Model • Trigger Pair – get some words from the matched interest which have weights less than the smallest weight for any word of the query. • Done to prevent overshadowing of original query by more popular/weighted words. • 1 word added for single worded query, 2 words for double or more worded query -> to prevent overshadowing. • Trigger Pair refines results from google to a great extent.

Ranking of results to user’s interest • Get top 20 results from Google. • Get top 10 keywords for each of the result • Score each page by summing the product of weights of common words between matched interest and keywords. • Get the top 10 pages based on the score. – 1st Result • Take a arithmetic mean of ranks of Google and rank of the algorithm and get top 10 pages – 2nd Result

Agent Architecture

Implementation Summary • Keyword: String word, Int val • Interest: 10 keywords + 1 marker with timestamp • Profile: 20 interests • Google element: String title, String snippet, String url

Results : 1. IRAQ • 10 pages from news on Iraq dated April 14th 2004. • 7 interests formed, 3 ages merged.

Interests

Points to note • 1st interest: lexington and concorde came in as they were advertisements on the first page – parser is not designed to ignore and considers part of the page • Can be done if structure of page known beforehand but impossible in present case. • Timestamps of 2nd and 10th page merging as both from reuters

Search: IRAQ • Search query: iraq • Matched interest: last one as weight of iraq is maximum in it – 42 • Trigger Pair causes Fallujah to be appended.

Comparision

Points to note • 1st result of Google describes about fallujah and has less elements of fighting in it – hence ranks a poor 10th in the PIWA rank. • 7th result of Google which is basically a discussion board on iraq war with lots of discussion on “iraq, fallujah, marines, coalition, Baghdad” (word in matched interest) ranks first in PIWA rankings. • Mixed Results: Its rank 1 are 2nd of Google and 5th of PIWA.

2. Sandeep • 3 pages: 2 homepages and 1 resume • 3 interests formed

Search: Sandeep • Search query: sandeep • Kumar appended

Points to note • Matched interest was derived from 10th result of Google but ranked 2nd in PIWA rankings and hence 5th on mixed results. • A page visited in past affects the results greatly. • Sandeep is very general term so results still not much inclined in my favour.

Search: Sandeep Kumar • Appended by “2004 iitg”

3. India Pakistan Series • 5 news pages dated 14th April 2004

Search: India Pakistan • Appended by “test series”

Points to note • Normal search in google without trigger pair resulted in results on war – not wanted by user • Google's 7th, 8th, 9th and 10th result don’t make it to top 10 of PIWA – shows ranking differences based on user’s interests.

Mixed Results • In classical AI search terminology • Google:- explore strategy – get new results • PIWA:- exploit strategy – use past information to decide new ranks • Mixed:- a 50-50 mix of both, can be changed to explore more initially and then exploit more as in any other learning process

Conclusion • A profile for an user was generated with absolutely no user relevance feedback • Dynamic profile maintenance – continuously updated by new information. • Profile used to improve user’s web searches to suit his interests.

Future Work • Improve GUI: specially for search utility • Support for plugins: to handle non-HTML documents • Support of encoded pages: SSL, gzipped, etc • News reader can be made easily with improved parser knowing the structure of news pages beforehand.

Thank you

Intelligent Web Agent

Intelligent Web Agent

Presentation Transcript

Intelligent Software Agent Technology

Intelligent Agent Technology

Intelligent Software Agent Technology

Intelligent Agent Technology and Application

Intelligent Agent in Education

Designing an Intelligent Agent

Intelligent Web Fuzzing

Agent Intelligent

Web Science: The Intelligent Web

Intelligent web searching

Logical Formalization of intelligent agent systems

Agent Web Reports

Intelligent Agent Framework

Intelligent Software Agent Technology

Intelligent Agent

Programmability of Intelligent Agent Avatars

Chapter 7 Intelligent Agent

Intelligent Agent

Web Crawler Agent (WCA)

Intelligent Agent / Network Mining

Intelligent Agent

Intelligent Software Agent Technology