implementing neural networks for text classification data sets n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Implementing Neural Networks for Text Classification: Data Sets PowerPoint Presentation
Download Presentation
Implementing Neural Networks for Text Classification: Data Sets

Loading in 2 Seconds...

play fullscreen
1 / 10

Implementing Neural Networks for Text Classification: Data Sets - PowerPoint PPT Presentation


  • 190 Views
  • Uploaded on

Implementing Neural Networks for Text Classification: Data Sets. Prerak Sanghvi Computer Science and Engineering Department State University of New York at Buffalo. Data Set Selection. There are two types of Data Sets that can be used:

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

Implementing Neural Networks for Text Classification: Data Sets


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
implementing neural networks for text classification data sets

Implementing Neural Networks for Text Classification:Data Sets

Prerak Sanghvi

Computer Science and Engineering Department

State University of New York at Buffalo

data set selection
Data Set Selection
  • There are two types of Data Sets that can be used:
    • Compilation of documents from web, etc manually specifically for this project
    • Use of an existing Data Set that has been worked on by other researchers
advantages of standard data sets
Advantages of Standard Data Sets
  • We don’t have to work for obtaining the data
  • Distribution of documents in the corpora used is even. Further, documents are well-classified
  • Comparison of results can be done with results from other researchers. This gives a comparative evaluation of the algorithm being used for classification.
most popular corpora
Most popular corpora
  • Most popular corpora used for text-classification research are:
    • Reuters-21578 data set (set of 21,578 newswire articles from Reuters – available as SGML documents – 1000 documents in each file)
    • 20-newsgroups data (a set of 20,000 newsgroup postings from 20 newsgroups – available as text files – one document per file)
    • WebKB database (web pages from 4 universities class)
reuters 21578 data set
Reuters-21578 data set
  • Data is classified into five groups of classes:
reuters 21578 data set1
Reuters-21578 data set
  • Categories are overlapping and non-exhaustive.
  • Overlapping: one document can be classified into more than one categories. E.g. a document can be about ‘nasdaq’ (EXCHANGES) and about ‘USA’ (PLACES) in general.
  • Non-exhaustive: There are categories into which no documents fall, and there are documents that do not fall into any category.
  • Categories with 20+ occurrences are too few. ANN approach would probably not work with such few examples.
example of a reuter 21578 document
Example of a Reuter-21578 document

<REUTERS TOPICS="YES" LEWISSPLIT="TRAIN" CGISPLIT="TRAINING-SET" OLDID="13522" NEWID="8001">

<DATE>20-MAR-1987 16:54:10.55</DATE>

<TOPICS><D>earn</D></TOPICS>

<PLACES><D>usa</D></PLACES>

<PEOPLE></PEOPLE>

<ORGS></ORGS>

<EXCHANGES></EXCHANGES>

<COMPANIES></COMPANIES>

<TEXT>

<TITLE>GANTOS INC &lt;GTOS> 4TH QTR JAN 31 NET</TITLE>

<DATELINE> GRAND RAPIDS, MICH., March 20 -</DATELINE>

<BODY>

Shr 43 cts vs 37 cts

Net 2,276,000 vs 1,674,000

Revs 32.6 mln vs 24.4 mln

</BODY>

</TEXT>

</REUTERS>

20 newsgroup data set
20-newsgroup data set
  • Each document is in a separate text file.
  • There are 1000 documents from each newsgroup.
  • Each document has only one source newsgroup, so each document falls into only one category.
  • The task of classification pertains to determining the source newsgroup of the document.
example of a 20 newsgroup document
Example of a 20-newsgroup document

Newsgroups: alt.atheism

Path: cantaloupe.srv.cs.cmu.edu!magnesium.club.cc.cmu.edu!news.sei.cmu.edu!cis.ohio-state.edu!zaphod.mps.ohio-state.edu!uwm.edu!psuvax1!psuvm!smm125

Organization: Penn State University

Date: Fri, 23 Apr 1993 18:54:23 EDT

From: <SMM125@psuvm.psu.edu>

Message-ID: <93113.185423SMM125@psuvm.psu.edu>

Subject: Re: YOU WILL ALL GO TO HELL!!!

References: <93106.155002JSN104@psuvm.psu.edu> <1qq837$cm6@usenet.INS.CWRU.Edu>

Lines: 1

jsn104 is jeremy scott noonan