Content Classification Analysis based on LDA Topic Model - PowerPoint PPT Presentation

content classification analysis based on lda topic model n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Content Classification Analysis based on LDA Topic Model PowerPoint Presentation
Download Presentation
Content Classification Analysis based on LDA Topic Model

play fullscreen
1 / 14
Content Classification Analysis based on LDA Topic Model
137 Views
Download Presentation
bobby
Download Presentation

Content Classification Analysis based on LDA Topic Model

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

  1. Content Classification Analysis based on LDA Topic Model Projectleader:Hongbozhao

  2. Content Classification Analysis based on LDA Topic Model Webcrawler AdvancedTF-IDF Testingparameters achievingwebnews chineseparsing&extracting contentsprocessing adding content-based tests finding best parameters in small data testing in big data comparing to content-based algorithm

  3. Web crawler • achieving nearly 17,000 web news through Sougou Database • including htmlcharacters, insignificantly achieving web news chinese parsing & extracting

  4. Web crawler • using ICTCLAS to parse and extract chinese words, excluding stop words, conjunctions, prepositions and numerals achieving web news chinese parsing & extracting

  5. Advanced TF-IDF • Extracting news into TITLE, BEGIN, CONTENT and END section with different weights • Using TF-IDF to calculate top 5 keywords, the accuracy is 81% comparing to the sorted database content processing adding content-based tests finding best parameters in small data TITLE BEGIN CONTENT END

  6. Advanced TF-IDF • Adding content-based algorithm(the accuracy through 81% to 82% when the semantic weight through 1.0 to 0.0), there is no significant changes. We concludes that the semantics is useless in this circumstance. contents processing adding content-based tests finding best parameters in small data

  7. Advanced TF-IDF • Testing perfect parameters in small data(less than 2000 news), including accurancy, time efficiency factors • testing sets = 30% of whole data • training sets = 70% of whole data contents processing adding content-based tests finding best parameters in small data

  8. Advanced TF-IDF • the keywords in training sets equals to testing sets contents processing adding content-based tests finding best parameters in small data • Unstable

  9. Advanced TF-IDF • Using all keywords in training sets contents processing adding content-based tests finding best parameters in small data • Extremly low speed

  10. Advanced TF-IDF • Using all keywords in testing sets contents processing adding content-based tests finding best parameters in small data • When using 10 keywords in training sets, the accuracy, error score and time efficency is perfect

  11. Testing parameters • Testingtobigdata,whenthetrainingsetineverysectionincreasesgraduallyto200,450,750andfinally1343(allwords),theaccuracyisshown in the figure.Thefinalaccuracyreaches82.5%or85.1%excludingtheculturesection.Theresultsshowstheperfectparametersweselected. testinginbigdata comparing to content-based algorithm

  12. Testing parameters • to content-based algorithm, theaccuracyisgreater,however,thetimeefficiencyislower testinginbigdata comparing to content-based algorithm

  13. Summary partialencoding&decodingproblems errorsinkeywordsparsingleadstoclassificationfaults partialrepeatedpassagesleadstoerrorsinaccuracy successfulalgorithmingeneral

  14. Thanks Content Classification Analysis based on LDA Topic Model