textklassifikation n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Textklassifikation PowerPoint Presentation
Download Presentation
Textklassifikation

Loading in 2 Seconds...

play fullscreen
1 / 10

Textklassifikation - PowerPoint PPT Presentation


  • 122 Views
  • Uploaded on

Textklassifikation. Der Scirus-Classifier. Überblick. Komplexes Programm: Porno-Filter Extraktion von Namen Klassifikation aufgrund von Text Klassifikation nach URL/Title Feste Klassifikation aufgrund einer URL-Liste Extraktion von Titel/Autor/Abstract etc bei Artikeln

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Textklassifikation' - aelan


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
textklassifikation

Textklassifikation

Der Scirus-Classifier

berblick
Überblick
  • Komplexes Programm:
    • Porno-Filter
    • Extraktion von Namen
    • Klassifikation aufgrund von Text
    • Klassifikation nach URL/Title
    • Feste Klassifikation aufgrund einer URL-Liste
    • Extraktion von Titel/Autor/Abstract etc bei Artikeln
    • Ausgabe von Refinement-Termen
  • Hier nur von Interesse: Klassifikation aufgrund des textuellen Inhalts
textklassifikation1
Textklassifikation
  • Lexikonbasiert:
    • Phrasen oder Wörter
    • Erhalten Gewicht für jede Kategorie
    • Starke Indikatoren
  • Klassifikation durch Berechnung eines Scores:
    • Für jedes Vorkommen wird für jede Kategorie ein Zähler hochgesetzt
    • Normalisierung nach Dokumentlänge
    • Schwellenwert
konfigurations datein
Konfigurations-Datein

//Number of words to process for subject identification

NWDS=2000000

MINWORDS=100

THRESHOLD=1

SUBJ=gen all 0 0

SUBJ=chem all 1 0

SUBJ=comp all 2 0

SUBJ=eng all 3 0

SUBJ=env all 4 0

SUBJ=geo all 5 0

SUBJ=astro all 6 0

SUBJ=life all 7 0

SUBJ=math all 8 0

SUBJ=mat all 9 0

SUBJ=med all 10 0

….

aufruf
Aufruf

CIS Subject Identifier and Content Extractor Version 5.0

USAGE: classifier [-h[elp]] [-os|l[A]] [-it|f|h] [-s[ilent]] [-c CONFIG_FILE] [-nout] [-uat] [-URL<filename>] [-smd<number>] [-ps] [-t FILES_TO_IDENTIFY]

-h: print help

-c CONFIG_FILE: Name of the configuration file. Default is ././config.txt

-os|l[A]: Output format

-os: Short: only print well identified subjects(default)

-ol: Long: print all subjects

-ot: Topics only are output; one line

Format: filename:WORDCOUNT#GENERALSCIENCESCORE#TOPICSWITHSCORE

´ -oA: Store and print all phrases for a topic

´ -oT: Print all phrases found in the dictionary

´ (Used for dictionary testing only)

-T[t][i][o]: Tasks to carry out and to output (default: all are set)

t: Topic identification

i: Information from content extractor

o: Offensive content filter

-it|h|f: Input format

-it: Plain text

-ih: HTML-file

-if: HTML-file preceded by header

-nINTEGER :Minumum number of words in a document

-MINTEGER :Maximum number of words to be processed in a document

tokenizer stops after INTEGER words

Documents with less words will get tag 'not_enough_data'

-mINTEGER :Minimum score for accepted documents

-rINTEGER : maximum relative count for phrase form/thousand

In thousand phrases one phrase form will only be counted

INTEGER times.

-NINTEGER :Maximum number of phrases to output in results for topics

-t FILES_TO_IDENTIFY List of files for which subject

should be identified. Default: stdin.

-D[r] D1|D2[:F1|F2[:FB1|FB2]]: process all files in directory and recurse

Dr: descend recursively into subdirectories

D1: name of directory to list or recurse

F1... : filename patterns (my contain *)

FB1: Patterns for forbidden directories (not recursed)

-s: print only some important messages, not all.

-nout: Turn off URL/Title classifier.

-uat: Use all titles for classification (not just those enclosed in <head>).

-URL<filename>: Filename of the URL list (format: <file><tab><url><newline>).

-smd<number>: Maximum number of words for small documents (default see config file).

-ps: Print title and url scores

-xml: Print XML output

ablauf
Ablauf
  • Einlesen des Textes bis zur spez. Anzahl von Wörtern
  • Abgleich mit dem Lexikon
  • Berechnen des Scores
  • Ausgabe des Ergebnisses in Abhängigkeit vom Schwellenwert
scoring formel
Scoring Formel
  • Sei:
    • d Dokument,
    • c Kategorie,
    • t Term,
    • l(t) = Länge von t,
    • wn(t) = Wortanzahl in t,
    • q(t,c) Gewicht von t für c und
    • s(t,c) starker Indikator t für c
    • T(c) Klassifikations-Schwellenwert für c
    • W = min(Wörter im Dokument, max proz. Wörter)
  • Score(d,c) = ∑td (l(t)/2 + (wn(t) -1) x 2) x q(t,c))/W
  • Si-score(d,c) = ∑td s(tc)
  • d wird als c klassifiziert gdw. Si-score(d,c) > 1 && score(d,c) > T(c)
klassifikations lexikon
Klassifikations-Lexikon
  • Format: TERM.INFO1/INFO2/...
  • INFO: TOPICS#FREQUENCY#QUALITY#LENGTH#TYPE#ALONE#OUTPUT
    • TOPICS: MAIN:SUB
    • FREQUENCY: 1 (not used)
    • QUALITY: 0...9
    • LENGTH (number of words)
    • TYPE: 0..3
      • 0: genuine topic-subtopic indicator
      • 1: only to distinguish between subtopics, not indicating topic itself
      • 2: as 0, but word is to be counted only if there are other phrases for same subtopic, with TYPE 0
      • 3: as 1, but word is to be counted only if there are other phrases for same subtopic, with TYPE 0
    • ALONE: 0/1 : strong indicator
    • OUTPUT: Ø,$, PHRASE
klassifikations lexikon1
Klassifikations-Lexikon
  • Beispiel
    • a vinculo matrimonii.18:0#1#0#3#0#0#$
    • a-37 aircraft.14:0#1#1#3#0#1#a 37 aircraft
    • a-address register.2:0#1#1#3#0#1#a address register
    • a-bomb survivors.7:0#1#8#3#0#1#a bomb survivors
    • a-c substitutions.15:0#1#8#3#0#1#a c substitutions/7:0#1#8#3#0#1#a c substitutions
    • a-calcium-calmodulin kinase.11:0#1#8#4#0#1#a calcium-calmodulin kinase
    • a-chromanoxyl radical.7:0#1#8#3#0#1#a chromanoxyl radical
    • a-crystallin gene.15:0#1#8#3#0#1#a crystallin gene/7:0#1#8#3#0#1#a crystallin gene
    • a-d conversion.3:0#1#1#3#0#1#a d conversion
    • a-d converter.13:0#1#1#3#0#1#a d converter/3:0#1#1#3#0#1#a d converter/9:0#1#1#3#0#1#a d converter
    • a-deficient mice.11:0#1#7#3#0#1#a deficient mice/15:0#1#8#3#0#1#a deficient mice
    • a-delta activity.11:0#1#8#3#0#1#a delta activity