1 / 20

Handy Tools and Frameworks

Handy Tools and Frameworks. … for our projects and work. Tools …. Apache Lucene WEKA itpp Misc *tex imagemagick, inkscape graphviz, gnuplot, gs. Tech presentation by Pavel Patz htpp:// lucene.apache.org/java/docs/index.html. APACHE LUCENE. Apache Lucene.

lolita
Download Presentation

Handy Tools and Frameworks

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Handy Tools and Frameworks … for our projects and work

  2. Tools … • Apache Lucene • WEKA • itpp • Misc • *tex • imagemagick, inkscape • graphviz, gnuplot, gs

  3. Tech presentation by Pavel Patz htpp:// lucene.apache.org/java/docs/index.html APACHE LUCENE

  4. Apache Lucene • Apache Lucene is a free/open source information retrieval software library • Doug Cutting’s grandmother’s middle name! • And also most powerful OpenSource indexer / search engine • Library for Java, C (with Perl and Python bindings), C++, Objective C, Delphi, Ruby, PHP, Common Lisp and C# (yep, even .net) • Fast and efficient solution • Over 20 MB/s on P 1.5 GHz • Index size 20-30% the size of indexed text • Widely adopted solution • Wikipedia (and MediaWiki as well), E.ON, Beagle, Strigi (Desktop search), isoHunt, Eclipse, Jira, Digg (it!), abclinuxu.cz, BlogScope, CNET, European Bioinformatics Institute, etc.

  5. Apache Lucene – Text processing • Stemmers • removes suffixes to find root of a word • Vs Lemmatizers • Create index storage a.k.a. Directory • In Database, in RAM, on File system • Create Analyzer • We need somehow separate tokens, find roots, exclude stop words • Create IndexWriter • Based on Directory and Analyzer • For each “record” (file, row in table…) create Document and store it

  6. Apache Lucene – Directory & Indexing • Directory consist of documents • Document consist of fields • Like ID, content, timestamps – what do you want to store • Fields • Can be stored, compressed (useful for long strings), not stored • Content of stored fields can be retrieved from directly search result. • Content can be indexed as • Tokenized • Not tokenized (for instance brand names – “Faster Runner”) • Indexed without NORMS (=no scoring) • Not indexed (but can be stored) • Indexing • Each document and / or field can have it’s “boost” value • Score (hitpoints) counting of results is based on many factors, boost value multiplies score of document / field.

  7. Apache Lucene – Search • We have index. So open it! • Use IndexSearcher – use singleton to better performance • Prepare Query • Lucene has simple query language • We should use same analyzer for querying as for indexing • We can search in fields, boost parts of query, make Boolean queries etc. • Execute Query • Enjoy results

  8. htpp://www.cs.waikato.ac.nz/ml/weka/ WEKA

  9. Weka 3: Data Mining Software in Java • collection of machine learning algorithms for data mining tasks • Library AND environment in one • Tools for data pre-processing, classification, regression, clustering, association rules, and visualization

  10. WEKA Tools • Collection of machine learning algorithms for data mining tasks • Library AND environment • Tools for data pre-processing, classification, regression, clustering, association rules, and visualization • Own data format (ARFF) • Text oriented, easily editable • Many algorithms (classifiers, preprocessors) • Many parameters • Possible to set in the GUI or in API

  11. WEKA Modules • The WEKA GUI consists of more parts • Explorer • Data analysis, visualisation, model management • Knowledge flow • Streaming data processing • Experimenter • Parameterized tests, statistics, performance evaluation, significance tests • CLI • Command line!

  12. http://itpp.sourceforge.net/ ITPP

  13. ITPP Intro • Do you Matlab? • Nope? But there is a number of examples in *.m • … and the API is actually nice • You can IT++ • C++ library of mathematical, signal processing and communication classes and functions • IT++ makes an extensive use of existing open-source or commercial libraries for increased functionality, speed and accuracy. In particular BLAS, LAPACK and FFTW • IT++ should work on GNU/Linux, Sun Solaris, Microsoft Windows (with Cygwin, MinGW/MSYS or Microsoft Visual C++) and Mac OS X operating systems

  14. ITPP Features • Basic mathematical features • templated vector and matrix classes • sparse vectors and matrix classes • elementary functions on vectors and matrices • statistics classes and functions • matrix decompositions such as eigenvalue, Cholesky, LU, Schur, SVD, and QR • solving linear system of equations (including over and underdetermined) • random number generation (Mersenne Twister generator) • binary and Galois types (both scalar and vector and matrices) • integration of 1-dimensional functions • unconditional nonlinear optimization (Quasi-Newton search) • Signal processing • filter functions and classes • frequency domain filtering • FFT, DFT, DCT, and Hadamard transforms • time and frequency domain windows • evaluating and finding roots of polynomials (and inverse operations) • filter design functions • fast independent component analysis (fast ICA) • Communications • modulators (BPSK, PSK, PAM, QAM) • vector modulators (e.g. for OFDM and MIMO) • OFDM and CDMA modulators • pulse shaping filters (including RC and RRC) • binary symmetric (BSC) and additive white Gaussian Noise (AWGN) channels • multipath fading channels (both frequency-flat and frequency-selective) • COST 207, COST 257, and ITU channel models • Hamming, extended Golay, and CRC codes • BCH and Reed-Solomon codes • convolutional and punctured convolutional codes • recursive convolutional codes, turbo codes, Interleavers • Protocol simulation • event-based simulation classes • signal and slots for simplified syntax • TCP clients and servers, selective repeat ARQ • queue classes, packet generators, Source coding • Scalar Quantizer (SQ) and Vector Quantizer (VQ) classes and functions for training of these • LPC, LSF, and cepstrum parameter calculation for speech processing • Gaussian Mixture Modeling • reading and saving several different audiofile formats • reading and saving images in PNM format

  15. Building ITPP • Cygwin & linux • Autotools • ./configure [--without-blas --without-lapack --without-fft • make -j • make -j install

  16. MISC.

  17. HeuristicLab

  18. ((pdf)La | xe)TeX • Tex makes beautiful pdf • looks professional, math, graphics • typesetting can be done like SW development • Portable, vector-oriented, blah blah • Scriptable

  19. Beautiful figures • ImageMagick • Converts many formats (e.g. to pdf) • GraphViz • Create graphs () from text files • Many layouts • Ps, pdf, svg outputs • Java/.NET alternative • GNUPlot • Non-graph plots • Many flavors of graphs (pie charts, etc.)

  20. All together • Put it all together • Test data • Test program • Text output of results (gnuplot, graphviz) • Prepared source for report (latex) = • On-demand generated seminary projects ;)

More Related