1 / 31

Datcracker Open data-mining platform connecting Rseslib and WEKA

Datcracker Open data-mining platform connecting Rseslib and WEKA. Marcin Wojnarski. Warsaw University, Poland. Outline. Datcracker is … Motivation What is available in version 0.5 HOWTO … Architecture Future releases. Datcracker is….

krikor
Download Presentation

Datcracker Open data-mining platform connecting Rseslib and WEKA

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. DatcrackerOpen data-mining platform connecting Rseslib and WEKA Marcin Wojnarski Warsaw University, Poland

  2. Outline • Datcracker is … • Motivation • What is available in version 0.5 • HOWTO … • Architecture • Future releases

  3. Datcracker is… …an open-source extensible data-mining platform which provides common architecture for data processing algorithms of various types. The algorithms can be combined together to build data processing schemes of large complexity.

  4. Main characteristics • Extensibility of algorithm poolthrough well-defined API • Extensibility of types of data that algorithms operate on • Stream-based data processing, for efficient handling of large volumes of dataand for freedom of designing complex experiments • Language: Java • Licence: GPL • Download: www.datcracker.org

  5. Motivation To enable independent research groups exchange and combine their algorithms To simplify implementation of new algorithms

  6. Available in version 0.5 • Rseslib algorithms: • classifiers (~20 algorithms) • Weka algorithms: • ARFF reader • classifiers (~60) • filters (47) • Datcracker algorithms: • Train&Test evaluation scheme • Data types: • vectors of numeric and/or symbolic features

  7. HOWTO: Read ARFF file Cell arff = new ArffReaderCell(); arff.set("filename", "data/iris.arff"); arff.set("labelIndex", "last"); arff.open(); System.out.println(arff.next()); System.out.println(arff.next()); arff.close(); Output: [data:[5.1 3.5 1.4 0.2] label:[Iris-setosa]] [data:[4.9 3.0 1.4 0.2] label:[Iris-setosa]]

  8. HOWTO: Train classifier (Rseslib) Cell learner = new RseslibClassifier("C45"); learner.set("pruning", "true"); learner.setSource(arff); learner.build(); learner.setSource(arff_test); learner.open(); System.out.println(learner.next()); learner.close();

  9. HOWTO: Train classifier (Weka) Cell learner = new WekaClassifier("J48"); learner.set("minNumObj", "2"); learner.setSource(arff); learner.build();

  10. HOWTO: Apply Weka filter Cell filter = new WekaFilter("attribute.Remove"); filter.set("attributeIndices", "3-6"); filter.setSource(arff); filter.open(); System.out.println(filter.next()); System.out.println(filter.next()); filter.close();

  11. HOWTO: Set parameters arff.set("filename", "data/iris.arff"); arff.set("labelIndex", "last"); ... OR Parameters par = new Parameters(); par.set("filename", "data/iris.arff"); par.set("labelIndex", "last"); ... arff.setParameters(par); par = arff.getParameters();

  12. HOWTO: Train & Test Cell learner = new RseslibClassifier("C45"); learner.set("pruning", "true"); TrainAndTest tt = new TrainAndTest(learner); tt.set("trainPercent", "70"); tt.set("repetitions", "10"); tt.setSource(source); tt.build(); System.out.println(tt.report());

  13. ARFF ARFF Filter1 Filter2 Classifier New ARFF Another Classifier set("attributeIndices","0-3") set("attributeIndices","5") Data Processing Chain Cell.setSource(sourceCell)

  14. Architecture

  15. Outline • Cell • interfaces • state • how to override • Data • MetaData

  16. Cell • Main class of Datcracker architecture • Base class for all data-processing algorithms • classifiers • clusterers • filters • data loaders • data generators • … • Cells can be connected in a Data Processing Chain • Data transfer between cells have form of a stream of samples • Receiving cell may immidiately consume incoming samples large volumes of data processed efficiently

  17. Cell’s interface Cell can be: • a data source • a data receiver • buildable • parameterized

  18. Cell as a data source Cell’s interface for data transfer: open() : MetaSampleopens communication session next() : Sampleretrieves next sample of data close() closes communication session

  19. Cell as a data receiver Cell’s interface for receiving data: setSource(Cell) set source cell

  20. Buildable cells • Some cells may be buildable: they have to be built before use • Building a cell is implemented by subclasses and may mean different things: • training a decision system • running an evaluation scheme (T&T, CV, …) • buffering input data • … • Cell’s interface for building: build() builds the cell erase() erases the cell; it can be built again afterwards

  21. Fixed cells • Cells that are not buildable are called fixed. They are usable just after construction or setting parameters: • file reader • WEKA filter • …

  22. Parameterized cells • Cell’s interface for parameterization: set(String name, String value) sets a parameter setParameters(Parameters) sets all parameters at once getParameters() :Parametersreturns all parameters that are set

  23. next() build() open() EMPTY CLOSED OPEN erase() close() State of the cell EMPTYcell has no content, cannot be used CLOSEDcontent has been built, cell ready to use OPENcell is being used now (generating samples of data)

  24. …motivation • To check against access violations when the cell is accessed.Examples: • two cells try to retrieve data from a given cell at the same time • someone tries to use an empty cell • someone tries to reconnect cells during their activity • To simplify implementation of subclasses (new algorithms):they may safely assume that access is correct(build() before open(), open() before next(), …) • To detect bugs early – important in heterogenous system!

  25. How to override Cell • Methods to override: • onBuild() • onErase() • onOpen() • onNext() • onClose() • Public methods build(), … can’t be overriden.They perform state checking and then call on…() method • Like event handlers in event-driven programming • You do not have to override all of them!(e.g. cell for reading data will not be buildable) • You can provide additional interface in your subclass

  26. Data representation • Data set split into samples • Sample: • data : Data input data • label : Data associated decision label • Separation of data and label: • useful for complex types of data/labels, e.g. in image processing (like segmentation) • useful for meta-learning algorithm, which operate on labels alone • labelled / unlabelled / partially labl. samples handled in the same way • Data:abstract base class. Downcasted by cells to what they expect • Currently available subclasses: • NumericFeature, SymbolicFeature, DataVector • In the future: time series, images, special types of labels, ...

  27. Immutability • Data objects are immutable: they cannot be modified after creation (like String class) • They can be freely shared among cells without risk of accidental modification • safety • simplicity • efficiency: • no need to copy data between cells • no need for synchronization in multi-threaded execution

  28. Metadata • Many algorithms have to know „type” of input data in advance, before processing of data starts metadata • Separation of data and metadata base class MetaData • Describes common properties of all Data objects generated in a given session • number and types of features in a DataVector • dictionary of possible values of a SymbolicFeature • … • Each Data subclass has an associated MetaData subclass • Immutable!

  29. Future releases • Architecture • Multi-input and multi-output cells • Composite cells (e.g. meta-learning) • Serialization and copying • Progress info and suspension of cell building • Algorithms • cross-validation • data buffering • … • Data types • time series • …

  30. Home www.datcracker.org

  31. Thank You

More Related