Using mongodb in a java enterprise application
This presentation is the property of its rightful owner.
Sponsored Links
1 / 38

Using MongoDB in a Java Enterprise Application PowerPoint PPT Presentation


  • 49 Views
  • Uploaded on
  • Presentation posted in: General

Using MongoDB in a Java Enterprise Application. Ellen Kraffmiller - Technical Lead Robert Treacy - Senior Software Architect Institute for Quantitative Social Science Harvard University iq.harvard.edu . TS-4656. Agenda. What is Consilience? Consilience Architecture Why MongoDB?

Download Presentation

Using MongoDB in a Java Enterprise Application

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Using mongodb in a java enterprise application

Using MongoDB in a Java Enterprise Application

Ellen Kraffmiller - Technical Lead

Robert Treacy - Senior Software Architect

Institute for Quantitative Social Science

Harvard University

iq.harvard.edu

TS-4656


Agenda

Agenda

What is Consilience?

Consilience Architecture

Why MongoDB?

Data Storage and Access Details

Using JPA vs. Mongo API

Summary

Q & A


Consilience intro

Consilience Intro

Research Tool- Discovery

Grimmer, Justin, and Gary King. 2011. General Purpose Computer-Assisted Clustering and Conceptualization. Proceedings of the National Academy of Sciences.

http://j.mp/j4xyav


Consilience history

Consilience History

  • 2010 - brainstorming, first mockups, prototypes, proof of concept, experimentation

  • Initially small document sets

    • Data could be loaded into memory from files

    • LCE calculations could be done on the fly

    • Derby used for user accounts, permissions, user history


Workflow

Workflow

  • Analyze document terms

  • Calculate clustering solutions for known methods

  • Maps solutions to two dimensional space

  • Generate LCE from user clicks on map

  • User discovers patterns in data, annotates and labels favorites


Using mongodb in a java enterprise application

Consilience Pre-processing

Get word frequency for each document

Stemming

Term document matrix

Stem - words file

Txt files

MongoDB

Run clustering methods

Grid point membership

Labels

Method points

Projection to 2D clustering space

Labeling

Calculate similarity matrix

Run K-means

Cluster membership matrix


Three stages of java processing

Three stages of Java processing

  • Document set ingest

  • Local Cluster Ensemble(LCE) Calculations

    • Kmeans

    • Labeling (Mutual Information

  • Cluster Analysis (Clustering Space page)


Consilience demo

Consilience Demo


Data requirements

Data Requirements

  • Goal – manage 10 million documents/set

  • Each document set has

    • Document text files and original files

    • Text Analysis Data (Term Document Matrix, Stem Data)

    • Method Clustering solutions – cluster assignments for each method

    • Local Cluster Ensemble - Clustering solutions


Why mongodb

Why MongoDB

  • Why not only SQL

    • Needed to consider persisting larger document sets

    • On the fly LCE calculations became impractical – they needed to be pre-calculated and persisted

    • Document set metadata could be any type

    • Need to efficiently handle potentially very large amounts of data

      • 10 Million documents with associated metadata, cluster memberships, 10,000 pre-calculated clustering solutions (grid points)


Why sql

Why SQL

  • Already had working code written

  • Take advantage of transaction management for frequently updated data

  • SQL data works well with Web Server Security Realm

  • Data in SQL database is relatively small & manageable


Data storage details

Data Storage Details

  • Use Derby for read/write user-related data

    • user accounts

    • document set permissions

    • user workspace history

  • Use MongoDB for read-mostly document set data and clustering solutions

    • Document text, metadata, original files

    • Clustering analysis data

      • Word counts

      • Pre-calculated cluster memberships and keywords


Data storage details1

Data Storage Details

  • Combination of mongodb collections and GridFS

    • Flat files to GridFS makes it easier to have multiple servers working on the data

    • MapReduce (future)


Derby and mongodb data storage

Derby and MongoDB Data Storage


Data access from java

Data access from Java

  • Derby JPA Entities

  • MongoDB  JPA Entities and MongoDB API


@ nosql jpa entities

@NoSQL JPA Entities

GridPoint and Cluster Example


Parent entity gridpoint java

Parent Entity - GridPoint.java

@Entity

@NoSql(dataFormat = DataFormatType.MAPPED)

publicclass GridPoint implementsSerializable{

@Id

@GeneratedValue

@Field(name = "_id")

protectedString id;

privatedouble x;

privatedouble y;

privatedouble distance;

privateint numberOfMethods;

privateLong rangeId;

privatedouble[][] prototypes;

@OneToMany(cascade = {CascadeType.PERSIST, CascadeType.REMOVE})

private List<Cluster> clusters;

//. . . Rest of class methods

}


Child entity cluster java

Child Entity - Cluster.java

@Entity

@NoSql(dataFormat=DataFormatType.MAPPED)

publicclass Cluster implementsSerializable{

@Id

@GeneratedValue

@Field(name = "_id")

privateString id;

privateString labels;

@ManyToOne

MongoGridPoint gridPoint;

privateLong rangeId;

@ElementCollection

private List<MiWord> miWords;

...

}


Embedded entity miword java

Embedded Entity - MiWord.java

@NoSql(dataFormat=DataFormatType.MAPPED)

@Embeddable

publicclass MiWord implementsSerializable{

privateString label;// most common variation ofthe stem

publicdouble mi;// mutual information value

publicint wordIndex;// index of the word in wordDocMatrix

// getters and setters ….

}


@ nosql mapping to mongodb

@NoSQL Mapping to MongoDB

  • Id field maps to String, not ObjectId

  • Bidirectional relationships not allowed (no “mappedBy”) – use two one directional instead

  • @OneToMany – Id’s of child entities stored in the parent

  • No @OrderBy – MongoDB saves order from insert

  • Full description of mapping support

    • http://wiki.eclipse.org/EclipseLink/UserGuide/JPA/Advanced_JPA_Development/NoSQL/Configuring


Saving gridpoint and related clusters

SavingGridPointand related Clusters

publicvoid savePoint(EntityManager em, List<int[]> clusterDocIds, Long rangeId, GridCoordinate coord ){

GridPoint gridPoint =new GridPoint();

gridPoint.setRangeId(rangeId);

//... Set more fields

List<MongoCluster> clusters =new ArrayList<>();

for(int i =0; i < clusterDocIds.size(); i++){

Cluster cluster =new MongoCluster();

cluster.setClusterSize(clusterDocIds.get(i).length);

cluster.setRangeId(rangeId);

cluster.setMiWords(miWordList);

clusters.add(cluster);

}

gridPoint.setClusters(clusters);

em.persist(gridPoint);

em.flush()

saveClusterFiles(clusterDocIds, gridPoint.getId());

}


Calling savepoint within transaction

Calling savePoint() within Transaction

publicvoid savePoints(Long rangeId, Coordinate[] gridCoordinates){

EntityManagerFactory emf = Persistence.createEntityManagerFactory("mongo-"+getMongoHost());

try{

for(Coordinate coord: gridCoordinates){

List<int[]> clusterDocIds = calcClustering(coord.x,coord.y);

EntityManager em = emf.createEntityManager();

em.getTransaction().begin();

savePoint(em, rangeId, coord.x, coord.y, clusterDocIds);

em.getTransaction().commit();

em.close();

}

}catch(Exception e){

rollback(rangeId);

}

}


Persistence xml defining persistence units

Persistence.xml: Defining Persistence Units

<?xmlversion="1.0"encoding="UTF-8"?>

<persistenceversion="2.0"xmlns="http://java.sun.com/xml/ns/persistence"xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance” xsi:schemaLocation=http://java.sun.com/xml/ns/persistence http://java.sun.com/xml/ns/persistence/persistence_2_0.xsd>

<persistence-unitname="text"transaction-type="JTA">

<provider>org.eclipse.persistence.jpa.PersistenceProvider</provider>

<jta-data-source>jdbc/text</jta-data-source>

<properties>

<propertyname="eclipselink.ddl-generation"value="create-tables"/>

<propertyname="eclipselink.cache.shared.default"value="false"/>

</properties>

</persistence-unit>


Persistence xml continued

Persistence.xml, continued

<persistence-unitname="mongo-localhost"transaction-type="RESOURCE_LOCAL">

<class>edu.harvard.iq.text.model.GridPoint</class>

<class>edu.harvard.iq.text.model.Cluster</class>

<properties>

<propertyname="eclipselink.target-database"value="org.eclipse.persistence.nosql.adapters.mongo.MongoPlatform"/>

<propertyname="eclipselink.nosql.connection-spec"value="org.eclipse.persistence.nosql.adapters.mongo.MongoConnectionSpec"/>

<propertyname="eclipselink.nosql.property.mongo.port"value="27017"/>

<propertyname="eclipselink.nosql.property.mongo.host"value="localhost"/>

<propertyname="eclipselink.nosql.property.mongo.db"value="mydb"/>

<propertyname="eclipselink.logging.level"value="SEVERE"/>

</properties>

</persistence-unit>

<persistence-unitname="mongo-ip-10-205-17-163.ec2.internal"transaction-type="RESOURCE_LOCAL">

. . .

<propertyname="eclipselink.nosql.property.mongo.host"value="ip-10-205-17-163.ec2.internal"/>

. . .

</persistence-unit></persistence>


Composite persistence unit

Composite Persistence Unit

We use separate persistence units for MongoDB and Derby, but “create-tables” no longer works

Another option – composite persistence unit:

<persistence-unitname="composite-pu"transaction-type="RESOURCE_LOCAL">

<provider>org.eclipse.persistence.jpa.PersistenceProvider</provider>

<jar-file>\lib\polyglot-persistence-rational-pu-1.0-SNAPSHOT.jar</jar-file>

<jar-file>\lib\polyglot-persistence-nosql-pu-1.0-SNAPSHOT.jar</jar-file>

<properties>

<propertyname="eclipselink.composite-unit"value="true"/>

</properties>

</persistence-unit>

</persistence>

With composite persistence unit, you can map JPA relationships between SQL and NoSQL entities

More Info: http://java.dzone.com/articles/polyglot-persistence-0


Minor issues with jpa @nosql

Minor Issues with JPA @NoSQL

Have to restart Glassfish after modifying Entity

Need to update persistence.xml with classnames

JPA Query language not fully supported – sometimes have to revert to native query

Overhead of using EntityManagerFactory, EntityManager & transactions


Accessing data with mongodb api

Accessing Data with MongoDB API


Create a set in derby and mongodb

Create a Set in Derby and MongoDB

@Stateless

@Named

publicclass DocSetService {

@PersistenceContext(unitName ="text")

protected EntityManager em;

publicvoid create(File parentDir, DocSet docSet){

try{

MongoSetWrapper.ingestData( parentDir, docSet.getSetId());

em.persist(docSet);

}catch(Exception e){

MongoSetWrapper.rollbackData(docSet.getSetId());

// other exception handling

}

}}


Getting a connection to mongodb

Getting a connection to MongoDB

publicclass MongoDB{

privatestatic MongoClient mongo;

privatestaticString dbName ="mydb”;

publicstaticvoid init(){

if(mongo ==null){

try{

mongo =new MongoClient(new ServerAddress(getHostName()));

}catch(UnknownHostException e){

//...do exception handling

}

}

}

publicstatic DB getMyDB(){

init();

return mongo.getDB(dbName);

}

}


Create mongoset with mongodb api

Create MongoSet with MongoDB API

private ObjectId createMongoSet(File setDir, String setId ){

DBCollection coll = MongoDB.getMyDB().getCollection("MongoSet");

BasicDBObject doc =new BasicDBObject("setId", setId);

ArrayList<String> summaryFields = readSummaryFieldsFromFile(setDir);

BasicDBList list =new BasicDBList();

list.addAll(summaryFields);

BasicDBList list2 =new BasicDBList();

list2.addAll(readAllFieldsFromFile(setDir));

doc.append("summaryFields", list);

doc.append("allFields", list2);

coll.insert(doc);

return(ObjectId) doc.get("_id");

}


Create document with mongodb api

Create Document with MongoDB API

privatevoid createDocument(ObjectId mongoSetId, String doc_id, String filename, File setDir, HashMap metadata, File origFile){

DBCollection coll=MongoDB.getMyDB().getCollection("Document");

BasicDBObject doc =new BasicDBObject("mongoSet_id", mongoSetId);

doc.append("doc_id", doc_id);

doc.append("filename", filename);

doc.append("text", getDocumentText(setDir, filename));

// HashMap can contain different types – difficult to do with JPA

doc.put("metadata", new BasicDBObject(metadata));

coll.insert(doc);

createOrigDocument( origFile, doc_id, mongoSetId );

}

}


Creating a gridfs file

Creating a GridFS File

privatevoid createOrigDocument(File origFile, String doc_id, ObjectId mongoSetId){

try{

// put the originalfile into GridFS,

// with the document id and MongoSet objectId as attributes

GridFS gfs =new GridFS(MongoDB.getMyDB());

GridFSInputFile gif = gfs.createFile(origFile);

gif.setFilename(origFile.getName());

gif.put(“mongoSet_id”, mongoSetId);

gif.put(”doc_id", doc_id);

gif.save();

}catch(IOException e){

thrownew ClusterException("Error saving orig doc"+origFile.getName(), e);

}

}


Read a document with mongodb api

Read a Document with MongoDB API

publicDocument getDocumentByIndex(ObjectId mongoSetId, int index){

BasicDBObject query =new BasicDBObject();

query.put("mongoSet_id", mongoSetId);

query.put("doc_id", Integer.toString(index));

DBObject obj= MongoDB.getMyDB().getCollection("Document").findOne(query);

Document doc =newDocument(mongoSetId, (String)obj.get("doc_id"), (String)obj.get("filename"), (String)obj.get("text"));

Object metadata = obj.get("metadata");

if(metadata!=null){

doc.setMetadata((HashMap)obj.get("metadata"));

}else{

// If there is no metadata for this document, just create a single metadata field, "Filename"

HashMap filename =newHashMap();

filename.put("filename", doc.getFilename());

doc.setMetadata(filename);

}

return doc;

}


Reading a gridfs file

Reading a GridFS File

classDocument{

private ObjectId mongoSetId;

privateString docId;

// .... other fields and methods

public GridFSDBFile getOrigDocument(){

GridFS gfs =new GridFS(MongoDB.getMyDB());

BasicDBObject query=new BasicDBObject().append("mongoSet_id",mongoSetId).append("doc_id", docId);

GridFSDBFile file = gfs.findOne(query);

return file;

}

}

//... Servlet method for displaying original document

protectedvoid processRequest(HttpServletRequest request, HttpServletResponse response){

// .... get Document from request parameters

GridFSDBFile origFile = doc.getOrigDocument();

BufferedInputStream bis =newBufferedInputStream(origFile.getInputStream());

// ... read from the stream and write to the response output

}

}


Accessing data with mongodb api1

Accessing data with MongoDB API

  • “Wordier” than JPA access – have to write more code

  • Finer level of control over how data is stored and retrieved

  • Other ways to manage marshalling/unmarshalling – JSON.parse(), GSON

  • Other Mongo Persistence libraries: Morphia, Jongo


Summary

Summary

  • Polyglot storage is useful when you have different types of data - can take advantage of different database features

  • Using @NoSQL JPA is easier when you are familiar with the API and are new to MongoDB.

  • There are details of @NoSQL mapping that you need to be aware of. Still need to understand MongoDB to use JPA effectively

  • MongoDB API is useful when you have fewer entities, and you need more control of storage and access, or you are using GridFS.


Thanks for coming

Thanks for coming!

  • Questions, Comments?


  • Login