Using mongodb in a java enterprise application
1 / 38

Using MongoDB in a Java Enterprise Application - PowerPoint PPT Presentation

  • Uploaded on

Using MongoDB in a Java Enterprise Application. Ellen Kraffmiller - Technical Lead Robert Treacy - Senior Software Architect Institute for Quantitative Social Science Harvard University . TS-4656. Agenda. What is Consilience? Consilience Architecture Why MongoDB?

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about 'Using MongoDB in a Java Enterprise Application' - adli

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Using mongodb in a java enterprise application

Using MongoDB in a Java Enterprise Application

Ellen Kraffmiller - Technical Lead

Robert Treacy - Senior Software Architect

Institute for Quantitative Social Science

Harvard University




What is Consilience?

Consilience Architecture

Why MongoDB?

Data Storage and Access Details

Using JPA vs. Mongo API


Q & A

Consilience intro
Consilience Intro

Research Tool- Discovery

Grimmer, Justin, and Gary King. 2011. General Purpose Computer-Assisted Clustering and Conceptualization. Proceedings of the National Academy of Sciences.

Consilience history
Consilience History

  • 2010 - brainstorming, first mockups, prototypes, proof of concept, experimentation

  • Initially small document sets

    • Data could be loaded into memory from files

    • LCE calculations could be done on the fly

    • Derby used for user accounts, permissions, user history


  • Analyze document terms

  • Calculate clustering solutions for known methods

  • Maps solutions to two dimensional space

  • Generate LCE from user clicks on map

  • User discovers patterns in data, annotates and labels favorites

Using mongodb in a java enterprise application

Consilience Pre-processing

Get word frequency for each document


Term document matrix

Stem - words file

Txt files


Run clustering methods

Grid point membership


Method points

Projection to 2D clustering space


Calculate similarity matrix

Run K-means

Cluster membership matrix

Three stages of java processing
Three stages of Java processing

  • Document set ingest

  • Local Cluster Ensemble(LCE) Calculations

    • Kmeans

    • Labeling (Mutual Information

  • Cluster Analysis (Clustering Space page)

Data requirements
Data Requirements

  • Goal – manage 10 million documents/set

  • Each document set has

    • Document text files and original files

    • Text Analysis Data (Term Document Matrix, Stem Data)

    • Method Clustering solutions – cluster assignments for each method

    • Local Cluster Ensemble - Clustering solutions

Why mongodb
Why MongoDB

  • Why not only SQL

    • Needed to consider persisting larger document sets

    • On the fly LCE calculations became impractical – they needed to be pre-calculated and persisted

    • Document set metadata could be any type

    • Need to efficiently handle potentially very large amounts of data

      • 10 Million documents with associated metadata, cluster memberships, 10,000 pre-calculated clustering solutions (grid points)

Why sql

  • Already had working code written

  • Take advantage of transaction management for frequently updated data

  • SQL data works well with Web Server Security Realm

  • Data in SQL database is relatively small & manageable

Data storage details
Data Storage Details

  • Use Derby for read/write user-related data

    • user accounts

    • document set permissions

    • user workspace history

  • Use MongoDB for read-mostly document set data and clustering solutions

    • Document text, metadata, original files

    • Clustering analysis data

      • Word counts

      • Pre-calculated cluster memberships and keywords

Data storage details1
Data Storage Details

  • Combination of mongodb collections and GridFS

    • Flat files to GridFS makes it easier to have multiple servers working on the data

    • MapReduce (future)

Data access from java
Data access from Java

  • Derby JPA Entities

  • MongoDB  JPA Entities and MongoDB API

@ nosql jpa entities
@NoSQL JPA Entities

GridPoint and Cluster Example

Parent entity gridpoint java
Parent Entity -


@NoSql(dataFormat = DataFormatType.MAPPED)

publicclass GridPoint implementsSerializable{



@Field(name = "_id")

protectedString id;

privatedouble x;

privatedouble y;

privatedouble distance;

privateint numberOfMethods;

privateLong rangeId;

privatedouble[][] prototypes;

@OneToMany(cascade = {CascadeType.PERSIST, CascadeType.REMOVE})

private List<Cluster> clusters;

//. . . Rest of class methods


Child entity cluster java
Child Entity -



publicclass Cluster implementsSerializable{



@Field(name = "_id")

privateString id;

privateString labels;


MongoGridPoint gridPoint;

privateLong rangeId;


private List<MiWord> miWords;



Embedded entity miword java
Embedded Entity -



publicclass MiWord implementsSerializable{

privateString label;// most common variation ofthe stem

publicdouble mi;// mutual information value

publicint wordIndex;// index of the word in wordDocMatrix

// getters and setters ….


@ nosql mapping to mongodb
@NoSQL Mapping to MongoDB

  • Id field maps to String, not ObjectId

  • Bidirectional relationships not allowed (no “mappedBy”) – use two one directional instead

  • @OneToMany – Id’s of child entities stored in the parent

  • No @OrderBy – MongoDB saves order from insert

  • Full description of mapping support


Saving gridpoint and related clusters
SavingGridPointand related Clusters

publicvoid savePoint(EntityManager em, List<int[]> clusterDocIds, Long rangeId, GridCoordinate coord ){

GridPoint gridPoint =new GridPoint();


//... Set more fields

List<MongoCluster> clusters =new ArrayList<>();

for(int i =0; i < clusterDocIds.size(); i++){

Cluster cluster =new MongoCluster();









saveClusterFiles(clusterDocIds, gridPoint.getId());


Calling savepoint within transaction
Calling savePoint() within Transaction

publicvoid savePoints(Long rangeId, Coordinate[] gridCoordinates){

EntityManagerFactory emf = Persistence.createEntityManagerFactory("mongo-"+getMongoHost());


for(Coordinate coord: gridCoordinates){

List<int[]> clusterDocIds = calcClustering(coord.x,coord.y);

EntityManager em = emf.createEntityManager();


savePoint(em, rangeId, coord.x, coord.y, clusterDocIds);




}catch(Exception e){




Persistence xml defining persistence units
Persistence.xml: Defining Persistence Units


<persistenceversion="2.0"xmlns=""xmlns:xsi="” xsi:schemaLocation=>









Persistence xml continued
Persistence.xml, continued














. . .


. . .


Composite persistence unit
Composite Persistence Unit

We use separate persistence units for MongoDB and Derby, but “create-tables” no longer works

Another option – composite persistence unit:










With composite persistence unit, you can map JPA relationships between SQL and NoSQL entities

More Info:

Minor issues with jpa @nosql
Minor Issues with JPA @NoSQL

Have to restart Glassfish after modifying Entity

Need to update persistence.xml with classnames

JPA Query language not fully supported – sometimes have to revert to native query

Overhead of using EntityManagerFactory, EntityManager & transactions

Create a set in derby and mongodb
Create a Set in Derby and MongoDB



publicclass DocSetService {

@PersistenceContext(unitName ="text")

protected EntityManager em;

publicvoid create(File parentDir, DocSet docSet){


MongoSetWrapper.ingestData( parentDir, docSet.getSetId());


}catch(Exception e){


// other exception handling



Getting a connection to mongodb
Getting a connection to MongoDB

publicclass MongoDB{

privatestatic MongoClient mongo;

privatestaticString dbName ="mydb”;

publicstaticvoid init(){

if(mongo ==null){


mongo =new MongoClient(new ServerAddress(getHostName()));

}catch(UnknownHostException e){

// exception handling




publicstatic DB getMyDB(){


return mongo.getDB(dbName);



Create mongoset with mongodb api
Create MongoSet with MongoDB API

private ObjectId createMongoSet(File setDir, String setId ){

DBCollection coll = MongoDB.getMyDB().getCollection("MongoSet");

BasicDBObject doc =new BasicDBObject("setId", setId);

ArrayList<String> summaryFields = readSummaryFieldsFromFile(setDir);

BasicDBList list =new BasicDBList();


BasicDBList list2 =new BasicDBList();


doc.append("summaryFields", list);

doc.append("allFields", list2);


return(ObjectId) doc.get("_id");


Create document with mongodb api
Create Document with MongoDB API

privatevoid createDocument(ObjectId mongoSetId, String doc_id, String filename, File setDir, HashMap metadata, File origFile){

DBCollection coll= MongoDB.getMyDB().getCollection("Document");

BasicDBObject doc =new BasicDBObject("mongoSet_id", mongoSetId);

doc.append("doc_id", doc_id);

doc.append("filename", filename);

doc.append("text", getDocumentText(setDir, filename));

// HashMap can contain different types – difficult to do with JPA

doc.put("metadata", new BasicDBObject(metadata));


createOrigDocument( origFile, doc_id, mongoSetId );



Creating a gridfs file
Creating a GridFS File

privatevoid createOrigDocument(File origFile, String doc_id, ObjectId mongoSetId){


// put the originalfile into GridFS,

// with the document id and MongoSet objectId as attributes

GridFS gfs =new GridFS(MongoDB.getMyDB());

GridFSInputFile gif = gfs.createFile(origFile);


gif.put(“mongoSet_id”, mongoSetId);

gif.put(”doc_id", doc_id);;

}catch(IOException e){

thrownew ClusterException("Error saving orig doc"+origFile.getName(), e);



Read a document with mongodb api
Read a Document with MongoDB API

publicDocument getDocumentByIndex(ObjectId mongoSetId, int index){

BasicDBObject query =new BasicDBObject();

query.put("mongoSet_id", mongoSetId);

query.put("doc_id", Integer.toString(index));

DBObject obj= MongoDB.getMyDB().getCollection("Document").findOne(query);

Document doc =newDocument(mongoSetId, (String)obj.get("doc_id"), (String)obj.get("filename"), (String)obj.get("text"));

Object metadata = obj.get("metadata");




// If there is no metadata for this document, just create a single metadata field, "Filename"

HashMap filename =newHashMap();

filename.put("filename", doc.getFilename());



return doc;


Reading a gridfs file
Reading a GridFS File


private ObjectId mongoSetId;

privateString docId;

// .... other fields and methods

public GridFSDBFile getOrigDocument(){

GridFS gfs =new GridFS(MongoDB.getMyDB());

BasicDBObject query=new BasicDBObject().append("mongoSet_id",mongoSetId).append("doc_id", docId);

GridFSDBFile file = gfs.findOne(query);

return file;



//... Servlet method for displaying original document

protectedvoid processRequest(HttpServletRequest request, HttpServletResponse response){

// .... get Document from request parameters

GridFSDBFile origFile = doc.getOrigDocument();

BufferedInputStream bis =newBufferedInputStream(origFile.getInputStream());

// ... read from the stream and write to the response output



Accessing data with mongodb api1
Accessing data with MongoDB API

  • “Wordier” than JPA access – have to write more code

  • Finer level of control over how data is stored and retrieved

  • Other ways to manage marshalling/unmarshalling – JSON.parse(), GSON

  • Other Mongo Persistence libraries: Morphia, Jongo


  • Polyglot storage is useful when you have different types of data - can take advantage of different database features

  • Using @NoSQL JPA is easier when you are familiar with the API and are new to MongoDB.

  • There are details of @NoSQL mapping that you need to be aware of. Still need to understand MongoDB to use JPA effectively

  • MongoDB API is useful when you have fewer entities, and you need more control of storage and access, or you are using GridFS.

Thanks for coming
Thanks for coming!

  • Questions, Comments?