1 / 42

DBMS Group Overview of our research

Carlos Ordonez University of Houston. DBMS Group Overview of our research. Outline. Research on DBS overview Research topics Papers Working with me Advice & Recommendations Members. Notes. Hands on no math or code! Same presentation I give to my students Once a year Take notes!.

kepperson
Download Presentation

DBMS Group Overview of our research

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Carlos Ordonez University of Houston DBMS GroupOverview of our research

  2. Outline • Research on DBS overview • Research topics • Papers • Working with me • Advice & Recommendations • Members

  3. Notes • Hands on • no math or code! • Same presentation I give to my students • Once a year • Take notes!

  4. OVERVIEWAREAS & PROGRAMMING

  5. DB systems today • Modeling: ER, UML, temporal, workflows, metadata, docs • Query languages: relational, logic • File systems and indexing: blocked, row store, B-trees, hash, bitmap • Query optimization: SPJA, recursion • Transaction processing: ACID, 2PL • Analytics: OLAP cubes and DM • Non-relational: provenance, keyword, column, XML, sensors, probabilistic

  6. Research: Database Systems • Both CS main areas • Theory: sets, external algorithms, discrete math, relational, complexity, cost analysis • Systems: large indexed files, query optimizer • Software • Languages: C++ and SQL, but also Java and C#. • Systems: DBMS, Unix, MS-DOS • Math Tools: R, WEKA, SAS, Matlab • Libraries: LAPACK, BLAS

  7. Knowledge Background • Computer Science • algorithms: linear, blocked, complexity, parall • data structures: external, secondary storage • OS: multithreaded programming, file systems, parallel programming (shared-nothing) • Compilers: parsing, optimization, query lang • Mathematics: • Discrete math: sets, combinatorial math, graphs, boolean algebra, algebra, OR • Continuous math: probability, multivariate statistics, numerical methods, optimization

  8. RESEARCH TOPICS

  9. Research Topics • Main: • Integrating data mining algorithms with a DBMS • Other: • Query optimization: Data pre-processing, linear recursive queries, transformation, Cubes • Medical data mining(heart disease, cancer) • Data quality: referential integr., distributed DBs • information retrieval, keyword search, database and document integration, recommendation

  10. Research Topic: Data Mining • Analytics • OLAP Cubes • Statistics: prob., multivariate stats, categorical data analysys, Bayesian, time series • Machine Learning, but Statistical Learning; much less pure AI: generative models • NOT • generic DM algorithm (complex math on flat files), • “mumbo jumbo” AI algorithms (Bayesian nets, SVM, Kohonen, FA,GMM,MRF,MCMC, ROC)

  11. What we don’t do • Algorithms incompatible with a traditional DBMS • AI, machine learning (saturated) • Text mining • image processing, pattern recognition • search engine infrastructure • generic data mining

  12. Specific topics, all “in a DBMS” • Numerical analysis • Bayesian statistics • Machine Learning models • Linear recursive queries • Measuring and reparing referential integrity • OLAP cubes, association rules and graphs • Keyword search • Query recommendation

  13. PAPERS

  14. Proceedings vs journals • Only CS has proceedings. Likely to change in future • Most journal papers appear in proceedings before • Thomson Reuters neutral to all areas, but incomplete and $$

  15. A good CS publicationcheck yourself in Google! • DBLP • ACM • Has citations in Google Scholar • Has impact factor on MS Academic search • A good paper in other areas: • Math: Mathscinet • Medical: pubmed • Thomson Reuters

  16. Preferred publicationsproceedings • Top proceedings • 1st: ACM (factor >20, SIG*) • 2nd: IEEE (stronger in imaging, AI, medical) • 3rd: LNCS from Springer • 4th: other, no proceedings, high % papers accepted, unknown PC members, local

  17. Preferred publicationsJournals • 1st in CS: ACM, IEEE 1stoutside CS: SIAM, AMA, ASA • 2nd: Elsevier, Springer, IOS Press, Kluwer (Metapress) • 3rd: IGI, Wiley, Blackwell, inderscience, KIISE • Impact factor (IF): • Thomson Reuters, Microsoft Academic, notyet Google Scholar.. • IF above 0.5 OK, 1 good, 2 isverygood.

  18. WORKING WITH ME

  19. Recommendations • Read DBLP • Check ACM • Read DBWorld • Read CACM • Read SIGMOD site

  20. Students: My experience • You need a lot of guidance • Most of you do not read papers on your own • Not everyone can do research: not motivated, lack initiative, lack knowledge • Few have good DB systems and math background • Some people are slow/bad programmers • Math background varies, but generally not good enough in linear algebra, calculus, multivariate statistics and discrete math

  21. Classification of students • BAD • LAZY: excuses • DUMB: does not understand, sleepy, wishful thinking • LOST: I want to work with someone else on their project • RANDOM: X+Y+.. different ideas without thought • GOOD • CREATIVE: I have this idea backed by these experimental results • WINNER: I made several comparisons and my program is faster or more accurate • CRAZY: I read this paper and I think my idea is better because of XYZ reasons • SCIENTIST: I have a draft of a paper with theory and experiments, here it is

  22. Bad student: typical comments • This stuff looks too difficult • Why resubmit a rejected paper? Not worth it • I cannot write well; it is hard and boring • I cannot follow your notation, I prefer to use one I invented; mine is better • I did some programming, but it is at home/USB • Finally here is my paper (a draft paper full of spelling errors and disconnected phrases) • I forgot to tell you I was not coming (yesterday, another day) • I did not read any new papers • Where is your web page?, What is ACM?, What is DBLP?, what is DBworld? • Can you help me debug my program? • Random ideas to do new stuff

  23. Good studentam I dreaming? • I did some theoretical analysis that I want to discuss it with you • I think this aspect from one of your papers or this famous paper can be improved • I have submitted two papers every semester • I have at least one paper accepted per year • My paper is ready two weeks before the deadline • Can we can briefly meet on the weekend? • I would like to stay late (8pm) two days per week, can you let me know when you are around?

  24. My requirements • B.S.: open • MS: One ACM/IEEE accepted paper; journal paper desirable, but not required • PhD • Preferred: 2 accepted journal papers; 5 papers total • Alternative: 10+ proceedings papers • 3+ papers on the same topic; 1+ paper per year • Defend dissertation between 4th and 5th year

  25. Research method • Hypothesis • New knowledge • In context, knowledge boundary • Important, current, needed • tell me something I don't know • Prove it • Theory: theorem (property, bound, existence,.) • Experiment: feasible, better, faster • Theory versus Experiment? mention Einstein

  26. Work routine • Send me status every week, Thursday morning preferred • Send paper, results, comments BEFORE we talk • My time is PRIME time. Do not waste it. • Compile experimental results in an .xls spreadsheet • Collect all status and my answers in a log file • Avoid excuses; avoid “forgetting” email • We can talk on the phone in the morning preferably (<30 minutes)

  27. My commitment • Answer e-mail within 2 days unless traveling • Can call you any time • Guidance to submit papers • Give you lead in procedings papers • Take lead in journal papers

  28. Deliverables • Experiments: spreadsheet with one sheet per week (main ingredient for a strong paper) • Program: C++ or Java modules that can be independently called (easy to use by someone else) • Source code: follow math notation from my papers, index arrays from 1, log comments, parameters, config • Papers: latex, db.bib, send me pdf in ACM 2-column format by default

  29. How you will be when finished • Understand how to apply math, theory • Strong C++, SQL programmer • More organized, proactive, creative • Better writing, better presenter

  30. Job prospect: excellent • Corporate world runs on DBMSs, academics underexploit DBMSs • DBMS industry: Oracle, Microsoft, Teradata, Greenplum, IBM • Search Engines: Google, Yahoo • Academic positions: always vacant positions due to industry demand

  31. Advice

  32. Quotes • Datta: Every paper has a home • Ezquerra: A significant knowledge contribution should appear in journal form • Vardi: journals should be #1 • Journal papers last forever (archival) • CO: Citations impact: 0 worthless: 1 even, 5+ OK, 10+ good, 100+ famous • Gray: There are lies, damn lies and benchmarks • Europe: Theory is the compass of CS programming • People do care who is 1st author, except theory, nth author of 4+ authors worth not much; 2-author papers optimal, 3 author paper upper bound • CO: Free ride OK in proceedings; Rarely free ride in journal

  33. CS vs other areas • A person in CS will not respect you if you do not have a paper in a top conference • A person oustide CS will not respect you if you do not have a journal paper • People in industry will not respect you if you do not have a patent, but proceedings help!

  34. Advice (mainly academic PhD) • Avoid gaps in your publication record (year without papers) • Always have journal papers in the pipeline • Submit to top conference every year • Choose 1 or 2 applications, careful 3+ • Paper count • Ph.D. wihout many papers is worthless, 1 journal • M.S. with 1 paper is exceptional • B.S. 1 paper automatic acceptance to PhD

  35. Recommendations: programming • Try ideas soon (discussed with me) • Always ask yourself O(n) • Index arrays from 1..d, 1..n in C++ and Java (Example double X[d+1]; X[1]=0) • Choose an acronym for your research; 8 letters (lrq, udfmodel, olaptest, ssvs) • Every file you send me should have the acronym as prefix • Parameter passing for experiments and GUI; every algorithm has one main call • C++: Follow GNU C++, plain editor preferred • SQL: reserved keywords uppercase, tables, columns lower case, indent, one term per line

  36. Recommendations: writing • 1st Section 4! End with abstract & related work • Novel writing: be clear Section 2, increase interest Section 3, deliver in Section 4, political Section 5 • Avoid creating multiple versions of same file; instead use source control • Write in ACM format by default (www-acm.cls) • Use our master db.bib • Put everything in zip file with acronym • Log main changes • Keep paper reviews handy

  37. RESEARCH ACCOMPLISHMENTS

  38. Visit my web page • Members • Articles • Courses

  39. Research Accomplishments • Over 70 articles • 20 journal artices, 15 in Thomson Reuters • 50 ACM/IEEE proceedings • 800 citations Google Scholar • 15 patentapplications; 9 patents • 57 articleson DBLP, 56 on ACM • H-index=14 per Google Scholar, 8 MS • Students • 1 PhDdissertation: Javier Garcia-Garcia • 6 MS theses • 2 PhDdissertations, in progress (Zhibo, Carlos G)

  40. My Publications • Conferences: • 1st: SIGMOD, CIKM, KDD, • 2nd: ICDM, ICDE, ADL • 3rd: MLDM • Journals: • 1st: IEEE TKDE, IEEE TITB, ACM TODS, • 2nd: DKE (Elsevier), KAIS (Springer), DSS (Elsevier), IDA (IOS Press) • 3rd: JCSE

  41. Members • 4 PhD students: • Zhibo Chen (US, Haliburton) • Carlos Garcia-Alvarado (Mexico, Greenplum) • Mario Navas (Ecuador) • Sasi K. Pitchaimalai (India)

  42. Old and new students • Old MS students: • Anu Goyal (India) • Kai Zhao (China) • Georgey Golovko (Ukraine) • Waree Rinsurong (Thailand) • Ahmad Qwasmeh (Jordan) • Rengan Xu (China) • new MS students: • Manish Limaye (India) • Manas Saha (India) • Naveen Mohanam (India)

More Related