The shocking details of genome ucsc edu
1 / 64

The Shocking Details of - PowerPoint PPT Presentation

  • Updated On :

The Shocking Details of History of the Code. Started in 1999 in C after Java proved hopelessly unportable across browsers. Early modules include a Worm genome browser (Intronerator), and GigAssembler which produced working draft of human genome.

Related searches for The Shocking Details of

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about 'The Shocking Details of' - zeshawn

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

History of the code l.jpg
History of the Code

  • Started in 1999 in C after Java proved hopelessly unportable across browsers.

  • Early modules include a Worm genome browser (Intronerator), and GigAssembler which produced working draft of human genome.

  • In 2001 a few other grad students started working on the code.

  • In 2002 hired staff to help with Genome Browser

  • Currently project employs ~20 full time people.

The genome browser staff l.jpg
The Genome Browser Staff

  • 5 programmers: Mark, Angie, Hiram, Kate, Rachel, Fan, Jim

  • 4 quality assurance engineers - Heather, Bob, Mike, Galt

  • 3 post-docs - Terry, Gill, Katie

  • 9 grad students - Chuck, Daryl, Brian, Robert, Yontao, Krish, Adam, Ryan, Andy

  • 3 system administrators - Paul, Jorge, Patrick

  • 1 writer - Donna

  • David Haussler and CBSE Staff

  • About 1/3 of staff (including me 3 days a week) telecommutes.

Slide4 l.jpg

The Goal

Make the human genome

understandable by humans.

Prognosis l.jpg

Maybe we’ll understand it one of these days

Add your own tracks l.jpg
Add Your Own Tracks

  • Users can extend the browser with their own tracks.

  • User tracks can be private or public.

  • No programming required.

  • GFF, GTF, PSL or BED formats supported

    #chrom start end [name strand score …]

    chr1 1302347 1302357 SP1 + 800

    chr1 1504778 1504787 SP2 – 980

The underlying database l.jpg
The Underlying Database

  • Power users and bioinformaticians sometimes want underlying database.

  • There is a table for each track.

  • Larger tracks have a table for each chromosome.

  • Format of a track table generally similar to add-your-own track formats.

  • Pieces of database available from ‘tables’ browser.

  • Whole database available as tab-separated files.

  • Most of database served via DAS.

Parasol and kilo cluster l.jpg
Parasol and Kilo Cluster

  • UCSC cluster has 1000 CPUs running Linux

  • 1,000,000 BLASTZ jobs in 25 hours for mouse/human alignment

  • We wrote Parasol job scheduler to keep up.

    • Very fast and free.

    • Jobs are organized into batches.

    • Error checking at job and at batch level.

Coding discipline is required l.jpg
Coding: Discipline Is Required

  • While software development is immune from almost all physical laws, entropy his us hard. - The Pragmatic Programmer

  • To keep the system from devolving into disorder we have to follow code conventions and insist on a lot of testing.

  • We use CVS (concurrent version system) to help all of us work on the same code at once.

Obtaining the code from cvs l.jpg
Obtaining the Code from CVS

  • See

  • This gets you a ‘sandbox’ - a local copy of the source to compile and edit.

  • Type ‘make’ in the lib and utilities directory.

  • You can do a ‘cvs update’ to get our updates to the code base.

  • To add permanently to code base email me to enable ‘cvs commit’

Lagging edge software l.jpg
Lagging Edge Software

  • C language - compilers still available!

  • CGI Scripts - portable if not pretty.

  • SQL database - at least MySQL is free.

Problems with c l.jpg
Problems with C

  • Missing booleans and strings.

  • No real objects.

  • Must free things

Advantages of c l.jpg
Advantages of C

  • Very fast at runtime.

  • Very portable.

  • Language is simple.

  • No tangled inheritance hierarchy.

  • Excellent free tools are available.

  • Libraries and conventions can compensate for language weaknesses.

Coping with missing data types in c l.jpg
Coping with Missing Data Types in C

  • #define boolean int

  • Fixing lack of real string type much harder

    • lineFile/common modules and autoSql code generator make parsing files relatively painless

    • dyString module not a horrible string ‘class’

Object oriented programming in c l.jpg
Object Oriented Programming in C

  • Build objects around structures.

  • Make families of functions with names that start with the structure name, and that take the structure as the first argument.

  • Implement polymorphism/virtual functions with function pointers in structure.

  • Inheritance is still difficult. Perhaps this is not such a bad thing.

Slide24 l.jpg

struct dnaSeq

/* A dna sequence in one-letter-per-base format. */


struct dnaSeq *next; /* Next in list. */

char *name; /* Sequence name. */

char *dna; /* a’s c’s g’s and t’s. Null terminated */

int size; /* Number of bases. */


struct dnaSeq *dnaSeqFromString(char *string);

/* Convert string containing sequence and possibly

* white space and numbers to a dnaSeq. */

void dnaSeqFree(struct dnaSeq **pSeq);

/* Free dnaSeq and set pointer to NULL. */

void dnaSeqFreeList(struct dnaSeq **pList);

/* Free list of dnaSeq’s. */

Slide25 l.jpg

struct screenObj

/* A two dimensional object in a sleazy video game. */


struct screenObj *next; /* Next in list. */

char *name; /* Object name. */

int x,y,width,height; /* Bounds of object. */

void (*draw)(struct screenObj *obj); /* Draw object */

boolean (*in)(struct screenObj *obj, int x, int y);

/* Return true if x,y is in object */

void *custom; /* Custom data for a particular type */

void (*freeCustom)(struct screenObj *obj);

/* Free custom data. */


#define screenObjDraw(obj) (obj->draw(obj))

/* Draw object. */

void screenObjFree(struct screenObj **pObj);

/* Free up screen object including custom part. */

Naming conventions l.jpg
Naming Conventions

  • Code is constrained by few natural laws.

  • There are many ways to do things, so programmers make arbitrary decisions.

  • Arbitrary decisions are hard to remember.

  • Conventions make decisions less arbitrary.

  • varName vs. VarName vs varname vs var_name. We use varName.

  • variable vs. var vs. vrbl vs. vble vs varible: if you need to abbreviate, keep it short.

Commenting conventions l.jpg
Commenting Conventions

  • Each module has a comment describing it’s overall purpose.

  • Each function also has an overall comment.

  • Each field in a structure has a comment.

  • Longer functions broken into ‘paragraphs’ that each begin with a comment.

  • The module, function, and structure comments are replicated in the .h file, which serves as an index to the module.

Error handling l.jpg
Error Handling

  • Code prints out a message and aborts (via the errAbort function) when there is a problem.

  • This saves loads of error handling code and is generally the right thing to do.

  • You can ‘catch’ an errAbort if necessary, though it rarely is.

Memory l.jpg

  • Uninitialized memory leads to difficult bugs.

  • Compiler set to warn of uninitialized vars

  • Dynamic memory goes through needMem. It is always zeroed.

  • Memory usually freed with freez(), which sets pointer to null as well as freeing it.

  • ‘Careful’ memory handler can be pushed to help track down memory bugs:

    • Sentinal values to detect writing past end of array

    • Detects memory freed twice or not freed

    • Detects heap corruption in general.

Generally useful modules l.jpg
Generally Useful Modules

  • String handling - common dystring wildcmp

  • Collections - common (singly linked lists), hash, dlist, binRange rbTree

  • DNA - dnautils dnaseq

  • Web - htmshell, cheapcgi, htmlPage

  • I/O - linefile, xap (XML), fa, nib, twoBit, blastParse, blastOut, maf, chain, gff

  • Graphics - memgfx, gifwrite, psGfx, vGfx

Anatomy of a cgi script l.jpg
Anatomy of a CGI Script

  • Gets called by Web Server when user clicks submit or follows a cgi link.

  • Input is in environment variables and sometimes also stdin. Routines in cheapCgi move this to a hash table.

  • Output is to stdout. Routines in htmshell help with output formatting.

  • In the middle often access a database.

Challenges of cgi l.jpg
Challenges of CGI

  • Each click launches program anew.

    • User state can be kept in ‘cart’ variables

  • Run from Web Server, harder to debug

    • Use cgiSpoof to run from command line

    • Push an error handler that will close out web page, so can see your error messages. htmShell does this, but webShell may not….

  • Ideally should run in less than 2 seconds.

Relational databases l.jpg
Relational Databases

  • Relational databases consist of tables, indices, and the Structured Query Language (SQL).

  • Tables are much like tab-separated files: #chrom start end name strand score chr22 14600000 14612345 ldlr + 0.989 chr21 18283999 18298577 vldlr - 0.998Fields are simple - no lists or substructures.

  • Can join tables based on a shared field. This is flexible, but only as fast as the index.

  • Tables and joins are accessed a row at a time.

  • The row is represented as an array of strings.

Converting a row to object l.jpg
Converting A Row to Object

struct exoFish *exoFishLoad(char **row)

/* Load a exoFish from row fetched with select * from exoFish

* from database. Dispose of this with exoFishFree(). */


struct exoFish *ret;


ret->chrom = cloneString(row[0]);

ret->chromStart = sqlUnsigned(row[1]);

ret->chromEnd = sqlUnsigned(row[2]);

ret->name = cloneString(row[3]);

ret->score = sqlUnsigned(row[4]);

return ret;


Motivation for autosql l.jpg
Motivation for AutoSql

  • Row to object code is tedious at best.

  • Also have save object, free object code to write.

  • SQL create statement needs to match C structure.

  • Lack of lists without doing a join can seriously impact performance and complicate schema.

Autosql data declaration l.jpg
AutoSql Data Declaration

table exoFish

"An evolutionarily conserved region (ecore) with Tetroadon"


string chrom; "Human chromosome or FPC contig"

uint chromStart; "Start position in chromosome"

uint chromEnd; "End position in chromosome"

string name; "Ecore name in Genoscope database"

uint score; "Score from 0 to 1000"


See autoSql.doc for more details.

See also autoXml

Coding conclusion l.jpg
Coding Conclusion

  • It’s always safer on the lagging edge

  • Consider redesigning system as COBOL character-based application

Ucsc gene family browser l.jpg
UCSC Gene Family Browser

Expression and other information on genes in a big sorted, linked table

Conclusions l.jpg

  • Genome browser - good for exploring genome and displaying your custom tracks

  • ‘kent’ code base - a good starting point for many programming projects

  • Family browser - a fine way to collect data sets.

  • Browser staff - helpful but overworked.