1 / 13

agenda

agenda. Where is stuff? Automating download of TCGA data DAM web service DCC web service Annotations web service Genomic bins Copy number polymorphisms/variations (CNVs) Mapping CBS segments to genes. /lpg/LPGCommon/schaefec/TCGA subdirectories.

zhen
Download Presentation

agenda

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. agenda • Where is stuff? • Automating download of TCGA data • DAM web service • DCC web service • Annotations web service • Genomic bins • Copy number polymorphisms/variations (CNVs) • Mapping CBS segments to genes

  2. /lpg/LPGCommon/schaefec/TCGAsubdirectories • Downloaded TCGA data (with scripts for download & pre-processing) • Annotations • Clinical • Agilent_GE_data • Agilent_MI_data • Illumina_ME_data • RNASeq • SNP_6_data • Other • BIN_MAPPING • CNV • mRNA_meta_data • miRNA_meta_data • methylation_meta_data • DAM_WebService, DCC_WS • Analysis, AnalysisOutput

  3. DAM Web Service • Generic code in directory DAM_WebService • Two steps • submit request, get back ticket (<disease>.1.xml) • poll until <status-message> = OK, then wget <archive-url> (<disease>.2.xml) • Submit bunch of requests in parallel • Example: SNP_6_data/snp6_dam_ws.csh • currently, flattenDir does not appear to work

  4. BRCA.1.xml <job-process> <ticket>c55fd0dd-20de-494d-8c7f-4fa01f1900c6</ticket> <submission-time>2011-03-05T11:48:17.735-05:00</submission-time> <estimated-size>9336767</estimated-size> <status-check-url>http://tcga-data.nci.nih.gov/tcga/damws/jobprocess/xml/ticket/c55fd0dd-20de-494d-8c7f-4fa01f1900c6</status-check-url> <job-status> <status-code>201</status-code> <status-message>Created</status-message> </job-status> </job-process>

  5. BRCA.2.xml <job-process> <ticket>c55fd0dd-20de-494d-8c7f-4fa01f1900c6</ticket> <submission-time>2011-03-05T11:48:17.735-05:00</submission-time> <estimated-size>9336767</estimated-size> <status-check-url>http://tcga-data.nci.nih.gov/tcga/damws/jobprocess/xml/ticket/c55fd0dd-20de-494d-8c7f-4fa01f1900c6</status-check-url> <job-status> <status-code>200</status-code> <status-message>OK</status-message> <archive-url>http://tcga-data.nci.nih.gov/tcgafiles/ftp_auth/distro_ftpusers/anonymous/userCreatedArchives/043dd58c-a3e2-4a3a-a8f7-8a040ab1d2f3.tar.gz</archive-url> </job-status> </job-process>

  6. DCC Web Service • complicated (and slow) but powerful interface • example request: • http://tcga-data.nci.nih.gov/tcgadccws/ GetXML?query=Archive[@isLatest=1][Platform[@name=Genome_Wide_SNP_6]][Disease[@abbreviation=BRCA]][ArchiveType[@type=Level_3]] • Generic parser looks for (class, field, attribute [maybe null]), e.g. • (“Archive”, “deployLocation”, undef) • (“Archive, “fileCollection”, “xlink:href”) • Example script: DCC_WS/FindFiles.pl • returns (DISEASE, date, directory, file) • Example script: DCC_WS/PatientsPerTSS.pl • returns (DISEASE, TSS name, number patients) • useful in pulling clinical xml (which is not available via DAM) • see Clinical/clinical_dcc_ws.csh

  7. Annotations • Replaces the old disease_barcode_status.txt files • Annotations/process.csh creates annotations.txt • (DISEASE, level, barcode, annotation type, TOSS/KEEP) • level: patient, sample, portion, aliquot • FindKeepers.pl • inputs: list of candidate aliquots; annotations.txt • output: filtered (KEEP) list of candidate aliquots • PickOneAliquot.pl – old script to avoid over-representation of one patient (cases: multiple portions; native DNA and WGA)

  8. BIN_MAPPING • Continuous bin numbering across the genome • probably more complicated than necessary • several levels of resolution (200K, 20K, 10K) • originally to support the CN heatmaps in CGWB • Use UCSC chromInfo.txt for chr length • Use UCSC refFlat.txt for gene, exon coordinates • Major issue on the horizon: hg18 vs hg19

  9. CNVs -- DGV • Based on published studies • Keep only VariationType == ‘CopyNumber’ (vast majority) • Min sample size for variation: 30 • Min frequency: 0.30 • Output format like CBS output: • (“dgv”, chr, seg-start, seg-end, “1”, “2.0”) • 1 == phony number of markers • 2.0 == phony log2ratio • Combine overlapping segments by using bins, size=1000 • so a loss of resolution

  10. CNVs – from normals • Disease-specific [current] or pooled? • Disregard chrX, chrY • Filter normal samples nsegs <= 1000 -0.10 <= mean log2ratio <= 0.10 • -0.20 <= diploid <= 0.20 • Tally non-diploid bins, bin size = 1000 bp • Create CNV segments for contiguous bins where tally >= 5% of samples • Output format like CBS output: • (“normals”, chr, seg-start, seg-end, “1”, “2.0”)

  11. gene-level copy number values • SNP_6_data/snp6.csh • CBSSeg2Gene.pl • ComputePairedGeneValues.pl • give up on chrX, chrY • for each gene choose extreme overlapping CBS segment value • MIN_OVERLAP currently set to 1 bp • filter out short CBS segments (likely to be artifact/CNV) • MIN_SEG currently set to 200 bp • binning only to speed up overlapping – no loss of resolution • make gene-level calls separate for tumor, matched normal, then subtract • output • (aliquot, gene, chr, start, stop, log2ratio, capped log2ratio [-2.0..2.0], PAIRED/UNPAIRED, CNV/NOCNV)

  12. miRNA_meta_data • genomic positions from mirbase.org • but now miRNA locations (with non-standard names) are also in refFlat.txt • targets from targetscan.org • 179,129 miRNA/target associations • 394 miRNA • 9432 targeted mRNA • caution: targetscan from UCSC is very reduced • 46,841 miRNA/target associations • 162 miRNA • 7981 targeted mRNA

  13. mRNA_meta_data • just the mechanics of updating gene symbols • pulls official symbols from refFlat.txt (to be in sync with the CN data) • pulls aliases for official symbols from col 5 of Entrez Gene flat file gene_info • maps unofficial symbols in UNC Agilent Level 3 data to aliases • creates 2-col file replace_syms.txt for use by FilterMapColumn.pl • presumably all this will be unnecessary when RNASeq submissions start using GAF

More Related