-
Notifications
You must be signed in to change notification settings - Fork 3
Serghei Mangul edited this page Jun 22, 2017
·
10 revisions
To prepare T1 we haev downloaded 1000G data We download
- WES
- WGS
Convert cram to fastq
while read line; do echo "samtools view -uf64 $PWD/../data/HG00264.alt_bwamem_GRCh38DH.20150718.GBR.low_coverage.cram | samtools bam2fq - | gzip >${line}_1.fastq.gz">run1_${line}.sh;done<../samples.txt
while read line; do echo "samtools view -uf128 $PWD/../data/HG00264.alt_bwamem_GRCh38DH.20150718.GBR.low_coverage.cram | samtools bam2fq - | gzip >${line}_2.fastq.gz">run2_${line}.sh;done<../samples.txt
/u/home/s/serghei/project/1000G/subset_24/WGS_fastq/fastq
ls run*sh | awk '{i+=1;print "qsub -cwd -V -N WGS"i" -l h_data=16G,time=10:00:00 "$1}' >all.sh
Estimate based only on the position for which we got GS(golden standard)
Based on WES we estimate position: 18874414,0,69,0,0
Based on WGS : 18874414,0,2,0,0
bam from 1000G
-t 1 -B 4 -O 6 -E 1 -M -R
[-t nThreads]
[-B mmPenalty]
-O INT Gap open penalty. [6]
-E INT Gap extension penalty.
-M Mark shorter split hits as secondary
-R INT Proceed with suboptimal alignments if there are no more than INT equally best hits. This option only affects paired-end mapping. Increasing this threshold helps to improve the pairing accuracy at the cost of speed, especially for short reads (~32bp).
Is this read multi-mapped?
SRR098434.64179639 147 chr1 10002 12 3S73M = 10063 -12 CCCAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTA '''''''''''''''''''''''''''''BBFBB<BBBBBBBBBBBBBBBBBBBBB<BBBBB<BBBBBBB<B<B<< AS:i:73 MC:Z:76M MQ:i:13 XA:Z:chr21,+46699884,76M,0;chr19,+58607544,76M,1;chr1,+248946031,26M1D50M,1;chr4,+190122715,76M,2; XS:i:76 MD:Z:73NM:i:0 RG:Z:SRR098434
MAPQ=0
Its a random choice. Also when there are more than 1 equally "best" hits for a read the alignment gets a MAPQ of 0. When the author benchmarks the BWA tools he usually throws out alignments with MAPQ = 0 since those are random assignments.
Alternative hits; format: (chr,pos,CIGAR,NM;)*
1. multiple identical alligments
SRR098434.12026021 83 chr1 10001 0 76M = 10059 113 TAACCCTAACCCTAACCCTATCCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAAACCTAACCCTAAC '''''''''''''B7BB0<B770000B''B<B7BBB0<B<<'0<0'7BBBB<BB<BBBBB<B<B<B<B0<<B<<<7 AS:i:66 MC:Z:55M21S MQ:i:0 XA:Z:chr5,-10127,76M,2;chr5,-10031,76M,2;chr5,-10103,76M,2;chr5,-10049,76M,2;chr5,-10019,76M,2;chr5,-10043,76M,2;chr5,-11658,76M,2;chr5,-10001,76M,2;chr5,-10115,76M,2;chr5,-10007,76M,2;chr5,-10097,76M,2;chr5,-10415,76M,2;chr5,-10481,76M,2;chr5,-10025,76M,2;chr5,-10067,76M,2;chr5,-10055,76M,2;chr13,+114354210,76M,2;chr5,-10091,76M,2;chr5,-10079,76M,2;chr5,-10073,76M,2;chr5,-10013,76M,2;chr5,-10037,76M,2;chr5,-10121,76M,2;chr5,-10061,76M,2;chr5,-10109,76M,2;chr5,-10085,76M,2;chr5,-11519,64M1D12M,2;chr22,+50807890,10M1D66M,2;chr19,+58607523,10M1D66M,3;chr7,-10050,31M3D45M,4;chr7,+159335894,55M1D21M,3;chr4,+190122861,33M1D43M,3;chr17,-113268,7M2D57M2D12M,6;chr7_KI270899v1_alt,-50,31M3D45M,4;chr17_GL383563v3_alt,-53268,7M2D57M2D12M,6; XS:i:66 MD:Z:20A42C12 NM:i:2 RG:Z:SRR098434
MAPQ=0; NM:i:2; XA:Z:chr5,-10127,76M,2
2. Other alligment are worse
SRR098434.106443395 163 chr1 58383 60 76M = 58435 128 ATCCTGTCAAACATATATGCTTCTAGATTTTTTTAAAGACTGTTTCTACTAAGAAAGCATAGACCGCTATTGAGAA <<<<BBBBBBBBBBBBBBBBBBBBBB<BBBBBBBBBBBBBBBBB0BB7'B<<<B0'0<77<B0<00<077BB<<7< AS:i:76 MC:Z:76M MQ:i:51 XA:Z:chr15,-101932521,76M,2;chr19,+99959,76M,2; XS:i:66 MD:Z:76 NM:i:0 RG:Z:SRR098434
MAPQ=60; NM:i:0; XA:Z:chr15,-101932521,76M,2;chr19,+99959,76M,2;
Rnning this code
python ~/code/ec/scripts/T1/prepare_gold_from_WES.py HG00264.alt_bwamem_GRCh38DH.20150826.GBR.exome.cram chr1 ~/test.txt
run latest here
- /u/home/s/serghei/project/1000G/WES
- qsub -cwd -V -N true -l h_data=8G,time=10:00:00 run.sh
Do we need to know which one is REF or ALT?