Skip to content
Serghei Mangul edited this page Jun 22, 2017 · 10 revisions

T1

To prepare T1 we haev downloaded 1000G data We download

  • WES
  • WGS

Convert cram to fastq

while read line; do echo "samtools view -uf64 $PWD/../data/HG00264.alt_bwamem_GRCh38DH.20150718.GBR.low_coverage.cram | samtools bam2fq - | gzip >${line}_1.fastq.gz">run1_${line}.sh;done<../samples.txt

while read line; do echo "samtools view -uf128 $PWD/../data/HG00264.alt_bwamem_GRCh38DH.20150718.GBR.low_coverage.cram | samtools bam2fq - | gzip >${line}_2.fastq.gz">run2_${line}.sh;done<../samples.txt
/u/home/s/serghei/project/1000G/subset_24/WGS_fastq/fastq

ls run*sh | awk '{i+=1;print "qsub -cwd -V -N WGS"i" -l h_data=16G,time=10:00:00 "$1}' >all.sh

Estimate based only on the position for which we got GS(golden standard)

For example

Based on WES we estimate position: 18874414,0,69,0,0
Based on WGS : 18874414,0,2,0,0

bam from 1000G

-t 1 -B 4 -O 6 -E 1 -M -R
[-t nThreads]
[-B mmPenalty] 
-O INT	Gap open penalty. [6]
-E INT	Gap extension penalty.
-M	Mark shorter split hits as secondary
-R INT	Proceed with suboptimal alignments if there are no more than INT equally best hits. This option only affects paired-end mapping. Increasing this threshold helps to improve the pairing accuracy at the cost of speed, especially for short reads (~32bp).

Is this read multi-mapped?

SRR098434.64179639	147	chr1	10002	12	3S73M	=	10063	-12	CCCAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTA	'''''''''''''''''''''''''''''BBFBB<BBBBBBBBBBBBBBBBBBBBB<BBBBB<BBBBBBB<B<B<<	AS:i:73	MC:Z:76M	MQ:i:13	XA:Z:chr21,+46699884,76M,0;chr19,+58607544,76M,1;chr1,+248946031,26M1D50M,1;chr4,+190122715,76M,2;	XS:i:76	MD:Z:73NM:i:0	RG:Z:SRR098434

MAPQ=0

Its a random choice. Also when there are more than 1 equally "best" hits for a read the alignment gets a MAPQ of 0. When the author benchmarks the BWA tools he usually throws out alignments with MAPQ = 0 since those are random assignments.
Alternative hits; format: (chr,pos,CIGAR,NM;)*

1.  multiple identical alligments

SRR098434.12026021	83	chr1	10001	0	76M	=	10059	113	TAACCCTAACCCTAACCCTATCCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAAACCTAACCCTAAC	'''''''''''''B7BB0<B770000B''B<B7BBB0<B<<'0<0'7BBBB<BB<BBBBB<B<B<B<B0<<B<<<7	AS:i:66	MC:Z:55M21S	MQ:i:0	XA:Z:chr5,-10127,76M,2;chr5,-10031,76M,2;chr5,-10103,76M,2;chr5,-10049,76M,2;chr5,-10019,76M,2;chr5,-10043,76M,2;chr5,-11658,76M,2;chr5,-10001,76M,2;chr5,-10115,76M,2;chr5,-10007,76M,2;chr5,-10097,76M,2;chr5,-10415,76M,2;chr5,-10481,76M,2;chr5,-10025,76M,2;chr5,-10067,76M,2;chr5,-10055,76M,2;chr13,+114354210,76M,2;chr5,-10091,76M,2;chr5,-10079,76M,2;chr5,-10073,76M,2;chr5,-10013,76M,2;chr5,-10037,76M,2;chr5,-10121,76M,2;chr5,-10061,76M,2;chr5,-10109,76M,2;chr5,-10085,76M,2;chr5,-11519,64M1D12M,2;chr22,+50807890,10M1D66M,2;chr19,+58607523,10M1D66M,3;chr7,-10050,31M3D45M,4;chr7,+159335894,55M1D21M,3;chr4,+190122861,33M1D43M,3;chr17,-113268,7M2D57M2D12M,6;chr7_KI270899v1_alt,-50,31M3D45M,4;chr17_GL383563v3_alt,-53268,7M2D57M2D12M,6;	XS:i:66	MD:Z:20A42C12	NM:i:2	RG:Z:SRR098434

MAPQ=0; NM:i:2; XA:Z:chr5,-10127,76M,2

2. Other alligment are worse 

SRR098434.106443395	163	chr1	58383	60	76M	=	58435	128	ATCCTGTCAAACATATATGCTTCTAGATTTTTTTAAAGACTGTTTCTACTAAGAAAGCATAGACCGCTATTGAGAA	<<<<BBBBBBBBBBBBBBBBBBBBBB<BBBBBBBBBBBBBBBBB0BB7'B<<<B0'0<77<B0<00<077BB<<7<	AS:i:76	MC:Z:76M	MQ:i:51	XA:Z:chr15,-101932521,76M,2;chr19,+99959,76M,2;	XS:i:66	MD:Z:76	NM:i:0	RG:Z:SRR098434

MAPQ=60; NM:i:0; XA:Z:chr15,-101932521,76M,2;chr19,+99959,76M,2;



Rnning this code

python ~/code/ec/scripts/T1/prepare_gold_from_WES.py HG00264.alt_bwamem_GRCh38DH.20150826.GBR.exome.cram chr1 ~/test.txt

run latest here

  • /u/home/s/serghei/project/1000G/WES
  • qsub -cwd -V -N true -l h_data=8G,time=10:00:00 run.sh

Do we need to know which one is REF or ALT?

Clone this wiki locally