-
Notifications
You must be signed in to change notification settings - Fork 55
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Lack of support for (b)gzipped vcf files and analysing only part(s) of the files (region support) #91
Comments
I found the main function in /usr/lib/python2.7/site-packages/svtyper/classic.py:
sv_genotype both reads the input files and genotypes the SVs. Therefore in sv_genotype near line 188 an addition should be made for the parsing of gzipped lines. I couldn't find an open() statement, so I am not sure whether it can simply be added here or not. |
The file is read with parser.add_argument('-i', '--input_vcf', metavar='FILE', type=argparse.FileType('r'), default=None, help='VCF input (default: stdin)') at line 25. However, using argparse.FileType is similar to using open() which is unable to read gzipped files. Python distinguishes between binary and text I/O. Files opened in binary mode (including 'b' in the mode argument) return contents as bytes objects without any decoding. In text mode (the default, or when 't' is included in the mode argument), the contents of the file are returned as str, the bytes having been first decoded using a platform-dependent encoding or using the specified encoding if given. Perhaps setting the type to string and opening and closing the file with ----Edit---- |
Yes, we've done something similar in svtools. See https://github.com/hall-lab/svtools/blob/master/svtools/utils.py for the class we put together to handle this. We could apply the same strategy here with a bit of refactoring. |
I'll leave this open until the feature is added. |
svtyper currently doesn't support the (b)gzipped file format. This leads to the following type of error.
Command:
svtyper -B $(ls analysis/temp/*/*piped.bam | paste -sd",") -i vcf/StructuralVariants.raw.lumpy.sorted.vcf.gz -l my.bams.json > vcf/StructuralVariants.gt.vcf
Output:
Traceback (most recent call last): File "/bin/svtyper", line 11, in <module> load_entry_point('svtyper==0.6.0', 'console_scripts', 'svtyper')() File "/usr/lib/python2.7/site-packages/svtyper/classic.py", line 572, in cli sys.exit(main()) File "/usr/lib/python2.7/site-packages/svtyper/classic.py", line 565, in main args.max_reads) File "/usr/lib/python2.7/site-packages/svtyper/classic.py", line 212, in sv_genotype var = Variant(v, vcf) File "/usr/lib/python2.7/site-packages/svtyper/parsers.py", line 253, in __init__ self.pos = int(var_list[1]) ValueError: invalid literal for int() with base 10: 'b\xee\xbd\xd1\xfb\xf0\x87\r\xa4=)\xa6\xa2\xda\xe8-\x81\x96\xe0\x8f\x7f|7\xbc\xe9\xeb\xe6}\x9f\xc1;\xce.\xf2'
The value of var_list[1] is a string of binary gibberish, which of course cannot be converted to an integer. This issue should be easy to solve by incorporating for instance
bgzip -d -c
orgunzip
( which is the same asgzip -d -c
) at the location where the lines of the file are read. In the past hour I didn't find the correct site in the code yet.The text was updated successfully, but these errors were encountered: