Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lack of support for (b)gzipped vcf files and analysing only part(s) of the files (region support) #91

Open
dvanderleest opened this issue Aug 28, 2018 · 4 comments

Comments

@dvanderleest
Copy link

dvanderleest commented Aug 28, 2018

svtyper currently doesn't support the (b)gzipped file format. This leads to the following type of error.

Command:
svtyper -B $(ls analysis/temp/*/*piped.bam | paste -sd",") -i vcf/StructuralVariants.raw.lumpy.sorted.vcf.gz -l my.bams.json > vcf/StructuralVariants.gt.vcf

Output:
Traceback (most recent call last): File "/bin/svtyper", line 11, in <module> load_entry_point('svtyper==0.6.0', 'console_scripts', 'svtyper')() File "/usr/lib/python2.7/site-packages/svtyper/classic.py", line 572, in cli sys.exit(main()) File "/usr/lib/python2.7/site-packages/svtyper/classic.py", line 565, in main args.max_reads) File "/usr/lib/python2.7/site-packages/svtyper/classic.py", line 212, in sv_genotype var = Variant(v, vcf) File "/usr/lib/python2.7/site-packages/svtyper/parsers.py", line 253, in __init__ self.pos = int(var_list[1]) ValueError: invalid literal for int() with base 10: 'b\xee\xbd\xd1\xfb\xf0\x87\r\xa4=)\xa6\xa2\xda\xe8-\x81\x96\xe0\x8f\x7f|7\xbc\xe9\xeb\xe6}\x9f\xc1;\xce.\xf2'

The value of var_list[1] is a string of binary gibberish, which of course cannot be converted to an integer. This issue should be easy to solve by incorporating for instance bgzip -d -c or gunzip ( which is the same as gzip -d -c) at the location where the lines of the file are read. In the past hour I didn't find the correct site in the code yet.

@dvanderleest
Copy link
Author

I found the main function in /usr/lib/python2.7/site-packages/svtyper/classic.py:

def main():
    # parse the command line args
    args = get_args()

    set_up_logging(args.verbose)

    if args.split_bam is not None:
        sys.stderr.write('Warning: --split_bam (-S) is deprecated. Ignoring %s.\n' % args.split_bam)

    # call primary function
    sv_genotype(args.bam,
        args.input_vcf,        <-- The path to input_vcf is passed to sv_genotype
        args.output_vcf,
        args.min_aligned,
        args.split_weight,
        args.disc_weight,
        args.num_samp,
        args.lib_info_path,
        args.debug,
        args.alignment_outpath,
        args.ref_fasta,
        args.sum_quals,
        args.max_reads)

sv_genotype both reads the input files and genotypes the SVs. Therefore in sv_genotype near line 188 an addition should be made for the parsing of gzipped lines. I couldn't find an open() statement, so I am not sure whether it can simply be added here or not.

@dvanderleest
Copy link
Author

dvanderleest commented Aug 28, 2018

The file is read with parser.add_argument('-i', '--input_vcf', metavar='FILE', type=argparse.FileType('r'), default=None, help='VCF input (default: stdin)') at line 25. However, using argparse.FileType is similar to using open() which is unable to read gzipped files.

Python distinguishes between binary and text I/O. Files opened in binary mode (including 'b' in the mode argument) return contents as bytes objects without any decoding. In text mode (the default, or when 't' is included in the mode argument), the contents of the file are returned as str, the bytes having been first decoded using a platform-dependent encoding or using the specified encoding if given.

Perhaps setting the type to string and opening and closing the file with gzip.open() and/or open() enables testing for compressed data.

----Edit----
Come to think of it. It is probably best to incorporate a vcf parser for this.

@dvanderleest dvanderleest changed the title Lack of support for (b)gzipped vcf files Lack of support for (b)gzipped vcf files and analysing only part(s) of the files (region support) Aug 28, 2018
@ernfrid
Copy link
Contributor

ernfrid commented Aug 28, 2018

Yes, we've done something similar in svtools. See https://github.com/hall-lab/svtools/blob/master/svtools/utils.py for the class we put together to handle this. We could apply the same strategy here with a bit of refactoring.

@ernfrid
Copy link
Contributor

ernfrid commented Aug 28, 2018

I'll leave this open until the feature is added.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants