Skip to content

Matching

masenar edited this page Nov 10, 2022 · 1 revision

Summary

Given a genome file and a primer pairs file, it matches every genome sequence with every primer pair and generates a template containing the best matches found as well as additional information. A part from the template, it also generates a table containing the genomes sequences and primer pairs that didn't match and another file containing statistics. If the output directory was "output", the program would generate respectively output_positive.csv, output_negative.csv and output_stats.txt.

HOWTO

HOWTO Open the program

To execute the program in GUI mode just double click on it.

If you are using the python script, type on a terminal:

python QMPrimers.py
#or python3 or python3.x, where X depends on your python version

Go to the next section for a quick review of the program options. Do not worry, there is another section with a guided example too.

To execute the program in command line mode:

If using the python script, type in a terminal:

python QMPrimers.py --match

If using the executable, type in a terminal:

./QMPrimers --match

Go to the next section for a quick review of the program options. Do not worry, there is another section with a guided example too. Both are for GUI mode, but the options are the same, so they are perfectly valid for command line users. To display the help page of the matching, type:

QMPrimers --match --help

HOWTO Fast Guide

Now you should have this on your screen: Matching Screenshot

  • The Genome entry is used to load the fasta file.

  • The Primer Pairs entry is used to load the primer pairs file, with the format specified on primer pairs file format.

  • The Output entry is used to specify the path and name of the output files. The program outputs 2 csv files: the template (output_positive.csv) and the negative matches (output_negative.csv) and a text file (output_stats.txt) containing basic statistics. For more info, see Output Data

  • The Precomputed Template entry is used to load a previous output file in order to bypass the computing. It's possible to load a template created by another program, but, obviously, the same format must be followed. The program doesn't need all the output info to restore the template. When opened the Parameters section becomes hidden.

  • Hanging primers: mf = forward maximum missmatches, mr = reverse max. miss. Primer pairs are allowed to match between [0-mf,len(genome)+mr] instead of just between the length of the genome

  • Check uppercase is used to transform genome sequences that are in lower case into upper case. When performing the matching the genomes in lower case will be ignored.

  • Check integrity is used to check if the genome file is correctly formatted. The primer pairs file is always checked due to the fact that is expected to be a small file. Check integrity will get triggered if a genome sequence is in lower case, that's why the checking uppercase feature is performed before this one. If this option is disabled but there are bad genomes, the program will still work, but there will be warnings saying that the genome is being skipped.

  • The Output info checklist is used to select which information the user wants to output to the output_positive.csv file. This option is only taken into account when saving, all data is generated no matter the output info the user wants.

  • Nend miss. If this option is greater than 0, only the last and first N mismatches of the forward and reverse primer respectively will be taken into account. This option is only taken into account when saving, that is to say, the program takes into account all mismatches to generate an origin template and then, if Nend is greater than 0, generates another template from the origin template taking into account this parameter.

  • debug: Only works for single matches, that is to say, when matching a single primer pair with a single genome sequence. If enabled, the program also generates the visual representation of the alignment on a txt file.

  • verbose: If enabled the program will output warning messages such as a primer has been skipped because it's longer than the genome sequence, for example. An additional log file in the program root directory is also generated. This log file always prints the warnings. Activating the verbose options makes the program to log also messages with the info flag.

HOWTO Guided example

Now you should have this on your screen: Matching Screenshot

So, do you want to perform some matching don't you? You are in the right place. To perform a matching we need a file containing genome sequences, that is the genome file. We need also a file containing the primer pairs, we can call it primer file, not very original, but will do. These files must follow a particular format: The genome file needs to be formatted as a fasta file and the primers file as a csv file, with this particular format. Let me show you an example of the genome file and the primer file.

You can download this files to test them on your computer if you want. Now that we have the files we should do some matching with them. To load the genome file just click open on the first entry, the one that has "<No Genome>" written. You can also write directly the path if you want. To load the primer file do the same but with the second entry, the one that has "<No Primer Pairs>" written into.

Now let's see what we have on the very right, the so called "Parameters". Here we can configure the maximum number of mismatches allowed on the forward and reverse primers. As for the other parameters, go to the previous section to recall what they are used for.

Once the genome and primer files are loaded (in fact they are not yet) and the parameters are set, we can now click on "Compute". Wait!, almost I forgot, if you want the program to display the warnings to you, you should activate the verbose option at the very bottom. Now you can proceed, happy matching!

The program will start to display info as the matching is being processed. Once it ends it will display a message saying so. If the files are big, this can take quite time, so it will be a tragedy to loose all the data gathered after 45 minutes because of an unexpected crash, that's why the program stores the file as it is being processed. Because of that, there is no need to click the "save" button, unless you want the output to be ordered by primer pairs, then you should click manually the "save button". Furthermore, if the results is manually saved, the parameters used and the time of that save is written too.

Now close the program.

No, wait!. Okay, no problem, we can fix this, keep calm!. Open again the program, but this time forget the genome and primer file and load the output_positive.csv file (the output file containing the template with the matching results) with the last entry, the one that has "<No template>" written into. Now the GUI should have changed slightly.

Click "load template" and the file should be loaded. We have loaded the matching result we got before closing the program. Now that we recovered the data, we can proceed with saving a custom template. To do that, we use the output info check boxes to choose which info do you want. You can also select the Nend mismatches if you want to. After configuring the output info, modify the output directory entry and click "set" (you can also use enter). Finally click "save". Remember to change the output directory if you want to keep the previous template.

There is another feature of the matching tab. You can also restore a template if you have also the original genome and primer files., although it does not work yet with Nend templates. For example, imagine that you have a template with the primer positions on the genome and you want to get the mismatches positions, types and bases. You can do it, just load the three files into the program and click "load template". The program should be able to restore the template as long as you can manually do it too, that is to say, it can not restore a template which has neither the reverse primer position nor the amplicon, thus impossible to recover the reverse primer position.

DATA FORMAT

Genome file format

The genome file must follow the fasta format.

Primer file format

The primer pairs file must have the following header. The order does not matter:

id;forwardPrimer;fPDNA;reversePrimer;rPDNA;ampliconMinLength;ampiconMaxLength

Forward and reverse primer chains must be formatted as 5' to 3'. Example:

5' ATCGGCGATATCGATCCCG 3'
               3' GGGC 5' Reverse
5' ATCG 3'                Forward
3' TAGCCGCTATAGCTAGGGC 5'

In the primer pairs file, they would be written as:

F ATCG
R CGGG

Template File

Raw header:

id,primerPair,fastaid,primerF,primerR,mismFT,mismRT,amplicon,F_pos,mismFT_loc,mismFT_type,mismFT_base,R_pos,mismRT_loc,mismRT_type,mismRT_base,mismFNn, mismRNn

For more info, see Output data -> *_positive.csv

Output Data

*_positive.csv

This csv table contains all the matches that has found during the matching. The information that is saved is:

  • id: A number that identifies unequivocally each match in the table. That's because it's possible to find multiple entries for the same genome and primer pair.
  • primerPair: Primer Pair identifier.
  • fastaid: Genome identifier.
  • primerF: Identifier of the forward primer.
  • primerR: Identifier of the reverse primer.
  • mism<F/R><Nn/T>: Number of mismatches on the Forward/Reverse primer.
  • ampliconLen: Distance between the forward and the reverse primer.
  • <F/N>_pos: Position of the Forward/Reverse primer on the genome.
  • mism<F/R><Nn/T>_loc: Location of the mismatches on the Forward/Reverse primer. On the forward goes right to left. On both starts at 1.
  • mism<F/R><Nn/T>_type: Nucleotides that produced each mismatch.
  • mism<F/R><Nn/T>_base: Base of the nucleotides that produced each mismatch. This can be Purine-Purine, Purine-Pyrimidine, Pyrimidine-Purine, Pyrimidine-Pyrimidine or Indeterminate, following this criteria
  • amplicon: Genome sequence between the forward and the reverse primer.

To explain the nomenclature a little bit: mism<F/R><Nn/T>_base means that in the template can be found:

  • mismFT_base and mismRT_base for the forward and reverse primer respectively, when saving a "normal template"
  • mismFN6_base and mismFN6_base in the event of saving a template with Nend = 6.

Here we can see an example followed by a descriptive image of the explained above: Output Example Result Explained

* The matching is performed on the 5'-3' template, that's why, for example, A-T is a mismatch, because actually it is a A-A (TODO, actually I think its T-T and the result is wrong!), as the mismatch output info says.

The reverse primer is transformed into its reverse complement during the execution of the program, as it is shown in the figure. The format of the primers should be specified in the primer file format section.

Base type criteria

IUPAC nucleotide code Base type
A Purine
C Pyrimidine
G Purine
T Pyrimidine
R Purine
Y Pyrimidine
S Indeterminate
W Indeterminate
K Indeterminate
M Indeterminate
B Indeterminate
D Indeterminate
H Indeterminate
V Indeterminate
N Indeterminate

*_negative.csv

This csv table contains pairs of primer pairs and genomes that have not produced a single match.

*_stats.txt

This simple text file contains a summary of the matching