Usage

Whole paranome delineation

The first and uppermost step of constructing KS distribution is to infer the paralogous gene families (whole paranome). The simplest way of achieving it is as below.

(ENV) $ wgd dmd cds.fasta

This step is actually affected by many parameters. As shown below, two key parameters are the eval, which determines the e-value cut-off for similarity in diamond and thus directly impacts the whole query-subject hits table, and the inflation, which determines the inflation factor in the MCL program and thus directly affects the gene family clustering result. Some other minor (maybe also major in some conditions) parameters including the normalizedpercent controling the percentage of upper hits used for gene length normalization and the bins controlling the number of bins divided in gene length normalization, both of which might influence the normalization result and thus the gene family clustering.

Global/Local MRBHs inference

RBHs are good candidates for representing orthologous relationship between genes. The global MRBHs (Multiple RBHs) can be inferred with the code below.

(ENV) $ wgd dmd cds1.fasta cds2.fasta cds3.fasta --globalmrbh

The parameter that can directly affect the result of global MRBHs is the cscore. If it’s set as default None, the process will go as strict RBHs. If it’s set as a float between 0 and 1, the number of inter-specific gene pairs will probably increase, so does the number of global MRBHs. Normally, more species lead to less global MRBHs. Sometimes lowering the cscore might be needed to acquire enough global MRBHs for other analysis such as phylogenetic inference.

The local MRBHs which focus on a specific species, is a logical regime for searching the pair-wise orthologues to build orthogroups used in phylogenetic dating. The following command can be used to infer local MRBHs. The influential parameters are the same as global MRBHs.

(ENV) $ wgd dmd cds1.fasta cds2.fasta cds3.fasta --focus cds1.fasta

Orthogroup inference

Orthogroups are the most common units used in evolutionary biology for comparative analysis. It can be inferred by the command below.

(ENV) $ wgd dmd cds1.fasta cds2.fasta cds3.fasta -oi

There are two ways of inferring orthogroups: one is the concatenation method which firstly concatenates all cds files as a super cds file and secondly infers the query-subject hits table as the single genome does, with gene length bias correction done per species pair, and the normalized score being fed into MCL program for the final gene family clustering; the other is different only by the way of obtaining the query-subject hits table via pair-wise diamond search in parallel instead of relying on the super cds file. The influential parameters are the same as in the whole paranome inference.

Collinear coalescence inference of phylogeny

Traditional phylogenetic inference relys on the sequence-based orthologues. The collinear orthologues are still understudied in inferring phylogeny. Such gene content- and order-conserved orthologues can be used to infer phylogeny under the multi-species coalescent (MSC) model with command below.

(ENV) $ wgd dmd cds1.fasta cds2.fasta cds3.fasta --anchorpoints apdata --segments smdata --listelements ledata --genetable gtdata --collinearcoalescence

We rely on the software ASTRAL to summarize the speceis tree based on the gene tree inferred per collinear orthologue group. At least 80% percent species is required to be present in the multiplicon and the intersected anchor points across all levels (the levels on the same scaffold are treated the same as on different scaffolds) are retreived as different collinear orthologue group (in the label for instance, Multiplicon1_Ap1, Multiplicon1_Ap2). The species occurance ratio can be controled by the parameter msogcut. The gene tree inference method and parameter can be controlled by the parameter tree_method and treeset.

Orthogroup assignment

If we have already the orthogroup file, to assign additional genome or transcriptome sequences into the existed orthogroup, we can use the command below.

(ENV) $ wgd dmd old_seq1 old_seq2 old_seq3 --geneassign --seq2assign new_seq1 --seq2assign new_seq2 --fam2assign families.tsv

The “old_seq” is the sequence file involved in the existed orthogroup while the “new_seq” is the additional sequence file to be assigned. The option seq2assign can be called multiple times.

Note

The version of hmmer we tested with is 3.1b2. The version of numpy we tested with is 1.19.0. Higher or lower version of these dependent packages might interrupt the proper processing of this function (other functions as well). Please consider modifying the version of certain packages when met with errors.

cli.dmd(sequences, outdir, tmpdir, cscore, inflation, eval, to_stop, cds, focus, anchorpoints, keepfasta, keepduplicates, globalmrbh, nthreads, orthoinfer, onlyortho, getnsog, tree_method, treeset, msogcut, geneassign, assign_method, seq2assign, fam2assign, concat, segments, listelements, collinearcoalescence, testsog, bins, buscosog, buscohmm, buscocutoff, genetable, normalizedpercent, nonormalization)

Whole paranome inference

Global/Local MRBHs inference

Orthogroups inference

Phylogeny inference based on the collinear coalescence

Parameters:
  • sequences (paths) – Argument of sequence files.

  • outdir (str) – Path of desired output directory, default “wgd_dmd”.

  • tmpdir (str or None) – Path of temporary directory.

  • cscore (float or None) – The c-score to restrict the homologs of MRBHs, default “None”.

  • inflation (float) – The inflation factor for MCL program, default “2.0”.

  • eval (float) – The e-value cut-off for similarity in diamond and/or hmmer, default “1e-10”.

  • to_stop (boolean flags) – Whether to translate through STOP codons, default False.

  • cds (boolean flags) – Whether to only translate the complete CDS that starts with a valid start codon and only contains a single in frame stop codon at the end and must be dividable by three, default False.

  • focus (path or None) – The species to be merged on local MRBHs, default “None”.

  • anchorpoints (path or None) – The anchor points data file, default “None”.

  • segments (path or None) – The segments data file used in collinear coalescence analysis if initiated, default “None”.

  • listelements (path or None) – The listsegments data file used in collinear coalescence analysis if initiated, default “None”.

  • keepfasta (boolean flags) – Whether to output the sequence information of MRBHs, default False.

  • keepduplicates (boolean flags) – Whether to allow the same gene to occur in different MRBHs, default False.

  • globalmrbh (boolean flags) – Whether to initiate global MRBHs constructionhether to initiate global MRBHs construction, default False.

  • nthreads (int) – The number of threads to use, default 4.

  • orthoinfer (boolean flags) – Whether to initiate orthogroup infernece, default False.

  • onlyortho (boolean flags) – Whether to only conduct orthogroup infernece, default False.

  • getnsog (boolean flags) – Whether to initiate the searching for nested single-copy gene families (NSOGs), default False.

  • tree_method (choice ['fasttree', 'iqtree', 'mrbayes']) – which gene tree inference program to invoke, default “fasttree”.

  • treeset (multiple str options or None) – The parameters setting for gene tree inference, default None.

  • msogcut (float) – The ratio cutoff for mostly single-copy family and species representation in collinear coalescence inference, default “0.8”.

  • geneassign (boolean flags) – Whether to initiate the gene-to-family assignment analysis, default False.

  • assign_method (choice ['hmmer', 'diamond']) – Which method to conduct the gene-to-family assignment analysis, default “hmmer”.

  • seq2assign (multiple path options or None) – The queried sequences data file in gene-to-family assignment analysis, default None.

  • fam2assign (path or None) – The queried familiy data file in gene-to-family assignment analysis, default None.

  • concat (boolean flags) – Whether to initiate the concatenation pipeline for orthogroup infernece, default False.

  • collinearcoalescence (boolean flags) – Whether to initiate the collinear coalescence analysis, default False.

  • testsog (boolean flags) – Whether to initiate the unbiased test of single-copy gene families, default False.

  • bins (int) – The number of bins divided in gene length normalization, default “100”.

  • normalizedpercent (int) – The percentage of upper hits used for gene length normalization, default “5”.

  • nonormalization (boolean flags) – Whether to call off the normalization, default False.

  • buscosog (boolean flags) – Thether to initiate the busco-guided single-copy gene family analysis, default False.

  • buscohmm (path or None) – The HMM profile datafile in the busco-guided single-copy gene family analysis, default None.

  • buscocutoff (path or None) – The HMM score cutoff datafile in the busco-guided single-copy gene family analysis, default None.

  • genetable (path or None) – The gene table datafile used in collinear coalescence analysis, default None.

KS distribution construction

After obtaining the paralogous/orthologous gene family file, the construction of KS distribution can be achieved as below.

(ENV) $ wgd ksd families *cds.fasta

The gene family file is mandatory input, which can be acquired from the eariler steps using wgd dmd. Depending on the number of species (or the provided cds sequence files), the meaning of constructed KS distribution differs. If only one species is given, the whole paranome KS distribution is to be established, which is a good material for the primary identification of WGDs. If orthogroups of multiple species are given, a paralog-ortholog mixed KS distribution is to be built that further subdivision per species pair of KS can be used in diagnosing the phylogenetic placement of focused WGD events. If the RBH gene family of two species is given, the constructed KS distribution is to show the KS age of divergence event, as illustrated in wgd v1. This step is quite flexible in that various types of gene family files can be provided, for instance, the paralogous gene family, the orthogroup, the RBH gene family, the collinear orthologous gene family (inferred in the collinear coalescence inference analysis) and whatever gene families that users would like to calculate, as long as in the correct format (tab-separated table that each row represents a gene family while each column represents a species with the first column as the label of gene family and the first row as the label of species name). The influential parameters include the pairwise parameter determining whether to calculate the KS on the alignment basis of each paralogous gene pair instead of the whole alignment for that family, and the strip_gaps parameter controlling whether to drop all gaps in multiple sequence alignment, which only makes a difference when co-occurring with the pairwise parameter because the program codeml will strip all the gaps anyway. The default tree_method is the external software fasttree, which can be replaced with the built-in cluster method.

Corrected KS distribution construction

To determine the phylogenetic location of a certain WGD, a rate-corrected KS distribution is needed to rescale the rate-dependent orthologous KS into the same rate of focused species. The command to achieve it is as below.

(ENV) $ wgd ksd families *cds.fasta --spair speciespair1 --spair speciespair2 --speciestree speciestree.nw

This step is mainly affected by the parameter reweight determining whether to recalculate the weight per species pair instead of using the weight calculated from the whole family, and the parameter onlyrootout determining whether to only conduct rate correction using the outgroup at root as outgroup. The parameter extraparanomeks enables the addition of extra paralogous KS data besides the probably existing ones in the original KS data and when duplicated paralogous gene pairs occur only the ones in extra paralogous KS data will be kept. This is to deal with the condition that users might provide gene families containing paralogous gene pairs which might not be complete and thus add more which might overlap with the extra paralogous KS data. But it’s suggested to separate the paralogous KS only in the extraparanomeks parameter while the gene families only contain orthologous gene pairs (for instance, the global MRBHs) because the elmm modeling only considers the paralogs transmitted from the extraparanomeks parameter. Nonetheless, the extraparanomeks parameter would not affect the result of rate correction which only involves orthologous KS. We will discuss about it further in the recipe.

KS tree inference

The KS tree is good way of visualizing and quantifying the variation of substitution rate. Here in wgd v2 we implemented an additional way of estimating KS branch length given a species tree or inferred gene tree in wgd ksd. The command is as below.

(ENV) $ wgd ksd kstree_data/fam.tsv data/kstree_data/Acorus_tatarinowii kstree_data/Amborella_trichopoda kstree_data/Aquilegia_coerulea kstree_data/Aristolochia_fimbriata kstree_data/Cycas_panzhihuaensi --kstree --speciestree kstree_data/species_tree1.nw --onlyconcatkstree -o wgd_kstree_topology1

The option kstree tells the program to infer KS tree instead of pairwise KS. The option speciestree takes the input species tree. The option onlyconcatkstree tells the program to only conduct the KS analysis for the concatenated alignment rather than for every alignment per family. We used two global MRBH families for the illustration above. Three alternative topologies were applied given the uncertain phylogenetic relationship within angiosperms.

_images/kstree.svg
cli.ksd(families, sequences, outdir, tmpdir, nthreads, to_stop, cds, pairwise, strip_gaps, tree_method, spair, speciestree, reweight, onlyrootout, extraparanomeks, anchorpoints, plotkde, plotapgmm, components)

KS distribution construction

Corrected KS distribution construction

Parameters:
  • families (path) – Argument of gene family file.

  • sequences (paths) – Argument of sequence files.

  • outdir (str) – Path of desired output directory, default “wgd_ksd”.

  • tmpdir (str or None) – Path of temporary directory.

  • nthreads (int) – The number of threads to use, default 4.

  • to_stop (boolean flags) – Whether to translate through STOP codons, default False.

  • cds (boolean flags) – Whether to only translate the complete CDS that starts with a valid start codon and only contains a single in frame stop codon at the end and must be dividable by three, default False.

  • pairwise (boolean flag) – Whether to initiate pairwise KS estimation, default False.

  • strip_gaps (boolean flag) – Whether to drop all gaps in multiple sequence alignment.

  • tree_method (choice ['cluster', 'fasttree', 'iqtree']) – which gene tree inference program to invoke, default “fasttree”.

  • spair (multiple str options or None) – The species pair to be plotted, default None.

  • speciestree (path or None) – The species tree to perform rate correction, default None.

  • reweight (boolean flag) – Whether to recalculate the weight per species pair, default False.

  • onlyrootout (boolean flag) – Whether to only conduct rate correction using the outgroup at root as outgroup, default False.

  • extraparanomeks (path or None) – The extra paranome ks data to be plotted in the mixed KS distribution, default None.

  • anchorpoints (path or None) – The anchorpoints.txt file to plot anchor KS in the mixed KS distribution, default None.

  • plotkde (boolean flag) – Whether to plot kde curve of orthologous KS distribution over histogram in the mixed Ks distribution, default False.

  • plotapgmm (boolean flag) – Whether to plot mixture modeling of anchor KS in the mixed KS distribution, default False.

  • plotelmm (boolean flag) – Whether to plot elmm mixture modeling of paranome KS in the mixed KS distribution, default False.

  • components ((int,int)) – The range of the number of components to fit in anchor KS mixture modeling, default (1,4).

Mixture model clustering of KS distribution

This function inherits from wgd v1 that bgmm and gmm are available for mixture modeling upon any KS distribution. The command to achieve it is as below.

(ENV) $ wgd mix ks.tsv

This part of analysis is mainly relying on the mixture module of scikit-learn library. Almost all the parameters of this function will have an impact on the results. Please see the description below for the detailed information.

cli.mix(ks_distribution, filters, ks_range, method, components, bins, output_dir, gamma, n_init, max_iter)

Mixture modeling of KS distribution

Parameters:
  • ks_distribution (path) – Argument of KS distribution file.

  • filters (int) – The cutoff alignment length, default “300”.

  • ks_range ((float,float)) – The KS range to be considered, default (0, 5)

  • method (choice ['gmm', 'bgmm']) – which mixture model to use, default “gmm”.

  • components ((int,int)) – The range of the number of components to fit, default (1,4)

  • bins (int) – The number of bins in KS distribution, default “50”.

  • outdir (str) – Path of desired output directory, default “wgd_mix”.

  • gamma (float) – The gamma parameter for bgmm models, default “0.001”.

  • n_init (int) – The number of k-means initializations, default “200”.

  • max_iter (int) – The maximum number of iterations, dafault “1000”.

Synteny inference

Synteny is frequently called nowadays in profiling the evolution of psedochromosomes. The program wgd syn does all the inference work about synteny. The simplest command is as below.

(ENV) $ wgd syn families.tsv gff3 (-ks ksdata)

The influential parameters for synteny inference include the minlen controling the minimum length of a scaffold to be considered, the maxsize controling the maximum family size to be considered, the minseglen determining the minimum length of segments to considered, the keepredun determining whether to keep redundant multiplicons, and the mingenenum controlling the minimum number of genes on segments to be considered.

cli.syn(families, gff_files, ks_distribution, outdir, feature, attribute, minlen, maxsize, ks_range, iadhore_options, ancestor, minseglen, keepredun, mingenenum, dotsize, apalpha, hoalpha)

Synteny inference

Parameters:
  • families (path) – Argument of gene family file.

  • gff_files (paths) – Argument of gff3 files.

  • ks_distribution (path or None) – The KS distribution datafile, default None.

  • outdir (str) – Path of desired output directory, default “wgd_syn”.

  • feature (str) – The feature for parsing gene IDs from GFF files, default “gene”.

  • attribute (str) – The attribute for parsing the gene IDs from the GFF files, default “ID”.

  • minlen (int) – The minimum length of a scaffold to be included in dotplot, default “-1”.

  • maxsize (int) – The maximum family size to include, default “200”.

  • ks_range ((float,float)) – The KS range in colored dotplot, default (0, 5).

  • iadhore_options (str) – The parameter setting in iadhore, default “”.

  • ancestor (str or None) – The assumed ancestor species, it’s still under development, default None.

  • minseglen (float) – The minimum length of segments to include, ratio if <= 1, default 100000.

  • keepredun (boolean flags) – Whether to keep redundant multiplicons, default False.

  • mingenenum (int) – The minimum number of genes for a segment to be considered, default 30.

  • dotsize (float) – The dot size in dot plot, default “1”.

  • apalpha (float) – The opacity of anchor dots, default “1”.

  • hoalpha (float) – The opacity of homolog dots, default “0.1”.

Search of anchor pairs for molecular dating

If we want to date the identified WGD events, determining anchor pairs to be dated is the key step. The program wgd peak can achieve this goal. The command is as below.

(ENV) $ wgd peak ksdata -ap apdata -sm smdata -le ledata -mp mpdata

There are two methods which can be called in this step. One is the heuristic method which can be called by adding the flag heuristic, where a heuristic refinement of anchor pairs based on the detected peaks from scipy.signal and the associated properties is conducted. Another method is to retreive the highest density region (HDR) of the segment-guided syntelogs with users-defined KS saturation cutoff.

cli.peak(ks_distribution, anchorpoints, outdir, alignfilter, ksrange, bin_width, weights_outliers_included, method, seed, em_iter, n_init, components, boots, weighted, plot, bw_method, n_medoids, kdemethod, n_clusters, kmedoids, guide, prominence_cutoff, kstodate, family, rel_height, ci, manualset, segments, hdr, heuristic, listelements, multipliconpairs, kscutoff, gamma)

Search of anchor pairs used in WGD dating

Parameters:
  • ks_distribution (path) – Argument of KS distribution datafile.

  • anchorpoints (path or None) – The anchor points datafile, default None.

  • segments (path or None) – The segments datafile, default None.

  • outdir (str) – Path of desired output directory, default “wgd_peak”.

  • alignfilter ((float,int,float)) – The cutoff for alignment identity, length and coverage, default (0.0, 0, 0.0).

  • ksrange ((float,float)) – The range of KS to be analyzed, default (0,5).

  • bin_width (float) – The bandwidth of KS distribution, default “0.1”.

  • weights_outliers_included (boolean flags) – Whether to include KS outliers, default False.

  • method (choice ['gmm', 'bgmm']) – Which mixture model to use, default “gmm”.

  • seed (int) – Random seed given to initialization, default “2352890”.

  • em_iter (int) – The number of EM iterations to perform, default “200”.

  • n_init (int) – The number of k-means initializations, default “200”.

  • components ((int,int)) – The range of the number of components to fit, default (1,4).

  • boots (int) – The number of bootstrap replicates of kde, default “200”.

  • weighted (boolean flags) – Whether to use node-weighted method of de-redundancy, default False.

  • plot (choice(['stacked', 'identical'])) – The plotting method to be used, default “identical”.

  • bw_method (choice['silverman', 'ISJ']) – The bandwidth method to be used, default “silverman”.

  • n_medoids (int) – The number of medoids to fit, default “2”.

  • kdemethod (choice['scipy', 'naivekde', 'treekde', 'fftkde']) – The kde method to be used, default “scipy”.

  • n_clusters (int) – The number of clusters to plot Elbow loss function, default “5”.

  • kmedoids (boolean flags) – Whether to initiate K-Medoids clustering analysis, default False.

  • guide (choice['multiplicon', 'basecluster', 'segment']) – The regime residing anchors, default “segment”.

  • prominence_cutoff (float) – The prominence cutoff of acceptable peaks, default “0.1”.

  • kstodate ((float,float)) – The range of KS to be dated in heuristic search, default (0.5, 1.5).

  • family (path or None) – The family to filter KS upon, default None.

  • manualset (boolean flags) – Whether to output anchor pairs with manually set KS range, default False.

  • rel_height (float) – The relative height at which the peak width is measured, default “0.4”.

  • ci (int) – The confidence level of log-normal distribution to date, default “95”.

  • hdr (int) – The highest density region (HDR) in a given distribution to date, default “95”.

  • heuristic (boolean flags) – Whether to initiate heuristic method of defining CI for dating, default False.

  • listelements (path or None) – The listsegments datafile, default None.

  • multipliconpairs (path or None) – The multipliconpairs datafile, default None.

  • kscutoff (float) – The KS saturation cutoff in dating, default “5”.

  • gamma (float) – The gamma parameter for bgmm models, default “1e-3”.

Concatenation-/Coalescence-based phylogenetic inference

The program wgd focus can handle various analysis. To recover the phylogeny, either in concatenation- or coalescence-based method, the following command can be used.

(ENV) $ wgd focus families *cds.fasta --concatenation/--coalescence

What this step does is 1) write sequence for each gene family, 2) infer multiple sequence alignment (MSA) for each gene family, if with concatenation method, 3) concatenate all MSA and infer the maximum likelihood gene tree as species tree, if with coalescence method 3) infer maximum likelihood gene tree for each gene family and summarize the species tree using ASTRAL. The influential parameters include the tree_method parameter controling the program to infer gene tree and the treeset parameter to control the parameter setting for gene tree inference.

Functional annotation of gene families

The functional annotation of gene families can also be achieved in wgd focue using the command below.

(ENV) $ wgd focus families *cds.fasta --annotation eggnog --eggnogdata eddata --dmnb dbdata

This step relies on the database provided by users. It’s required to pre-set the executable of annotation program for instance, eggnog, hmmer and interproscan in the environment variable. The influential parameter is the evalue that controls the e-value cut-off for annotation.

WGD dating

The program wgd focus is the final step in WGD dating. After obtaining the anchor pairs-merged orthogroups from wgd dmd, the phylogenetic dating can be conducted with the command below.

(ENV) $ wgd focus families sequence1 sequence2 sequence3 --dating mcmctree --speciestree species_tree.nw

The command shown above is a simple example that calls the molecular dating program mcmctree to conduct the dating. A species tree is mandatory input. Further dating parameters can be provided with datingset parameters.

cli.focus(families, sequences, outdir, tmpdir, nthreads, to_stop, cds, strip_gaps, aligner, tree_method, treeset, concatenation, coalescence, speciestree, dating, datingset, nsites, outgroup, partition, aamodel, ks, annotation, pairwise, eggnogdata, pfam, dmnb, hmm, evalue, exepath, fossil, rootheight, chainset, beastlgjar, beagle, protdating)

Concatenation-/Coalescence-based phylogenetic inference Functional annotation of gene families Phylogenetic dating of WGDs

Parameters:
  • families (path) – Argument of gene family file.

  • sequences (paths) – Argument of sequence files.

  • outdir (str) – Path of desired output directory, default “wgd_focus”.

  • tmpdir (str or None) – Path of temporary directory.

  • nthreads (int) – The number of threads to use, default 4.

  • to_stop (boolean flags) – Whether to translate through STOP codons, default False.

  • cds (boolean flags) – Whether to only translate the complete CDS that starts with a valid start codon and only contains a single in frame stop codon at the end and must be dividable by three, default False.

  • strip_gaps (boolean flag) – Whether to drop all gaps in multiple sequence alignment.

  • aligner (choice(['muscle', 'prank', 'mafft'])) – Which alignment program to use, default “mafft”.

  • tree_method (choice ['fasttree', 'iqtree', 'mrbayes']) – which gene tree inference program to invoke, default “fasttree”.

  • treeset (multiple str options or None) – The parameters setting for gene tree inference, default None.

  • concatenation (boolean flag) – Whether to initiate the concatenation-based species tree inference, default False.

  • coalescence (boolean flag) – Whether to initiate the coalescence-based species tree inference, default False.

  • speciestree (path or None) – The species tree datafile for dating, default None.

  • dating (choice(['beast', 'mcmctree', 'r8s', 'none'])) – Which molecular dating program to use, default None.

  • datingset (multiple str options or None) – The parameters setting for dating program, default None.

  • nsites (int) – The nsites information for r8s dating, default None.

  • outgroup (str) – The outgroup information for r8s dating, default None.

  • partition (boolean flag) – Whether to initiate partition dating analysis for codon, default False.

  • aamodel (choice(['poisson','wag', 'lg', 'dayhoff'])) – Which protein model to be used in mcmctree, default “poisson”.

  • ks (boolean flag) – Whether to initiate KS calculation, default False.

  • annotation (choice(['none','eggnog', 'hmmpfam', 'interproscan'])) – Which annotation program to use, default None.

  • pairwise (boolean flag) – Whether to initiate pairwise KS estimation, default False.

  • eggnogdata (path or None) – The eggnog annotation datafile, default None.

  • pfam (choice(['none', 'denovo', 'realign'])) – Which option to use for pfam annotation, default None.

  • dmnb (path or None) – The diamond database for annotation, default None.

  • hmm (path or None) – The HMM profile for annotation, default None.

  • evalue (float) – The e-value cut-off for annotation, default “1e-10”.

  • exepath (path or None) – The path to the interproscan executable, default None.

  • fossil ((str,str,str,str,str)) – The fossil calibration information in Beast, default (‘clade1;clade2’, ‘taxa1,taxa2;taxa3,taxa4’, ‘4;5’, ‘0.5;0.6’, ‘400;500’).

  • rootheight ((float,float,float)) – The root height calibration info in Beast, default (4,0.5,400).

  • chainset ((int,int)) – The parameters of MCMC chain in Beast, default (10000,100).

  • beastlgjar (path or None) – The path to beastLG.jar, default None.

  • beagle (boolean flag) – Whether to use beagle in Beast, default False.

  • protdating (boolean flag) – Whether to only initiate the protein-concatenation based dating analysis, default False.

KS distribution visualization

The program wgd viz can be used in plotting KS distribution and synteny. To visualize the KS distribution, the command below can be used.

(ENV) $ wgd viz --data ks.tsv

The program wgd viz will automately calculate and plot the exponential-lognormal mixture modeling (ELMM) result. The influential parameters include the em_iterations controling the maximum EM iterations and the em_initializations controling the the maximum EM initializations, the prominence_cutoff determining the prominence cutoff of acceptable peaks and the rel_height to set the relative height at which the peak width is measured.

Corrected KS distribution visualization

To add rate correction result into the KS plot, one can use the command below.

(ENV) $ wgd viz --data ks.tsv --spair speciespair1 --spair speciespair2 --speciestree speciestree.nw --gsmap gene_species.map

The additional required file gene_species.map is automately produced from the wgd ksd step when producing the ks.tsv file.

Note

The gene_species.map file is no longer needed in v2.0.21 onwards. Leave this parameter as default None is enough.

Synteny visualization

The synteny plot produced by the program wgd syn can be reproduced by wgd viz too. The command is as below.

(ENV) $ wgd viz --anchorpoints apdata --segments smdata --multiplicon mtdata --genetable gtdata --plotsyn (--datafile ksdata)

The influential parameters include the minlen controling the minimum length of a scaffold to be included in dotplot, the maxsize determining the maximum family size to include, the minseglen determining the minimum length of segments to include, the keepredun controling whether to keep redundant multiplicons.

cli.viz(datafile, spair, outdir, gsmap, plotkde, reweight, em_iterations, em_initializations, prominence_cutoff, segments, minlen, maxsize, anchorpoints, multiplicon, genetable, rel_height, speciestree, onlyrootout, minseglen, keepredun, extraparanomeks, plotapgmm, plotelmm, components, mingenenum, plotsyn, dotsize, apalpha, hoalpha)

KS distribution visualization Synteny visualization

Parameters:
  • datafile (path or None) – The path to datafile, default None.

  • spair (multiple str options) – The species pair to be plotted, default None.

  • outdir (str) – Path of desired output directory, default “wgd_focus”.

  • gsmap (path or None) – The gene name-species name map, default None.

  • plotkde (boolean flag) – Whether to plot kde curve upon histogram, default False.

  • reweight (boolean flag) – Whether to recalculate the weight per species pair, default False.

  • em_iterations (int) – The maximum EM iterations, default “200”.

  • em_initializations (int) – The maximum EM initializations, default “200”.

  • prominence_cutoff (float) – The prominence cutoff of acceptable peaks, default “0.1”.

  • segments (path or None) – The segments data file, default None.

  • minlen (int) – The minimum length of a scaffold to be included in dotplot, if “-1” was set, the 10% of the longest scaffolds will be used, default “-1”.

  • maxsize (int) – The maximum family size to include, default “200”.

  • anchorpoints (path or None) – The anchor points datafile, default None.

  • multiplicon (path or None) – The multiplicons datafile, default None.

  • genetable (path or None) – The gene table datafile, default None.

  • rel_height (float) – The relative height at which the peak width is measured, default “0.4”.

  • speciestree (path or None) – The species tree to perform rate correction, default None.

  • onlyrootout (boolean flag) – Whether to only conduct rate correction using the outgroup at root as outgroup, default False.

  • minseglen (float) – The minimum length of segments to include in ratio if <= 1, default “100000”.

  • keepredun (boolean flag) – Whether to keep redundant multiplicons, default False.

  • extraparanomeks (path or None) – The extra paranome ks data to be plotted in the mixed KS distribution, default None.

  • plotapgmm (boolean flag) – Whether to plot mixture modeling of anchor KS in the mixed KS distribution, default False.

  • plotelmm (boolean flag) – Whether to plot elmm mixture modeling of paranome KS in the mixed KS distribution, default False.

  • components ((int,int)) – The range of the number of components to fit in anchor KS mixture modeling, default (1,4).

  • mingenenum (int) – The minimum number of genes for a segment to be considered, default “30”.

  • plotsyn (boolean flag) – Whether to initiate the synteny plot, default False.

  • dotsize (float) – The dot size in dot plot, default “1”.

  • apalpha (float) – The opacity of anchor dots, default “1”.

  • hoalpha (float) – The opacity of homolog dots, default “0.1”.