Usage

Whole paranome delineation

The first and uppermost step of constructing K_S distribution is to infer the paralogous gene families (whole paranome). The simplest way of achieving it is as below.

(ENV) $ wgd dmd cds.fasta

This step is actually affected by many parameters. As shown below, two key parameters are the eval, which determines the e-value cut-off for similarity in diamond and thus directly impacts the whole query-subject hits table, and the inflation, which determines the inflation factor in the MCL program and thus directly affects the gene family clustering result. Some other minor (maybe also major in some conditions) parameters including the normalizedpercent controling the percentage of upper hits used for gene length normalization and the bins controlling the number of bins divided in gene length normalization, both of which might influence the normalization result and thus the gene family clustering.

Global/Local MRBHs inference

RBHs are good candidates for representing orthologous relationship between genes. The global MRBHs (Multiple RBHs) can be inferred with the code below.

(ENV) $ wgd dmd cds1.fasta cds2.fasta cds3.fasta --globalmrbh

The parameter that can directly affect the result of global MRBHs is the cscore. If it’s set as default None, the process will go as strict RBHs. If it’s set as a float between 0 and 1, the number of inter-specific gene pairs will probably increase, so does the number of global MRBHs. Normally, more species lead to less global MRBHs. Sometimes lowering the cscore might be needed to acquire enough global MRBHs for other analysis such as phylogenetic inference.

The local MRBHs which focus on a specific species, is a logical regime for searching the pair-wise orthologues to build orthogroups used in phylogenetic dating. The following command can be used to infer local MRBHs. The influential parameters are the same as global MRBHs.

(ENV) $ wgd dmd cds1.fasta cds2.fasta cds3.fasta --focus cds1.fasta

Orthogroup inference

Orthogroups are the most common units used in evolutionary biology for comparative analysis. It can be inferred by the command below.

(ENV) $ wgd dmd cds1.fasta cds2.fasta cds3.fasta -oi

There are two ways of inferring orthogroups: one is the concatenation method which firstly concatenates all cds files as a super cds file and secondly infers the query-subject hits table as the single genome does, with gene length bias correction done per species pair, and the normalized score being fed into MCL program for the final gene family clustering; the other is different only by the way of obtaining the query-subject hits table via pair-wise diamond search in parallel instead of relying on the super cds file. The influential parameters are the same as in the whole paranome inference.

Collinear coalescence inference of phylogeny

Traditional phylogenetic inference relys on the sequence-based orthologues. The collinear orthologues are still understudied in inferring phylogeny. Such gene content- and order-conserved orthologues can be used to infer phylogeny under the multi-species coalescent (MSC) model with command below.

(ENV) $ wgd dmd cds1.fasta cds2.fasta cds3.fasta --anchorpoints apdata --segments smdata --listelements ledata --genetable gtdata --collinearcoalescence

We rely on the software ASTRAL to summarize the speceis tree based on the gene tree inferred per collinear orthologue group. At least 80% percent species is required to be present in the multiplicon and the intersected anchor points across all levels (the levels on the same scaffold are treated the same as on different scaffolds) are retreived as different collinear orthologue group (in the label for instance, Multiplicon1_Ap1, Multiplicon1_Ap2). The species occurance ratio can be controled by the parameter msogcut. The gene tree inference method and parameter can be controlled by the parameter tree_method and treeset.

Orthogroup assignment

If we have already the orthogroup file, to assign additional genome or transcriptome sequences into the existed orthogroup, we can use the command below.

(ENV) $ wgd dmd old_seq1 old_seq2 old_seq3 --geneassign --seq2assign new_seq1 --seq2assign new_seq2 --fam2assign families.tsv

The “old_seq” is the sequence file involved in the existed orthogroup while the “new_seq” is the additional sequence file to be assigned. The option seq2assign can be called multiple times.

Note

The version of hmmer we tested with is 3.1b2. The version of numpy we tested with is 1.19.0. Higher or lower version of these dependent packages might interrupt the proper processing of this function (other functions as well). Please consider modifying the version of certain packages when met with errors.

cli.dmd(sequences, outdir, tmpdir, cscore, inflation, eval, to_stop, cds, focus, anchorpoints, keepfasta, keepduplicates, globalmrbh, nthreads, orthoinfer, onlyortho, getnsog, tree_method, treeset, msogcut, geneassign, assign_method, seq2assign, fam2assign, concat, segments, listelements, collinearcoalescence, testsog, bins, buscosog, buscohmm, buscocutoff, genetable, normalizedpercent, nonormalization)

Whole paranome inference

Global/Local MRBHs inference

Orthogroups inference

Phylogeny inference based on the collinear coalescence

Parameters:

sequences (paths) – Argument of sequence files.
outdir (str) – Path of desired output directory, default “wgd_dmd”.
tmpdir (str or None) – Path of temporary directory.
cscore (float or None) – The c-score to restrict the homologs of MRBHs, default “None”.
inflation (float) – The inflation factor for MCL program, default “2.0”.
eval (float) – The e-value cut-off for similarity in diamond and/or hmmer, default “1e-10”.
to_stop (boolean flags) – Whether to translate through STOP codons, default False.
cds (boolean flags) – Whether to only translate the complete CDS that starts with a valid start codon and only contains a single in frame stop codon at the end and must be dividable by three, default False.
focus (path or None) – The species to be merged on local MRBHs, default “None”.
anchorpoints (path or None) – The anchor points data file, default “None”.
segments (path or None) – The segments data file used in collinear coalescence analysis if initiated, default “None”.
listelements (path or None) – The listsegments data file used in collinear coalescence analysis if initiated, default “None”.
keepfasta (boolean flags) – Whether to output the sequence information of MRBHs, default False.
keepduplicates (boolean flags) – Whether to allow the same gene to occur in different MRBHs, default False.
globalmrbh (boolean flags) – Whether to initiate global MRBHs constructionhether to initiate global MRBHs construction, default False.
nthreads (int) – The number of threads to use, default 4.
orthoinfer (boolean flags) – Whether to initiate orthogroup infernece, default False.
onlyortho (boolean flags) – Whether to only conduct orthogroup infernece, default False.
getnsog (boolean flags) – Whether to initiate the searching for nested single-copy gene families (NSOGs), default False.
tree_method (choice ['fasttree', 'iqtree', 'mrbayes']) – which gene tree inference program to invoke, default “fasttree”.
treeset (multiple str options or None) – The parameters setting for gene tree inference, default None.
msogcut (float) – The ratio cutoff for mostly single-copy family and species representation in collinear coalescence inference, default “0.8”.
geneassign (boolean flags) – Whether to initiate the gene-to-family assignment analysis, default False.
assign_method (choice ['hmmer', 'diamond']) – Which method to conduct the gene-to-family assignment analysis, default “hmmer”.
seq2assign (multiple path options or None) – The queried sequences data file in gene-to-family assignment analysis, default None.
fam2assign (path or None) – The queried familiy data file in gene-to-family assignment analysis, default None.
concat (boolean flags) – Whether to initiate the concatenation pipeline for orthogroup infernece, default False.
collinearcoalescence (boolean flags) – Whether to initiate the collinear coalescence analysis, default False.
testsog (boolean flags) – Whether to initiate the unbiased test of single-copy gene families, default False.
bins (int) – The number of bins divided in gene length normalization, default “100”.
normalizedpercent (int) – The percentage of upper hits used for gene length normalization, default “5”.
nonormalization (boolean flags) – Whether to call off the normalization, default False.
buscosog (boolean flags) – Thether to initiate the busco-guided single-copy gene family analysis, default False.
buscohmm (path or None) – The HMM profile datafile in the busco-guided single-copy gene family analysis, default None.
buscocutoff (path or None) – The HMM score cutoff datafile in the busco-guided single-copy gene family analysis, default None.
genetable (path or None) – The gene table datafile used in collinear coalescence analysis, default None.

K_S distribution construction

After obtaining the paralogous/orthologous gene family file, the construction of K_S distribution can be achieved as below.

(ENV) $ wgd ksd families *cds.fasta

The gene family file is mandatory input, which can be acquired from the eariler steps using wgd dmd. Depending on the number of species (or the provided cds sequence files), the meaning of constructed K_S distribution differs. If only one species is given, the whole paranome K_S distribution is to be established, which is a good material for the primary identification of WGDs. If orthogroups of multiple species are given, a paralog-ortholog mixed K_S distribution is to be built that further subdivision per species pair of K_S can be used in diagnosing the phylogenetic placement of focused WGD events. If the RBH gene family of two species is given, the constructed K_S distribution is to show the K_S age of divergence event, as illustrated in wgd v1. This step is quite flexible in that various types of gene family files can be provided, for instance, the paralogous gene family, the orthogroup, the RBH gene family, the collinear orthologous gene family (inferred in the collinear coalescence inference analysis) and whatever gene families that users would like to calculate, as long as in the correct format (tab-separated table that each row represents a gene family while each column represents a species with the first column as the label of gene family and the first row as the label of species name). The influential parameters include the pairwise parameter determining whether to calculate the K_S on the alignment basis of each paralogous gene pair instead of the whole alignment for that family, and the strip_gaps parameter controlling whether to drop all gaps in multiple sequence alignment, which only makes a difference when co-occurring with the pairwise parameter because the program codeml will strip all the gaps anyway. The default tree_method is the external software fasttree, which can be replaced with the built-in cluster method.

Corrected K_S distribution construction

To determine the phylogenetic location of a certain WGD, a rate-corrected K_S distribution is needed to rescale the rate-dependent orthologous K_S into the same rate of focused species. The command to achieve it is as below.

(ENV) $ wgd ksd families *cds.fasta --spair speciespair1 --spair speciespair2 --speciestree speciestree.nw

This step is mainly affected by the parameter reweight determining whether to recalculate the weight per species pair instead of using the weight calculated from the whole family, and the parameter onlyrootout determining whether to only conduct rate correction using the outgroup at root as outgroup. The parameter extraparanomeks enables the addition of extra paralogous K_S data besides the probably existing ones in the original K_S data and when duplicated paralogous gene pairs occur only the ones in extra paralogous K_S data will be kept. This is to deal with the condition that users might provide gene families containing paralogous gene pairs which might not be complete and thus add more which might overlap with the extra paralogous K_S data. But it’s suggested to separate the paralogous K_S only in the extraparanomeks parameter while the gene families only contain orthologous gene pairs (for instance, the global MRBHs) because the elmm modeling only considers the paralogs transmitted from the extraparanomeks parameter. Nonetheless, the extraparanomeks parameter would not affect the result of rate correction which only involves orthologous K_S. We will discuss about it further in the recipe.

K_S tree inference

The K_S tree is good way of visualizing and quantifying the variation of substitution rate. Here in wgd v2 we implemented an additional way of estimating K_S branch length given a species tree or inferred gene tree in wgd ksd. The command is as below.

(ENV) $ wgd ksd kstree_data/fam.tsv data/kstree_data/Acorus_tatarinowii kstree_data/Amborella_trichopoda kstree_data/Aquilegia_coerulea kstree_data/Aristolochia_fimbriata kstree_data/Cycas_panzhihuaensi --kstree --speciestree kstree_data/species_tree1.nw --onlyconcatkstree -o wgd_kstree_topology1

The option kstree tells the program to infer K_S tree instead of pairwise K_S. The option speciestree takes the input species tree. The option onlyconcatkstree tells the program to only conduct the K_S analysis for the concatenated alignment rather than for every alignment per family. We used two global MRBH families for the illustration above. Three alternative topologies were applied given the uncertain phylogenetic relationship within angiosperms.

cli.ksd(families, sequences, outdir, tmpdir, nthreads, to_stop, cds, pairwise, strip_gaps, tree_method, spair, speciestree, reweight, onlyrootout, extraparanomeks, anchorpoints, plotkde, plotapgmm, components)

K_S distribution construction

Corrected K_S distribution construction

Parameters:

families (path) – Argument of gene family file.
sequences (paths) – Argument of sequence files.
outdir (str) – Path of desired output directory, default “wgd_ksd”.
tmpdir (str or None) – Path of temporary directory.
nthreads (int) – The number of threads to use, default 4.
to_stop (boolean flags) – Whether to translate through STOP codons, default False.
cds (boolean flags) – Whether to only translate the complete CDS that starts with a valid start codon and only contains a single in frame stop codon at the end and must be dividable by three, default False.
pairwise (boolean flag) – Whether to initiate pairwise K_S estimation, default False.
strip_gaps (boolean flag) – Whether to drop all gaps in multiple sequence alignment.
tree_method (choice ['cluster', 'fasttree', 'iqtree']) – which gene tree inference program to invoke, default “fasttree”.
spair (multiple str options or None) – The species pair to be plotted, default None.
speciestree (path or None) – The species tree to perform rate correction, default None.
reweight (boolean flag) – Whether to recalculate the weight per species pair, default False.
onlyrootout (boolean flag) – Whether to only conduct rate correction using the outgroup at root as outgroup, default False.
extraparanomeks (path or None) – The extra paranome ks data to be plotted in the mixed K_S distribution, default None.
anchorpoints (path or None) – The anchorpoints.txt file to plot anchor K_S in the mixed K_S distribution, default None.
plotkde (boolean flag) – Whether to plot kde curve of orthologous K_S distribution over histogram in the mixed Ks distribution, default False.
plotapgmm (boolean flag) – Whether to plot mixture modeling of anchor K_S in the mixed K_S distribution, default False.
plotelmm (boolean flag) – Whether to plot elmm mixture modeling of paranome K_S in the mixed K_S distribution, default False.
components ((int,int)) – The range of the number of components to fit in anchor K_S mixture modeling, default (1,4).

Mixture model clustering of K_S distribution

This function inherits from wgd v1 that bgmm and gmm are available for mixture modeling upon any K_S distribution. The command to achieve it is as below.

(ENV) $ wgd mix ks.tsv

This part of analysis is mainly relying on the mixture module of scikit-learn library. Almost all the parameters of this function will have an impact on the results. Please see the description below for the detailed information.

cli.mix(ks_distribution, filters, ks_range, method, components, bins, output_dir, gamma, n_init, max_iter)

Mixture modeling of K_S distribution

Parameters:

ks_distribution (path) – Argument of K_S distribution file.
filters (int) – The cutoff alignment length, default “300”.
ks_range ((float,float)) – The K_S range to be considered, default (0, 5)
method (choice ['gmm', 'bgmm']) – which mixture model to use, default “gmm”.
components ((int,int)) – The range of the number of components to fit, default (1,4)
bins (int) – The number of bins in K_S distribution, default “50”.
outdir (str) – Path of desired output directory, default “wgd_mix”.
gamma (float) – The gamma parameter for bgmm models, default “0.001”.
n_init (int) – The number of k-means initializations, default “200”.
max_iter (int) – The maximum number of iterations, dafault “1000”.

Synteny inference

Synteny is frequently called nowadays in profiling the evolution of psedochromosomes. The program wgd syn does all the inference work about synteny. The simplest command is as below.

(ENV) $ wgd syn families.tsv gff3 (-ks ksdata)

The influential parameters for synteny inference include the minlen controling the minimum length of a scaffold to be considered, the maxsize controling the maximum family size to be considered, the minseglen determining the minimum length of segments to considered, the keepredun determining whether to keep redundant multiplicons, and the mingenenum controlling the minimum number of genes on segments to be considered.

cli.syn(families, gff_files, ks_distribution, outdir, feature, attribute, minlen, maxsize, ks_range, iadhore_options, ancestor, minseglen, keepredun, mingenenum, dotsize, apalpha, hoalpha)

Synteny inference

Parameters:

families (path) – Argument of gene family file.
gff_files (paths) – Argument of gff3 files.
ks_distribution (path or None) – The K_S distribution datafile, default None.
outdir (str) – Path of desired output directory, default “wgd_syn”.
feature (str) – The feature for parsing gene IDs from GFF files, default “gene”.
attribute (str) – The attribute for parsing the gene IDs from the GFF files, default “ID”.
minlen (int) – The minimum length of a scaffold to be included in dotplot, default “-1”.
maxsize (int) – The maximum family size to include, default “200”.
ks_range ((float,float)) – The K_S range in colored dotplot, default (0, 5).
iadhore_options (str) – The parameter setting in iadhore, default “”.
ancestor (str or None) – The assumed ancestor species, it’s still under development, default None.
minseglen (float) – The minimum length of segments to include, ratio if <= 1, default 100000.
keepredun (boolean flags) – Whether to keep redundant multiplicons, default False.
mingenenum (int) – The minimum number of genes for a segment to be considered, default 30.
dotsize (float) – The dot size in dot plot, default “1”.
apalpha (float) – The opacity of anchor dots, default “1”.
hoalpha (float) – The opacity of homolog dots, default “0.1”.

Search of anchor pairs for molecular dating

If we want to date the identified WGD events, determining anchor pairs to be dated is the key step. The program wgd peak can achieve this goal. The command is as below.

(ENV) $ wgd peak ksdata -ap apdata -sm smdata -le ledata -mp mpdata

There are two methods which can be called in this step. One is the heuristic method which can be called by adding the flag heuristic, where a heuristic refinement of anchor pairs based on the detected peaks from scipy.signal and the associated properties is conducted. Another method is to retreive the highest density region (HDR) of the segment-guided syntelogs with users-defined K_S saturation cutoff.

cli.peak(ks_distribution, anchorpoints, outdir, alignfilter, ksrange, bin_width, weights_outliers_included, method, seed, em_iter, n_init, components, boots, weighted, plot, bw_method, n_medoids, kdemethod, n_clusters, kmedoids, guide, prominence_cutoff, kstodate, family, rel_height, ci, manualset, segments, hdr, heuristic, listelements, multipliconpairs, kscutoff, gamma)

Search of anchor pairs used in WGD dating

Parameters:

ks_distribution (path) – Argument of K_S distribution datafile.
anchorpoints (path or None) – The anchor points datafile, default None.
segments (path or None) – The segments datafile, default None.
outdir (str) – Path of desired output directory, default “wgd_peak”.
alignfilter ((float,int,float)) – The cutoff for alignment identity, length and coverage, default (0.0, 0, 0.0).
ksrange ((float,float)) – The range of K_S to be analyzed, default (0,5).
bin_width (float) – The bandwidth of K_S distribution, default “0.1”.
weights_outliers_included (boolean flags) – Whether to include K_S outliers, default False.
method (choice ['gmm', 'bgmm']) – Which mixture model to use, default “gmm”.
seed (int) – Random seed given to initialization, default “2352890”.
em_iter (int) – The number of EM iterations to perform, default “200”.
n_init (int) – The number of k-means initializations, default “200”.
components ((int,int)) – The range of the number of components to fit, default (1,4).
boots (int) – The number of bootstrap replicates of kde, default “200”.
weighted (boolean flags) – Whether to use node-weighted method of de-redundancy, default False.
plot (choice(['stacked', 'identical'])) – The plotting method to be used, default “identical”.
bw_method (choice['silverman', 'ISJ']) – The bandwidth method to be used, default “silverman”.
n_medoids (int) – The number of medoids to fit, default “2”.
kdemethod (choice['scipy', 'naivekde', 'treekde', 'fftkde']) – The kde method to be used, default “scipy”.
n_clusters (int) – The number of clusters to plot Elbow loss function, default “5”.
kmedoids (boolean flags) – Whether to initiate K-Medoids clustering analysis, default False.
guide (choice['multiplicon', 'basecluster', 'segment']) – The regime residing anchors, default “segment”.
prominence_cutoff (float) – The prominence cutoff of acceptable peaks, default “0.1”.
kstodate ((float,float)) – The range of K_S to be dated in heuristic search, default (0.5, 1.5).
family (path or None) – The family to filter K_S upon, default None.
manualset (boolean flags) – Whether to output anchor pairs with manually set K_S range, default False.
rel_height (float) – The relative height at which the peak width is measured, default “0.4”.
ci (int) – The confidence level of log-normal distribution to date, default “95”.
hdr (int) – The highest density region (HDR) in a given distribution to date, default “95”.
heuristic (boolean flags) – Whether to initiate heuristic method of defining CI for dating, default False.
listelements (path or None) – The listsegments datafile, default None.
multipliconpairs (path or None) – The multipliconpairs datafile, default None.
kscutoff (float) – The K_S saturation cutoff in dating, default “5”.
gamma (float) – The gamma parameter for bgmm models, default “1e-3”.

Concatenation-/Coalescence-based phylogenetic inference

The program wgd focus can handle various analysis. To recover the phylogeny, either in concatenation- or coalescence-based method, the following command can be used.

(ENV) $ wgd focus families *cds.fasta --concatenation/--coalescence

What this step does is 1) write sequence for each gene family, 2) infer multiple sequence alignment (MSA) for each gene family, if with concatenation method, 3) concatenate all MSA and infer the maximum likelihood gene tree as species tree, if with coalescence method 3) infer maximum likelihood gene tree for each gene family and summarize the species tree using ASTRAL. The influential parameters include the tree_method parameter controling the program to infer gene tree and the treeset parameter to control the parameter setting for gene tree inference.

Functional annotation of gene families

The functional annotation of gene families can also be achieved in wgd focue using the command below.

(ENV) $ wgd focus families *cds.fasta --annotation eggnog --eggnogdata eddata --dmnb dbdata

This step relies on the database provided by users. It’s required to pre-set the executable of annotation program for instance, eggnog, hmmer and interproscan in the environment variable. The influential parameter is the evalue that controls the e-value cut-off for annotation.

WGD dating

The program wgd focus is the final step in WGD dating. After obtaining the anchor pairs-merged orthogroups from wgd dmd, the phylogenetic dating can be conducted with the command below.

(ENV) $ wgd focus families sequence1 sequence2 sequence3 --dating mcmctree --speciestree species_tree.nw

The command shown above is a simple example that calls the molecular dating program mcmctree to conduct the dating. A species tree is mandatory input. Further dating parameters can be provided with datingset parameters.

cli.focus(families, sequences, outdir, tmpdir, nthreads, to_stop, cds, strip_gaps, aligner, tree_method, treeset, concatenation, coalescence, speciestree, dating, datingset, nsites, outgroup, partition, aamodel, ks, annotation, pairwise, eggnogdata, pfam, dmnb, hmm, evalue, exepath, fossil, rootheight, chainset, beastlgjar, beagle, protdating)

Concatenation-/Coalescence-based phylogenetic inference Functional annotation of gene families Phylogenetic dating of WGDs

Parameters:

families (path) – Argument of gene family file.
sequences (paths) – Argument of sequence files.
outdir (str) – Path of desired output directory, default “wgd_focus”.
tmpdir (str or None) – Path of temporary directory.
nthreads (int) – The number of threads to use, default 4.
to_stop (boolean flags) – Whether to translate through STOP codons, default False.
cds (boolean flags) – Whether to only translate the complete CDS that starts with a valid start codon and only contains a single in frame stop codon at the end and must be dividable by three, default False.
strip_gaps (boolean flag) – Whether to drop all gaps in multiple sequence alignment.
aligner (choice(['muscle', 'prank', 'mafft'])) – Which alignment program to use, default “mafft”.
tree_method (choice ['fasttree', 'iqtree', 'mrbayes']) – which gene tree inference program to invoke, default “fasttree”.
treeset (multiple str options or None) – The parameters setting for gene tree inference, default None.
concatenation (boolean flag) – Whether to initiate the concatenation-based species tree inference, default False.
coalescence (boolean flag) – Whether to initiate the coalescence-based species tree inference, default False.
speciestree (path or None) – The species tree datafile for dating, default None.
dating (choice(['beast', 'mcmctree', 'r8s', 'none'])) – Which molecular dating program to use, default None.
datingset (multiple str options or None) – The parameters setting for dating program, default None.
nsites (int) – The nsites information for r8s dating, default None.
outgroup (str) – The outgroup information for r8s dating, default None.
partition (boolean flag) – Whether to initiate partition dating analysis for codon, default False.
aamodel (choice(['poisson','wag', 'lg', 'dayhoff'])) – Which protein model to be used in mcmctree, default “poisson”.
ks (boolean flag) – Whether to initiate K_S calculation, default False.
annotation (choice(['none','eggnog', 'hmmpfam', 'interproscan'])) – Which annotation program to use, default None.
pairwise (boolean flag) – Whether to initiate pairwise K_S estimation, default False.
eggnogdata (path or None) – The eggnog annotation datafile, default None.
pfam (choice(['none', 'denovo', 'realign'])) – Which option to use for pfam annotation, default None.
dmnb (path or None) – The diamond database for annotation, default None.
hmm (path or None) – The HMM profile for annotation, default None.
evalue (float) – The e-value cut-off for annotation, default “1e-10”.
exepath (path or None) – The path to the interproscan executable, default None.
fossil ((str,str,str,str,str)) – The fossil calibration information in Beast, default (‘clade1;clade2’, ‘taxa1,taxa2;taxa3,taxa4’, ‘4;5’, ‘0.5;0.6’, ‘400;500’).
rootheight ((float,float,float)) – The root height calibration info in Beast, default (4,0.5,400).
chainset ((int,int)) – The parameters of MCMC chain in Beast, default (10000,100).
beastlgjar (path or None) – The path to beastLG.jar, default None.
beagle (boolean flag) – Whether to use beagle in Beast, default False.
protdating (boolean flag) – Whether to only initiate the protein-concatenation based dating analysis, default False.

K_S distribution visualization

The program wgd viz can be used in plotting K_S distribution and synteny. To visualize the K_S distribution, the command below can be used.

(ENV) $ wgd viz --data ks.tsv

The program wgd viz will automately calculate and plot the exponential-lognormal mixture modeling (ELMM) result. The influential parameters include the em_iterations controling the maximum EM iterations and the em_initializations controling the the maximum EM initializations, the prominence_cutoff determining the prominence cutoff of acceptable peaks and the rel_height to set the relative height at which the peak width is measured.

Corrected K_S distribution visualization

To add rate correction result into the K_S plot, one can use the command below.

(ENV) $ wgd viz --data ks.tsv --spair speciespair1 --spair speciespair2 --speciestree speciestree.nw --gsmap gene_species.map

The additional required file gene_species.map is automately produced from the wgd ksd step when producing the ks.tsv file.

Note

The gene_species.map file is no longer needed in v2.0.21 onwards. Leave this parameter as default None is enough.

Synteny visualization

The synteny plot produced by the program wgd syn can be reproduced by wgd viz too. The command is as below.

(ENV) $ wgd viz --anchorpoints apdata --segments smdata --multiplicon mtdata --genetable gtdata --plotsyn (--datafile ksdata)

The influential parameters include the minlen controling the minimum length of a scaffold to be included in dotplot, the maxsize determining the maximum family size to include, the minseglen determining the minimum length of segments to include, the keepredun controling whether to keep redundant multiplicons.

cli.viz(datafile, spair, outdir, gsmap, plotkde, reweight, em_iterations, em_initializations, prominence_cutoff, segments, minlen, maxsize, anchorpoints, multiplicon, genetable, rel_height, speciestree, onlyrootout, minseglen, keepredun, extraparanomeks, plotapgmm, plotelmm, components, mingenenum, plotsyn, dotsize, apalpha, hoalpha)

K_S distribution visualization Synteny visualization

Parameters:

datafile (path or None) – The path to datafile, default None.
spair (multiple str options) – The species pair to be plotted, default None.
outdir (str) – Path of desired output directory, default “wgd_focus”.
gsmap (path or None) – The gene name-species name map, default None.
plotkde (boolean flag) – Whether to plot kde curve upon histogram, default False.
reweight (boolean flag) – Whether to recalculate the weight per species pair, default False.
em_iterations (int) – The maximum EM iterations, default “200”.
em_initializations (int) – The maximum EM initializations, default “200”.
prominence_cutoff (float) – The prominence cutoff of acceptable peaks, default “0.1”.
segments (path or None) – The segments data file, default None.
minlen (int) – The minimum length of a scaffold to be included in dotplot, if “-1” was set, the 10% of the longest scaffolds will be used, default “-1”.
maxsize (int) – The maximum family size to include, default “200”.
anchorpoints (path or None) – The anchor points datafile, default None.
multiplicon (path or None) – The multiplicons datafile, default None.
genetable (path or None) – The gene table datafile, default None.
rel_height (float) – The relative height at which the peak width is measured, default “0.4”.
speciestree (path or None) – The species tree to perform rate correction, default None.
onlyrootout (boolean flag) – Whether to only conduct rate correction using the outgroup at root as outgroup, default False.
minseglen (float) – The minimum length of segments to include in ratio if <= 1, default “100000”.
keepredun (boolean flag) – Whether to keep redundant multiplicons, default False.
extraparanomeks (path or None) – The extra paranome ks data to be plotted in the mixed K_S distribution, default None.
plotapgmm (boolean flag) – Whether to plot mixture modeling of anchor K_S in the mixed K_S distribution, default False.
plotelmm (boolean flag) – Whether to plot elmm mixture modeling of paranome K_S in the mixed K_S distribution, default False.
components ((int,int)) – The range of the number of components to fit in anchor K_S mixture modeling, default (1,4).
mingenenum (int) – The minimum number of genes for a segment to be considered, default “30”.
plotsyn (boolean flag) – Whether to initiate the synteny plot, default False.
dotsize (float) – The dot size in dot plot, default “1”.
apalpha (float) – The opacity of anchor dots, default “1”.
hoalpha (float) – The opacity of homolog dots, default “0.1”.

Usage

Whole paranome delineation

Global/Local MRBHs inference

Orthogroup inference

Collinear coalescence inference of phylogeny

Orthogroup assignment

KS distribution construction

Corrected KS distribution construction

KS tree inference

Mixture model clustering of KS distribution

Synteny inference

Search of anchor pairs for molecular dating

Concatenation-/Coalescence-based phylogenetic inference

Functional annotation of gene families

WGD dating

KS distribution visualization

Corrected KS distribution visualization

Synteny visualization

K_S distribution construction

Corrected K_S distribution construction

K_S tree inference

Mixture model clustering of K_S distribution

K_S distribution visualization

Corrected K_S distribution visualization