Homology Workflow

Welcome to REAT
version - 0.6.0

Command-line call:
/home/docs/checkouts/readthedocs.org/user_builds/reat/envs/v0.6.0/bin/reat homology --help


usage: reat homology [-h] --genome GENOME [-p OUTPUT_PREFIX]
                     --alignment_species ALIGNMENT_SPECIES
                     [--annotations_csv ANNOTATIONS_CSV]
                     [--protein_sequences [PROTEIN_SEQUENCES [PROTEIN_SEQUENCES ...]]]
                     [--annotation_filters {all,none,exon_len,intron_len,internal_stop,aa_len,splicing} [{all,none,exon_len,intron_len,internal_stop,aa_len,splicing} ...]]
                     --mikado_config MIKADO_CONFIG --mikado_scoring
                     MIKADO_SCORING [--junctions JUNCTIONS] [--utrs UTRS]
                     [--pick_extra_config PICK_EXTRA_CONFIG]
                     [--min_cdna_length MIN_CDNA_LENGTH]
                     [--max_intron_length MAX_INTRON_LENGTH]
                     [--filter_min_cds FILTER_MIN_CDS]
                     [--filter_max_intron FILTER_MAX_INTRON]
                     [--filter_min_exon FILTER_MIN_EXON]
                     [--alignment_min_exon_len ALIGNMENT_MIN_EXON_LEN]
                     [--alignment_filters {all,none,exon_len,intron_len,internal_stop,aa_len,splicing} [{all,none,exon_len,intron_len,internal_stop,aa_len,splicing} ...]]
                     [--alignment_min_identity ALIGNMENT_MIN_IDENTITY]
                     [--alignment_min_coverage ALIGNMENT_MIN_COVERAGE]
                     [--alignment_max_per_query ALIGNMENT_MAX_PER_QUERY]
                     [--alignment_recursion_level ALIGNMENT_RECURSION_LEVEL]
                     [--alignment_show_intron_length]
                     [--exon_f1_filter EXON_F1_FILTER]
                     [--junction_f1_filter JUNCTION_F1_FILTER]

optional arguments:
  -h, --help            show this help message and exit
  --genome GENOME       Fasta file of the genome to annotate (default: None)
  -p OUTPUT_PREFIX, --output_prefix OUTPUT_PREFIX
                        Prefix for the final output files (default: xspecies)
  --alignment_species ALIGNMENT_SPECIES
                        Species specific parameters, select a value from the first or second column of https://raw.githubusercontent.com/ogotoh/spaln/master/table/gnm2tab (default: None)
  --annotations_csv ANNOTATIONS_CSV
                        CSV file with reference annotations to extract proteins/cdnas for spliced alignments. The CSV fields are: genome_fasta,annotation_gff
                        Example:
                        Athaliana.fa,Athaliana.gff (default: None)
  --protein_sequences [PROTEIN_SEQUENCES [PROTEIN_SEQUENCES ...]]
                        List of files containing protein sequences to use as evidence (default: None)
  --annotation_filters {all,none,exon_len,intron_len,internal_stop,aa_len,splicing} [{all,none,exon_len,intron_len,internal_stop,aa_len,splicing} ...]
                        Filter annotation coding genes by the filter types specified (default: ['none'])
  --mikado_config MIKADO_CONFIG
                        Base configuration for Mikado consolidation stage. (default: None)
  --mikado_scoring MIKADO_SCORING
                        Scoring file for Mikado pick at consolidation stage. (default: None)
  --junctions JUNCTIONS
                        Validated junctions BED file for use in Mikado consolidation stage. (default: None)
  --utrs UTRS           Gene models that may provide UTR extensions to the homology based models at the mikado stage (default: None)
  --pick_extra_config PICK_EXTRA_CONFIG
                        Extra configuration for Mikado pick stage (default: None)
  --min_cdna_length MIN_CDNA_LENGTH
                        Minimum cdna length for models to consider in Mikado consolidation stage (default: 100)
  --max_intron_length MAX_INTRON_LENGTH
                        Maximum intron length for models to consider in Mikado consolidation stage (default: 1000000)
  --filter_min_cds FILTER_MIN_CDS
                        If 'aa_len' filter is enabled for annotation coding features, any CDS smaller thanthis parameter will be filtered out (default: 20)
  --filter_max_intron FILTER_MAX_INTRON
                        If 'intron_len' filter is enabled, any features with introns longer than this parameter will be filtered out (default: 200000)
  --filter_min_exon FILTER_MIN_EXON
                        If 'exon_len' filter is enabled, any features with exons shorter than this parameter will be filtered out (default: 20)
  --alignment_min_exon_len ALIGNMENT_MIN_EXON_LEN
                        Minimum exon length, alignment parameter (default: 20)
  --alignment_filters {all,none,exon_len,intron_len,internal_stop,aa_len,splicing} [{all,none,exon_len,intron_len,internal_stop,aa_len,splicing} ...]
                        Filter alignment results by the filter types specified (default: ['none'])
  --alignment_min_identity ALIGNMENT_MIN_IDENTITY
                        Minimum identity filter for alignments (default: 50)
  --alignment_min_coverage ALIGNMENT_MIN_COVERAGE
                        Minimum coverage filter for alignments (default: 80)
  --alignment_max_per_query ALIGNMENT_MAX_PER_QUERY
                        Maximum number of alignments per input query protein (default: 4)
  --alignment_recursion_level ALIGNMENT_RECURSION_LEVEL
                        SPALN's Q value, indicating the level of recursion for the Hirschberg algorithm (default: 6)
  --alignment_show_intron_length
                        Add an attribute to the alignment gff with the maximum intron len for each mRNA (default: False)
  --exon_f1_filter EXON_F1_FILTER
                        Filter alignments scored against its original structure with a CDS exon f1 lower than this value (default: None)
  --junction_f1_filter JUNCTION_F1_FILTER
                        Filter alignments scored against its original structure with a CDS junction f1 lower than this value (default: None)

When there is protein evidence available from related species, the homology workflow can be used to generate gene models based on this evidence. This is achieved by aligning the proteins provided through a set of related species annotations and evaluating these alignments to generate a score.

After the proteins from related species are aligned to the reference, these alignments are filtered using a set of selectable criteria such as: exon length, intron length, protein length, canonical splicing and presence of internal stop codons. The resulting sequences are then evaluated.

Protein alignments are evaluated in two ways, coherence of the alignment structure with respect to the original model’s structure and consensus structure from the multiple species. These scores are then used by Mikado to group and filter models, generating a set of predicted models.

Comparing original structure vs alignment result

In the following example we observe two cases of the comparison between alignments and the original structure, the example on the top of the image shows the original gene structure matching that of the protein alignment structure. On the other hand, on the bottom of the image the original structure does not match the aligned protein, specifically the first and second exons have different lengths.

Comparison of original structure vs alignment result

Cross species scoring

When comparing alignments from multiple species we evaluate the consensus structure and provide a score corresponding to the most commonly observed structure. In the following example, we observe a clear majority of protein alignments supporting a structure with five exons, this leads to a xspecies score of 70% for the models marked correct.

Comparison alignment structure from multiple species

Configurable computational resources available

"ei_homology.CombineXspecies.runtime_attr_override": " {
               cpu_cores -> Int
              max_retries -> Int?
              boot_disk_gb -> Int?
              queue -> String?
              disk_gb -> Int?
              constraints -> String?
              mem_gb -> Float?
              preemptible_tries -> Int?
              }? (optional)",
"ei_homology.aln_attr": " {
               cpu_cores -> Int
              max_retries -> Int?
              boot_disk_gb -> Int?
              queue -> String?
              disk_gb -> Int?
              constraints -> String?
              mem_gb -> Float?
              preemptible_tries -> Int?
              }? (optional)",
"ei_homology.index_attr": " {
               cpu_cores -> Int
              max_retries -> Int?
              boot_disk_gb -> Int?
              queue -> String?
              disk_gb -> Int?
              constraints -> String?
              mem_gb -> Float?
              preemptible_tries -> Int?
              }? (optional)",
"ei_homology.mikado_attr": " {
               cpu_cores -> Int
              max_retries -> Int?
              boot_disk_gb -> Int?
              queue -> String?
              disk_gb -> Int?
              constraints -> String?
              mem_gb -> Float?
              preemptible_tries -> Int?
              }? (optional)",
"ei_homology.score_attr": " {
               cpu_cores -> Int
              max_retries -> Int?
              boot_disk_gb -> Int?
              queue -> String?
              disk_gb -> Int?
              constraints -> String?
              mem_gb -> Float?
              preemptible_tries -> Int?
              }? (optional)"
Homology workflow diagram