• Login
    View Item 
    •   UMB Digital Archive
    • UMB Open Access Articles
    • UMB Open Access Articles 2020
    • View Item
    •   UMB Digital Archive
    • UMB Open Access Articles
    • UMB Open Access Articles 2020
    • View Item
    JavaScript is disabled for your browser. Some features of this site may not work without it.

    Browse

    All of UMB Digital ArchiveCommunitiesPublication DateAuthorsTitlesSubjectsThis CollectionPublication DateAuthorsTitlesSubjects

    My Account

    LoginRegister

    Statistics

    Display statistics

    Obtaining extremely large and accurate protein multiple sequence alignments from curated hierarchical alignments

    • CSV
    • RefMan
    • EndNote
    • BibTex
    • RefWorks
    Author
    Neuwald, A.F.
    Lanczycki, C.J.
    Hodges, T.K.
    Date
    2020
    Journal
    Database : the journal of biological databases and curation
    Publisher
    Oxford University Press
    Type
    Article
    
    Metadata
    Show full item record
    See at
    https://doi.org/10.1093/database/baaa042
    Abstract
    For optimal performance, machine learning methods for protein sequence/structural analysis typically require as input a large multiple sequence alignment (MSA), which is often created using query-based iterative programs, such as PSI-BLAST or JackHMMER. However, because these programs align database sequences using a query sequence as a template, they may fail to detect or may tend to misalign sequences distantly related to the query. More generally, automated MSA programs often fail to align sequences correctly due to the unpredictable nature of protein evolution. Addressing this problem typically requires manual curation in the light of structural data. However, curated MSAs tend to contain too few sequences to serve as input for statistically based methods. We address these shortcomings by making publicly available a set of 252 curated hierarchical MSAs (hiMSAs), containing a total of 26 212 066 sequences, along with programs for generating from these extremely large MSAs. Each hiMSA consists of a set of hierarchically arranged MSAs representing individual subgroups within a superfamily along with template MSAs specifying how to align each subgroup MSA against MSAs higher up the hierarchy. Central to this approach is the MAPGAPS search program, which uses a hiMSA as a query to align (potentially vast numbers of) matching database sequences with accuracy comparable to that of the curated hiMSA. We illustrate this process for the exonuclease-endonuclease-phosphatase superfamily and for pleckstrin homology domains. A set of extremely large MSAs generated from the hiMSAs in this way is available as input for deep learning, big data analyses. MAPGAPS, auxiliary programs CDD2MGS, AddPhylum, PurgeMSA and ConvertMSA and links to National Center for Biotechnology Information data files are available at https://www.igs.umaryland.edu/labs/neuwald/software/mapgaps/. Copyright The Author(s) 2020.
    Keyword
    multiple sequence alignment
    hierarchical alignments
    Sequence Analysis, Protein
    Data Curation
    Big Data
    Identifier to cite or link to this item
    https://www.scopus.com/inward/record.uri?eid=2-s2.0-85086008674&doi=10.1093%2fdatabase%2fbaaa042&partnerID=40&md5=ec513dca4afcdec7c833830670c67cb6; http://hdl.handle.net/10713/13107
    ae974a485f413a2113503eed53cd6c53
    10.1093/database/baaa042
    Scopus Count
    Collections
    UMB Open Access Articles 2020

    entitlement

     
    DSpace software (copyright © 2002 - 2021)  DuraSpace
    Quick Guide | Policies | Contact Us | UMB Health Sciences & Human Services Library
    Open Repository is a service operated by 
    Atmire NV
     

    Export search results

    The export option will allow you to export the current search results of the entered query to a file. Different formats are available for download. To export the items, click on the button corresponding with the preferred download format.

    By default, clicking on the export buttons will result in a download of the allowed maximum amount of items.

    To select a subset of the search results, click "Selective Export" button and make a selection of the items you want to export. The amount of items that can be exported at once is similarly restricted as the full export.

    After making a selection, click one of the export format buttons. The amount of items that will be exported is indicated in the bubble next to export format.