LCR-eXXXplore User Manual

Table of Contents

General Description

LCR-eXXXplorer is a web-service designed to assist biologists in displaying, searching and sharing Low Complexity Regions (LCRs) contained within protein sequences. The service is built upon a modified instance of GBrowse (Stein et al., 2002) and it currently contains 545,000 sequences (taken from the of UniProt/SwissProt 2014_05 release) with more than 16 million LCR-related annotations. Biased regions are detected using two of the most commonly employed algorithms, namely SEG and CAST. Along with information about sequence complexity (SEG) or compositional bias (CAST), LCR-eXXXplorer displays external annotations from UniProtKB/SwissProt, as well as predicted disordered and binding regions by utilizing IUPRED (Dosztányi et al., 2005b, 2005a) and ANCHOR (Mészáros et al., 2009; Dosztányi et al., 2009) respectively. LCR-related annotations are fully searchable through an advanced search form and results can either be displayed in the browser or downloaded in a tab-delimited format for further processing. Users may also search for data in LCR-eXXXplorer using a local installation of BLASTP. Furthermore, users can initiate external BLASTP searches against the NCBI databases using as input query the currently displayed sequence after applying masking using any combination of amino acid residue types and detection algorithm

While creating LCR-eXXXplorer we focused of four key issues: data, visualization, search and sharing/downloads.

 

Data

Features displayed through the LCR-eXXXplorer interface can be of 3 major types: (i)  pre-calculated features, hosted on our webserver, (ii) features calculated on-the-fly and (iii) features from external sources, such as UniProt/SwissProt or generated by the user.
Pre-calculated features include low complexity regions detected by SEG and CAST (using default settings), as well as generic sequence-related information, such as sequence length, protein and gene name, organism and the sequence itself. These features are stored in a MySQL database for fast search and retrieval of data.
With over half a million sequences stored in LCR-eXXXplorer, it makes sense to calculate some of the less requested features on-the-fly, i.e. just before displaying them to the user. Such features include the composition of each protein sequence per residue types, predicted disordered and binding regions (along with their scores) from IUPRED and ANCHOR, and GBrowse tracks that combine two or more stored features into one visually coherent piece of information (combo-tracks).


Displaying low complexity regions in a protein sequence might not be that useful unless there is a way to associate these LCRs with other regions that are linked to functional or structural features of the specific protein. By taking advantage of the underlying GBrowse capability to display features stored on a remote web accessible server, LCR-eXXXplorer incorporates selected annotations from UniProt/SwissProt into the main browser interface. UniProt/SwissProt annotations displayed in LCR-eXXXplorer are of two major types: (i) general annotations associated with the protein sequence (e.g. protein name, GO terms, PDB accession IDs) and (ii) position specific annotations, which may include domains, sites, secondary structure etc. These annotations are fetched from UniProt/SwissProt on demand for the protein sequence of interest, facilitated by a custom-designed cgi-bin script. The retrieved features are further post-processed to a format suitable for the LCR-eXXXplorer. More details about how we handle and display UniProtKB-derived annotations are given in the following section.

 

LCR-eXXXplorer category UniProt/SwissProt tags
Amino acid modifications  Lipidation, Glycosylation, Disulfide bond, Cross-link
Molecule processing Chain, Propeptide, Signal peptide
Repeats Repeat, Compositional bias
Secondary structure Helix, Beta strand, Turn
Sites Active site, Metal binding, Binding site, Site
Topological Regions Topological Domain, Transmembrane, Intramembrane
Others Domain, Region, Coiled coil, Motif, Alternative sequence
Ignored features Initiator methionine, Non-standard residue, Modified residue, Natural variant, Mutagenesis, Sequence uncertainty, Sequence conflict, Non-adjacent residues, Non-terminal residue

UniProt/SwissProt annotations displayed in LCR-eXXXplorer and their organization in categories. More details on UniProt/SwissProt features are provided here.
Category “Ignored features” contains annotation types that are excluded from LCR-eXXXplorer.

 

Visualization

The main strength of LCR-eXXXplorer –setting it apart from similar services– is its visualization capabilities. The best way to explain the available visualization options is by describing the interface for an example protein sequence:

The above figure shows the typical LCR-eXXXplorer entry for a protein sequence containing LCRs. Starting from the top we display the record summary in the form of a table that contains basic information regarding the example sequence, such as protein and gene names, organism, length, GO terms and links to entries of associated 3D structures, if available in the Protein Data Bank. Some of these annotations are retrieved in real-time from UniProt/SwissProt while others are stored in the LCR-eXXXplorer database. Whenever a specific annotation type exists in both sources then we first try to obtain the entry from UniProt/SwissProt and if this fails we fall back to the internal entry. LCR-related annotations describing properties of the full sequence are also displayed here. Such annotations include the percentage of the sequence masked by SEG and CAST, as well as the amino acid composition of the protein sequence, which is displayed below the overall sequence composition for the LCRs-eXXXplorer database.

Just below the record summary, lies the main browser interface where a user can view location-specific annotations in a highly customisable manner. Annotations (or their groups) are displayed in horizontal strips, called tracks, which can be individually rearranged, modified, added or removed by the user. The very first track, called “Region” displays an overview of the protein sequence with residues masked by SEG and CAST marked as purple and magenta rectangles respectively. The “Region” track always displays the full length of the protein sequence and a user can click and drag on it to zoom and pan on a specific region of the sequence. All the other tracks always display the part of the sequence that is highlighted in the “Region”.

 

Default tracks

By default LCR-eXXXplorer displays the following tracks to a user:

 

Optional tracks

Besides the 5 default tracks, an LCR-eXXXplorer user has the option to display masked residues of a specific type as a separate track. These tracks are available in the track selection tab of LCR-eXXXplorer under the category “Masked residues per type”. Tracks representing LCRs rich in a specific amino acid residue type are available under the category “Low Complexity Regions”. These are different from masked residues, since a region rich in (e.g.) Proline most of the times contains other residue types as well. These tracks are generated using the output files produced by CAST when called with the –stat or –tab parameters. In summary, LCR-eXXXplorer provides 75 different types of annotations as tracks plus the option to display an unlimited amount of custom, user-generated, tracks as described in section Uploading Your Own and 3rd Party Annotations bellow.

Within the track table users can enable or disable any combination of the available annotations as well as uploaded tracks. Furthermore users can:

 

Simple search

To select a protein sequence to view, enter its name in the text field labeled UniProtKB Acc/Name or Gene name.
Recognized types of search terms are the following:

a UniProtKB Accession or Entry Name
Whenever a new version of UniProtKB/SwissProt is loaded in LCR-eXXXplorer we also keep the associated Accession and Entry Names. Click here for more information about Accession IDs and here for Entry Names.

a gene name
You can enter the name of a gene using the standard nomenclature for that species.

You may use the wildcard character "*" if you are unsure of the protein name. If the requested protein is not found, the browser will display a message to this effect.

LCR-eXXXplorer offers autocompletion; a menu of possible choices will pop up as you type in gene names and other search terms.

 

Advanced search

If you want to search using additional fields enable Advanced Search by selecting it from the dropdown menu named Plugins and pressing Configure.... This way you can search by sequence properties such as length and the percentage of the sequence masked by CAST and/or SEG. You can also retrieve sequences with specific compositional bias type(s).

The first two fields, (Accession, Gene name) are similar to those available in simple search. The added functionality here is syntax checking (e.g. gene names must contain only letters, digits and “-“), targeted search against specific fields, and the ability to use the more advanced SQL-compatible wildcards for some of the fields.

Advanced Search also allows searching by sequence properties such as length and the percentage of the sequence length masked by CAST and/or SEG. Importantly, LCR-eXXXplorer allows a user to retrieve sequences with specific compositional bias type(s). Using a multiple selection list, a user can search for protein sequences locally enriched in one or more residue types. Complex search scenarios such as “find all human protein sequences between 200 and 800 residues long, rich in Proline and Leucine according to CAST and in Serine according to SEG” are also possible.

 

Search within UniProtKB/SwissProt Accessions

One limitation of the current Advanced Search implementation in LCR-eXXXplorer is that it doesn’t allow for full Boolean searches, since it only returns results matching all search criteria. It also doesn’t support searching based on other sequence properties such as the presence of specific domains. To compensate for these, LCR-eXXXplorer allows a user to input/upload a list of UniProt/SwissProt Accessions and use them as search criteria. This way one can use other, more powerful, search engines –such as those available in UniProt/SwissProt or InterPro– and then perform LCR-specific searches within the results using LCR-eXXXplorer.

 

Viewing the Search results

The Overview and Detail Panels

If a single protein is found in the database, the browser will automatically display it. The protein is displayed in two graphical panels:

Protein Map
This panel displays the whole protein sequence context, as well as an overview of the detected LCRs.
A highlighted rectangle indicates the region of the protein sequence that is displayed in the details panel.

Details panel
This panel displays a zoomed-in view of the sequence corresponding to the protein's map selection rectangle. The detail panel consists of one or more tracks showing annotations and other features that have been placed on the sequence. The detail panel is described at length later.

 

Viewing a Precise Region around a Protein

You can view a precise region of a protein sequence using the notation UniProtKBAcc:start..stop, where start and stop are the start and stop positions of the region. The beginning of the feature is position 1.

 

Dealing with Multiple Hits

If more than one protein is found, then the browser will summarize the various hits in a table. You can click on one of the sequences in the table in order to view it in the detail view:

You can click on the header of any sortable () column to sort the results. Use the filter fields to further refine your search. You can utilize multiple filters at the same time.
You can also create powerful filters by using Regular Expressions. Click here for a primer in Reqular Expressions.

Keep in mind that filtering and sorting thousands of results may take a few seconds. Please be patient :)

 

BLASTP search

BLAST against LCR-eXXXplorer database

LCR-eXXXplorer supports BLASTP searches in two modes, internal and external. In internal mode a user can BLAST the LCR-eXXXplorer database for similar sequences. Results are displayed in a simple and intuitive interface. Clicking on a single result, redirects the user to the main browser interface where the respective high-scoring segment part (HSP) is added as new track. What differentiates BLAST searches on LCR-eXXXplorer from a similar search in UniProt/SwissProt is the ability to search against masked databases (either with SEG or CAST) and with settings that are fine-tuned for spurious hits rejection.

BLAST against NCBI databases

In external mode a user can use any sequence (or part of it) stored in the LCR-eXXXplorer as a query sequence for a BLASTP search on the NCBI servers. This capability is provided through a novel plugin specifically developed for LCR-eXXXplorer. To take full advantage of this functionality the user should first enable the desired annotation tracks in the tracks browser and select the part (or the whole length) of the sequence that’s going to be used as the query sequence for BLASTP.
Selecting Submit to NCBI BLASTP in the plugins drop-down list and pressing Configure... will reveal the available options for external BLASTP searches.

In the top section the user may soft mask the protein sequence using any of the enabled annotations tracks (only tracks showing internal LCR-eXXXplorer annotations are available). Tracks can be combined freely to create a soft mask that matches the user’s needs (i.e. mask all positive charged [Arg, His, Lys] residues within CAST-detected LCRs and/or all negatively charged [Asp, Glu] residues masked by SEG).
The option to invert the applied soft mask is also available, essentially allowing the user to mask out all the non-LCRs residues. Doing so will enable searching against the NCBI databases for sequences having compositional bias patterns similar to the currently loaded sequence in LCR-eXXXplorer. This approach in BLAST search, to our knowledge, has never been documented before, but we have not experimented in detail with this option. For this type of search to work though, one should disable all types of LCR handling in NCBI BLASTP as discussed in the next paragraph.
In the lower section of the panel a series of LCRs-related BLASTP configuration options are available. These options are also available at the NCBI web-BLAST interface but are hidden by default. The exposed configuration settings are:

 

Once a region is displayed, you can navigate through it in a number of ways:

Click and drag on any of the rulers
Any of the rulers that show base position, including those in the protein map and detail panels, are clickable. Click once to recenter the view at the desired position, or click and drag to select a region.

Scroll left or right with the and buttons
These buttons, which appear in the "Scroll/Zoom" section of the screen, and at the top right and left corners of the detail panel, will scroll the detail panel to the left or right. The and buttons scroll an entire screen's worth, while and scroll a half screen.

Zoom in or out using the Show XXX aa menu.
Use menu that appears in the center of the "Scroll/Zoom" section to change the zoom level. The menu item name indicates the number of amino acid residues (aa) to show in the detail panel. For example, selecting the item "100 aa" will zoom the detail panel so as to show a region 100 aa wide.

Make fine adjustments on the zoom level using the and buttons.
Press the and buttons to change the zoom level by small increments (usually 10-20%, depending on how the browser is configured).

Flip the orientation by selecting the "flip" checkbox
To flip the browser display so that the N-terminal points to the right, select the "flip" checkbox.

Recenter the protein map panel by clicking on its scale
The scale at the top of the protein map panel is live. Clicking on it will recenter the detail panel around the location you clicked. This is a fast and easy way to make fine adjustments in the displayed region.

Mouse over a feature to get more information about it
Mousing over a feature will usually bring up a text bubble that contains more identifying information about the feature.

Click on a feature to see its details
Clicking on a feature in the details view will display a popup box with additional data or link to a page that displays detailed information on it. This could be a page on the browser's web site, or a page on another resource somewhere on the web.

 

The detailed view is composed of a number of distinct tracks which stretch horizontally from one end of the display to another. Each track corresponds to a different type of genomic feature, and is distinguished by a distinctive graphical shape and color.

Each track has a descriptive name printed above it, and a series of small control icons to the left and right of the name.

 

Download and Export data and visuals

An important aspect of LCR-eXXXplorer is that it provides users with the option to share and/or download LCR-related annotations for further processing. This can be achieved in several ways some of which are available on all GBrowse-based web-services, while others are unique to LCR-eXXXplorer. The easiest and simplest way to download all the detected LCRs of a protein in a Generic Feature Format (GFF) version 3 compliant file is by using the menu option “File > Export as… > …GFF annotation table”.

The same menu section also offers the option to generate clean versions of the tracks browser in Portable Network Graphics (PNG) and Encapsulated PostScript formats for use in publications. Downloads in FASTA, GFF3 & GenBank (actually GenPept) format for all of stored annotations or for each track separately are also available through the “Download this track” command, available at the header of each track ( icon). Similarly, the option to “Share” ( icon) each track with other GBrowse-enabled web-services is also available.

Using either the Download decorated FASTA file or the Download masked FASTA file plugins a user has the option to download a FASTA file containing a custom selection of annotations. The first plugin can overlay any of the currently displayed tracks on top of the standard FASTA sequence with lower-case characters; depending on the output format, annotations can be represented as capital letters or with any combination of italics, bold, underlined and/or coloured characters. The Download masked FASTA file plugin was designed and built specifically for LCR-eXXXplorer; it allows the user to download any combination of LCR-related annotation as a (soft or hard) masked FASTA file that can be used as input in other bioinformatics tools. Besides letting the user to apply any of the annotations tracks as either a soft or hard mask, it also offers the option to invert the mask. Such capability can help when searching for sequences having similar types of low complexity regions.

Download (A) decorated & (B) masked FASTA files.

Another way to download files from LCR-eXXXplorer is through the Advanced Search plugin. Selecting “Download CSV file” will write search results on a tab-delimited text file. The text file contains 47 columns with annotations such as Accession, Name, Description, length and the portion of the sequence masked by SEG or CAST. For each of the matching proteins, counts of masked amino acid residues grouped by type and algorithm are also provided. Using this mechanism, users can download such files for all the sequences matching specific criteria (e.g. all human protein sequences) and post-process them using other tools. Finally, a shortcut to download (or browse) all the sequences of a specific organism is also provided inside the record summary table for every sequence stored in LCR-eXXXplorer.

Column # Header(s) Description(s)
1 load_id UniprotKB Accession
2 gene Gene name
3 Note Protein name
4 Organism Organism name
5 length Protein sequence length
6 perCAST Percentage of protein sequence length masked by CAST
7 perSEG Percentage of protein sequence length masked by SEG
8 - 27 cΧ* Count of residues of type X masked by CAST
28 - 47 sX* Counts of residues of type X masked by SEG

Description of the fields (columns) in the CSV files available for download in LCR-eXXXplorer.
* All of the 20 standard amino-acid types in ascending lexicographical order.

 

Uploading Your Own and 3rd Party Annotations

LCR-eXXXplorer supports uploading your own data set tracks. You can then share these tracks with other individuals and groups.

Entering New Data Sets

To upload a custom track to LCR-eXXXplorer, click on the link labeled "Custom Tracks" located towards the top of the screen and select "Add custom tracks: [From Text]". This will open a new text editing area. Type in one or more lines of genomic feature information following the example in the screenshot below. The simplest upload format is called BED, and takes three space-separated fields:

<UniProtKBAcc> <start> <end>

Where <UniProtKBAcc> is the UniProtKB Accession of the protein that the feature lives on, and <start> and <end> indicates the range that the feature spans. (The BED format uses zero-based coordinates, so the first amino acid of the sequence is actually position 0. Other formats accepted by LCR-eXXXplorer use the more familiar one-base coordinates.).

Press the "Upload" button. The indicated features will be uploaded to LCR-eXXXplorer and information about the upload set will be displayed in a new section similar to this one:

To view your uploaded annotations in a new browser track, navigate on the Browser tab. Note that you may need to navigate to the region of the sequence containing the annotations in order to see them. To edit the annotations, click the "[edit]" link to the right of the name of the upload file (in this example "upload_19442").

To customize the appearance of the track, press the "[edit]" link to the right of the "Configuration" line. Note that this involves editing a configuration file that has many options. You can perform all basic customization tasks such as changing the height and color of your uploaded features, much more conveniently by clicking the icon next to the track itself.

To add a description to the uploaded set, click on the line that reads "Click to add a description" and type in a one-line description of the data. Press the escape key to undo the changes. Press return or mouse out of the description area to accept the change.

To delete the uploaded annotation data completely, click the trash can icon.

To share an uploaded track with another person, click on the icon.

 

Uploading Files

For anything more than a short list of annotations, you will want to prepare your custom track(s) in a file on your local machine and upload the file to LCR-eXXXplorer. This is also mandatory for file formats that require fields to be separated by tabs rather than spaces. Use a text editor to prepare the track definitions and data and be sure to save as text only. Now click on the "Add custom tracks: [From a file]" link in the upload area, and follow the prompts to upload the file.

You can upload or type in as many annotation files as you desire. If you upload a new file with exactly the same name as an old one, LCR-eXXXplorer will rename the new file by appending a "-1" to its end to avoid your inadvertently overwriting the old file. To replace an existing file, delete the old one before you upload the new one.

Annotation File Formats

LCR-eXXXplorer accepts custom track data in a variety of formats including BED, GFF, and GFF3. For dense quantitative data use Wiggle (WIG) format.

GBrowse also accepts a simple internal format called feature file format (FFF). The main advantage of this format is that it allows for extensive customization of the appearance of each track and feature.

In most cases, you can define multiple tracks in a single file by creating several "track" definition lines in BED or WIG format files, or by defining several track configuration sections in FFF format files. See the documentation of each format for assistance.

 

Importing tracks from Remote Sources

LCR-eXXXplorer is based on GBrowse and thus it allows you to import track data from a variety of remote locations, including:

  1. A track from another GBrowse instance that shares the same sequence coordinates.
  2. A track definition file that someone has placed on an internet-accessible web or FTP server.

To import a track from a remote source, you must know its URL.

To import a track from another GBrowse instance, click on the radio wave () icon in the track title and then copy the URL that appears there. Go to the Custom Tracks panel, and click on the "Add custom tracks: [From a file]" link. This will open up a new text field. Paste in the copied URL and press Import.

If the remote track was successfully imported, a description of the remote track similar to the previous section's listing of uploaded tracks will appear.

It should be noted here that LCR-eXXXplorer although based on GBrowse has received extensive modifications to support displaying protein instead of genomic sequences. For that reason, most of the tracks available in other GBrowse instances cannot be directly imported into LCR-eXXXplorer because of the different coordinates system used.

To import a track from a Web or FTP file, make sure that the web or FTP server is accessible to the internet, and that file is in one of the upload formats recognized by GBrowse (e.g. BED or GFF3). Then click the "Add custom tracks: [From a file]" and paste in the annotation file's URL as before.

You may add as many remote tracks as you wish. To delete one, simply click on the trash icon next to its name in the remote track listing.

 

Quick Uploads

A quick way to upload one or two features is by adding an "add" argument to the LCR-eXXXplorer URL. For example, this URL will create a track containing a feature that occupies the region on protein sequence P79734 between aas 150 and 250:

http://repeat.biol.ucy.ac.cy/fgb2/gbrowse/swissprot/?add=P79734:150..250

or a multipart feature with segments at 150..250 and 280..300:

http://repeat.biol.ucy.ac.cy/fgb2/gbrowse/swissprot/?add=P79734:150..250,280..300

If your are not already viewing the protein you want to add the features to, then add name=reference to the URL, where "reference" is the protein Accession:

http://repeat.biol.ucy.ac.cy/fgb2/gbrowse/swissprot/?add=P79734:150..250,280..300;name=P79734
You can have as many "add" options as you like:
http://repeat.biol.ucy.ac.cy/fgb2/gbrowse/swissprot/?add=P79734:150..250,280..300;add=P79734:1..156

To label the track and the features within the track, you may use the extended format shown here:

reference+trackName+featureName+start..stop,start..stop,start..stop ...

where "reference" is the UniProtKB Accession of the protein, "trackName" is the name of the new track you wish to place the feature in, and "featureName" is the name of the feature you are uploading. As before, the list of "start..stop" coordinates are the start and end coordinates relative to "reference". For multisegmented features, specify multiple start..stop pairs separated by commas.

http://repeat.biol.ucy.ac.cy/fgb2/gbrowse/swissprot/?add=P79734+MyLCRs+LCRs+150..250,280..300

See the uploaded annotation help file for more details.

 

References

Altschul, S.F., Wootton, J.C., Gertz, E.M., Agarwala, R., et al. (2005) Protein Database Searches Using Compositionally Adjusted Substitution Matrices. FEBS J. 272 (20), 5101–5109.
Dosztányi, Z., Csizmok, V., Tompa, P. & Simon, I. (2005a) IUPred: web server for the prediction of intrinsically unstructured regions of proteins based on estimated energy content. Bioinformatics. 21 (16), 3433–3434.
Dosztányi, Z., Csizmók, V., Tompa, P. & Simon, I. (2005b) The Pairwise Energy Content Estimated from Amino Acid Composition Discriminates between Folded and Intrinsically Unstructured Proteins. Journal of Molecular Biology. 347 (4), 827–839.
Dosztányi, Z., Mészáros, B. & Simon, I. (2009) ANCHOR: web server for predicting protein binding regions in disordered proteins. Bioinformatics. 25 (20), 2745–2746.
Dunker, A.K., Lawson, J.D., Brown, C.J., Williams, R.M., et al. (2001) Intrinsically disordered protein. Journal of Molecular Graphics and Modelling. 19 (1), 26–59.
Mészáros, B., Simon, I. & Dosztányi, Z. (2009) Prediction of Protein Binding Regions in Disordered Proteins. PLoS Comput Biol. 5 (5), e1000376.
Romero, P., Obradovic, Z., Li, X., Garner, E.C., et al. (2001) Sequence complexity of disordered protein. Proteins. 42 (1), 38–48.
Schaffer, A.A., Aravind, L., Madden, T.L., Shavirin, S., et al. (2001) Improving the accuracy of PSI-BLAST  protein database searches with composition-based statistics and  other refinements. Nucleic Acids Res. 29 (14), 2994–3005.
Stein, L.D., Mungall, C., Shu, S., Caudy, M., et al. (2002) The Generic Genome Browser: A Building Block for a Model Organism System Database. Genome Res. 12 (10), 1599–1610.
Yu, Y.-K. & Altschul, S.F. (2005) The construction of amino acid substitution matrices for the comparison of proteins with non-standard compositions. Bioinformatics. 21 (7), 902–911.
Yu, Y.-K., Wootton, J.C. & Altschul, S.F. (2003) The compositional adjustment of amino acid substitution matrices. Proc Natl Acad Sci U S A. 100 (26), 15688–15693.


LCR-eXXXplorer User Manual contains sections adapted from the General Help document of the Generic Genome Browser Software.
(last updated January 2015).

The LCRs eXXXplorer development team: Ioannis Kirmitzoglou, and Vasilis J. Promponas
(c) 2006-2014 Bioinformatics Research Laboratory, University of Cyprus
Sponsors