Multalin help page
Table of contents :
Welcome to Multalin!
This software will allow you to align simultaneously
several biological sequences.
What is a Multiple sequence alignment? It is the arrangement of several
protein or nucleic acid sequences with postulated gaps so that similar residues
are juxtaposed. A positive score is attached to identities, conservative or
non-conservative substitutions (the score amplitude measuring the similarity)
and a penalty to gaps; an ideal program would maximise the total score, taking
account of all possible alignments and allowing for any length gap at any position.
Unfortunately the computing requirements, both of time and memory, grow as the
nth power, where n is the sequence number, so this ideal alignment can be found
only for two sequences or three short sequences. In the general case, to be practicable
programs must restrict the conditions of the optimisation. Nevertheless it is undeniably
useful to have an automatic system available for multiple sequence alignment to provide
a starting point for a more human analysis.
Multalin creates a multiple
sequence alignment from a group of related sequences using progressive pairwise
alignments. The method used is described in "Multiple sequence alignment with
hierarchical clustering", F.Corpet, 1988, Nucl. Acids Res. 16 10881-10890.
Warning : No computer skills are required to use MultAlin, only basic www knowledge !
On the MultAlin home page you will see a large rectangle. This is where you are going to
paste (as in cut and paste) your sequences (try a sample set of sequences the first time).
Instead of pasting your sequences, you can give the name of your sequences file,
or select it with the Browse button.
The next step is to set the parameters.
These are only of basic www difficulty but you will be able to find help by clicking on
the associated question mark. Simply use the pop up menus or type in text or numbers
where required.
When you are ready click on the "submit data" button (you can use either the buttons at top
or at bottom of the page .
Now you will have to wait for our server to calculate.(this can take up to a few
hours for very large sequences).
The result will be sent back to your internet browser in the form of a GIF image (default),
a plain text or a coloured html page.
You will be able to change the colours, font size, line size etc. and even the consensus
levels (see Presentation options for
details).
The procedure is the same as for the MultAlin set-up, just use the pop up menus and type in
text or numbers where required.
When ready click on the "Apply Changes" button.
The new image will appear shortly after. (only the image is changed, no realignment is done)
On your result page, you can add a sequence to the
alignment. This sequence will be aligned with your already aligned sequences
and you'll get a new result page, with the new sequence placed beside its more
similar sequence. For this step, MultAlin performs an optimal alignment of the
new sequence and the block of the already aligned sequences: the result can be
different if you directly ask for an alignment of all the sequences in the first
form.
Paste your new sequence in the rectangle aera in Fasta/Multalin format (i.e.
one line with a beginning '>' for the sequence name, and other lines with rhe
sequence itself). Click on the "Apply Changes" button when ready.
The MultAlin format is similar to Fasta. Sequences can be interrupted by spaces
or digits not taken into account (see samples in MultAlin and pure Fasta formats)
> SeqName the sequence name is the
> first word of the first comment line
> max: 8 letters
> comment lines begin with >
AAAACCGTTAAA...
> SeqNam2 the 2nd sequence beginning
> shows the end of the first one
AAACCTGGAC...
LOCUS SeqName
any lines
ORIGIN anything
1 aggtcccttt tgtgttgttt
The sequence name is the first word after the LOCUS key-word.
The sequence begins on the line following the ORIGIN key-word.
The next sequence information begins with the LOCUS key-word.
See sample.
ID SeqName
any lines
SQ anything
aauccagug gagaucaaag
any sequence lines
//
The sequence name is the first word after the ID key-word.
The sequence begins on the line following the SQ key-word.
The next sequence information begins on the line following //
See sample.
a coloured image
a GIF image is loaded as any image. Click the image button
if you have not selected the "automatically load images". The GIF image that you
will see is configurable. You can change the colours of comment text, font size,
background colour, high and low consensus colours and the neutral colour.
a plain text
it is the fastest way if you have problem loading images or
large html pages.
a coloured html text
this html page uses a style sheet, so you must select
the "Enable style sheets" option of your browser. The Html page that you will see
is configurable. You can change background colour, high and low consensus colours
and the neutral colour. To change the font size, use your browser Preferences.
In any case you can adjust the consensus levels.
Just underneath you will be able to see the input sequence file, the
cluster file, the alignment in fasta or msf format plain text, the alignment in msf format with colour indications
as a coded text, an html text or a gif image.
Any of these files can be saved to your local disk, simply using your WWW browser.
The plain texts can be viewed, edited or printed with any text editor, the Html
page and the GIF image, with your browser or a text processor that allows these
formats.
To translate the colour indications of the coded text to true colours, you can use Microsoft Word
and the MultAlin macro (FTP multalin.dot
and save to disk even if you see odd characters on your browser) as follow:
Open your .doc file with Microsoft Word (File/Open)
Change the templates (File/Models... or Tools/Models..., Link..., search the disk to
select multalin.dot, Open)
Run MultAlin Macro (Tools/Macro..., select MultAlin, Run)
You can also add MultAlin macro to your current model (Normal.dot):
Tools/Macro..., Organizer, Close File then Open File (on the same
button), search the disk to select multalin.dot, Open, select MultAlin,
Copy >> into Normal.dot, Close
Other parameters
-
S. Henikoff and J.G. Henikoff, Amino acid substitution matrices from protein blocks, 1992, P.N.A.S. USA 89, 10915-10919.
This table is the original Blosum62 with a value of 4 added to each entry for it to be non-negative.
-
M.O. Dayfoff, R.M. Schwartz and B.C. Orcutt, Atlas of Protein and Sequence Structure
, Ed M.O. Dayhoff, National Biomedical Research Foundation (Washington D.C. 1979).
This table is Dayhoff's PAM250 with a value of 8 added to each entry for it to be non-negative.
-
Each value is the maximum number of common bases in the corresponding amino acid codon.
-
J.L. Risler, M.O Delorme, H. Delacroix, A.Henaut, Journal of Molecular Biology, 204, 1019, 1988.
-
This table scores a match for any overlap between any IUB (International Union of Biochemits)
nucleic acid ambiguity symbols, except X/N, as follows :
A or C = M; A or G = R; A or T = W; C or G = S; C or T = Y; G or T =K; A or C or G = V;
A or C or T = H; A or G or T =D; C or G or T = B; A or C or G or T = X or N;
These codes are compatible with the codes used by the EMBL, GenBank and PIR data libraries
and by the GCG package.
-
This table scores :
8 for a match
6 for a match with two base ambiguity symbol
4 for a match with a three base ambiguity symbol
3 for a match with a four base ambiguity symbol
where the ambiguity symbols are :
A or C = M; A or G = R; A or T = W; C or G = S; C or T = Y; G or T =K; A or C or G = V;
A or C or T = H; A or G or T =D; C or G or T = B; A or C or G or T = X or N;
These codes are compatible with the codes used by the EMBL, GenBank and PIR data libraries
and by the GCG package.
-
This table scores 1 for a match and 0 for a mismatch between any two letters.
Personal table
You can use your own comparison table by giving its file name, or selecting it
with the Browse button. To write your own table, use the same format as the
standard MultAlin tables (see Dayhoff symbol comparison table
for format details). You can also select a comparison table from the GCG
package: in this case the table file name must end with ".cmp" (e.g. pileupdna.cmp).
This penalty is subtracted to the alignment score of 2 clusters each time a new gap
is inserted in one cluster. This penalty is length dependent: it is the sum of
"penalty at gap opening" and of "penalty at gap extension" times the gap length;
both values must be non negative; their maximum value is 255.
The similarity score is equal to the sum of the values of the matches (each match scored with
the scoring table) less the gap penalties. The gap penalty is charged for every internal
gap. By default, no penalty is charged for terminal gaps.
An optimal alignment is one with the maximum possible score.
It is sensitive to the symbol comparison values and to the gap penalties.
By default no penalty is charged for terminal gap. The user can change that for
particular alignments where terminal gaps must be considered as the internal
ones. Choose "beginning" to charge a gap at the sequence beginning, "end"
to charge one at the end and "both" to charge all terminal gaps.
With this option, final alignment can be obtained more quickly, but it
may not be the best possible alignment.
For a coloured image
you can choose the text size, the
text colour, the background colour and three colours for the sequence residues
(high consensus, low consensus, neutral).
For a coloured html text
you can choose the background colour and
three colours for the sequence residues (high consensus, low consensus,
neutral). The text colour is automatically set to the neutral colour. The font
size can be set with your WWW browser preferences.
You can choose the conservation thresholds for a position to be a high or low
consensus position. A residue that is highly conserved appears in high-consensus
colour and as an uppercase letter in the consensus line. A residue that is weakly
conserved appears in low-consensus colour and as a lowercase letter
in the consensus line . Other residues appears in neutral colour. A
position with no conserved residue is represented by a dot in the consensus
line.
- Normal
- In all sequences, all positions are in upper-case.
- Case
- All the positions in each sequence that are identical with the
consensus are in upper-case, the other positions are in lower-case.
CCQF2P aGDAAvGEK iakaKCtACH dlnkggpi-- -----KvGPp LFGVfGRTtG TfagYs-Ysp GytvmGqKG-
Consensus ..GDaa.GeK .fn.kC.aCH .i....gt.i .....KtGPn L%GVvgrtag t...%k.Y.e g..e.gakg.
- Difference
- The first sequence is normal; in the other sequences, the
residue identical to the first sequence residue at the same position is
represented by a point(.), the others are in upper-case.
CCPC50 QDGDAAKGEK EFN-KCKACH MIQAPDGTDI I-KGGKTGPN LYGVVGRKIA SEEGFK-YGE GILEVAEKNP
CCRF2C ........ ...-...T.. S.I.....E. V-..A..... .......TAG TYPE..-.KD S.VALGASG-
-
An alignment can be very large if sequences are long. If you prefer to see the
alignment by blocks, you can choose to reduce the line length. By default, it is
set to 1000 residues. For a printable page, 60 or 100 can be better (it depends
on the font size).
-
To count the positions in the alignment, a rule line gives the first and last
position in the alignment for each block. In the image, there is also a plus
sign (+) each 10 positions and in the html text, there is a blank position each
10 positions. You can change this graduation step to any value between 1 and the
line size.
Florence Corpet
MultAlin's author. (Comments and suggestions very welcome)
If you use MultAlin frequently you may be interested in downloading the program.
For this you must have prior authorisation from the author. Please e-mail.
Last modified: Date 2000/03/21