Downloading Stanford's Current Assembly of Candida albicans Sequence
The completed and annotated sequence has been published in Proc Natl Acad Sci U S A. 2004 May 11;101(19):7329-34. Epub 2004.
This web site is no longer supported. A copy of the current data has been handed over to CGD.
Our most recent assemblies of Candida albicans are based on 10.4X sequence coverage.
The genome of Candida albicans is diploid. No haploid form of the organism is known to exist. Standard sequence assembly software does not recognize the possibility of diploidy, and when confronted with sufficiently different alleles, has no option but to assemble them into separate contigs. In our final standard assembly (assembly 6), heterozygosity resulted in a great deal of fragmentation in the assembled sequence.
As of May 2002, we have completed assembly 19, a reconstruction of the diploid genome of Candida albicans. Assemblies 7-18 were produced during the development of our software and methods for assembly of the diploid genome. Assembly of diploid whole-genome shotgun sequence, at least in an organism with the degree of divergence between alleles observed in Candida, cannot be regarded as a routine task at this time. Consequently the early diploid assemblies (7-18) were inferior and sometimes quite incorrect, depending on the success of the techniques being tested. Assembly 19 is therefore our first public release of diploid sequence.
Contigs in our Candida assemblies are numbered for identification purposes. Contig names contain the assembly number in the form Contig(Assembly#)-(Contig#), e.g., Contig6-1511. The numbering of contigs within assembly 19 is somewhat complex because there are two sequences for most of the contigs, representing the two alleles. Please see the release notes below for details.
The contigs, trimmed for quality, along with ORF sequences and their translations are available by anonymous ftp on the project's ftp site
Contigs from assemblies 4 and 5 are still available on the ftp site
Assembly 6 is expected to be the final assembly of Candida albicans sequence data starting from the individual reads. After preprocessing steps to remove most highly repeated sequences, a total of 313,165 reads were assembled into 1213 contigs 2kb or greater. These data represent 10.4X mean coverage assuming a haploid genome size of 15.5MB (excepting repeats such as the rDNA assemble as a single copy). The contigs add to 17.4MB, exceededing the genome size because of heterozygous regions assembling separately. The mtDNA is not included in the assembly as it must be translated with a different genetic code.
Translation of the assembly resulted in 9168 ORFs capable of encoding proteins 100aa or greater in length including ambiguities. In general, the ORFs contained a start and stop codon. ORFs extending to the end of a contig but lacking a stop were included as, except in rare cases where the contigs are at chromosome ends, they will eventually reach one. Reading frames that remain open upstream to the beginning of a contig were divided into two classes. Those that contained a 100aa or greater ORF with a start codon within them are represented as the smaller ORF with the start because upstream sequences are more likely to encounter a stop before a start. For completeness, those lacking an internal 100aa ORF with a start are included up to the beginning of the contig.
Along with assembled DNA, a reference set of ORFs is being provided with assembly 6. The ORFs are numbered sequentially with the lowest numbers deriving from the contigs with the lowest numbers. A typical fasta header line reads as follows:
orf6.2.prot orf6-1097:610-281:e 330 bp, 109 aa, contig 2283 bp
This is interpreted as the protein sequence for ORF 2 from assembly 6. It derives from nucleotides 610-281 in contig 6-1097 (a start coordinate greater than the stop indicates it is read from the complementary strand). The letter "e" indicates that an entire ORF is present (start and stop). The letter "i" is used to indicate an incomplete ORF with the letters N and C to indicate the end of the ORF that is incomplete. The letter "n" is used to indicate that while an entire ORF is given, the reading frame remains open to the beginning of the contig. A count of codons that could be used to extend such an ORF is given in the header line. Examples are:
orf6.1.prot orf6-1072:1371-1:iC 1371 bp, 457 aa, contig 2182 bp orf6.4.prot orf6-1097:2281-1868:iN 414 bp, 137 aa, contig 2283 bp orf6.14.prot orf6-1174:18-545:en 528 bp, 175 aa, contig 2692 bp upcont=5
Fasta headers for the DNA sequences of ORFs are formed in the same way except that the ORF name is lacking the ".prot" extension. ORF DNA sequences are always given as the sense strand.
The set of 9168 ORFs contains a large number of ORFs that are internal
to or are overlapping with larger ORFs. These smaller ORFs are
currently included for completeness and will be removed from the
reference set at a later date. In a smaller number of cases, two
ORFs are parts of the same gene. Causes include: introns, gaps in the
current sequence, and remaining frameshifts.
The Candida genome contains regions that are homozygous, and others that are not. In homozygous regions, the assembler can combine reads from both alleles into the same contig. In heterozygous regions where the level of heterozygosity is low, it can do the same in spite of a few disagreements between alleles (it treats the polymorphisms as if they resulted from sequencing errors). From the assembler's point of view these regions are effectively homozygous. In these release notes, the term "homozygous" should be interpreted as looking homozygous to the assembler, and a low level of polymorphsim between alleles can still be found in the homozygous regions. Assembly 19 does not currently contain information on polymorphisms in such regions. In the near future we will provide annotation of such residual polymorphisms.
In regions with more than minimal divergence between alleles, the assembler must put reads from the two alleles into different contigs. This happened frequently in assembly 6, resulting in considerable fragmentation and difficulty in interpretation, e.g., in distinguishing allele pairs from family members.
In assembly 19, we have developed techniques to detect separate assembly of alleles and to combine separated contigs from assembly 6 into diploid contigs in assembly 19. For most contigs in assembly 19, we present distinct sequences for the two alleles.
Contig numbering. For some contigs from assembly 6, we found no indication of allele sequence assembled separately. Such contigs passed unchanged, except possibly for minor differences in trimming of low-quality bases at the end, into assembly 19. Contigs of this kind have the same number in assembly 19; for example, Contig19-1785 is the same as Contig6-1785, and is presumed to be homozygous.
When we were able to detect separation of alleles in assembly 6, we combined the affected assembly 6 contigs into larger diploid contigs in assembly 19. All contigs so formed were assigned numbers starting at 10000; for example, Contig19-10014 is made up from contigs 6-1076, 6-2434, 6-1473, 6-1632, 6-2141, and 6-2001. A diagram is provided in PDF format for Contig19-10014 (and all others) showing how it is formed from assembly 6 contigs. A dotted line separates the assembly 6 contigs assigned to the two alleles. In regions where one allele has a gap, the sequence is presumed to be homozygous and is filled in from the other allele. Otherwise the top allele derives its sequence from the assembly 6 contig shown above the dotted line, and the bottom allele from the contig at the same position shown below the line. This process results in two sequences representing the two alleles for the contig. The top allele is arbitrarily designated as primary, and the sequence given for Contig19-10014 is that derived from the top set of assembly 6 contigs. The sequence for the other allele is given the name Contig19-20014 (i.e., add 10000 to the number of the primary allele). In viewing the diagrams, note that because of insertions and deletions between alleles, corresponding poisitions on the two alleles are not always connected by a direct vertical line, but usually in large diploid contigs the size of insertions is visually negligible.
Contig19-10262 is exceptional in that it was constructed by joining two assembly 6 contigs with no evidence of separation of alleles. Accordingly it does not have a second allele, and there is no contig 19-20262.
ORFs. ORFs were called using the same methods described for assembly 6, with one addition. In a small number of cases, the construction of diploid contigs involved the insertion of blocks of "N" bases to fill gaps on one allele where evidence indicated that the sequence should not be filled in from the other allele. Usually the number of N's to be inserted was known at best approximately. To avoid having ORFs crossing large blocks of N of essentially arbitrary length, ORF calling stopped at any group of 12 or more N's, and ORFs that run up against such N-blocks are labeled using the same incompleteness rules applied to ORFs running off the ends of contigs in assembly 6. ORFs of this type are identifiable by inclusion of 12 N's at the affected end, which translate to 4 X's in the protein sequence.
ORFs were called using both alleles of the diploid contigs. There are 14220 ORFs in the complete set so obtained. In many cases ORFs are exactly duplicated between alleles.
ORF Alleles. Nonredundant Protein Set. A computational process identified pairs of ORFs that are deemed to be alleles based on position and protein sequence similarity. Generally the identification of alleles is straightforward. In complicated instances we recommend examination of the ORFs and blast results to understand the situation. The web pages identify ORFs designated as alleles and give indications of which cases are complicated. The allele pairs were used to generate a nonredundant protein set using the following rule: whenever the protein sequences for a pair of alleles were identical, the translation of the ORF derived from the secondary allele (the 20000-series contig) was excluded from the nonredundant protein set. This set of proteins was used as the blast database in performing the searches of Candida ORFs against all other Candida ORFs. There are 9259 proteins in the nonredundant set.
Links | Webmaster