您好,欢迎来到爱go旅游网。
搜索
您的当前位置:首页基因组学概论

基因组学概论

来源:爱go旅游网


自编教材

基 因 组 学 概 论

(Genomics)

黄建昌

仲 恺 农 业 技 术 学 院

园艺园林学院

2010年11月

Introduction

§0.1 From Genetics to Genomics

1 基因组学的研究对象与任务

基因组学是研究生物基因结构与功能的学科.基因组学是在遗传学的基础上发展起来的一门现代生物技术前沿科学,也是现代分子生物学和遗传工程技术所必要学科,是当今生物学研究领域最热门、最有生命力、发展最快的前沿科学之一。基因组学的主要任务是研究探索生物基因结构与功能,生物遗传和物理图谱构建,建立和发展生物信息技术,为生物遗传改良及遗传病的防治提供相关技术依据。

2 基因组学发展的历程

1900 Mendel遗传学的重新发现;1865年孟德尔根据前人的工作和他自己进行8年的豌豆杂交试验,提出了遗传因子分离和重组的假设,为遗传学作为一门独立学科的出现揭开了序幕。

1910 Morgan发现连锁遗传学,确立基因与染色体之间的关系;1911年, 美国生物学家摩尔根阐明关于基因的学说。1933年,摩尔根获得了诺贝尔医学和生理学奖。

1944 Avery等证明遗传物质是DNA;

1953 Watson和Crick发现DNA的双螺旋结构;1953年英国剑桥大学的

生物化学家克里克,与美国青年生物学家沃森,合作研究发现:DNA是由两条核苷酸链组成的双螺旋结构。由此他们获得了1962年的诺贝尔医学和生理学奖。 1973 Cohen等第一次实现重组DNA,开始了遗传工程时代;1973年,美国斯坦福大学教授S·科恩等将两个不同的质粒拼接在一起,并将其导入大肠杆菌。科恩随后以DNA重组技术发明人的身份向美国专利局申报了世界上第一个基因工程的技术专利。这标志着自然界不同物种间在亿万年中形成的天然屏障被打破了,人类可以根据自己的意愿定向地改造生物的遗传特性,甚至创造新的生命类型。美国生物化学家、现代基因工程的创始人P·伯格,1972年把两种病毒的DNA用同一种限制性内切酶切割后,再用DNA连接酶把这两种DNA分子连接起来,于是产生了一种新的重组DNA分子,首次实现两种不同生物的DNA体外连接,获得了第一批重组DNA分子,这标志着基因工程技术的诞生。伯格因此获得了1980年诺贝尔化学奖。

1983 Mullis发明了PCR技术,实现了DNA体外大量扩增;PCR是聚合酶链反应(polymerase chain reaction)的简称,是一种在短时间内使DNA大量扩增的技术,它源于穆利斯1983年回乡下度假途中的一个突发奇想。PCR法被称为是“DNA分子的拷贝机”。

1990 人类基因组计划起动;美国病毒学家R·杜尔贝科(1975年诺贝尔奖金获得者)1986年3月7日在美国《科学》杂志上发表了一篇题为《癌症研究的转折点——人类基因组的全序列分析》的文章。”该文后来被称为“人类基因组计划”的“标书”。 1989年,美国国立卫生研究院成立了人类染色体研究中心,沃森出任第一任主任。1990年,美国国会批准了“人类基因组计划”,被称为“生命科学阿波罗计划”的人类基因组计划正式启动。1955年,华裔学者蒋有兴与瑞典学者莱温通过实验确认了人体的46条染色体,并于第二年公布了这一发现。至此,关于人类染色体数目的探索大功告成。所谓的基因组,指的是生物体内的所有DNA,包括它的基因。人类基因组计划要测定的是人体23对染色体中的所有DNA的序列,它由31.647亿个碱基对组成,共有约3万个基因。换句话说,生命天书是由30多亿个字写成的,如果将这30多亿字排版到一张报纸上,那么大约需要20万页纸才能排完这部巨著。“人类基因组计划”的主要任务包括:找出人类DNA上的所有基因,确定30亿个碱基对的排列顺序;建立相应的数据库,进行数据分析,并分析此计划可能带来的人种、伦理及社会问题。2000年6月26日,美国总统克林顿和英国首相布莱尔联合宣布:人类有史以来的第一个基因组草图已经完成。2001年2月12日 中、美、日、德、法、英等6国科学家和美国塞莱拉公司联合公布人类基因组图谱及初步分析结果。2003年4月15日,在DNA双螺旋结构模型发表50周年前夕,中、美、日、英、法、德六国元首或政府首脑签署文件,六国科学家联合宣布:人类基因组序列图完成。人类基因组图谱的绘就,是人类探索自身奥秘史上的一个重要里程碑。它被很多分析家认为是生物技术世纪诞生的标志,也就是说,21世纪是生物技术主宰世界的世纪。正如一个世纪前量子论的诞生被认为揭开了物理学主宰的20世纪一样。

1995 第一个原核生物(细菌)基因组测序完成; 1996 第一个真核生物(酵母)的基因组测序完成; 1998 第一个多细胞生物(线虫)的基因组测序完成; 2000 果蝇和拟南芥的基因组测序完成; 人类和水稻的第一张基因组草图完成;

2001 人类基因组测序完成;

2002 水稻(籼稻和粳稻)基因组草图完成。水稻是最重要的粮食作物之一,全世界一半以上的人口以水稻为主食。1997年9月,水稻基因组测序国际联盟在新加坡举行的植物分子学大会期间成立。1998年2月,中、日、美、英、韩五国代表制定了“国际水稻基因组测序计划”,这是继“人类基因组计划”后的又一重大国际合作的基因组研究项目。该计划预计到2008年完成目标,实际上已于最近完成。2005年8月11日,Nature 上发表了该计划的测序结果2002年11月21日,英国著名的《自然》杂志的封面上又出现了沉甸甸的水稻稻穗,同时发表了中国科学家完成的第四号染色体精确测序和日本科学家完成的第一号染色体精确测序的论文。

2005年9月1日,Nature 发表了黑猩猩全基因组测序的结果。

从DNA双螺旋结构模型诞生,到人类基因组计划全面完成,人类历史恰好走过了半个世纪。这50年里,科学家们在一步步破解生命的奥秘。这其中,我们不仅看到了基因科学的发展历程,也看到了科学家们的聪明智慧,同时更看到了他们为追求真理而不懈探索的崇高精神。DNA是螺旋状的,生命科学的探索之路也是螺旋的,而且是永无止境的。人体自身和大千世界还有数不清的未解之谜,正等待着人们进行探索。让我们体验美、探索美,续写和创造永无止境的螺旋之美。

§0.3基因组学的分子基础

1 相关概念

遗传:亲代性状特征在子代出现,使子代与亲代基本相似的现象。植物的 “种瓜得瓜,种豆得豆”, 动物的 “龙生龙,凤生凤”就是遗传的表现。

DNA:脱氧核糖核酸,为重要的生物大分子,是遗传的主要物质基础,是遗传信息的贮存场所。

DNA:是长链多聚分子,由4种核苷酸组成,这4种核苷酸可以任何次序排列连接成数百万个核苷酸长链分子。

基因:控制生物性状特征表现的基本遗传单位。是DNA的一个片断。 基因组:生物体所包含的一整套完整的基因。 决定生物进化的基本要素:遗传、变异和选择。基因组学研究的是遗传这一要素。

2 DNA的分子生物学 2.1 DNA的组成

是一种长链多聚分子,由4种核苷酸组成 dATP、dCTP、dGTP、 dTTP。一般两条单链互补形成双链。多聚核苷酸链的化学反应只能是5‘——3’方向。 2.2 碱基配对 A与T,C与G称为互补碱基对。 2.3 DNA的双螺旋构象

3 RNA的分子生物学

3.1 RNA的组成 结构上与DNA相似,但碱基为A、G、U、C

3.2 RNA的构象

以单链形式存在。但单链RNA不管其大小如何都会形成或长或短的分子内双螺旋构象。

4 基因组顺序 4.1 C值

一个单倍体基因组中DNA的总量。 4.2 基因组顺序的复杂性

各种生物的基因组顺序具有高度的复杂性。 4.3 基因组的重复顺序

各种生物的基因组存在大量的重复顺序。 4.4 基因组的单一顺序

 低等生物的基因组存在大量的单一顺序,而高等生物的基因组存在少量的单一顺序。 

4 基因与基因家族 4.1 编码RNA基因

rRNA基因、tRNA基因、scRNA基因、snRNA基因、snoRNA基因 4.2 编码蛋白质基因 4.3 基因家族 4.4 异常结构基因 4.5 假基因

基因概念的形成与发展

 1866年,奥地利遗传学家G.Mendel根据他近10年的豌豆杂交试验,首次发现了遗传学的基本规律,如分离定律和自由组合定律

 1909年,丹麦生物学家W. Johannsen 根据希腊文“给予生命”之义,创造了基因(gene)一词,并用它代替Mendel的 hereditary factor

1910年,美国遗传学家T. H. Morgan以果蝇为材料,确认遗传物质基础存在于染色体中,发现了连锁和互换定律。

 1928年,F. Griffith首先发现了肺炎双球菌(Streptococcus pneumoniae)的转化现象。  1944年,美国微生物学家O.T.Avery等重复该实验并证明了转化因子是DNA分子,把基因定位于染色体上的理论推进了一步,提出“DNA是基因的载体”。

1953年,J.Watson和F.Crick的DNA双螺旋结构成为解密DNA分子复制过程

的钥匙,特别是DNA半保留复制规律的揭示使遗传学家长期感到困惑的基因自我

复制问题得解决,也为基因存在于DNA,遗传信息可以通过DNA的半保留复制而传代下去的认识提供了基础。

随着分子生物学和分子遗传学的不断进步,特别是由于发展出了诸如DNA分子克隆技术和快速准确的核苷酸序列分析法,以及核酸分子杂交技术等现代生物学实验手段,使我们能够从分子水平上研究基因的结构与功能,发现了“移动基因(movable genes)”、“断裂基因”(split gene)、“假基因(pseudo gene)”、“重叠基因(overlapping genes)”等有关基因的概念,从而丰富并深化了我们对基因本质的认识,充实了基因工程的理论基础。

1.1.2.核酸的结构与性质 1.1.2.1. DNA的结构与性质 1. DNA的结构

 DNA是由被称为核苷酸(nucleotide)的单体组成的长链状聚合物,被称为聚核苷酸。每个核苷酸含有一个糖基、一个含N的环状碱基(base) 和一个磷酸基团。DNA所含的单糖是2`- 脱氧核糖。每个核苷酸含腺嘌呤(adenine,简写为A)、鸟嘌呤(guannine,简写为G)、胞嘧啶(cytosine,简写为C)或胸腺嘧啶(thymine,简写为T)四种碱基中的一种。碱基共价结合于2`- 脱氧核糖的1`位构成核苷(nucleoside)(图1-1)。

2、DNA的复制

 DNA复制是细胞拷贝其DNA的过程,从而使细胞中的遗传信息在细胞分裂之后传给子细胞。在复制过程中,DNA被DNA聚合酶拷贝。DNA聚合酶作用于单链DNA,并合成一条与原来的单链互补的新链。DNA合成通常以5’→ 3’方向发生。复制是半保留式的(图1-4)。

DNA复制机制在很多生物中都是极相似的。差别仅存在于参与复制的酶及蛋白。在原核生物(如大肠杆菌)中,DNA聚合酶I及III两种酶负责DNA的合成。在真核生物中,DNA由五种聚合酶(α,β,γ,δ,ε)复制。复制必须十分精确,即使很小的错误率也会造成几次细胞分裂之后丢失重要遗传信息。DNA聚合酶可以检查新合成链中插入的碱基正确与否。这一功能可保证DNA复制的准确性。此功能由该酶的反向(3’→ 5’) 核酸外切活性实现,即将非正确插入的碱基从新合成链上切除并代之以正确碱基。这被称为校正(proofreading)能力。DNA复制的错误率要比转录的低得多,据估计只有五十亿分之一的碱基是非正确插入的。

4.DNA的变性、复性、杂交

 变性: 由天然状态到变性状态的过程叫变性(denaturation)又称熔解。  复性:变性DNA在一定条件下又可以恢复天然DNA的结构,这个过程叫复

 分子杂交:复性DNA中,如果两条链来源不同,就叫分子杂交

1.1.3真核生物的基因组

1.真核生物具有真正的核结构和一定数目的染色体。除配子为单倍体外,体细胞一般为二倍体。99%的DNA在核基因组(nuclear genome)中。 2.真核生物基因组远大于原核生物,具有多个复制起点。

3.大部分基因具有内含子,因此真核基因一般是不连续的,又称断裂基因。 4.非编码序列的量多于编码序列。 5.存在大量重复DNA序列。

6.真核生物中未发现原核生物的操纵子结构。

1.2. 真核基因的结构与类型 1.2.1基因转录有关的结构 1.2.1.1启动子

 基因的表达是由一段位于编码序列上游的DNA调控的,这段顺序称之为启动子(promoter)。启动子中的保守序列可以被RNA聚合酶和别的与转录有关的转录因子(transcription factor)识别并结合,启动基因的RNA转录。细胞中一个基因的表达由其启动子的序列以及该启动子与RNA聚合酶和转录因子的结合能力来决定。

启动子是基因转录起始所必须的一段DNA序列,一般位于结构基因的上游,是DNA分子上与RNA聚合酶特异结合而使转录起始的部位,启动子本身不被转录。通常把一个基因开始转录的位点上的核苷酸作为+1;把由+1走向DNA5’未端方向的DNA叫做转录起点的上游,用负数表示;把由+1走向DNA3’未端方向的DNA叫做转录起点的下游,用正数表示。如“-10”转录起点上游的第10个核苷酸,“+10”即表示转录起点下游的第10个核苷酸。

1.启动子成分

 ①帽子位点(cap site),即转录起始点,其碱基大多为A(指编码链),两侧各有若干个嘧啶核苷酸。

 ②TATA盒(TATA Box),又称Goldberg-Hogness Box,是位于转录起点上游的一段保守序列。它的顺序为TATAAATA,位置在-34—-36之间。绝大多数的真核基因都有TATA盒,TATA盒对真核基因的转录起始不是必需的,缺失仍可进行转录,但转录的起始就会不在原来的位置上,而且转录可以在若干个不同的位置上开始,产生多种转录物;同时,TATA盒中的任一碱基的突变,都引起转录的剧烈下降。因此,TATA盒决定了转录起点的正确选择,并影响转录起始的效

率。在有些基因中不存在TATA盒,这样的基因中可能存在某种替代机制。

③CAAT盒(CAAT Box), 在某些真核基因中存在,其一致序列为GGTCAATCT。一般位于-75附近,虽然名为CAAT盒,但前面GG的重要性并不亚于CAAT部分。CAAT盒的突变会导致转录效率的急剧下降,它对某些基因的转录是必需的,对某些基因(如胸苷激酶基因)的转录则是不必要的。  ④GC盒(GC Box),有一些RNA聚合酶II转录的基因在远离起点的更上游处有一段CCGCCC序列,称为GC Box,它与转录的调节有关。

 帽子位点(即转录起始位点)和TATA盒为多数基因所拥有,故称为核心启动子。核心启动子对多数基因来讲可以产生一个基本的转录水平。

1.2.1.2 内含子和外显子 mRNA的剪接(splicing)

将基因中对应于成熟mRNA中尚存的DNA序列称为外显子(exon),对应于被切除的部分称为内含子(intron)。由于整个基因是由外显子和内含子组成的镶嵌结构,所以被称为断裂基因(split gene)。

内含子普遍存在于真核生物和真核病毒中,在发现内含子后很长一段时间内,人们曾以为内含子是真核生物的标志。但在1983年以后,相继在原核生物中发现了内含子的存在,如大肠杆菌T4噬菌体的胸腺嘧啶核苷酸合成酶基因、硫化叶菌的亮氨酸tRNA和丝氨酸tRNA基因中发现了内含子的存在,这就打破了内含子只存在于真核生物中的概念。

 每个基因中所含有的内含子的数目变化很大,可以从0到50多个不等。外显子和内含子的长度也有变化,但通常内含子比外显子要长,占了整个基因序列的大部分。内含子的特点是:5’端以GT开始, 3’以AG结束,称为GT/AG规则。

1.2.1.3.终止子

转录终止子(terminator)是一个基因编码区下游的可被RNA聚合酶识别和停止合成RNA的一段DNA序列。这些序列经常含有一些自身互补区,能在RNA产物上形成茎环或发夹二级结构(图1-9)。 这些结构使聚合酶停顿并随即终止转录。

1.1.2.2 RNA的结构和性质

 RNA的结构与DNA的相似,但有一些重要的区别。在RNA中,核糖取代了DNA的2`-脱氧核糖。另外,同样能够与腺嘌呤配对的尿嘧啶取代了胸腺嘧啶。除此之外,RNA通常以单链多聚核苷酸的形式存在,不形成双螺旋。但是,同一条RNA链上的互补部分也会产生碱基配对,形成短的双链区。 细胞中含有3种RNA,它们是tRNA(转移RNA)、rRNA(核糖体RNA)和mRNA

(信使RNA),均由DNA转录得到。

1.2.1.4.增强子

 许多真核生物启动子的转录可被远离转录起始位点数千个碱基的调控元件所增强,这一调控元件被称为增强子(enhancer)。这一现象最初是在DNA病毒SV40基因组中发现的,来自SV40 DNA的约100bp的序列,即使处于上游很远的位置也能显著增强基本启动子的转录。增强子的特征是100~200bp 长,含有多个对增强子总体活性起作用的序列元件,这些元件或是广谱的,或是细胞类型特异性的。

1.2.1.5沉默子

在基因内能抑制基因表达的DNA序列叫沉默子(又称减弱子或抑制子,silencer)。与增强子相似,它也属于顺式作用元件中的调控元件,其作用不受位置和指向的影响,也表现出组织细胞特异性,在真核生物中普遍存在,但在细菌中很少。与增强子主要区别是它是负调控元件。

1.2.2.2 5’非翻译区

 基因的转录起始位点到翻译起始密码子之间的一段序列被称为5’非翻译区。该区的5’端是前体mRNA加帽(7-甲基鸟嘌呤核苷)的位点。有些基因的5’非翻译区中还有一些茎环结构,这些茎环结构与翻译起始密码子中的旁邻序列对翻译的效率都有影响。此外有些植物基因的5’非翻译区中还鉴定出有内含子存在。如在Ubiquitin基因、Sh基因、Actin基因、Adhl基因与Wx基因中的5’非翻译区中均有内含子存在,且这些内含子都有增强基因表达的作用。

1.2.2.3 编码区

1.起始密码子

 Kozak比较了47种植物基因与植物病毒基因中翻译起始位点附近的23个

核苷酸,除了一个基因例外,其余的都是从5’端的第一个AUG作为翻译起始密码子的。例外的是菜豆的凝集素基因,它的转录本5’端有4个AUG,但彼此的读码框不同。可能是由于前面3个AUG的旁邻序列不适宜核糖体的识别而不被使用,第4个AUG是真正的翻译起始密码子。

1.2.3.基因的命名

 ①每个基因座位用三个小写斜体英文字母表示,如tur,这三个字母来自说明基因特性单词的前三个字母。一些基因的特性无法用一个词表示,就得用两个或三个词的前三个字母来表示。

 ②表型相同基因不同的突变型,用三个字母后面加一个大写字母表示。  ③同一基因的不同突变位点用基因符号后面所加的阿拉伯数字表示,如果突变位点所属的基因还不确定,那么大写字母用一短线代替。

 ④基因的蛋白质产物和表型用该基因的大写正体表示。

1.2.4、植物基因的基本类型

 细胞核结构蛋白基因  结构蛋白基因 细胞壁蛋白基因

 细胞膜蛋白基因  细胞质结构蛋白基因 植物基因 胚胎发育特异基因

 种子贮存蛋白基因  花器官特异基因  优势表达基因 果实特异基因

 营养器官特异基因  特化器官特异基因  环境诱导特异基因  细胞代谢酶基因

1.3.1基因的表达 1.3.1.1中心法则

1.3.1.2基因的转录

 转录是基因表达的第一阶段,在转录过程中,合成了该基因DNA序

列的RNA拷贝。RNA的合成由RNA聚合酶(RNA polymerase)以DNA为模板来完成。DNA双螺旋两条链中一条是模板链(template strand),另一条是非模板链(nontemplate strand)。生成的RNA是用DNA的模板链为模板,合成的RNA分子是非模板链的拷贝(图1-13)。基因的次序通常是指非模板链的碱基次序。 非模板链也被称为有义(+)链(sense(+) strand)或编码链(coding strand)。合成的RNA分子被称为转录产物(transcript),接接着可能用于翻译产生蛋白质或用作rRNA或tRNA。

1.3.1.3蛋白质的合成(翻译) 1.遗传密码

 每单位的三个碱基被称为一个密码子(codon),编码一种氨基酸(表1-2)。

 密码子的简并性(degeneracy)或遗传密码的丰余(redundancy) DNA和RNA的四种碱基能形成43=64种密码子,负责编码在蛋白质中存在的20个氨基酸。由于密码子的种类多于氨基酸的种类,因此除了甲硫氨酸和色氨酸外,每种氨基酸都有一种以上的密码子。这种现象被称为简并(degeneracy)或遗传密码的丰余(redundancy)。

同义密码子(synonymous coden)

 编码同一个氨基酸的密码子被称为同义密码子。同义密码子之间的差别通

常发生在第三个碱基上,这个碱基位置被称为摆动位置(wobble position)。 终止密码子(termination codon 或 stop codon) UAG、UGA和UAA则不编码任何氨基酸,而是蛋白质合成的终止信号。 起始密码子(initiation codon)

 编码甲硫氨酸的AUG,也是蛋白质合成的起始信号,被称为起始密码子。所有蛋白质的合成都是从甲硫氨酸开始的,尽管有些情况下,蛋白质合成结束后该甲硫氨酸会被去掉。

遗传密码无标点符号,因此要正确阅读密码,必须从一个正确的起始,此后连

续不断地一个密码子挨一个密码子往下读,直到碰到终止信号。如果在核苷酸序列中插入一个碱基或删除一个碱基,就会使这一点以后的读码发生错误,此为移码。

1.3.2原核生物和真核生物在基因表达调控的差别

1.3.2.1原核生物的基因表达调控特点

 原核生物对环境有很高的适应性和应变能力,这是它们赖以生存繁衍的基

础。

 原核生物调控主要发生在转录水平上,操纵子水平的转录调控是转录调控

的主要形式。

 在原核生物中,也有不少翻译过程的调控机制,如反义RNA的调控作用。

§0.4基因组学的应用前景

1 生物的多样性 2 生物克隆

3 生物的遗传改良:转基因生物就是将外源基因转入动物或植物,使其表达出原来没有的某种性状,得到的新型生物称为转基因动物或转基因植物。 4 人类健康:中国古代就有对动物白化现象的记载。唐朝大诗人李白在《秋浦歌》中写到:“秋浦多白猿,超腾若飞雪......” 。其实,动物的白化现象是一种遗传性疾病。基因疗法,长生不老不是梦

5 生物进化:Phylogenetic relationships among multicellular organisms whose genomes have been sequenced or are currently being sequenced. Rice is the only cereal to have its genome sequenced. The genome sequence of the model plant Arabidopsis was largely completed in 2000. Species in dark blue are those with completed sequences or drafts that have been published; sequencing of genomes for species in turquoise is ongoing. Ma, millions of years ago.

Chapter 1 Fundation of Genomics

§1 Cell Fundation of Genomics

Understanding what makers up a cell and how that cell works is

fundamental to all the biological sciences. Appreciating the similarities and differences between cell types is particularly important to the fields of cell and molecular biology. These fundamental similarities and differences provide a unifying theme, allowing the principles learned from studying one cell type to be extrapolated and generalized to other cell types.

1.1 What is a Cell ?

Cells (细胞) are the structural and functional unit of all living organisms. Some organisms, like bacteria, are unicellular (单细胞)-consisting of a single cell. Other organisms, such as humans, are multicellular (多细胞), or have many cells—an estimated 100,000,000,000,000cells!

1.1.1 Cell Organization

There are two general categories of cells: prokaryotes and ukaryotes Prokaryotic(原核生物)Organisms Prokaryotes are distinguished from eukaryotes on the basis of nuclear organization, specifically their lack of a nuclear membrane (核膜).

Prokaryotes also lack any of the intracellular organelles and structures that are characteristic of eukaryotic cells. Eukaryotic(真核生物)Organisms

The major and extremely significant difference between prokaryotes and eukaryotes is that eukaryotic cell contain membrane-bounded compartments in which specific metabolic activities take place. Most important among these is the houses the eukaryotic cell’s DNA. It is this nucleus (细胞核) that gives the eukaryote—literally, true nucleus—its name.

1.1.2 Cell Structures

The Plasma Membrane—A Cell’s Protective Coat The Cytoskeleton—A Cell,s Scaffold The Cytoplasm—A Cell,s Inner Space The Nucleus—A Cell,s Center

The Ribosome—The Protein Production Mechine

Mitochondria and Chloroplasts—The Power Generators

The Endoplasmic Reticulum and the Golgi Apparatus— Macromolecule Managers

Lysosomes and Peroxisomes— The Cellular Digestive System

1.1.3 Making New Cells

Mitosis(有丝分裂) Every time a cell divides, it must ensure that its DNA is shared between the two daughter cells. Mitosis is the process of “divvying up” the genome between the daughter cells.

Meiosis(减数分裂) Meiosis is a spepcialized type of cell division that occurs during the formation of gametes. Meiosis I (减数分裂I) refers to the first of the two divisions and is often called the reduction division. Meiosis II (减数分裂II) is quite simply a mitotic division of each of the haploid cells produced in Meiosis I

§1.2 What is a Genome

The genome (基因组) is the entire DNA content of a cell, including all of the genes and all of the intergenic regions.Life is specified by genomes. Every organism, including humans, has a genome that contains all the biological

information needed to build and maintain a living example of that organism. The biological information contained in a genome is encoded in its

DNA—deoxyribonucleic acid (脱氧核糖核酸)—and divided into discrete units called genes (基因).

1.2.1 Eukaryotic Nuclear Genomes

Eukaryotic nuclear genomes range in size from less than 10Mb to over 100,000Mb. Genome size broadly coincides with organism complexity, the genomes of higher eukaryotes being larger than those of lower eukaryotes, but size is determined not only by the number of genes in the genome but also by the amount of repetitive DNA (重复DNA), the larger genomes tending to be ones in which the copy numbers of the repeat sequences are highest.

The nuclear genome is split into a set of linear DNA molecules, each contained in a chromosome (染色体). No exceptions to this pattern are known: all eukaryotes that have been studied have at least two chromosomes and the DNA molecules are always linear.

Table 1.1 Size of genomes

Organism Genome size (Mb)Prokaryotes

Mycoplasma genitalium 0.58 Escherichia coli 4.64 Bacillus megaterium 30 Eukaryotes

Fungi

Saccharomyces cerevisiae (yeast) 12.1 Aspergillus nidulans 25.4

Protozoa Tetrahymena pyriformis 190

Invertebrates

Drosophila melanogaster (fruit fly) 100 Bombyx mori (silkworm) 490 Locusta migratoria (locust) 5,000

Vertebrates

Fugu rubripes (pufferfish) 400 Homo sapiens (humans) 3,000 Mus musculus (mouse) 3,300

Plants

Arabidopsis thaliana (vetch) 100 Oryza sativa (rice) 565 Zea mays (maize) 5,000 Triticum aestivum (wheat) 17,000 Fritillaria assyriaca (fritillary) 120,000

1.2.2 Eukaryotic Organelle Genomes

Almost all eukaryotes have mitochondrial genomes and all photosynthetic eukaryotes have chloroplast genomes. Most mitochondrial and chloroplast genomes are circular, but in many eukaryotes the circular genomes coexist in their organelles with linear versions.

Copy numbers for organelle genomes are not particularly well understood. Each human mitochondrion contains about 10 identical molecules, which means that there are about 8000 per cell. Photosynthetic microorganisms such as Chlamydomonas have approximately 1000 chloroplast genomes per cell, about one fifth the number present in a higher plant cell.

Table 1.2 Sizes of mitochondrial and chloroplast genomes

Species Type of organism Genome size (kb)A. Mitochondrial genomes

Chlamydomonas reinhardtii Green alga 16 Mus musculus Vertebrate (mouse) 16 Homo sapiens Vertebrate (human) 17 Drosophila melanogaater Vertebrate (fruit fly) 19 Chondrus crispus Red alga 26 Saccharomyces cerevisiae Yeast 75 Brassica oleracea Flowering plant (cabbage) 160 Arabidopsis thaliana Flowering plant (vetch) 367 Zea mays Flowering plant (maize) 570 Cucumis melo Flowering plant (melon) 2500 B. Chloroplast genomes

Pisum sativum Flowering plant (pea) 120 Marchantia polymorpha Liverwort 121 Oryza sativa Flowering plant (rice) 136

Nicotiana tabacum Flowering plant (tobacco) 156 Chlamydomonas reinhartii Green alga 195

1.2.3 Prokaryotic Genomes

Most prokaryotic genomes are less than 5 Mb in size, although there are a few that are substantially larger than this. The traditional view has been that in a typical prokaryote the genome is contained in a single circular DNA molecule, localized within the nucleoid (拟核).A complication concerns the precise

status of plasmids (质粒) with regard to the prokaryotic genome. A plasmid is a small piece of DNA, often but not always circular, that coexists with the main chromosome in a bacterial cell. Some types of plasmid are able to integrate into the main genome, but others are thought to be permanently independent.

§1.3 What is a Chromosome

1.3.1 Number of Chromosome(染色体数目)

Species(物种) Number of Chromosome 染色体数目(2n) 人类 Homo sapiens 46 小家鼠 Mus musculus 40 果蝇 Drosophila melanogaster 8 小麦 Triticum vulgare 42 水稻 Oryza sativa 24 豌豆 Pisum sativum 14 链孢霉 Neurospora crassa 7 衣藻 Chlamydomonas reinhardi 16

1.3.2 Chromosome form · Chromosome length染色体的长度 着丝粒(centromere) · 端粒(telomere) · 核仁组成区(nucleolar organizer) 随体(satellite) · 染色粒(chromomere) · 常染色质(euchromatin)和异染色质(heterochromatin)

1.3.3 Band of Chromosome染色体的带型

多线染色体的带纹: 双翅目多线染色体中由于大量染色粒的特定配对形成带纹(band,Painter 1939)。染色体的带纹通常有清楚的轮廓和特征性的宽度,每一条带纹与相邻带纹界线清楚,并通过称为纹间的区间而彼此分开。

带纹含有约95%的染色体DNA,它们类似于遗传功能单位,并显示出在形态上表现为疏松的选择性转录活性。

染色体的分带技术:

G带:用吉姆沙(Giemsa)染色产生的带; · Q带:用荧光染料染色产生的带; R带:与G带相反的带;· C带:显示组成型异染色质的带。

1.3.4 人类染色体的带型分析

1956年,瑞典细胞遗传学家庄有兴等报告,人的染色体数是46,而不是过去认为的48。

1960年4月,在美国科罗拉多州首府丹佛市(Denver)召开的国际学术会议上对人的染色体分群和命名的术语、符号、方法等作了统一的规定。

1971年,在巴黎召开了人类遗传学第9届国际会议,提出了人类染色体带型系统。

1977年,国际人类染色体命名常务委员会召开会议对巴黎系统进行了修订,会后公布了《人类细胞遗传学命名国际体制(ISCN)(1978)》

1981年,国际人类染色体命名常务委员会又公布了《人类细胞遗传学高分辨显带命名国际体制》。

§1.3.4 人类染色体的带型分析

Denver系统(1960): 人类染色体根据大小和着丝粒的位置分为7群: A(1-3) B(4-5) C(6-12) D(13-15) E(16-18) F(19-20) G(21-22)

此外还有X和Y 人类细胞遗传学高分辨显带命名国际体制(1980):

染色体的短臂以p表示,长臂以q表示;每个臂再划分为区(region); 每个区再划分为带(band)。 例如,12p14代表第12染色体短臂上第1区第4带,而9q34代表第9染色体长臂上第3区第4带。 人类细胞遗传学高分辨显带命名国际体制(1980):

染色体的短臂以p表示,长臂以q表示; 每个臂再划分为区(region); 每个区再划分为带(band)。

例如,12p14代表第12染色体短臂上第1区第4带,而9q34代表第9染色体长臂上第3区第4带。

Chapter 2 Molecular Markers(分子标记)

§2.1 Genetic Markers

遗传标记的特征:

· 亲本间存在着多态性(即差异),也就是说具有可识别性。 · 亲本间存在的多态性在后代中可以重演,即具有可遗传性。 遗传标记的类型:

· 形态标记,或可见标记(visible markers) · 细胞学标记 (cytological markers) · 同工酶标记(isozyme markers)

· DNA标记(DNA markers) 形态标记或可见标记(visible markers):

是指在个体上可以看见的遗传标记,如花色(红花、白花)、株高(高秆、矮秆)等。

细胞学标记(cytological markers):

是指细胞学上能观察到的遗传标记,主要是指染色体上可以识别的特征,如染色钮(玉米第9染色体末端)、带纹等。

同工酶标记(isozyme markers):

是指以同工酶带型为标记,通过电泳使同工酶带型产生多态性,如酯酶同工酶、过氧化氢酶同工酶等。

DNA标记(DNA markers):

是指以DNA片段为标记,通过DNA片段的电泳使DNA产生多态性,如RFLP等。

形态标记、细胞学标记、同功酶标记存在的问题是:

· 这些标记在数量上都是有限的。虽然经过近百年的努力,目前这些标记的数量仍然不多,因此限制了这些标记的利用。 · 这些标记在操作上比较麻烦,难以开展大规模的研究和利用。 DNA标记具有的优势:

· 在数量上是巨大的; · 操作相对简单,适合大规模开展工作; · 标记比较明显,容易识别;

· 受环境影响少,标记本身就是遗传物质;

同工酶标记和DNA标记都是分子标记。但目前分子标记一般指DNA标记。

§2.2 RFLP

RFLP (Restriction fragment length polymorphism),译为限制性片段长度多态性。Botstein 等1980年首次发现,以后被大量利用,在微生物、植物、动物和人类上都得到了广泛的利用。RFLP标记是最早利用的DNA标记,在技术上比较复杂,涉及多个环节。

2.2.1 DNA提取

RFLP标记是最早利用的DNA标记。做RFLP分析,需要提取生物体内的DNA,并对DNA进行体外处理。因此,DNA提取是RFLP分析的第一个环节。

DNA提取的方法很多,不同的生物所用的方法也不同。DNA提取的基本原理和过程主要包括以下几个方面。

1 生物体样品的选取:

动植物生物体由不同的器官和组织构成。应选用生物体中生长旺盛的器官、组织和细胞提取DNA,如植物幼苗或新长出的叶片、动物的血等。DNA提取的效果由DNA的质量和得率来评价。选用适当的生物器官和组织是获得高质量和高得率DNA的保证。取样后应尽快在低温中保存,长时间保存应在超低温中保存。

2 细胞的裂解:

作为DNA提取的开始,多细胞的生物体样品首先需要经过研磨,使样品破粹。然后样品粉末再通过裂解液把细胞裂解。裂解液通常是由Tris [Tris(hydroxymethyl)aminomethane,即三羟基甲基氨基甲烷] 缓冲液加裂解剂组成,如SDS(Sodium dodecyl sulfate)即十二烷基磺酸钠、CTAB(Hexadecyltrimethyl ammonium bromide)即十六烷基三甲基溴化铵等。 3 蛋白质的沉淀:

通常使用能使蛋白质变性的试剂,如醋酸钾、醋酸钠、氯仿等。 4 RNA的消化:

通常使用RNA酶。 5 DNA沉淀:

通常使用冰冻的异丙醇(2-ropanol)、70%乙醇等。 6 DNA溶解:

通常使用TE缓冲液(10mM Tris,1mM EDTA,pH8.0,灭菌)。 EDTA(didodium ethylenediaminetetra-acetate)即乙二胺四乙酸。

2.2.2 Southern blotting

印迹转移(Southern blotting)是由E.M. Southern于1975年发明的,故名。 1 限制性片段的产生:

DNA是大分子。水稻基因组的大小是4.3x108bp,小麦是1.6x1010bp。 限制性内切酶能识别DNA上的限制性位点。

利用限制性内切酶可以把DNA分解为具有一定长度的片段,这就是限制性片段。

2 电泳: 电泳用的凝胶由琼脂糖(agarose)制备,琼脂糖的浓度通常为0.9-1.0%。

酶切后的DNA样品通过电泳,使DNA片段分离,并按分子量的大小排列。 杂交膜的制备:

DNA从凝胶转移到特制的杂交膜上,常用Hybond N+(Amersham) 的尼龙膜。

凝胶的处理:

脱嘌呤处理:0.25M HCl变性处理:0.4M NaOH DNA转移: 转移液:0.4M NaOH。 洗膜:

洗膜液:2xSSC缓冲液

2.2.3 探针的制备 用于RFLP分析的探针(probe)是DNA片段,长度约0.5-2.5kb。探针的来源有:基因组的随机片段、cDNA 等。

探针以克隆的方式保存,以质粒(plasmid)载体,如pUC19等,通过大肠杆菌繁殖。 1 载体:

用作载体的质粒必须具备三个特性: 一个复制起始点(Ori);

·一个显性的选择标记,如氨苄青霉素(ampicillin)抗性基因(Ampr); · 单一的限制性酶切位点。2 探针的标记(labeling):

利用放射性同位素(32P-dCTP)等对探针进行标记,以跟踪探针。

同位素标记:利用DNA聚合酶(Klenow),在DNA复制的过程中,把32P-dCTP合成到DNA片段上。

2.2.4 DNA杂交

1 预杂交:

杂交液 + 鲑精DNA ( sheared salmon sperm DNA, SSS DNA ) 杂交液: 20 X SSPE 250 ml 100 X Denhardt‟s 50 ml 10% (w/v) SDS 50 ml Make up to 1000 ml 65°C保温,2小时以上。 2 杂交:

把已标记的探针加入到杂交液中,加入经预杂交的杂交膜。65°C保温,过夜。3 洗膜:

A. 2X SSC, 0.1% SDS, 65°C, 20 min; B. 1X SSC, 0.1% SDS, 65°C, 20 min; C. 0.5X SSC, 0.1% SDS, 65°C, 20 min

2.2.5 放射自显影

包膜: 用塑料薄膜把已杂交的杂交膜包起来。放入暗盒中。

照射: 放入X光片,紧压。-80 °C冷柜中曝光2-4天。显影: 在暗房中取出X光片,显影、定影、冲冼。

2.2.6 Southern blotting

Eukaryotic DNA is cleaved with one or several restriction enzymes. The cleaved DNA is separated by size using agarose gel electrophoresis. The gel is then laid on a piece of nitrocellulose, and a flow of buffer is set up through the gel onto the nitrocellulose. This causes the DNA fragments to flow out of the gel and bind to the filter. A replica of the DNA fragments in the gel is created on the filter. The filter can be hybridized to a suitable labeled hybridize to the probe will give a signal following autoradiography.

§2.3 PCR

PCR (polymerase chain reaction)即多聚酶链式反应,是一种DNA扩增(DNA

amplification)技术。它是由美国Cetus公司人类遗传实验室的Mullis K.B.等人于1983年发明的。

2.3.1 DNA复制的原理 1 DNA复制的半保留性:

DNA既然是主要的遗传物质,它必然具备自我复制的能力。DNA分子的复制

首先是从它的一端沿氢键逐渐断开。当双螺旋的一端拆开为两条单链,而另一端仍保持双链状态时,以分开的双链为模板,从细胞核内吸取游离的核苷酸,按照碱基配对的方式(即A与T配对,G与C配对),合成新链。新链与模板链互相盘旋在一起,形成DNA的双链结构。这样,随着DNA分子双螺旋的完全拆开,就逐渐形成了两个新的DNA分子。由于新合成的DNA分子保留了原来母DNA分子双链中的一条链,因此DNA的这种复制方式称为半保留模式(Semiconservative model)。 2 DNA复制的半不连续性 :

当双链DNA分子解链成为两个单链的DNA模板以便DNA复制时,DNA分子就象“Y”型,这个结构称为复制叉(replication fork)。复制时,复制叉向一个方向移动。

在复制叉的两条链上,一条链的3’→5’方向与复制叉的移动方向相同,这条链能够连续不断地合成新链。而另一链的3’→5’方向与复制叉的移动方向相反,这条链只能以不连续的方式合成新链。因此,整个DNA复制是半不连续的(semidiscontinuous)。 3 引物和DNA聚合酶:

在DNA复制的过程中,引物酶(primase)复合体(引物单体)结合DNA单链上,合成短片段的RNA引物(primer)。在DNA多聚酶(polymerase)Ⅲ的作用下,以引物链为起点,合成与模板链互补的DNA链。

2.3.2 PCR技术 1 基本原理:

PCR是体外酶促合成特异DNA片段的一种方法。其主要步骤是: 高温变性:置待扩增DNA于高温下解链成为单链模板。 低温退火:人工合成的两个寡聚核苷酸引物在低温条件下分别与目的片段两侧的两条链互补结合。

适温延伸:DNA聚合酶在72°C将单核苷酸从引物3’端开始掺入,沿模板5’®3’方向延伸,合成DNA新链。由于每一周期所产生的DNA均能成为下一次循环的模板,所以PCR产物以指数方式增加,经过25-30次周期之后,理论上可增加109倍,实际上至少可扩增105,一般可106-107。

2 PCR反应液(例子):

10x PCR buffer 2.5 15mM MgCl2 2.5 (changable) 1mM dNTP 5.0 Taq (1u) 1.0 (changable)

Primer (100ng/ml) 1.0 x 2 Genomic DNA (20ng/ml) 5.0 H2O x Total 25.0 (µl)

10x PCR buffer:

500mM KCl 200mM Tris-HCl(pH8.2)0.01% gelatin

3 PCR cycle:

PCR循环按高温变性、低温退火和适温延伸的原理设计所用的温度和时间。 94 °C 4’-10’ 1 cycle

94 °C 1’ 37-55 °C 1’ 25-40 cycle 70-75 °C 2’ 72 °C 5’-10’ 1 cycle

4 影响PCR的因素: PCR方法操作简单,但影响的因素较多,欲得到好的反应结果,需要根据不同的情况,摸索最适条件。

A.模板 B.引物 C.Mg++浓度 D.dNTP浓度

E.TaqDNA聚合酶用量

F.循环参数 4.1 TaqDNA聚合酶:

在PCR中,最初使用的是大肠杆菌DNA聚合酶I的Klenow片段。由于该酶不能耐受反应循环中解链所需的高温(93°C -95°C),因此在反应过程中必须不断补加聚合酶以满足每次扩增的需要。这样造成操作繁琐,反应低效及费用增加。另一方面,由于Klenow片段聚合反应温度低(37°C),容易形成引物与DNA模板的非专一性配对,或受某些DNA二级结构干扰,结果产生许多非专一性的产物带。A Taq酶的性质:

Taq DNA聚合酶是从生长在温泉中的水生栖热菌(Thermus aquaticus)中分离纯化出来的。这种栖热菌能在70°C-75°C中生长。

Taq DNA聚合酶的分子量为93.9kD,活性可达200,000U/mg蛋白。

Taq DNA聚合酶的功能是:在有四种核苷酸三磷酸盐的反应系统中,以高温变性的DNA为模板,从分别结合在模板DNA两端的引物为出发点,按5’ ®3’的方向,沿着模板顺序合成新链。

Taq DNA聚合酶具有较高的热稳定性。以实验为证,将反应系统分别置于92.5°C、95°C、97.5°C,分别经过130分钟、40分钟和5-6分钟后,Taq DNA聚合酶的活性仍保持50%。 B 影响酶活性的因素:

虽然Taq DNA聚合酶有很强的温度适应范围,但高于90°C的环境仍会使部分酶变性失活。反之,如果温度过低,不但酶活性受影响,而且由于引物在低温下(特别是25°C-27°C)可能与基因组中别的部分同源的序列结合,使得一些扩增产物并非为目的序列。适当提高温度,错配碱基多会解离,反应产物的特异性增加。Taq DNA聚合酶的最适应温度为70°C。

Taq DNA聚合酶活性对Mg++的浓度非常敏感。Taq DNA聚合酶与其它聚合酶一样,是Mg++依赖性酶。测定表明,在MgCl为2.0mM的条件下,酶的活性显示最高。

KCl的最适浓度是50mM,高于75mM时,聚合反应受到明显的抑制。

4.2 引物设计:

引物是决定PCR结果的关键。引物设计的原则是: (1)引物长度以15-30个碱基为宜,(G+C)含量约50%,应尽量避免数

个嘌呤或嘧啶的连续排列;

(2)避免引物内部形成二级结构;

(3)两个引物之间不应发生互补,特别是在引物3’端,避免形成“引物二聚体”;

(4)引物的3’端应为G或C。

4.3 PCR产物的检测和多态性分析:

PCR产物通常在琼脂糖凝胶电泳,溴化乙锭染色后,在紫外光下观察。 PCR产物通常以共显性的带型出现。

§2.3.4 RAPD

RAPD (Random amplified polymorphic DNA)即随机扩增多态DNA。这是Williams 等于1990年首创的。

RAPD是随机扩增的PCR技术:RAPD与一般PCR的主要差异是利用随机引物。引物通常是10核苷酸的长度,引物是随机设计的,因此利用RAPD引物通常可以从基因组中扩增出多个DNA片段。 RAPD的检测与一般的PCR相同。RAPD的带型通常为显性。

§2.4 AFLP

AFLP(Amplified fragment length polymorphism)即扩增性片段长度多态性,是荷兰科学家Zabeau M. 和Vos P. 发明的。他们于1992年12月16日向欧洲专利局提出SRFA(Selective restriction fragment amplification)作为DNA 指纹分析(DNA fingerprinting)的方法的专利申请。专利于1993年3月公开,并引起了各国相关实验室的深厚兴趣。1995年,Vos P. 等人在《核酸研究》(Nucl Acids Res.)上发表第一篇有关AFLP的论文。 选择性限制性片段扩增(SRFA):前面已经介绍了RFLP和PCR的技术,这两种技术都有许多优点,但也都存在一些不足之处。RFLP具有分子标记较多、可以对整个基因组作指纹分析,但其技术较为复杂。PCR具有技术比较简单,但特异性引物不易发展,数量有限,难以用作指纹分析。如果把这两种技术结合在一起,有可能发展出一种方法简单,可以快速产生大量标记,适合作指纹分析的方法。AFLP就是基于这一考虑而发展起来的。从广义来说,PCR以及由PCR产生的相关方法如 RAPD等产生的多态性都是AFLP(扩增性片段长度多态性)。这里讲的AFLP指的是SRFA(Selective restriction fragment amplification)。 §2.4.1 限制性片段的产生 1 限制性片段的设计:

SRFA的主要目的是指纹分析。指纹分析要求从基因组中快速产生较多的带。从基因组中快速检测出较多的带通常是使用测序胶进行的。测序胶是聚丙烯酰胺凝胶,最有效的分离片段约100bp-500bp。因此,为了达到最好的分辨效果,用于SRFA的限制性片段应主要为100bp-500bp。

2 限制性酶的选择:

前面已经介绍过限制性酶。这里需要考虑的是,用什么限制性酶能使基因组酶切后,大多数片段在100bp-500bp的长度。经过估算和试验,选用一个6-bp识别位点的酶和一个4-bp识别位点的酶,同时处理DNA后能达到这一效果。通常,在SRFA实验中,6-bp识别位点的酶选用PstI,而4-bp识别位点的酶选用MseI。

§2.4.2 限制性片段扩增 1接头的设计和连接:

接头(adapter)是人为设计的、连接到限制性片段两侧的一小段DNA。它的作用是为设计引物提供已知的碱基序列。 设计接头的原则是:

(1)接头必须能连接到酶切位点上。

(2)接头的两条寡核苷酸链应能互补,形成双链结构。 (3)接头的碱基序列应符合引物设计的要求。 接头的连接是利用T4 DNA连接酶,把接头与限制性片段连接起来,每个限制性片段的两侧分别连接两个不同的接头。连接后,接头成为模板DNA的一部分。2 非选择性引物的设计和非选择性扩增:

SRFA所用的引物是根据接头的碱基序列合成的。引物序列的方向是5’®3’,它应能与接头的3’®5’链互补。因此,非选择性的引物,实际上就是接头的5’®3’链。

由于限制性片段由两种限制性酶产生,连接着两个不同的接头,因此也有相应的两个不同的引物。

由于两个引物与限制性片段两侧的接头一样,因此凡连接有接头的模板都可以作为模板,因此扩增是非特异性的。扩增的方法与通常的PCR方法是相似的。

§2.4.3 选择性限制性片段扩增 选择性引物的设计:

由于非选择性扩增产生数十万个不同的产物,扩增产物电泳后显示不出带纹。因此必须对非选择性扩增产物再进行选择性扩增,从中扩增出有限的产物,通常是几十个产物,才能显示出能区别的带纹。

选择性扩增的选择性效果取决于于引物中选择性碱基的数目: 选择性碱基数目

选择的模板DNA(已连接接头的模板)

§2.4.3 选择性限制性片段扩增 选择性扩增:

选择性扩增是通过利用选择性引物来实现的。其扩增的方法与非选择性扩增相似。为了使扩增产物通过放射自显影检测到,在PCR反应中加入32P-dCTP(2.5mCi=0.0015mM),使扩增产物产生放射性。

§2.4.4 选择性扩增产物的检测 通过聚丙烯酰胺凝胶电泳。通过放射自显影显带。

§2.5 SSLP

SSLP (simple sequence length polymorphism) 即简单序列长度多态性,是由于简单序列的重复次数不同,导致扩增片段长度的不同而产生的多态性。 2.5.1 微卫星DNA

微卫星(microsatellite)又称简单序列重复(simple sequence repeat,SSR),或称短串联重复(short tandem repeat,STR),是一类由几个(一般2-4个)核苷酸为重复单位组成的长达几十个核苷酸的串联重复序列。 DNA序列通常划分为单一序列和重复序列。在真核细胞,当把DNA裂解成大约104bp的片段,然后氯化铯密度离心,结果会出现一个主峰和一些小峰。这些小峰的DNA称为卫星DNA。在氯化铯密度离心中,出现在主峰的DNA其碱基分布均匀,GC/AT的比例比较固定。而出现在卫星峰中的DNA其碱基分布不均匀,G-C含量与主峰不同,造成密度上的差异,因此偏离主峰。G-C含量小,其密度变小。卫星DNA实质上重复序列的DNA。

由于重复单位的不同,微卫星存在着多种不同的类型。各类微卫星的频率在不同的生物体之间存在着显著的差异。在人类基因组中,最丰富的微卫星是(AC)n和(GA)n。(AT)n是植物基因组中最丰富的微卫星。在水稻基因组中,最丰富的微卫星是(GA)n和(GT)n。不同生物的基因组中,微卫星的丰度也差异较大。哺乳动物基因组中微卫星的丰度约为植物基因组的5倍。在植物中,双子叶植物微卫星的丰度约为单子叶植物的3倍。

2.5.2 多态性信息量

在微卫星DNA中,简单序列的重复次数在同一物种的不同品种或不同个体中存在着较大的差异。或者说,微卫星座位上存在着非常丰富的等位基因。例如,在水稻中,RFLP座位的等位基因数为2-4个,而微卫星的等位基因数为2-25个。从多态性信息含量(polymorphism information content,PIC)来衡量,RFLP的PIC值为0.39,而微卫星的PIC为0.69。

由于微卫星座位具有丰富的等位基因,因此微卫星标记是一类比较理想的分子标记。

2.5.3 微卫星标记的发展 微卫星标记是通过对微卫星序列作特异PCR扩增产生的。对微卫星序列作特异PCR扩增,需要根据微卫星两侧的序列设计引物,这种引物特异性的,即在基因组中是独一无二的。设计引物的前题是要知道引物所在座位的DNA序列。因此,微卫星两侧DNA的测序便成为微卫星标记发展的最大限制因素。

微卫星标记的发展主要有两种途径: (1)通过构建文库和测序

A 构建文库  B。筛选文库 C· 测序 D· 设计引物 (2)从已知的DNA序列中筛选 A· 从现有的DNA序列资料中寻找微卫星序列 B · 设计引物

§2.5.4 几种分子标记特点的比较

几种分子标记特点的比较

标记类型 带型 等位基因数 杂合度 稳定性 技术难度 费用 RFLP 共显性 中 (4.8) 0.63 高 难 高 RAPD 显性 少 (2.0) 0.36 低 易 低 AFLP 显性 少 (2.0) 0.34 稍低 难 高 SSLP 共显性 多 (6.8) 0.72 高 易 低 ISSR 显性 较多 高 较易 较低 注:表中数字来自Pejic I., etal., TAG (1998) 97: 1248-1255.

Chapter 3 Genetic mapping

遗传作图 (Genetic mapping) 即遗传图谱的构建。它是利用遗传学的原理和方法,构建能反映基因组中遗传标记之间遗传关系的图谱。因此,遗传作图中“遗传”二字,既是手段,也是目的。

§3.1 传统遗传图 遗传标记

传统遗传图中的遗传标记主要是形态学标记。形态学标记的数量不多,但利用形态学标记作图的时间很长,在二十世的前八十年主要是利用形态学标记作图。 连锁分析

从摩尔根时代开始,连锁(linkage)分析便成为遗传分析的重要手段,更是遗传作图的主要手段。遗传标记之间的遗传关系主要是通过连锁关系来反映。而连锁关系是通过重组率(percentage of recombinants)来反映的。

重组型配子数

重组率 (%) =————————x100

配子总数

连锁分析的方法通常是通过两点测验(two-point testcross)和三点测验来进行。 图谱构建 根据连锁分析的结果,确定遗传标记之间的遗传距离和排列顺序,形成比较完整的连锁群,并由各连锁群构成基因组的遗传图谱。 · 遗传标记之间的遗传距离,单位为重组率; · 遗传标记的排列顺序; · 连锁群的建立,确定连锁群与染色体之间的关系; · 图谱的构建,包括基因组的所有连锁群。

§3.2 作图群体

作图群体(mapping population)是指用于遗传作图的分离群体,如F2、BC1、DH、RI等。 无论是传统的遗传作图,还是现代的分子遗传图谱的构建,都需要有作图群体。

作图群体的基本要求:

群体要足够大; 群体随机分离; 双亲间的多态性高。

§3.2.1 F2群体

基因型的读数(score):

1-亲本1的纯合基因型(aa) 2-杂合体(ab)

3-亲本2的纯合基因型(bb) 0-缺资料对于显性带型而言:

4-非亲本1纯合基因型(ab或bb) 5-非亲本2纯合基因型(ab或aa) 群体的特点: 优点: · 群体容易产生 · 群体分离符合孟德尔规律 · 基因型(带型)容易识别 缺点: · 非永久性群体

· 表型鉴定误差大,特别是对于数量性状而言。

§3.2.2 BC群体

基因型的读数(score):

1-轮回亲本的纯合基因型(aa) 2-杂合体(ab) 0-缺资料 群体的特点: 优点: · 群体容易产生; · 群体分离符合孟德尔规律; · 适合亲缘关系较远的亲本; 缺点: · 非永久性群体; · 表型鉴定误差大,特别是对于数量性状而言; · 当显性时表现型和基因型鉴定都有麻烦。

§3.2.3 DH群体

DH (doubled haploid ) 群体,即加倍单倍体群体,是由加倍单倍体品系组成的分离群体。

基因型的读数(score):

1-亲本1的纯合基因型(aa) 3-亲本2的纯合基因型(bb) 0-缺资料 群体的特点: 优点: · 永久性群体,可重复使用; · 群体分离通常符合孟德尔规律; · 以品系作为分离单元,表现型鉴定较可靠; 缺点: · 群体不易产生; · 花粉培养过程可能会对基因型产生选择和引起变异。

§3.2.4 RI群体

RI群体,即重组近交系(Recombinant inbred lines)群体,是由重组近交系组成的分离群体。 基因型的读数(score):

1-亲本1的纯合基因型(aa) 3-亲本2的纯合基因型(bb) 0-缺资料 群体的特点: 优点: · 永久性群体,可重复使用; ·以品系作为分离单元,表现型鉴定较可靠; 缺点: · 群体不易产生; · 群体通常会出现偏态分离,这是在RI群体发展过程 中自然选择和人工选择的结果。

§3.3 分子标记分析

根据需要与可能,选择合适的分子标记,对亲本和作图群体进行标记基因型的分析,为构建遗传图谱收集必要的数据。

§3.3.1 亲本分析

作图群体是通过两个亲本杂交发展而来的,因此在发展和利用作图群体之前,通常首先要了解双亲之间的多态性。 亲本间的亲缘距离与多态性程度的关系:

亲本间的多态性程度与亲缘关系的距离存在着正相关,因此在发展作图群体时,应尽量扩大亲本的亲缘关系,以便获得较高的多态性程度。 限制性内切酶与多态性程度的关系:

限制性内切酶与多态性程度也存在一定的关系。

识别位点的碱基数与多态性程度的关系,识别位点的碱基数越多,产生的片

段越长,片段长度的多态性就越高。因此,通常6-bp识别位点的限制性内切酶比4-bp识别位点的限制性内切酶产生更高程度的多态性。

相同碱基数识别位点的限制性内切酶在多态性上也存在一定程度的差异,因此可以通过试验来选择使用。 亲本间多态性的分析:

为了减少作图群体的分析中分子标记利用的盲目性,在分析作图群体之前,首先需要对亲本的多态性进行分析。其目的包括: 筛选出具有多态性的分子标记;

在RFLP分析中,还要筛选出产生多态性的酶,找出具有多态性的标记/酶组合;

·了解具有多态性的标记在图谱中的分布,分子标记在图谱中的分布应合理,适合作连锁分析。如果使用的是新的标记,应考虑在作图群体能用的标记的数量是否足够。

§3.3.2 作图群体的分析 作图群体的确定: · 作图群体应足够大,以减少试验误差,提高精确度。 · 作图群体在分子标记分析过程中应保持不变。 多态标记的筛选:

根据亲本分析的结果,选择具有多态性的分子标记。在RFLP组合中,确定标记/酶组合。

作图群体的分子标记分析: 以双亲为对照。

利用选择的分子标记,逐一对作图群体进行分析。作图群体中,分子标记的带型应出现分离。观察带型分离的情况,并做记录。 数据库的建立: · 把基因型分离的结果数字化 · 按作图的要求,整理数据

· 建立计算机文件

§3.4 遗传图谱的构建

§3.4.1 连锁作图的基本原理

前面已经指出,重组率是根据分离群体中重组型个体占个体总数的比率来估算的。这种估算方法无法得到估算值的标准误,因此无法对估算进行显著性检验。采用最大似然法估计(maximum likelihood estimation,MLE)方法进行重组率的估计可以解决这一问题。 似然比与连锁检验: 似然(likelihood),定义为特定观测表现型出现的可能性。

最大似然,定义为以满足其估计值在观察结果中出现的概率最大为条件。 似然比 = L(r)/ L(1/2) L(r):假设两个标记间以r的重组率相连锁的概率;

L(1/2):假设两个标记间为非连锁的概率。 连锁检验:

LOD = log L(r)/ L(1/2) LOD为L(r)/ L(1/2)取以10为底的对数。 在实际应用技术中,要求似然比L(r)/ L(1/2)大于1000:1,即LOD>3.0,才能证实这两标记间存在连锁。即

LOD = log L(r)/ L(1/2)= log 1000=3.0

3.4.2 Mapmaker软件的利用

“作图者” (Mapmaker) 是Lander等(1987)设计的,是专门用于遗传作图的软件。

准备资料(prepare data):

如果是F2群体的资料,阅读资料后,屏幕上将出现:

data type f2 intercross 146 362 0 symbols 1=A 2=H 3=B 0=- (个体数 座位数 QT数) 其中:

1=A—亲本A的纯合基因型(aa)

2=H—杂合基因型(ab)

3=B—亲本B的纯合基因型(bb) 4=C—非亲本A纯合基因型(ab,bb) 5=D—非亲本B纯合基因型(aa,ab) 0=- — 缺资料

分群(group):

通过两点分析,计算各座位之间的连锁关系,并把基因座位划分为若干个连锁群。屏幕上将出现:

group 1:1 4 12 35 „

group 2:2 7 18 42 „

通过设置LOD值,可以改变连锁群的数目。LOD值增大,连锁群的数目也会增大。LOD值>3.0时,其分群的可靠性较大。当座位数足够大时,连锁群的数目与单倍体的染色体数目相同。 排序(sequence):

通过多点分析,可以计算出在同一连锁群内不同排列顺序下,各座位之间的距离和连锁群的总长度。 比较(compare):

通过比较同一连锁群内不同排列顺序的作图结果,找出log-likelihood(对数似然值)为最大的排列方式为最合理的排列方式。 作图(map):

显示最后的作图结果。

§3.4.3 图谱分析

1 标记总数: 图谱上包含的标记(座位)总数。 图谱长度(单位:centi-Morgan,cM): 各连锁群的长度;

基因组的总长度。 标记密度(座位/ cM):

分子标记的平均距离。当分子标记的密度足够大时,通常称为高密度图谱。 标记分布的均匀程度:

遗传图谱上希望分子标记的分布比较均匀,不要出现距离较大的间隙(gap)。

Chapter 4 Mapping of genes

基因定位( Mapping of genes )是指通过遗传作图的方法,确定基因与遗传标记之间的关系。

§4.1 基因的鉴定

4.1.1 基因的概念 基因的广义概念:

基因是具有某种功能的DNA片段,包括结构基因和调控基因,即包括能转录和不能转录的具有某种功能的DNA序列,其含义非常广泛。 分子遗传学的概念:

基因是DNA的一个转录单位。转录产物RNA通过反转录可以合成cDNA,因此通常认为一个cDNA相当于一个基因。但是,目前大多数cDNA的功能尚不知道,因此这种基因是通过转录来识别的。 孟德尔遗传学的概念:

基因是控制某一性状的遗传因子。基因是通过表现型来识别的,即表现型是基因型控制的,通过表现型可以分析基因型。 主效基因和微效基因:

从孟德尔遗传学的基因概念来看,基因型是通过表现型来认识的,根据基因对表现型影响的大小,通常把基因划分为主效基因(major gene)和微效基因(minor gene)。主效基因: 又称主基因,是一类控制质量性状的基因,其性状表现为不连续的变异。微效基因: 是一类控制数量性状的基因,因此通常称为数量性状座位(quantitative trait locus,QTL),其性状表现为连续的变异。 基因座位与等位基因: 基因座位(locus):基因在遗传图上的位置。

等位基因(allele):在相同的基因座上,两种或多种变异基因之一。 由两个以上等位基因组成的一组等位基因称为复等位基因(multiple alleles)。

4.1.2 遗传分析

判断某一性状是否由主基因控制和受多少对基因控制,首先需要对该性状作遗传分析。遗传分析的基本方法是:

(1)选择具有相对性状的两个亲本杂交;

(2)在F1中观察该性状属于完全显性,还是不完全显性;

(3)在F2中分析分离群体中相对性状的分离。

4.1.3 等位性测定

由于有些基因已经定位,因此在基因定位之前,需要做基因的等位性(allelism)测定,以明确要定位的基因尚没有定位。基因的等位性测定的基本方法是:

(1)查阅有关文献,了解该性状目前的研究情况,包括该性状已发现多少个基因,哪些基因已定位等。

(1) 如果不知道要定位的基因与已定位的基因之间的关系,但它们的表现型相

同,则存在它们是同一个基因的可能性,需要确定它们之间的关系。

§4.2 主基因定位

基因定位通常指的是主基因的定位。而微效基因的定位通常用专有名词QTL定位。

4.2.1 基因定位的原理

无论是遗传标记还是基因,都在染色体上占有它们的座位,而这些座位都是DNA的一个片段,因此从座位上看,它们并没有什么区别。通过遗传图谱的构建,很多分子标记在染色体上的位置已经确定,因此只要确定基因与分子标记之间的连锁关系,就可以确定基因在染色体上的位置。

分子标记是通过它们产生的带型来识别的,而基因是通过它们产生的表现型来识别的。因此与前面讲述的遗传作图不同,基因定位既要分析分子标记的带型,又要分析基因型产生的表现型。基因定位的基本原理是通过分析分子标记与表现型之间的连锁关系,确定基因在遗传图谱中的位置。

4.2.2 表型分析 作图群体的发展:

作图群体是基因定位中不可缺少的试验材料。作图群体的基本要求是: (1)双亲具有相对性状的差异;

(2)双亲应具有较高程度的多态性,以便找到紧密连锁的分子标记; (3)作图群体的大小应符合要求;

(4)作图群体中相对性状的分离应符合孟德尔规律。

表型测定: 在基因定位中,目的基因的基因型是通过它的表现型来确定的,因此表现型测定是决定基因定位质量的关键。表型测定要求:

(1)由于作图群体较大,表型分析通常在自然条件下进行,必须使作图群体生长正常,并且应减少个体间的试验误差; (2)应以双亲和F1为对照,并统一标准;

(3)表型测量要准确。表型在测量和辨别过程中通常容易出现误差,必

须统一测量标准,并由熟练的人员操作;

(4)必须采取基因专一性的检测方法,如特定的菌种、花粉育性等。 表型数据的整理: 原始的表现型数据都是通过一定的测量方法获得的,具有特定的单位,如长度、重量、百分比等。由于质量性状的特点,作图群体一般表现

为不连续的分布,因此可以将作图群体分成若干个组。为了满足作图软件的需要,分别用不同的数字代表不同组的表现型,并与标记基因型的数值相一致。 表现型数字化转换的规定是:

1-亲本1的纯合基因型(aa) 2-杂合体(ab) 3-亲本2的纯合基因型(bb) 对于显性带型而言:

4-非亲本1纯合基因型(ab或bb),即亲本2的表现型为显性; 5-非亲本2纯合基因型(ab或aa),即亲本1的表现型为显性;

0- 缺资料

4.2.3 分子标记的分析 已定位标记的利用:

通过遗传图谱的构建,很多分子标记在染色体上的位置已经确定,因此只要找到与基因连锁的分子标记,就可以确定基因在染色体上的位置。这类标记有RFLP标记和SSR标记等。

已定位标记的利用,通常是从现有的遗传图谱中,按一定距离均匀选取分子标记。

1)亲本多态性的分析,筛选出具有多态性的分子标记;

2)作图群体的标记基因型分析,找到与目的基因连锁的分子标记。 未定位标记的利用:

有些分子标记是随机标记,如RAPD和AFLP标记等, 这类标记通常没有确定在染色体上的位置,因此每次使用都是随机的。利用随机标记只能以随机的方式,从大量的标记中筛选出与目的基因连锁的分子标记。 混合池分析:

混合池(bulk)分析是快速筛选出与目的基因连锁的分子标记的一种方法。这是Michelmore RW, etal. (1991)首创的。无论是已定位标记还是未定位标记,都要利用作图群体才能确定它们与目的基因的连锁关系,而作图群体通常都比较大,因此筛选与目的基因连锁的分子标记的工作量非常大。利用混合分析,可以大大地减少这方面的工作量。 混合池的组成:

“混合池”由作图群体中具有相同相对性状个体的DNA组成。在作图群体中,通常选择具有同一相对性状的10-20个体的DNA组成一个混合池。这样,在一个作图群体中,两个相对性状各自组成一个混合池。理论上,两个不同相对性状的混合池除了目的基因的基因型不同外,其余基因型是相同的。因此有些人把相对性状的两个混合池称为“近等基因池” 。组成混合池的个体,应具有典型的表型。

混合池的利用:

在实际利用中,两个混合池通常与两个亲本一起做分子标记的分析。当两个亲本的带型有差异,而两个混合池的带型无差异时,表明该标记与目的基因不连锁;当两个亲本的带型有差异,而两个混合池的带型也有相应的差异时,表明该标记与目的基因可能存在连锁关系。

通过混合池的利用筛选出可能与目的基因存在连锁关系的标记后,还要利用整个作图群体做连锁分析,才能确定它们之间的连锁关系。

近等基因系的利用:

近等基因系的建立:近等基因系的建立通常采用连续回交的方法进行。选择具有相对性状差异的两个亲本杂交,然后用带有隐性性状的亲本作轮回亲本回交。在每一分离世代对目的基因加以选择,经过回交7-8代以上,才能选育成近等基因系。近等基因系除目的性状外,其它性状应与轮回亲本相似。 近等基因系在基因定位中的利用:

由于近等基因系除目的基因外,其遗传背景与轮回亲本相似,利用近等基因系作基因定位确实是理想的材料。近等基因系通常用作亲本之一,用于发展作图群体。

1)近等基因系的表现型不受复杂的遗传背景的干扰,其表现比较真实,能较真实地反映基因效应的大小。

2)利用近等基因系和轮回亲本一起做分子标记分析,一旦筛选出多态性的分子标记,该标记就很可能与目的基因连锁,从而减少连锁标记筛选的工作量。

4.2.4 遗传作图

筛选出与目的基因连锁的分子标记后,表明目的基因与该标记连锁。如果所用的分子标记已定位,则基因所在的位置也已知道。如果所用的分子标记尚未定位,则基因所在的位置还不知道。在这种情况下,需要对连锁的分子标记定位。 1 连锁分析:

基因定位的基本方法是确定表现型与已定位标记之间的连锁关系。把表现型和标记基因型都按统一的规定转换为数值,并把它们整合在一个数据库中,通过Mapmarker软件,可以分析它们之间的连锁关系 2 随机标记的定位:

随机标记定位的策略是利用己知标记来定位未知标记。利用己知标记来定位未知标记的最佳方案是利用永久性作图群体。分析随机标记在永久性作图群体中的分离,并将数据加到永久性作图群体作图时已建立的数据库,再做遗传作图,就可以确定随机标记在遗传图谱中的位置。因此,永久性作图群体是随机标记定位的重要工具。

3 精细定位 (fine mapping):

在基因定位中,找到已定位的分子标记,只是完成了基因的粗定位,即只知道基因的大概位置,这对于基因定位来说是不完善的。基因精细定位,就是在基因粗定位的基础上,进一步完善基因定位的工作。基因精细定位的目标,对于不同的目的有不同的要求。对于分子标记辅助选择的目的而言,基因精细定位的目标是:

(1)基因两侧都要找到紧密连锁的分子标记;

(2)分子标记与基因之间的距离<5cM,或两标记之间的距离<10cM。

§4.3 QTL定位

QTL(quantitative trait locus)称为数量性状基因座位。由于这类基因控制数量性状,并且表现为微小效应,因此难以用主基因定位的方法来定位。随着分子标记技术的不断发展,QTL定位的方法也日趋成熟

生物性状的形成取决于两方面的因素,一是亲本的基因型,一是环境条件的影响。因此,表现型是基因型和环境条件共同作用的结果。

P = G + E 其中,P表示表现型值,G表示基因型值,E表示环境条件引起的变异。

相对来说,数量性状的表现型值中,由环境条件引起的变异更大。在传统的遗传分析中,通常是分析数量性状的表现型值中,基因型值所占的比率,以便评价数量性状遗传力的大小,或者是通过双列分析等方法,估计控制数量性状的基因对数,但难以确定控制数量性状的基因所在染色体上的位置。分子标记的出现,为QTL定位提供了有力的手段。

4.3.1 基本方法

按采用的标记数目分类: · 单标记法 · 相邻双标记法 · 多标记法 按使用的统计方法分类: 方差分析法 · 回归分析法 · 最大似然分析法 按作图所用区间数分类:

零区间作图法 · 单区间作图法 · 多区间作图法

单标记方差分析法(single marker ANOVA(analysis of variance))是Thoday (1961)提出的。

4.3.2 QTL定位的新策略 QTL分析的发展方向:

把复杂性状变为简单性状;

把多基因分解为单基因(单一孟德尔因子); 把数量性状变成质量性状。 永久性作图群体的利用

利用永久性作图群体作QTL分析的优势在于:

(1)以品系为分离单元,使数量性状的测量比较准确。 (2)可以在试验中设置重复,以减少试验误差。

(3)可以进行多点试验,以评价基因与环境的互作关系。 AB-QTL分析

AB-QTL(advanced backcross QTL)分析,即高世代回交QTL分析,是把QTL分析的世代推迟到BC2或BC3的一种QTL分析方法。这种方法能更好地鉴定和利用QTL,特别适用于从野生型等非优良材料中鉴定优良的QTL,并转移到优良的育种品系中。 AB-QTL分析的优点是:

(1)与低世代群体,或称平衡群体(F2、BC1等)相比,AB群体中个体的基因

型(及表现型)与优良的亲本(轮回亲本)更加相似,使产量等数量性状的测量更加准确。

(2)在远缘杂交的低世代群体中,通常会出现一些不良的性状(如不育性、落

粒等),因此影响了对数量性状的测量。在AB群体的发展过程中,逐步淘汰一些来自供体的不良性状, 使AB群体的数量性状表现更加正

常。

(3)由于AB群体中,供体的基因频率较低,因此供体基因型之间的相互作用

(上位性作用)较小,因此供体中的QTL转移到受体后,其效应值变化不大。

(4)再增加回交次数,可以较容易地获得近等基因系,即QTL-NIL。这些QTL-NIL

可以进行重复的田间试验。

(5)由于多次回交,在减数分裂过程中有更多的机会发生重组,因此QTL与不

良基因之间的连锁关系容易打破,可以更好地利用QTL。

次级作图群体的利用

次级作图群体是利用近等基因系(near-isogenic line,NIL)、导入系(introgression line,IL)、染色体片段代换系(chromosomal segment subsititution line,CSSL)等次级品系发展的次级作图群体。次级作图群体在QTL分析中得到广泛的利用,通常称为 QTL-NIL、IL-QTL等。 利用次级作图群体的优势:

(1)具有永久性群体的特点,可以反复使用,排除了环境的影响。

(2)由于每个品系与轮回亲本的遗传背景十分相似,排除了遗传背景的干扰,

因此QTL表型分析的结果较准确。

(2) 由于每个品系只带有来自供体亲本的很小部分,通常只有一个或少数几个

片段,可以把控制同一数量性状的多个QTL进行分解,即把多基因分解为单一的孟德尔因子,使复杂性状简单化,因此大大地提高了QTL分析的准确度和可靠性。

§4.4 分子标记辅助选择

基因定位的目的是为了更好地利用基因。分子标记辅助选择(Marker assisted selection,MAS)就是利用与基因紧密连锁的分子标记,对目的基因进行辅助选择,从而实现对基因的有效利用。

4.4.1 基本的原理和条件 1 标记与基因的关系

分子标记辅助选择(Marker-assisted selection,MAS)就是利用与基因紧密连锁的分子标记,辅助选择目的基因的基因型。

在分子标记辅助选择中,直接选择的是分子标记的基因型,而不是目的基因的基因型,因此利用分子标记选择目的基因的基因型,只是起辅助选择的作用。当分子标记与目的基因紧密连锁时,通过选择分子标记的基因型,可以有效地选择到目的基因的基因型。但当分子标记与目的基因之间的距离较远时,通过分子标记的基因型选择目的基因的基因型的效果就变得较差。分子标记辅助选择的效果取决于连锁的紧密程度。

为了提高分子标记辅助选择的准确率,不但要选择与目的基因紧密连锁的分子标记,而且还要利用目的基因两侧的分子标记。假如某一分子标记与目的基因的距离为5cM,侧该标记与基因之间的重组率约为5%,意味着利用该标记选择目的基因的误差率约为5%。如果某基因两侧各有一个分子标记,其距离均为5cM,

侧两标记与目的基因之间同时发生交换的频率为: 5 % x 5% = 0.025 %考虑到交换之间的干扰,利用这两个标记选择基因的误差率应少于0.025%。这样的误差率是符合育种要求的。 MAS在育种上利用的基本条件

分子标记主要有两类,即RFLP标记和以PCR为基础的分子标记。过去用得较多的是RFLP标记,但RFLP标记在技术上比较复杂,在育种中难以利用。以PCR为基础的分子标记,如STS和SSR等,主要利用的是PCR技术,具有方法简单、时间短、费用少的优点,适合育种上的要求。因此,建立以PCR为基础的分子标记辅助选择的技术体系,将使分子标记辅助选择技术得到切实可行的利用。

4.4.2 以PCR为基础的MAS

RFLP标记转变为PCR标记 目前已经定位的基因主要是利用RFLP标记进行的,许多基因虽然已经定位,但却难以在育种中利用。为了使这些基因在育种中得到利用,首先需要把RFLP标记转变为PCR标记。

STS 标记 把RFLP标记转变为以PCR为基础的分子标记,就是把RFLP标记转变为序列标签位点(Sequence tagged site,STS)标记。这个过程包括: (1)对RFLP标记(长度约为0.5-2.5kb)进行末端测序,测序的末端长度通

常为100bp-300bp。

(2)利用测序的末端序列设计PCR引物,PCR引物设计的原则与前述的相同。 (3)利用上述设计的PCR引物进行PCR扩增,扩增的产物即为STS标记。STS

标记与原RFLP标记具有相同的座位,其片段长度比RFLP标记的长度略短。

利用STS引物扩增产生的扩增片段长度多态性称为扩增子长度多态性(Amplicon length polymorphism,ALP)。

与RFLP相比,ALP的频率明显降低。这里由于RFLP是在较长的DNA片段中产生的,其片段长度通常为2.0-20kb,而扩增子的长度较短,通常只有

0.5-2.5kb。由于RFLP转变为STS后,其ALP的频率明显降低,使RFLP资料的利用出现了新的问题。

PBR :PBR(PCR-based RFLP)或称 CAPS(cleaved amplified polymorphic sequence)是对扩增片段再进行限制性片段分析的方法。 随机的PCR标记转变为特异的PCR标记:

(1)随机的PCR标记,RAPD、AFLP等; (2)随机PCR标记的克隆和测序; (3)SCAR标记

利用测序的资料设计引物,把随机的PCR标记转变为特异的PCR标记。这种标记称为序列特征扩增区(sequence characterized amplified region,SCAR)标记。

利用PCR标记进行基因定位:

利用微卫星标记等PCR标记进行基因定位,基因定位后直接用于分子标记辅助选择,实现基因定位与分子标记辅助选择的一体化。

4.4.3 基因聚合

基因聚合是把多个有利基因聚合在一起的育种方法。

单性状多个基因的聚合:把控制同一性状的多个基因聚合在一起,使某一性状表现更加突出。例如,把多个水稻抗白叶枯病基因聚合在一起,培育出具有高抗、广谱抗性和持续抗性的水稻新品种。

多个性状基因的聚合:把控制不同性状的多个基因聚合在一起,使新品种具有多个优良性状。例如,把水稻抗白叶枯病基因与抗稻瘿蚊基因聚合在一起,培育出既抗白叶枯病、又抗稻瘿蚊的水稻新品种。

Chapter 5 物理作图

§5.1 概 述

结构基因组学的发展: 遗传作图 物理作图

基因组测序

物理作图与遗传作图的主要差异: DNA片段的大小:

分子标记®大片段克隆 作图方法:

遗传距离®物理距离 物理作图的主要环节

Size of genomes

Organism Genome size (Mb)Prokaryotes Mycoplasma genitalium 0.58 Escherichia coli 4.64 Bacillus megaterium 30 Eukaryotes

Fungi Saccharomyces cerevisiae (yeast) 12.1 Aspergillus nidulans 25.4 Protozoa Tetrahymena pyriformis 190 Invertebrates Drosophila melanogaster (fruit fly) 100 Bombyx mori (silkworm) 490 Locusta migratoria (locust) 5,000

Vertebrates Fugu rubripes (pufferfish) 400 Homo sapiens (humans) 3,000 Mus musculus (mouse) 3,300 Plants Arabidopsis thaliana (vetch) 100 Oryza sativa (rice) 565 Zea mays (maize) 5,000 Triticum aestivum (wheat) 17,000

§5.2 大片段DNA的分离

物理作图所用的基本材料是大分子DNA片段,其长度一般在100-500kb。这样长的DNA片段采用一般的方法是难以分离的。解决这一问题是采用新的电泳技术——脉冲凝胶电泳(pulsed-field gel electrophoresis, PFGE)。 PFGE的基本原理:

将一个方向不断变换的电场,取代简单的单一电场(单向电场,使电泳中受阻的DNA分子在电场改变时扭转迁移方向,达到分离的目的。 垂直交变电场电泳

垂直交变电场(OFAGE),有2对电极,分别位于凝胶对角线的两端,2对电极轮流接通电场,产生一个交替的方向不同的电场,DNA分子在凝胶中不断改变迁移方向。交变电场电泳主要用于相对分子量较大的DNA分子的分离。

§5.3 作图文库

大片段DNA通过特殊的载体可以进行克隆。作图文库是指大片段DNA克隆的文库,如YAC、BAC和TAC等。这些作图文库的建立,为物理作图准备材料。 5.3.1 YAC

YAC (Yeast artificial chromosome) 即酵母人工染色体,具有酵母染色体的特性,以酵母细胞为宿主,能在酵母细胞中复制。以YAC为载体,可以乘载1Mb的大片段DNA,是物理作图中最早利用的大片段DNA载体。 酵母作为遗传研究材料的优点: · 生长快,一个世代只需2个小时;

· 基因组小,只有1.4x107bp,比E.coli只大几倍,比人的基因组小200倍; · 酵母细胞存在单倍体和二倍体两个世代,容易获得隐性突变,适合作遗传分析。 YAC文库的构建:

目标基因组大分子DNA的制备 载体制备 载体和插入DNA的连接 · 转化· 转化子鉴定A. 酵母人工染色体载体 B. 外源DNA的克隆;

C. 重组的酵母人工染色体。

Cloning DNA in yeast artificial chromosomes

The cloning vector pYAC4 contains the elements needed for replication of the

vector as a linear artificial chromosome (CEN4) and two telomeres (TEL) from Tetrabymena. ARS1 is an autonomously replication. In addition, there are two genes for selection of transformed yeast cells: URA3 and TRP1. The ori is a bacterial cells. When cut with EcoRI and BamHI, the fragment between the telomere sequences is lost, and there is also a gene for ampicillin resistance (AmpR) for selection in bacterial cells. When cut with EcoRI and BamHI, the fragment between the telomere sequences is lost, and two arms are producted with EcoRI ends for cloning. YAC转化:

1) 酵母原质体的制备 酵母细胞的脱壁采用蜗牛酶(lyticase)

2) 酵母原质体保温在渗透压缓冲液中, 加入连接反混合液, 200C保温过夜。 4) 筛选培养基 酵母基本培养基, 缺少URA,色氨酸和腺苷酸。

4) 铺板 将转化处理的酵母原质体悬浮在渗透压选择液化培养基中迅速铺板。

5) 300C培养4-5天, 出现白色克隆为阳性克隆。

5.3.2 BAC

BAC (Bacterial artificial chromosome)即细菌人工染色体,具有细菌染色体的特性,以细菌细胞为宿主,能在细菌细胞中复制。以BAC为载体,可以乘载约300kb的大片段DNA,是物理作图中目前用得最多的大片段DNA载体。细菌人工染色体(BAC):pBAC载体为mini-F的衍生质粒。OriS和repE基因介导F因子的单向复制,parA和parB维持低水平质粒拷贝数。CosN为l噬菌体末端酶专一性酶切位点,loxP为P1Cre蛋白作用位点,均可将环状BAC DNA 转变为线性分子,便于物理作图。HindIII和BamHI为外源DNA插入位点,两侧的NotI位点可用于直接检测克隆的大分子DNA。

BAC转化: 1) 载体与插入大分子DAN的连接: 将含有大分子DNA的琼脂糖薄

片680C保温 5 min, 降温至370C, 加入agarase消化. 然后取出部分DNA样品,按重量比1:1加入载体, 加入连接酶过夜。

2) 宿主菌制备: 选定的宿主菌在完全培养基上培养至光密度0.2, 收集细菌。 3) 将细菌密度调整到109, 加入连接反应液,转移到电击杯(0.1 cm直径)中.

电击转化参数:25000V/cm。

4) 氯霉素选择培养基上筛选阳性克隆, 阳性克隆白色.

5.3.3 TAC

TAC (Transformation-competent artificial chromosome)即可转化人工染色体,具有细菌染色体的特性,以细菌和农秆菌细胞为宿主。以BAC为载体,可以乘载约300kb的大片段DNA,并且可以通过农秆菌转化,是把大片段DNA乘载与转化结合为一体的载体。

不同类型人工染色体的比较

载体类型 宿主 最大长度 稳定性 嵌合体 技术难度 转化能力YAC Yeast >1Mb 较差 有 大 无 BAC E.coli 300kb 好 无 小 无 TAC E.coli/Agr 300kb 好 无 小 有

§5.4 作图方法

5.4.1 原位杂交法

原位杂交(in situ hybridization)是指在染色体上进行DNA杂交,以便识别探针在染色体上位置的方法。

5.4.2 限制性作图法

限制性作图 (Restriction mapping) 是通过比较不同限制性酶产生的DNA片段的大小,将限制性酶切位点标定在DNA分子的相对位置上,构建物理图谱的方法。基本方法: · 先用一种限制性酶处理样品,获得第一组DNA片段; · 用第二种限制性酶处理样品,获得第二组DNA片段; · 用二种酶混合处理,获得第三组DNA片段; · 对各组数据进行比较,利用加减法确定酶切位点的位置。 稀有切点内切酶的应用:

在基因组DNA顺序中只有很少可识别序列的限制酶,一般识别位点在6-8碱基对之间, 并含有高G/C比。

大肠杆菌基因组全长4.5x106bp,用NotI酶切产生21个片段,长度在20-1000 kb之间。经脉冲凝胶电泳分离后,将酶切片段转移到杂交膜上。选用已经定位的基因为探针与之杂交,可确定各酶切片段的排列顺序。

5.4.3 分子标记锚定法( DNA marker anchoring)

分子标记锚定法是以遗传作图为基础,利用已定位的分子标记去锚定大片段DNA克隆,确定DNA克隆片段在物理图谱中位置的方法。 An ideal physical map of a chromosome

The overlapping bars in physical map are large-insert genomic DNA clones such as BACs and/or YACs. Letters from A to F indicate anchor DNA markers selected from the developed RFLP linkage map. The markers are integrated into the physical map by probing its source library with them and mapped to the chromosome by FISH of the markers-associated BACs. The distances between markers are measured in cM in the genetic map, in kb in the physical map, and in mm on the chromosome cytogenetic map.

5.4.4 克隆指纹法( clone fingerprinting )

所谓“DNA指纹”是指确定DNA样品具有的特定DNA片段组成。一个克隆的指纹表示该克隆所具有的限定的顺序特征,可以同其它克隆产生的同类指纹相比较。如果指纹重叠,表明这2个克隆具有共同的区域。

克隆指纹法的原理:

如果2个克隆彼此重叠,它们一定含有相同的顺序。 Ordering cosmids by fingerprinting

In this example, the cloned DNA in a cosmid vector has three HindIII sites (H). The cut ends of the fragments are end-labeled with a radiolabeled nucleotide. A second digest is performed using Sau3AI (S), which cuts more frequently than HindIII, so a set of smaller fragments is produced, only some of which are endlabeled. The fragments are sorted on a polyacrylamide sequencing gel, and the labeled fragments are detected by autoradiography. The sets of labeled fragments depend on the distributions of HindIII and Sau3AI sites, and are characteristic for eachcosmid clone (hence a “fingerprint”).

5.4.5 Chromosome walking

This method is used to move systematically along a chromosome from a known location. A cloned probe is used to isolate phage clones carrying genomic fragments from that region of the chromosome. A small DNA fragment from the end of the largest phage clone is used to rescreen the same library. Among the clones recovered are new phages that carry the probe sequence but whose sequences also extend farther along the chromosome. A new probe is generated from the far end of one of these

phages and used to screen the library again to isolate new clones extending still farther. Hundreds of kilobases of contiguous chromosomal DNA can be cloned by repeated cycles of walking.

Chapter 6 DNA Sequencing

The ultimate objective of a genome project is the complete DNA sequence for the organism being studied, ideally integrated with the genetic and/or physical maps of the genome so that genes and other interesting features can be located within the DNA sequence. This chapter describes the techniques and research strategies that are used during the sequencing phase of a genome project, when this ultimate objective is being directly addressed.

§6.1 The Methodology for DNA Sequencing

DNA测序是在核酸酶学和生物化学的基础上,创立并发展起来的一门重要的

DNA技术学。这门技术对于从分子水平上研究基因的结构与功能的关系,以及克隆DNA片段的操作方面,都具有十分重要的实用价值。 DNA测序的两种方法: · 化学降解法(chemical degradation method) (Maxam等,1977);

链终止法(the chain termination method) (Sanger等,1977)。

6.1.1 Chemical Degradation Method化学降解法

Maxam-Gilbert化学修饰法:

化学修饰DNA测序法,是美国哈佛大学的A.M.Maxam和W.Gilbert于1977年发明的。化学修饰法主要用于测定短片段DNA的序列。Maxam-Gilbert化学修饰法的原理:

用化学试剂处理具末端放射性标记的DNA片段,造成碱基的特异性切割。由此产生的一组具有各种不同长度的DNA链的反应混合物,经凝胶电泳和放射自显影后,直接读出待测DNA片段的核苷酸顺序。

碱基特异的化学切割反应:

专门用于核苷酸作化学修饰,并打开核苷酸碱基环的化学试剂有硫酸二甲酯(dimethylsulphate)和肼(hydrazine)。 · 硫酸二甲酯特异性地切割G; ACGT · 甲酸特异性地切割G和/或A; ACGT · 肼在NaCl条件下特异性地切割C; ACGT · 肼在无NaCl条件下特异性地切割T和/或C。 ACGT化学修饰法的测序过程:

(a)利用32P标记DNA片段的末端;

(b)将32P末端标记的DNA片段分成4个反应管,进行化学切割反应; (c)在聚丙烯酰胺测序胶上电泳,经放射自显影,根据带谱可读出相应的序列。

The Maxam and Gilgert DNA-sequencing procedure

A segment of DNA is labeled at one end with 32P. The labeled DNA is divided into four samples and each sample is treated with a chemical that specifically destroys one or two of the four bases in the DNA. This generates a series of labeled fragments, the lengths of which depend on the distance of the destroyed base from the labeled end of the molecule. The pattern of bands on the x-ray film is read to determine the sequence of the DNA. 6.1.2 The Chain Termination Method Sanger 双脱氧链终止DNA测序法:

利用DNA聚合酶和双脱氧链终止物测定DNA核苷酸顺序的方法,是由英国剑桥分子生物学实验室的生物化学家F. Sanger等人于1977年发明的。Sanger 双脱氧链终止DNA测序法的基本原理:

利用DNA聚合酶所具有的两种酶催反应的特性:

能够利用单链的DNA作模板,合成出准确的DNA互补链;

能够利用2’和3’双脱氧核苷三磷酸作底物,使之参入到寡核苷酸链的3’-末端,从而终止DNA链的生长。在每一反应试管中,都加入一种互不相同的ddNTP和全部4种dNTP,其中有一种带有32P同位素标记。反应混合物样品加在

聚丙烯酰胺序列中,按片段大小进行电泳分离。谱带的判读是从胶的底部开始,所得的核苷酸碱基顺序,与模板链为互联互补链。 The Sanger DNA-sequencing procedure

(a) 2‟,3‟-Dideoxynucleotides of each of the four bases are prepared. Once incorporated into a growing DNA strand, the dideoxynucleotide (ddNTP) cannot form a phosphodiester bond with the next incoming dNTP. Growth of that particular DNA chain stops.

(b) A DNA strand to be sequenced, along with labeled primer, is split into four DNA polymerase reactions, each containing one of the four ddNTPs. The resultant labeled fragments are separated by size on an acrylamide gel, autoradiography is performed, and the pattern of the fragments gives the DNA sequence.

Sanger双脱氧-pUC体系DNA测序法:

(a)将待作测序的DNA片段克隆到pUC质粒载体中; (b)用pUC重组体转化大肠杆菌; (c)制备质粒DNA,供DNA测序之用; (d)按双脱氧链终止法进行DNA测序。

6.1.3 DNA 序列分析的自动化DNA 序列分析自动化包括两个方面的内容,一方面是指“分析反应”的自动化,另一方面是指“读片过程”的自动化。 自动化的DNA 序列分析,也是根据Sanger 双脱氧链终止DNA测序法的基本原理发展起来的。由激光发射器产生的激光束,通过精密的光学系统后被导向凝胶表面的检测区。在此,激光束垂直射向凝胶,同经过检测孔的DNA片段发生作用,并提供能量激发荧光发色基团发射出具特异性波长的荧光。这些荧光通过聚焦镜集中后传给滤光镜/棱镜组件,以便四种碱基产生的不同标记波长区别开来。经成像透像最后由高灵敏度的相机分段收集信号,传送给计算所分析处理。荧光化合物标记链终止法以荧光颜色为标记信号,每种ddNTP各有1种代表颜色;整个反应在一个试管中进行;当新合成的终止单链通过荧光监测仪时,可由光信号读出末端核苷酸并由电脑记录。 毛细管电泳DNA自动化测序:

用毛细管电泳取代聚丙烯凝胶平板电泳,可使DNA的测序速度更为迅速。这种电泳装置有96个泳道,每次可同时进行96次测序,每轮实验不到2小时,1天可完成近千次反应。毛细管电泳测序装置:

DNA样品通过一束充满凝胶的毛细管进行电泳,利用共聚焦荧光扫描显微镜,可将核苷酸的荧光标记信号放大,并由计算机读取碱基顺序。 Sanger双脱氧链终止DNA测序法的测序能力:

·手工测序:最大约300b; ·自动测序:最大约600b。

§6.2 Assembly of a Contiguous DNA Sequence

DNA顺序的组装:

基因组的每条染色体长度可达数百万碱基对以上,将链终止法阅读的小片段DNA组装成真实的排列顺序是一项浩繁而精细的工作。DNA顺序的组装主要有3种方法:

鸟枪法( shotgun ); · 克隆重叠群法(clone contig);

· 引导鸟枪法( directed shotgun )。 随机鸟枪法测序与序列组装:

6.2.1 Sequence assembly by the shotgun approach

鸟枪法的顺序组装是直接从已测序的小片段中寻找彼此重叠的测序克隆,然后依次向两侧邻接的序列延伸。这一方法不需要预先了解任何基因组的情况,即使缺少遗传图谱和物理图谱,也可以完成整个基因组的组装。 鸟枪法的优势:

鸟枪法的主要优势在于:测序速度快,并且无须提供相关的遗传图谱和物理图谱。

20世纪90年代末,用鸟枪法在较短时间内成功地完成了流感嗜血杆菌的基因组测序,引发了一场微生物基因组测序的热潮。这些研究项目的实施,表明鸟枪法测序可以组建一条流水工作线,一个科研小组可以分工合作,一部分人专门负责制备DNA,一部分人负责测序和数据分析。经验证明,一个5Mb大小基因组的测序可以在一年内完成。 鸟枪法的局限:

对于结构复杂的大基因组而言,鸟枪法的序列组装的起始阶段工作量非常大。首先需要对小片段进行序列分析,找出重叠群,需要分析的小片段的数量太大,达到现有计算机能力的极限。此外,基因组中普遍存在的重复序列是十分棘手的问题,在序列组装时可能出现错误连接,使某些片段从原位置跳到另一无关位置。因此,对于缺少重复顺序的小基因组(<5Mb)而言,鸟枪法仍为最佳的选择,但对于大基因组而言,还需要更加有效的策略。重叠群法或图位法:

实践证明,在5Mb的范围内,用鸟枪法进行测序组装可以获得较好的效果。因此,对于大基因组而言,可以先将基因组DNA分解为若干个较大的DNA片段,构建基因组文库。每个大片段克隆可以独立进行鸟枪法测序,亦可以组建重叠群后进行鸟枪法测序。因此称为克隆重叠群测序法。

更理想的是,重叠群经分子标记定位于遗传图谱或物理图谱上,并利用分子标记验证顺序组装的结果。因此,该方法又称为图位法。

6.2.3 The directed shotgun approach指导鸟枪法( The directed shotgun approach )是对随机鸟枪法的改良和发展。主要是为了克服随机鸟枪法组装中,由于重复序列引起的错误排序。

例如,在人类基因组测序中,要复盖99%的基因组序列,就要完成7000万次测序(按每次可读500bp左右算)。如果完全采用随机鸟枪法,要从7000万序列中寻找重叠的序列,又要避免广泛存在的高密度重复顺序的干扰,显然是做不到的。人类基因组测序的安排:

(1)构建插入片段平均为2kb的人类基因组质粒文库,每个克隆经双向

测序可读顺序约500bp。

(2)构建插入片段平均为10kb的人类基因组质粒文库,每个克隆经双向

测序读取端部序列约500bp。

人类基因组中大多数重复顺序的长度在5kb或略长,10kb文库的构建有助于在序列组装时校正重复顺序产生的差错。

§6.3 Projects of genome Sequencing

1990 人类基因组计划起动;

1995 第一个原核生物(细菌)基因组测序完成; 1996 第一个真核生物(酵母)的基因组测序完成; 1998 第一个多细胞生物(线虫)的基因组测序完成; 2000 果蝇和拟南芥的基因组测序完成; 人类和水稻的第一张基因组草图完成; 2001 人类基因组测序完成;

2002 水稻(籼稻和粳稻)基因组草图完成。

几种不同生物基因组测序的策略: 大肠杆菌基因组测序----图位法; · 流感嗜血杆菌基因组测序---鸟枪法; · 果蝇基因组测序---鸟枪法; · 人类基因组测序---图位法和鸟枪法; · 水稻基因组测序---图位法和鸟枪法

人类基因组草图

国际联合体的测序结果:

总长: 2 692 Mb 基因数: 26 383 · Cerela Genomics 的测序结果:

总长: 2 847 Mb 基因数: 31 778 人类基因组测序的结果

Celera Genomics采取WGA和CSA策略组装的人类基因组结果比较项目

WGA CSA

位于支架中的顺序(bp) 2 847 890 390 2 905 568 203 位于重叠群中的顺序(bp) 2 586 634 108 2 653 979 733

支架数目 118 968 170 033 重叠群数目 221 036 170 442 间隙数目 102 068 116 442

小于或等于1kb间隙数目 62 356 72 091

支架平均长 ( bp ) 23 938 54 217 重叠群平均长 ( bp ) 11 702 15 609 支架间隙平均长 ( bp) 2 560 2 161

WGA(whole-genome assembly):全基因组组装; CSA(compartmentalized shotgun assembly):区间化组装。

Chapter 7 基因组序列解读

A genome sequence is not an end in self. A major challenge still has to be met in understanding what the genome contains and how the genome functions.

Understanding how the genome functions is, to a certain extent, merely a different way of stating the objectives of molecular biology over the last 30 years. The

difference is that in the past attention has been directed at the expression pathways for individual genes, with groups of genes being considered only when the expression of one gene is linked to that of another. Now the issues have become more general and relate to the expression of the genome as a whole.

§7.1 Locating Genes in DNA Sequences

Once a DNA sequence has been obtained, whether it is the sequence of a single cloned fragment or of an entire chromosome, then various methods can be employed to locate the genes that are present. These methods can be divided into those that involve simply inspecting the sequence, by eye or more frequently by computer, to look for the special sequence features associated with genes, and those methods that locate genes by experimental analysis of the DNA sequence. 7.1.1 Gene location by sequence inspection

Sequence inspection can be used to locate genes because genes are not random series of nucleotides but instead have distinctive features. These features determine whether a sequence is a gene or not, and so by definition are not possessed by noncoding DNA. If in the future we can fully understand the exact nature of the specific sequence features that define a gene then sequence inspection will become a foolproof way of locating genes. We are not yet at this stage but sequence inspection is still a powerful tool in genome analysis.Genes are open reading frames(开放阅读框)

A gene is a segment of the genome that is transcribed into RNA. If the RNA is a transcript of a protein-coding gene then it is call a messenger RNA (mRNA) and is translated into protein. If the RNA is noncoding, such as ribosomal RNA (rRNA) and transfer RNA (tRNA) then it is not translated.

The part of a protein-coding gene that is translated into protein is called the open reading fram (ORF). Each triplet of nucleotides in the ORF is a codon that specifies an amino acid in accordance with the rules of the genetic code. The ORF is read in the 5‟ to 3‟ direction along the mRNA. The ORF starts with an initiation codon and ends with a termination codon.Searching for open reading frames Initiation codon: ATG · Termination codon: TAA, TAG or TGA

Searching a DNA sequence for ORFs that begin with an ATG and end with a termination triplet is therefore one way of looking for genes. The analysis is complicated by the fact that each DNA sequence has six reading frames, three in one direction and three in the reverse direction on the complementary stand, but computers are quite capable of scanning all six reading frames for ORFs.

GAC ® TGA ® ATG ®

5‟ -ATGACGAGAGAGCAGCCATTTTAG- 3‟

3‟-TACTGCTCTCTCGTCGGTAAAATC-5‟ ¬ ATC ¬ AAT ¬ AAA

A double-stranded DNA molecule has six reading frames

Both strands are read in the 5‟ ®3‟ direction. Each strand has three reading frames, depending on which nucleotide is chosen as the starting position.

The key to the success of ORF scanning is the frequency with which termination triplets appear in the DNA sequence. If the DNA has random sequence and a GC content of 50% then each of the three termination triplets-TAA, TAG and TGA-will appear, on average, once every 43 = 64 bp. If the GC content is >50% then the termination triplets, being AT-rich, will occur less frequently but one would still be expected every 100-200 bp. Most genes are longer than 50 codons (the average lengths are 317 codons for E. coli and 483 codons for S. cerevisiae). Simple ORF scans are less effective with higher eukaryotic DNA

Although ORF scans work well for simple bacterial genomes, they are less effective for locating genes in DNA sequences from higher eukaryotes. This is partly because there is substantially more space between real genes (70% of the human genome is intergenic), increasing the chances of finding spurious ORFs. But the main problem with the human genome and those of higher eukaryotes in general is that their genes are often split by introns, and so do not appear as continuous ORFs in the DNA sequence. Exons and introns

Many genes in eukaryotes are discontinuous, being split into exons(外显子)and introns(内含子). The introns are removed from the primary transcript by splicing to produce the functional RNA molecule.

“Upstream” refers to the region of DNA before a gene; “downstream” is after the gene.

The organization of a split genes

DNA containg the gene for the protein ovalbumin was allowed to hybridize with ovalbumin mRNA. The eight exons (L,1-7) of the gene anneal to the complementary regions of RNA, and the seven introns (A-G) loop out from the hybrid. The 5‟ and 3‟ ends of the mRNA are indicated, as is the poly-A tail (Chambon, 1981). ORF scans are complicated by introns

The nucleotide sequence of a short gene containing a single intron is shown. The correct translation is given immediately below the nucleotide sequence: in this translation the intron has been left out and the amino acid sequence is hence split into two sequences. In the lower line, the sequence has been translated without realizing that an intron is present. As a result of this error, the amino acid sequence appears to terminate within the intron.

Solving the problem posed by introns is the main challenge for bioinformaticists writing new software programs for ORF location. Three modifications to the basic procedure for ORF scanning have been adopted (Fickett, 1996): Codon bias(密码子偏爱)

refers to the fact that not all codons are used equally frequently in the genes of a particular organism.

Exon-intron boundaries(外显子-内含子边界)

can be searched for as these have distinctive sequence features. Upstream control sequences(上游控制序列)

can be used to locate the regions where genes begin.

7.1.2 Homology searches give an extra dimension to sequence inspection The limitations of ORF scanning with higher eukaryotic genomes are offset to a certain extent by the use of a homology search to test whether a series of triplets is a real exon or a chance sequence. In this analysis the DNA databases are searched to determine if the test sequence is identical or similar to any genes that have already been sequenced. 同源查询(氨基酸顺序)

In this example, the two nucleotide sequences are 76% identical, as indicated by the asterisks. This might be taken as evidence that the sequences are homologous. However, when the sequences are translated into amino acids the identity decreases to 28%, suggesting that the similarity at the nucleotide level was fortuitous.同源性,一致性和相似性1) 同源性基因系指起源于同一祖先但顺序已经发生变异的基因

成员, 分布在不同物种间的同源基因又称直系基因。同一物种的同源基因则称水平基因。

2) 基因同源性只有“是”和“非”的区别, 无所谓百分比。

3) 一致性系指同源DNA顺序的同一碱基位置的相同的碱基成员, 或者蛋白质

的同一氨基酸位置的相同的氨基酸成员, 可用百分比表示。

4) 相似性系指同源蛋白质的氨基酸顺序中一致性氨基酸和可取代氨基酸所占

的比例。可取代氨基酸系指具有相同性质如极性氨基酸或非极性氨基酸的成员, 它们之间的代换不影响蛋白质(或酶)的生物学功能。

§7.2 Prokaryotic Genomes

The relatively small sizes of prokaryotic genomes, which means that they are amenable to rapid sequence analysis by the shotgun approach, has resulted in the frequent publication over the last few years of complete genome sequences for members of the bacteria and archaea(古细菌). As a consequence we are beginning to understand a great deal about the anatomies of prokaryotic genomes, and in many respects we know more about these organisms than we do about eukaryotes.

7.2.1 The physical structure of the prokaryotic genomeMost prokaryotic genomes are less than 5 Mb in size, although there are a few that are substantially larger than this. In a typical prokaryote the genome is contained in a single circular DNA molecule, localized within the nucleoid (核质区)of prokaryotic cell.

The current model has the E. coli DNA attached to a protein core from which

40-50 supercoiled loops radiate out into the cell. Each loop contains approximately 100kb of supercoiled DNA.A model for the structure of the E. coli nucleoid

Between 40 and 50 supercoiled loops of DNA radiate from the central protein core. One of the loops is shown in circular form, indicating that a break has occurred in this segment of DNA, resulting in a loss of the supercoiling.

7.2.2 The genetic organization of the prokaryotic genomeWe have already learnt that bacterial genomes have compact genetic organizations with very little space between genes. This compact organization is beneficial to prokaryotes, for example by enabling the genome to be replicated relatively quickly. Operons are characteristic features of prokaryotic genomesA characteristic features of prokaryotic genomes is the presence of operons(操纵子). An operon is a group of genes that are located adjacent to one another in the genome, with perhaps just one or two nucleotides between the end of one gene and the start of the next. All the genes in an operon are expressed as a single unit. This type of arrangement is common in prokaryotic

genomes, a typical E. coli example being the lactose operon (乳糖操纵子), the first operon to be discovered. The lactose operon

The three genes are called lacZ, lacY and lacA, the first two separated by 52 bp and the second two by 64 bp. All three genes are expressed together, lacY coding for the lactose permease which transports lactose into the cell, and lacZ and lacA coding for enzymes that split lactose into its component sugars, galactose and glucose.The tryptophan operon

The tryptophan operon contains five genes coding for enzymes involved in the multistep biochemical pathway that converts chorismic acid into the amino acid tryptophan. The genes in the tryptophan operon are closer together than those in the lactose operon: tryE and trpD overlap by one bp, as do trpB and trpA; trpD and trpC are separated by 4 bp and trpB by 12 bp.

最小基因组推测当比较不同属间的基因分类表时,就会发现基因的目录非常有趣。例如大肠杆菌中有243个已鉴定的基因同能量代谢有关,而流感嗜血杆菌(Haemophilus influeniae)只有112个同类基因,生殖道支原体(Mycoplasma genitalium)更少,仅为31个。这种比较提出了一个关于满足游离单细胞存活所必需的最少基因数的讨论。主要的资料来自2个已测序的最小基因组生物,生殖道支原体及其亲缘种M. Preumoniae,它们分别含有470个和679个基因。这两个最小基因组生物共有的基因可视为维持单细胞生命不可缺少的基础,从理论上考虑所需最低基因数为256个(Mushegian和Koonin, 1996)。后经突变实验增加了生殖道支原体所需的基因数,达300个。

§7.3 Eukaryotic Organelle Genomes

The possibility that some genes might be located outside of the nucleus –

extra-chromosomal genes as they were initially called – was first raised in 1950s as a means of explaining the unusual inheritance patterns of certain phenotypes. Electron microscopy and biochemical studies at about the same time provided hints that DNA

molecules might be present in mitochondria and chloroplasts. Eventually, in the early 1960s, these various lines of evidence were brought together and the existence of mitochondrial and chloroplast genomes, independent of and distinct from the nuclear genome, was accepted.

7.3.1 Physical features of organelle genomesAlmost all eukaryotes have

mitochondrial genomes and all photosynthetic eukaryotes have chloroplast genomes. The most mitochondrial and chloroplast genomes are circular, but in many eukaryotes the circular genomes coexist in their organelles with linear versions and, in the case of chloroplasts, with smaller circles that contain subcomponents of the genome as a whole. We also now realize that the mitochondrial genomes of some microbial eukaryotes are always linear.Sizes of organelle genomes

Mitochondrial genome sizes are variable and unrelated to the complexity of the organism. Most multicellular animals have small mitochondrial genomes with a compact genetic organization, the genes being close together with little space between them. The human mitochondrial genome, is typical of this type. Lower eukaryotes such as Saccharomyces cerevisiae and flowing plants have larger and less compact mitochondrial genomes, with a number of the genes containing introns. Chloroplast genome sizes are less variable and all appear to be organized along similar lines. 7.3.2 Genetic content of organelle genomesOrganelle genomes are much smaller than their nuclear counterparts and their gene contents are much more limited.

Mitochondrial genomes display the greater variability, gene contents ranging from 12 to 92. All mitochondrial genomes contain genes for the mitochondrial rRNAs and at least some of the protein components of the respiratory chain, the latter being the main biochemical feature of the mitochondrion. Most chloroplast genomes appear to possess the same set of 200 or so genes, again coding for ribosomal and proteins involved in photosynthesis.

§7.4 Eukaryotic Nuclear Genomes

We have already learnt that the human genome is split into two components: the nuclear genome and the mitochondrial genome. This is the typical pattern for most eukaryotes, the bulk of the genome being contained in the chromosomes in the cell nucleus with a much smaller part located in the mitochondria and in the case of photosynthetic organisms, in the chloroplasts.

7.4.1 Organization of Nuclear GenomesEukaryotic nuclear genomes range in size from less than 10 Mb to over 100 000 Mb. Genome size broadly coincides with organism complexity, the genomes of higher eukaryotes being larger than those of lower eukaryotes, but size is determined not only by the number of genes in the genome but also by the amount of repetitive DNA, the larger genomes tending to be ones in which the copy numbers of the repeat sequences are highest.

7.4.2 Genes and gene-related sequences When one looks at a set of human chromosome under the microscope the inevitable question that springs to mind is “Where exactly are the genes?”. The EST map largely answers this question for the

human genome and shows that the genes are not distributed evenly along the

chromosomes but that each chromosome is made up of regions in which genes are abundant interspersed with regions of lower gene content. 7.4.3 Repetitive DNA

There are various types of repetitive DNA and several classification systems have been devised. The repeats are divided into those that are clustered into tandem arrays and those which are dispersed around the genome. 7.4.3.1 Tandemly repeated DNA

Tandemly repeated DNA is a common feature of eukaryotic genomes but is virtually unknown in prokaryotes. This type of repeat is also call satellite DNA, because DNA fragments containing tandemly repeated sequences form „satellite‟ bands when genomic DNA is fractionated by density gradient centrifugation.7.4.3.2 Satellite(卫星)DNA

The satellite bands in density gradients of eukayrotic DNA are made up of fragments that are composed of long series of tandem repeats, possibly hundreds of kb in length. A single genome can contain several different types of satellite DNA, each with a different repeat unit, these units being anything from <5 to >200 bp. Although some satellite DNA is scattered around a genome, most is located in the centromeres, supporting the hypothesis that repetitive DNA plays a structural role in the centromere.

7.4.3.3 Minisatellite(小卫星)DNA

Minisatellite DNA is a second type of repetitive DNA. Minisatellite form clusters up to 20 kb in length, with repeat units up to 25 bp. Telomeric DNA, which in humans comprises hundreds of copies of the motif 5‟-TTAGGG-3‟, is an example of a minisatellite. In addition to telomeric minisatellites, some eukaryotic genomes contain various other clusters of minisatellite DNA.

7.4.3.4 Microsatellite(微卫星)DNA

Microsatellite clusters are shorter, usually <150 bp, and the repeat unit is usually 4 bp or less. The function of microsatellites is equally mysterious. The typical microsatellite consists of a 1-, 2-, 3- or 4-bp unit repeated 10-20 times. Although each microsatellite is relatively short, there are many of them in the genome, which is why they are used as markers on genome maps.

In human, for example, microsatellites with a CA repeat, such as

5‟-CACACACACACACA-3‟3‟-GTGTGTGTGTGTGT-5‟Make up 0.5% of the genome, 15 Mb in all. Single base-pair repeats of the type

5‟-AAAAAAAAAAAAAA-3‟3‟-TTTTTTTTTTTTTTTT-5‟Make up another 0.3%. 7.4.3.5 Interspersed genome-wide repeatsInterspersed repeats must have arisen by a mechanism, one that can result in a copy of a repeat unit appearing in the genome at a position distant from the location of the original sequence. The most frequent way in which this occurs is by transposition, and most interpered repeats have inherent transpositional activity.

Transposition via an RNA intermediate

The version that involves an RNA intermediate is called retrotransposition(逆转

录转座). The basic mechanism involves three steps: 1. An RNA copy of transposon(转座子)is synthesized by the normal process of transcription.

2. The RNA transcript is copied into DNA, which initially exists as an

independent molecule outside of the genome. This conversion of RNA to DNA, the reverse of the normal transcription process, requires a special enzyme called reverse transcriptase(反转录酶).

3. The DNA copy of the transposon integrates into the genome, possibly back into the same chromosome occupied by the original unit, or possibly into a different chromosome.

Reverse transcription of cytoplasmic mRNAs leads to intronless pseudogenes

The presence of several pseudogenes in the chromosomes of higher eukaryotes has led to speculation that these spliced DNA copies arose through reverse transcription of cytoplasmic mRNA, perhaps during an abortive retrovirus infection. The spliced genes are thought to have then been reinserted into the chromosome by a mechanism that is still unclear.RNA transposons or retroelements(逆转座因子)are features of eukaryotic genomes but have not so far been discovered in prokaryotes. They have attracted a great deal of attention because there are clear similarities

between some types of retroelement and free-living viruses call retroviruses(逆转录病毒), which include many benign forms but also virulent types such as the human immunodeficiency viruses which cause AIDS.Retroviruses and retrotransposons are LTR elements, as they have long terminal repeats (长末端重复顺序)at either end which play a role in the transposition process. Other retroelements do not have LTRs. These are called retroposons and in mammals include: · LINEs (long interspersed nuclear elements 长分散核因子 ), which contain a reverse transcriptase-like gene probably involved in the retrotransposition process. · SINEs (short interspersed nuclear elements 短分散核因子 ), which do not have a reverse transcriptase gene but can still transpose, probably by borrowing reverse transcriptase enzymes that have been synthesized by other retroelements.Retoelements

A comparison of the structures of four types of retroelement. Retroviruses (A) and retrotransposons (B) are LTR elements that possess long terminal repeats at each end. The gag gene codes for a series of proteins in the virus core, pol codes for the reverse transcriptase and other enzymes involved in replication of the element, and env codes for coat proteins. LINEs (C) and SINEs (D) are non-LTR retroelements or retroposons. Both have a poly(A) region at one end.DNA transposons

Not all transposons require an RNA intermediate. Many are able to transpose in a more direct, DNA to DNA manner. With these elements we are aware of two distinct transposition mechanisms, one involving direct interaction between the donor transposon and the target site, resulting in copying of the donor element (replicative transposition复制转座), and the second involving excision of the element and reintegration at a new site (conservative transposition保守转座). Both mechanisms require enzymes which are usually coded by genes within the transposon.In eukaryotes, DNA transposons are less common than retrotransposons, but they have a

special place in genetics because a family of plant DNA transposons, the Ac/Ds

elements of maize, were the first transposable elements to be discovered, by Barbara McClintock in the 1950s. Her conclusions – that some genes are mobile and can move from one position to another in a chromosome – were based on exquisite genetical experiments, but were widely disbelieved by other biologists until the late 1970s when the molecular basis of transposition was first appreciated. Although relatively uncommon in eukaryotes, DNA transposons are an important component of

prokaryotic genome anatomy. The insertion sequences(插入序列), IS1 and IS186, present in the 50-kb segment of E. coli DNA that are examples of DNA transposons, and a single E. coli genome may contain as many as 20 of these, of various types. Most of the sequence of an IS is taken up by one or two genes that specify the transposase(转座酶) enzyme that catalyzes its transposition. IS elements can transpose either replicatively or conservatively.

Other kinds of DNA transposon known in E. coli, and fairly typical of prokaryotes in general, are:

Composite transposons(复合转座子), which are basically a pair of IS elements flanking a segment of DNA usually containing one or more genes, often ones coding for antibiotic resistance. The transposition of a composite transposon is catalyzed by the transposase coded by one or both of the IS elements. Composite transposons use the conservative mechanism of transposition. Tn3-type transposons(Tn-3型转座子)have their own transposase gene and so do not require flanking IS elements transpose replicatively. Transposable phages(转座噬菌体)are bacterial viruses which transpose replicatively as part of their normal infection cycle.DNA transposons of prokaryotes

Four types are shown. Insertion sequences (A), Tn3-type transposons (C) and transposable phages (D) are flanked by short (<50bp) inverted terminal repeat (ITR) sequences(反向末端重复序列). The resolvase gene of the Tn3-type transposon codes for a protein involved in the transposition process.DNA transposons of prokaryotes

Four types are shown. Insertion sequences (A), Tn3-type transposons (C) and transposable phages (D) are flanked by short (<50bp) inverted terminal repeat (ITR) sequences(反向末端重复序列). The resolvase gene of the Tn3-type transposon codes for a protein involved in the transposition process.

§7.5 Genome Sequence of Arabidopsis thliana

Arabidopsis thaliana has many advantages for genome analysis, including a short generation time, small size, large number of offspring, and a relatively small nuclear genome. These advantages promoted the growth of a scientific community that has investigated the biological processes of Arabidopsis and has characterized many genes. To support these activities, an international collaboration (the

Arabidopsis Genome Initiative, AGI) began sequencing the genome in 1996, and completed the genome sequencing in 2000.

§7.5.1 Genome mapsArabidopsis thaliana is aA Model System for Plant Science Arabidopsis thaliana is a small plant being used as a model for studying plant biology.

This simple angiosperm is becoming an important research tool for addressing

fundamental questions of biological function and organization that extend across the major kingdoms of living organisms. Arabidopsis offers many advantages for genetic and molecular studies, including a short life cycle, small genome, ability to be transformed, widespread availability of mutants, and prolific seed production. The Multinational Arabidopsis Genome Project was established in 1990 to facilitate genome analysis and coordination of research and training programs. §7.5.2 Overview of the genome sequenceThe flowering plant Arabidopsis thaliana is an important model system for identifying genes and determining their functions. The sequenced regions cover 115.4 megabases of the 125-megabase genome. The evolution of Arabidopsis involved a whole-genome duplication, followed by subsequent gene loss and extensive local gene duplications. The genome contains 25,498 genes encoding proteins from 11,000 families. Arabidopsis has many families of new proteins but also lacks several common protein families. This is the first complete genome sequence of a plant and provides the foundations for more

comprehensive comparison of conserved processes in all eukaryotes, identifying a wide range of plant-specific gene functions and establishing rapid systematic ways to identify genes for crop improvement. §7.5.3 Characterization of the Coding RegionsThe 25,498 genes predicted in the genome of Arabidopsis is the largest gene set published to date: C. Elegans has

19,099 genes and Drosophila 13,601 genes. Arabidopsis and C. elegans have similar gene density, whereas Drosophila has a lower gene density; Arabidopsis also has a significantly greater extent of tandem gene duplications and segmental duplications, which may account for its larger gene set.

The functions of 69% of the genes were classified according to sequence

similarity to proteins of known function in all organisms; only 9% of the genes have been characterized experimentally. The significant proportion of genes with predicted functions involved in metabolism, gene regulation and defence is consistent with previous analyses. Roughly 30% of the 25,498 predicted gene products could not be assigned to functional categories. · Only 8–23% of Arabidopsis proteins involved in transcription have related genes in other eukaryotic genomes, reflecting the independent evolution of many plant transcription factors. · In contrast, 48–60% of genes involved in protein synthesis have counterparts in the other eukaryotic genomes, reflecting highly conserved gene functions. · The relatively high proportion of matches between Arabidopsis and bacterial proteins in the categories 'metabolism' and 'energy' reflects both the acquisition of bacterial genes from the ancestor of the plastid and high conservation of sequences across all species. · Finally, a comparison between unicellular and multicellular eukaryotes indicates that Arabidopsis genes involved in cellular communication and signal transduction have more counterparts in multicellular eukaryotes than in yeast, reflecting the need for sets of genes for communication in multicellular organisms. Comparison of functional categories between organisms

Subsets of the Arabidopsis proteome containing all proteins that fall into a common functional class were assembled. Each subset was searched against the complete set of translations from Escherichia coli, Synechocystis sp. PCC6803, Saccharomyces cerevisae, Drosophila , C. elegans and a Homo sapiens non-redundant protein database. A total of 11,601 protein types were identified. Thirty-five per cent of the predicted proteins are unique in the genome, and the proportion of proteins belonging to families of more than five members is substantially higher in

Arabidopsis (37.4%) than in Drosophila (12.1%) or C. elegans (24.0%). The absolute number of Arabidopsis gene families and singletons (types) is in the same range as the other multicellular eukaryotes, indicating that a proteome of 11,000–15,000 types is sufficient for a wide diversity of multicellular life. The proportion of gene families with more than two members is considerably more pronounced in Arabidopsis than in other eukaryotes.

Distribution of tandemly repeated gene arrays in the Arabidopsis genome

Tandemly repeated gene arrays were identified using the BLASTP program. The histogram gives the number of clusters in the genome containing 2 to n similar gene units in tandem. §7.5.4 Genome Organization and DuplicationThe duplicated regions encompass 67.9 Mb, 60% of the genome, slightly more than was found in the DNA-based

alignment. The extent of sequence conservation of the duplicated genes varies greatly, with 6,303 (37%) of the 17,193 genes in the segments classified as highly conserved and a further 1,705 (10%) showing less significant similarity. The proportion of homologous genes in each duplicated segment also varies widely, between 20% and 47% for the highly conserved class of genes.

Segmentally duplicated regions in the Arabidopsis genome

Individual chromosomes are depicted as horizontal grey bars (with chromosome 1 at the top), centromeres are marked black. Coloured bands connect corresponding duplicated segments. Similarity between the rDNA repeats are excluded. Duplicated segments in reversed orientation are connected with twisted coloured bands. The scale is in megabases.Transposable elementsFor many plants with large genomes, class I retrotransposons contribute most of the nucleotide content. In the small Arabidopsis genome, class I elements are less abundant and primarily occupy the centromere. In contrast, Basho elements and class II transposons such as MITEs and MULEs predominate on the periphery of pericentromeric domains. In class II transposons, MULEs and CACTA elements are clustered near centromeres and heterochromatic knobs, whereas MITEs and hAT elements have a less pronounced bias. The

distribution pattern of transposable elements observed in Arabidopsis may reflect different types of pericentromeric heterochromatin regions and may be similar to those found in animals.

Distribution of class I, II and Basho transposons in Arabidopsis chromosomes

The frequency of class I retroelements (green), class II DNA transposons (blue) and Basho elements (purple) are shown at 100-kb intervals along the five chromosomes of Arabidopsis. Nucleolar organizers (NORs)Nucleolar organizers

(NORs) contain arrays of unit repeats encoding the 18S, 5.8S and 25S ribosomal RNA genes and are transcribed by RNA polymerase I. Together with 5S RNA, which is transcribed by RNA polymerase III, these rRNAs form the structural and catalytic cores of cytoplasmic ribosomes. In Arabidopsis, the NORs juxtapose the telomeres of chromosomes 2 and 4, and comprise uninterrupted 18S, 5.8S and 25S units all orientated on the chromosomes in the same direction. Both NORs are roughly

3.5–4.0 megabase-pairs and comprise 350–400 highly methylated rRNA gene units, each 10 kb. The sequence between the euchromatic arms and NORs has been determined.

TelomeresArabidopsis telomeres are composed of CCCTAAA repeats and average 2–3 kb. For TEL4N (telomere 4 North), consensus repeats are adjacent to the NOR; the remaining telomeres are typically separated from coding sequences by repetitive subtelomeric regions measuring less than 4 kb. Imperfect telomere-like arrays of up to 24 kb are found elsewhere in the genome, particularly near centromeres. These arrays might affect the expression of nearby genes and may have resulted from ancient rearrangements, such as inversions of the chromosome arms.

CentromeresCentromere DNA mediates chromosome attachment to the meiotic and mitotic spindles and often forms dense heterochromatin. Arabidopsis centromeres, like those of many higher eukaryotes, contain numerous repetitive elements including retroelements, transposons, microsatellites and middle repetitive DNA. These repeats are rare in the euchromatic arm and often most abundant in pericentromeric DNA. Predicted centromere composition

Genetically defined centromere boundaries are indicated by filled circles; fully and partially assembled BAC sequences are represented by solid and dashed black lines, respectively. Estimates of repeat sizes within the centromeres were derived from consideration of repeat copy number, physical mapping and cytogenetic assays.

§7.5 Genome Sequence of Arabidopsis thliana

The twentieth century began with the rediscovery of Mendel's rules of inheritance in pea, and it ends with the elucidation of the complete genetic

complement of a model plant, Arabidopsis. The analysis of the completed sequence of a flowering plant provides insights into the genetic basis of the similarities and

differences of diverse multicellular organisms. It also creates the potential for direct and efficient access to a much deeper understanding of plant development and

environmental responses, and permits the structure and dynamics of plant genomes to be assessed and understood.

Chapter 8 Gene Cloning

In brief, genes are cloned by taking a piece of DNA from an organism and splicing it into a cloning vector to make a recombinant DNA molecule. A cloning vector is an artificially constructed DNA molecule capable of replication in a host organism, such as a bacterium, and into which a piece of DNA to be studied can be specifically inserted at known positions. The recombinant DNA molecule is

introduced into a host such as E.coli, yeast, animal cell, or plant cell. Replication of the recombinant DNA molecule (molecular cloning) occurs in the host cell, thereby producing many identical copies.All techniques for gene isolation exploit one or more of the four characteristics that define genes: · they have a defined primary structure (sequence); · they occupy a particular location within the genome; · they encode an RNA with a particular expression pattern; · many genes encode protein or mRNA products with a defined function.Higher plants and animals tend to have relatively large quantities of DNA in their genome, making it difficult to fish out particular genes of interest. Further,

higher plants and animals have relatively long generation times. Based on phenotypes, one is seldom able to study millions of individuals and seldom knows enough about the function of a gene to isolate it from the many thousands of other genes in the organism, although many major genes have been cloned based on cDNA for some organisms.

§8.1 cDNA Cloning

DNA copies, called complementary DNA (cDNA), can be made from mRNA molecules isolated from cells. These cDNA molecules can then be cloned. Thus, if a specific mRNA molecule can be isolated, the corresponding cDNA can be made and cloned. The analysis of that cloned cDNA molecule can then provide information about the gene that encoded the mRNA. More typically, the entire mRNA population of a cell is isolated and a corresponding set of cDNA molecules is made and inserted into a cloning vector to produce a cDNA library. Since a cDNA library reflects the gene activity of the cell type at the time the mRNAs are isolated, the construction and analysis of cDNA libraries is useful for comparing gene activities in different cell types of the same organism, because there would be similarities and differences in the clones represented in the cDNA libraries of each cell type.

8.1.1 Cloning vectors

Different kinds of cloning vectors have been developed to construct and clone recombinant DNA molecules. All cloning vectors must: (1) Replicate within at least one host organism;

(2) Have one or more restriction sites into which foreign DNA can be interested;

(3) Have one or more dominant selectable markers to detect those cells

which contain the vectors.

8.1.2 Synthesis of cDNA Molecules

Recognize that the clones in the cDNA library represent the mature mRNAs found in the cell. In eukaryotes, mature mRNAs are processed molecules, so the sequences obtained are not equivalent to gene clones. In particular, intron sequences are typically present in gene clones but not in cDNA clones. For any mRNA, cDNA clones can be useful for subsequently isolating the gene that codes for that mRNA. The gene clone can provide more information than can the cDNA clone, for example, on the regulatory sequences associated with the gene. Process of Transcription

During transcription, one strand of a DNA double helix is used as a template by mRNA polymerase to synthesize a mRNA. During this step, mRNA passes through various phases, including one called splicing, where the non-coding sequences are eliminated.

cDNA libraries are mostly made from eukaryotic mRNAs. This can be achieved relatively easily because, uniquely among the RNAs found in eukaryotes, only mRNAs contain a poly(A) tail. These poly(A)+ mRNAs can be purified from the mixture by passing the RNA molecules over a column to which short chains of deoxythymidylic acid, called oligo(dT) chains, have been attached. As the RNA molecules pass though an oligo(dT) column, the poly(A) tails on the mRNA

molecules form complementary base pairs with the oligo(dT) chains. As a result the mRNAs are captured on the column while the other RNAs pass through. The captured mRNAs are subsequently released and collected, for example, by decreasing the ionic strength of the buffer passing over the column so that the hydrogen bonds will be disrupted. This method results in significant enrichment of poly(A)+ mRNAs in the mixed RNA population to about 50 percent versus about 3 percent in the cell. RNA Isolation

In this example, the material is made up of glass beads to which thymine molecules are attached. Since adenine and thymine molecules readily bind to each other, mRNAs with Poly(A) tails will be selectively retained on the beads. As seen on the left hand side of the diagram, a solution containing various RNA populations is applied to the separation column. Only the Poly(A) RNA is retained, as it is

immobilized on the solid support material. The other RNAs and cellular material pass through the column. On the right, the bound Poly(A) mRNA is retrieved by treating the column with a special buffer solution that breaks the thymine nucleotide-AAA bond. The mRNA can be collected in a tube for further experimentation.

8.1.2 Synthesis of cDNA Molecules

Once an enriched population of mRNA molecules has been isolated,

double-stranded complementary DNA (cDNA) copies are made in vitro. First, a short oligo(dT) chain is hybridized to the poly(A) tail at the 3‟ end of each mRNA strand. The oligo(dT) acts as a primer for reverse transcriptase, which makes a

complementary DNA copy of the mRNA strand. Next, RNase H, DNA polymerase I, and DNA ligase are used to synthesize the second DNA strand. RNase H degrades the RNA strand in the hybrid DNA-mRNA, DNA polymerase I makes new. DNA

fragments using the partially degraded RNA fragments as primers, and finally DNA

ligase ligates the new DNA fragments together to make a complete chain. The result is a double-stranded cDNA molecule, the sequence of which is derived from the original poly(A)+ mRNA molecule. Complementary DNA

cDNA is a form of DNA prepared in the laboratory using an enzyme called reverse transcriptase. cDNA production is the reverse of the usual process of

transcription in cells since the procedure uses mRNA as a template rather than DNA. Unlike genomic DNA, cDNA contains only expressed DNA sequences, or exons.

8.1.3 Production of cDNA Libraries

Once cDNA molecules have been synthesized, they must be cloned. The cDNAs are cloned using a restriction site linker, or linker, which is a relatively short,

double-stranded piece of DNA about 8 to 12 nucleotide pairs long. The linker contains the BamHI restriction site. Both the cDNA molecules and the linkers have blunt ends, and they can be ligated together at high concentrations of T4 DNA ligase. Sticky ends are produced in the cDNA molecule by cleaving the cDNA with BamHI. The resulting DNA is inserted into a cloning vector that has also been cleaved with BamHI and the recombinant DNA molecule produced is transformed into an E. coli host cell for cloning.

§8.2 Cloning Based on mRNA Level

Several techniques for isolating plant genes take advantage of the fact that many genes have characteristic patterns of expression. For example, many potentially valuable plant genes, such as those involved in the synthesis of valuable secondary metabolites, are expressed in specialized tissues. A commonly used method to enrich for such genes is differential screening. In this technique, mRNA is prepared from tissues of different plants that are distinguished by some criterion such as exposure to particular environmental conditions, or from different tissues or stages of development. A DNA library from the appropriate organism is then probed

sequentially with labeled cDNA produced from each of the mRNA samples. Clones that are more highly labeled by the cDNA from one mRNA sample than from another contain genes that are differentially expressed in the two samples.

8.2.1 Differential Hybridization

差别杂交(differential hybridization)又称差别筛选(differential screening),适用于分离经特殊处理而被诱发表达的mRNA之cDNA克隆。 差别杂交的技术基础是拥有两种不同的细胞群体:一个细胞群体中目的基因正常表达,另一个细胞群体中目的基因不表达。在这种情况下,可制备两种不同的mRNA提取物。其一是含有一定比例的目的mRNA的总mRNA群体,其二是不含有目的基因mRNA的总mRNA群体。因此,可以通过这两种总mRNA为探针的平行杂交,对由表达目的基因的细胞总mRNA构建的克隆库进行筛选。当使用存在目

的基因的mRNA探针时,所有包含着重组体的菌落都呈阳性反应,而使用不存在目的基因的mRNA探针时,除了含有目的基因的菌落外,其余的所有菌落都呈阳性反应。 差别杂交

(a)从表达和不表达目的基因mRNA的两个细胞群体中分别制备总mRNA;

(b)以oligo(dT)引物或是短的寡核苷酸随机引物,将总mRNA反转

录成放射性标记的探针;

(c)用每种探针分子分别与cDNA文库杂交,该文库是由表达目的基因

的细胞总mRNA构建的。

判别杂交的局限性:

· 灵敏度较低,特别是对于低丰度的mRNA而言; · 重复性差。

8.2.2 DDRT-PCR

mRNA差异显示PCR(mRNA Differential Display Reverse

Transcription-PCR, DDRT-PCR)是分离编码产物未知目的基因的一种快速而有效的方法。

根据表达特性的差异,高等真核生物的基因可以分为两大类:一类叫看家基因(house-keeping gene),以其组成型表达模式维持细胞的基本代谢活动;另一类叫发育调控基因(developmental regulated gene),以其时空特异性表达模式完成个体的正常的生长、发育和分化。这种在生物个体发育的不同阶段,或是在不同的组织或细胞中发生的不同基因按时间、空间进行有序的表达方式,叫做基因的差别表达( differential expression)。 DDRT-PCR的基本原理

几乎所有真核基因的mRNA分子的3’-末端,都带有一段多聚的腺苷酸结构,即通常所说的poly(A)尾巴。因此,在RNA聚合酶的作用下,可按mRNA为模板,以oligo(dT)为引物,合成出cDNA拷贝。为了能够从一对基因型不同的个体中,有效地鉴定并分离出差别表达的基因,需要设计适合的PCR扩增引物: 3’-端锚定引物:由11个连续的脱氧胸苷酸加上2个脱氧核苷酸组成,用5’-T11MN通式表示。其中,M为除T以外的任一核苷酸(即A、G或C),而N为任一核苷酸(即A、G、C或T),故MN共有12种不同的排列组合方式。 5’-端随机引物:由10个核苷酸随机组成。 DDRT-PCR反应的基本程序

(a)分别从A组和B组提取总mRNA;

(b)加入3’-端锚定引物(5’-T11MN-3’)进行反转录合成第一链cDNA;

(c)加入一对特定组合的5’ -端随机引物和3’-端锚定引物

(5’-T11MN-3’),以及35S-dNTP进行PCR扩增;

(d)将同位标记的(即所谓热的)PCR产物加样在变性的DNA序列胶中作电泳分离,并作放射自显影。

除了A组和B组特有条带外,大部分条带是两组共有的。

用3’-端锚定引物和5’-端随机引物组成引物对,以第一链cDNA为模板进行PCR扩增,在标准的测序胶中电泳2-3小时,可以显示出50-100条长度在100-500bp之间的DNA条带。那么,使用12种3’-端锚定引物和20种5’-端随机引物,组成240组引物对进行PCR扩增,应能产生20,000条左右的DNA带,其中每一条代表一种特定的mRNA。这个数字大体上涵盖了在一定的发育阶段,某种类型细胞中所表达的全部的mRNA。

8.2.3 Suppression Subtractive Hybridization

抑制性扣除杂交(Suppression Subtractive Hybridization, SSH)技术是Diatchenko等人于1996年建立的一种分离差异表达基因的新方法。该方法是以抑制PCR为基础的DNA差减杂交法。它所依靠的技术有两个:抑制PCR技术和差减杂交技术。 扣除杂交

扣除杂交或差减杂交(subtractive hybridization)的本质,是除去那些共同存在的、或非诱发产生的cDNA序列,从而使欲分离的目的基因的序列得到有效富集,提高分离差异表达cDNA的敏感性。 抑制PCR

利用非目标序列片段两侧的长反向重复序列,在退火时产生“锅柄”结构,无法与引物配对,从而选择性地抑制非目标序列的扩增。 SSH技术的优越性 · 假阳性率低,这是由它的两步杂交和两次抑制PCR所决定的; · 高敏感性,经过均等化处理和目标片段的富集,保证了低丰度mRNA也可能被检出; · 效率高,一次SSH反应可以同时分离几十甚至几百个差异表达基因。 SARS病毒基因的克隆

上海生命科学研究院生化与细胞所在提取SARS病毒RNA的基础上,利用RT-PCR技术成功地克隆了SARS病毒的S、M、N、E及RNA聚合酶、蛋白水解酶(3 CL)等6种主要蛋白的基因。

上海生命科学研究院药物所在上述有关工作基础上,与合作单位研究人员一起进行了SARS病毒重要蛋白质基因的表达、分离、纯化等工作,获得了SARS病毒E蛋白、N蛋白和3CL蛋白水解酶样品,目前已用于抗SARS病毒药物体外筛选。

§8.3 Map-Based Cloning

High-density linkage maps of molecular markers provide a new gene-coding approach, map-based cloning or positional cloning, which makes it possible to fish out a single gene (~103 bp) responsible for an incremental improvement in an endpoint measurement without knowing the biophysical basis of the change. Map-based cloning usually consists of:

identifying the markers that flank and show tight genetic linkage to the target

gene; · walking to the gene by using various genomic libraries constructed in, for example, the yeast artificial chromosome (YAC) vector;

confirming the gene effects by the comparison of the isolated gene with a wild-type allele or, in the case of plants, complementation of the recessive phenotype by transformation.

Theoretically, map-based cloning methods permit the isolation of any gene which can be precisely mapped.

8.3.1 Genetic Mapping of Genes

The strategy of map-based cloning is first to find one or more DNA markers linked to a gene of interest. The DNA markers were then used to screen the library and isolate (or “land” on) the clone containing the gene by the way of chromosomal walking.

Restriction fragment length polymorphism (RFLP) map of the Xa-21 genomic region on rice chromosome 11. The distance between markers is shown in centi-Morgans on the left.

8.3.2 Physical Mapping of Genes

An essential requirement for map-based cloning is the availability of

comprehensive genomic libraries of relatively large DNA fragments (YAC or BAC libraries). Another requirement, the key to the successful application of these methods is the availability of DNA markers that are closely linked to the gene of interest (ideally less than a few hundred kilobases apart). Chromosome walking

This method is used to move systematically along a chromosome from a known location. A cloned probe is used to isolate phage clones carrying genomic fragments from that region of the chromosome. A small DNA fragment from the end of the largest phage clone is used to rescreen the same library. Among the clones recovered are new phages that carry the probe sequence but whose sequences also extend farther along the chromosome. A new probe is generated from the far end of one of these phages and used to screen the library again to isolate new clones extending still farther. Hundreds of kilobases of contiguous chromosomal DNA can be cloned by repeated cycles of walking.

8.3.3 Complementation Testing

Proof of successful gene identification and cloning usually requires

complementation of the mutant phenotype by transformation with a wild-type allele. The major drawback of positional cloning, however, is the difficulty of narrowing down the field of candidate clones to a manageable number for complementation testing. For fine-scale mapping of a mutation locus, it is usually necessary to analyze nearly a thousand progeny (usually F2 plants) or even more. In addition, even after accurate mapping, positional cloning procedures that use YAC or BAC clones

require subcloning of many small fragments into a ransformation-competent vector for complementation testing. In many cases, these steps are rate-limiting hurdles to positional cloning.

Plant transformation-competent vectors, such as the cosmid vector pOCA18 and the l-phage vector lT12, have been developed for construction of genomic libraries with inserts of 5-25kb that are used for genetic complementation of mutants. The low cloning capacity of these vectors, however, limits their usefulness for efficient gene isolation by positional cloning.

Recently, a 150-kb human DNA fragment was transferred into the tobacco genome by using a binary-BAC vector by Agrobacterium-mediated transformation. To accelerate gene isolation from plants by positional cloning, vector systems suitable for both chromosome walking and genetic complementation have been developed. The transformation-competent artificial chromosome (TAC) vector (pYLTAC7), can accept and maintain large genomic DNA fragments (40-80 kb) stably in both Escherichia coli and Agrobacterium tumefaciens. Furthermore, it has the cis sequences required for Agrobacterium-mediated gene transfer into plants. Physical map of pYLTAC7

The map shows the the location of some sites for endonucleases that cleave the molecular once or twice. LB and RB, left and right borders, respectively; OD, overdrive sequence; Pnos, promoter of the nopaline synthase gene; HPT, coding region of the hygromycin phosphotransferase gene; nos 3‟, polyadenylation signals of the nopaline synthase gene; KanR, kanamycin-resistance gene (NRT1).

8.3.4 Identification of Candidate Gene

A case study in tomato

Genetic mapping in tomato indicated that the YAC clone PTY538-1 spans the Pto locus. The YAC clone was then used to probe approximately 920,000

plaque-forming units of a leaf cDNA library. Of approximately 200 hybridizing

plaques, 30 were investigated further. The cDNA inserts were used to probe a tomato mapping population consisting of 85 plants with recombination events in the Pto

region. Two of the clones, CD127 and CD146 (both 1.2 kb), contained sequences that cross-hybridized with each other. When CD127 was mapped, it cosegregated with Pto. The genetic cosegregation of CD127 with Pto and the fact that the clone was isolated from a leaf-tissue library made the cDNA a strong candidate for the Pto gene.

8.3.5 Predicted Sequences

A case study in rice

Sequencing of the 9.6-kb KpnI genomic fragment containing the 2.3-kb HindIII fragments, revealed a single, large open reading frame (ORF) of 3075 base pairs, interrupted by one intron of 843 base pairs. Sequencing of complementary DNAs (cDNAs) indicates that the intron is processed as predicted in both IRBB21 and the transgenic line 106-17. In RNA blot experiments of IRBB21, four bands hybridize with RG103. The largest band of ~3.1 kb is consistent with the size of the full-length cDNA isolated from line 106-17.

Partial restriction enzyme map and rice complementation analysis of the Xa-21 coding region

Transformation of Taipei 309 with genomic subclones pB821, pC822, pB852 and pB853 produced plants with resistance (R) or susceptibility (S) to Xoo race 6 stain PX099Az. The 9.6-kb KpnI DNA fragment of cosmid 116 was cloned into plasmid pTA818 to generate pC822. HindIII (H), KpnI (K), and HindIII-KpnI DNA fragments of cosmid 116 were ligated to pBluescript SK+ (Stratagene) to generate pB821,

pB852, and pB853, respectively. In the Xa21 coding region, the ATG and TGA codons, the RG103-hybridizing region, and 5‟and 3‟ splice junctions corresponding to the consensus sequences of eukaryotic mRNAs are marked. The intron is designated by a horizontal bar.

Chapter 9 Functional genomics

The field of functional genomics has emerged to address the function of genes discovered genome sequencing. In contrast to the previously previouly prevalent gene-by-gene approaches, new high-throughput methods are being developed for expression analysis as well as for the recovery and identification of mutants. The

experimental approach is consequently changing from hypothesis-driven to nonbiased data collection and an archiving methodology that makes these data available for analysis by bioinformatics tools. The functional genomics methodology is also changing the experimental strategy from a forward genetics (or mutant to gene)

approach to a reverse genetics (or sequenced gene to mutant and function) approach. It is expected that the functional genomics of model plants will contribute to the understanding of basic plant biology as well as the exploitation of genomic information for crop improvement.

§9.1 An Overview

This is possible by integration of genome-wide analytical tools, genetic resources and biological knowledge of the traits. Therefore, the functional analysis of all genes in a genome requires high-throughput equipment to conduct reverse genetics, rapid genotyping and discovery of candidate genes, gene expression in metabolic pathways. The global analysis terms are frequently used in recent years such as: Genome(基因组): discovery of all genes,

Transcriptome(转录物组): quantification of genes expression, Proteome(蛋白质组): cataloguing of all proteins,

Metabolome(代谢物组): estimation of all types of metabolites.

To assign cellular function of many novel genes which are predicted from the whole genome analysis, several high-throughput approaches can be employed in functional genomics, such as: · DNA chips · Serial analysis of gene expression (SAGE) · RNA-mediated interference · Gene traps · Yeast two-hybrid screening · Metobolites quantification.

§9.2 Transcriptome

The generation of messenger RNA expression profiles is referred to as transcriptomics(转录物组学), as these are based around the process of transcription. And the complement of mRNAs transcribed from a cell's genome is called the transcriptome.

9.2.1 ESTs

Expressed Sequence Tags(转录序列标签), or ESTs are small pieces of DNA sequence (usually 200 to 500 nucleotides long) that are generated by sequencing

either one or both ends of an expressed gene. The idea is to sequence bits of DNA that represent genes expressed in certain cells, tissues, or organs from different organisms and use these \"tags\" to fish a gene out of a portion of chromosomal DNA by matching base pairs. ESTs provide researchers with a quick and inexpensive route for

discovering new genes, for obtaining data on gene expression and regulation, and for constructing genome maps. From cDNA to ESTs

Once cDNA representing an expressed gene has been isolated, scientists can then sequence a few hundred nucleotides from either end of the molecule to create two different kinds of ESTs. Sequencing only the beginning portion of the cDNA produces what is called a 5' EST. A 5' EST, which is obtained from the portion of a transcript that usually codes for a protein. These regions tend to be conserved across species and do not change much within a gene family. Sequencing the ending portion of the cDNA molecule produces what is called a 3' EST. As these ESTs are generated from the 3' end of a transcript, they are likely to fall within non-coding, or untranslated regions (UTR), and therefore tend to exhibit less cross-species conservation than do coding sequences.

An overview of how Expressed Sequence Tags are generated

ESTs are generated by sequencing cDNA, which itself is synthesized from the mRNA molecules in a cell. The mRNA in a cell are copies of the genes that are being expressed. mRNA does not contain sequences from the regions between genes, nor from the noncoding introns that are present within many interesting parts of the

genome.

9.2.2 ESTs as Genome Landmarks

For a map to make navigational sense, it must include reliable landmarks or \"markers\". Currently, the most powerful mapping technique, and one that has been used to generate many genome maps, relies on STS mapping. A Sequence Tagged Site (STS) is a short DNA sequence that is easily recognizable and occurs only once in a genome (or chromosome). The 3' ESTs serve as a common source of STSs due to their likelihood of being unique to a particular species, and provide the additional feature of pointing directly to an expressed gene.

A YAC-Based Rice Transcript Map Containing 6591 EST Sites

A comprehensive rice YAC-based EST map is established (Wu et al, Plant Cell 14, 2002). Clone-specific primer pairs designed from 6713 unique EST sequences (3' end) derived from 19 cDNA libraries were screened on YAC clones and used for map construction in combination with genetic analysis. The map is composed of 2782 YAC clones, containing 364 YAC contigs with 6591 assigned EST sites from 6421 unique sequences, and covers 80.8% of the rice genome.

9.2.3 Large-Scale cDNA Analyses

Large-scale cDNA analyses have several great advantages in genome

investigations. For example, isolated and partially characterized cDNA clones have been used not only as expressed sequence tags (ESTs) on RFLP linkage maps, but also as effective probes to screen YAC clones for the construction of physical maps of chromosomes. A good-quality cDNA clone library will also be a powerful tool for isolation and characterization of useful genes for breeding and other applications. Furthermore, if a cDNA library contains cDNAs corresponding to all mRNAs in an organism, then the primary structure of any synthesized protein can be deduced from the cDNA library, since ESTs contain the amino acid sequence data for the expressed proteins.

Schematic diagram of large-scale cDNA analysis

cDNA clones are isolated from various tissues and calluses of plants to capture as many expressed genes as possible. Isolated clones are then partially sequenced and characterized. The obtained data have been effectively used for mapping and will be used for sequencing of genomic regions, investigations gene expression mechanisms, and so on.

Toward the complete cataloguing of all rice genes

So far, cDNA clones have been randomly chosen from the cDNA libraries derived from various tissues and calluses. Recently, however, the sequence

redundancies of isolated clones in the various libraries have been increasing with the increase in the number of analyzed clones. The reason for this trend is thought to be that most genes expressed to a significant extent in each tissue or callus have already been isolated. Thus, from now on, it will be necessary to capture the genes whose rates of expression are very low and/or are specific to certain tissues, growth stages or stress conditions.

9.2.4 Serial Analysis of Gene Expression

Serial Analysis of Gene Expression (SAGE,基因表达系列分析) is a rapid, high-throughput and comprehensive transcriptome profiling tool. The SAGE is a

sequence independent method which involves the isolation of a short sequence tag (14 bp) from each transcript and ligation of many tags to form concatemers(多联体). Then these concatemers are cloned into a vector, sequenced and the data analyzed using SAGE software programes. Recently, a modified SAGE version called

LongSAGE is reported to isolate 21 basepairs (bp) tags from each transcript. Long SAGE method is the complementary approach to computational gene prediction methods. These tags can be used as anchoring points on chromosomes to identify novel exons or genes in the genomic sequences. SAGE技术的主要理论依据: · 来自转录物内特定位置的一小段寡核苷酸序列(9-11个bp)含有鉴定一个转录物特异性的足够信息,可以作为区别转录物的标签(tag); · 通过简单的方法将这些标签串连在一起,形成大量多联体(concatemer),对每个克隆到载体的多联体进行测序并应用SAGE软件分析,可确定表达的基因种类,并可根据标签出现的频率确定基因的表达丰度(abundance)。 SAGE文库的建立

(3) 用生物素酰化的Oligo-dT引物合成cDNA双链,再合成双链cDNA。用专

门识别4bp碱基的锚定酶(anchoring enzyme),如NIaIII(识别位点为CATG)消化合成的双链cDNA,释放5’序列,而生物素酰化的3’端仍被吸附在链霉素亲和蛋白磁珠(streptavidin-coatedbeads)上。

(4) 分离与磁珠结合的具3’端poly(A)尾巴的cDNA片段,与接头(A和B)

连接,酶切位点一般位于识别位点下游约20bp处,释放带有接头的SAGE标签。

(5) 带有接头的SAGE标签经DNA聚合酶(Klenow)补平后,由连接酶产生带

有2个接头的双标签(ditag),对双标签PCR扩增后,再用锚定酶消化,得到尾尾相连的SAGE双标签,双标签的两端含有锚定酶的酶切位点。 (6) 去除接头的SAGE双标签,彼此连接形成长短不一的多联体(concatemer),

电泳分离后收集大小适中的片段克隆到高拷贝的质粒载体,由此形成SAGE库(SAGE library)。 SAGEmap in NCBI

Serial analysis of gene expression (SAGE) is a powerful method for the

identification of gene expression patterns. Advantages over other methods such as use of oligo- or cDNA microarrays are that SAGE is not dependent on prior knowledge of transcript information and is able to detect transcripts expressed at low copy number. Results are reported in terms of absolute or relative numbers of tags, facilitating direct comparison of SAGE results obtained in different laboratories. NCBI has developed a public repository for SAGE transcriptome information from a number of different organisms and tissues called SAGEmap. Recently, SAGE have been used to produce the first quantitative expression profile of adult mouse heart and have made this

transcriptome available at SAGEmap (GSM1681). This represents an important step forward in the quantitative determination of the cardiac transcriptome and is an approach that is likely to be extended to other species in the near future. The data represented by the SAGEmap

All of the data represented by the SAGEmap resource was derived using the latter method. Briefly, the steps in the latter algorithm are as follows:

1.Locate the NlaIII sites (i.e., CATG \"punctuation signals\") within the ditag concatemer,

2.extract ditags of length 20-26 bases which fall between these sites,

3.remove repeat occurrences of ditags, including repeat occurrences in the reverse-complemented orientation,

4.define tags as the end-most 10 bases of each ditag, reverse-complementing the right-handed tag,

5.remove tags corresponding to linker (e.g., TCCCCGTACA and

TCCCTATTAA), as well as those with unspecified bases (i.e., bases other than A, C, G, or T), and

6.for each tag, count its number of occurrences.

The product of this processing is a list of tags with their corresponding count values, and thus is a digital representation of cellular gene expression.

§9.3 Gene Disruption

The generation of many mutants will facilitate the identification and cloning of many valuable genes. Most efforts are presently made to disrupt as many genes as possible, observe a phenotype and deduce a putative function from it.

9.3.1 Deletion Mutants

The first type is a collection of deletion mutants. This collection has been established using fast neutron, gamma ray irradiation as well as diepoxybutance. Screening has been initiated for morphological defects and responses to biotic or abiotic stress. The major difficulty with this collection is that it is not easy to identify the gene causing the phenotype since there is no tag. On the other hand such a collection can be used to search for mutations in a given known gene by PCR screening.

Development of IR64 mutants using three mutagens: diepoxybutane (DEB), fast neutron (FN), and gamma ray (GR). Treatment doses are indicated. Gy = Gray.

§9.3.2 T-DNA Insertion

The second type of collection is the one derived from direct transfer DNA

(T-DNA) transformation. One of the most advanced programme has produced 22,000 primary transgenic plants from the Dong Jin variety corresponding to approximately 25,700 tagging events. The vector used contains a GUS reporter gene and 1.6 to 2.1 %

of the tested organs show a positive staining indicating that the vector functions as a gene trap. The main advantages of the T-DNA strategy are the stability of the

insertions, their low copy number and the possibility of immediately having plants which can be screened.The major drawback is that the plants are transgenics which might cause difficulty in the future if the negative perception of field trials of transgenic plants is not reversed in public opinion.

Two binary vectors were constructed for T-DNA insertional mutagenesis of rice. The first plasmid, pGA1633, contains the promoterless gus gene immediately next to the right border and the cauliflower mosaic virus (CaMV) 35S promoter-hygromycin phosphotransferase (hph) chimeric gene as a selectable marker. The second plasmid, pGA2144, was constructed to increase gene trap efficiency. In this plasmid, an intron carrying three putative splicing donors and acceptors was placed in front of gus. In pGA2144, the CaMV 35S was replaced with the strong promoter of the rice a-tubulin gene OsTubA4 along with its first intron for expression of the selectable marker hph gene.

Schematic diagrams of pGA1633 and pGA2144 T-DNA tagging vectors

The RB and LB in shaded boxes represent the right border and left border of T-DNA, respectively. E = EcoRI site, gus = b-glucuronidase, Tn = nopaline synthase (nos) terminator, p35S = cauliflower mosaic virus 35S promoter, hph = hygromycin phosphotransferase, T7 = transcription termination region of gene 7 of the pTiA6, I = the OsTubA4 intron 3 carrying three putative splicing acceptor and donor sites, pOsTubA = the promoter of the rice a-tublin gene, and OsTubA4 = OsTubA4-1, the first intron of OsTubA4. The gus probe (probe A) and hph probe (probe B) used for DNA blot analyses are indicated.

A variety of GUS staining patterns was observed from the tagged lines. Some were tissue- or organ-specific and others were expressed ubiquitously. This

observation supports T-DNA insertion as a random event. The flanking sequences of the GUS-positive lines are being isolated to obtain the genes that provided GUS expression. The next generation of the tagged lines is studied to determine whether these lines display any mutant phenotypes in the organs where the gus gene was activated, and whether the phenotypes cosegregate with the T-DNA.

The frequency of GUS expression in various organs of transgenic plants. GUS activities were analyzed from leaves and roots of 5,353 seedlings, mature flowers of 7,026 transgenic plants, and developing seeds of 1,948 plants.

§9.3.3 Ac/Ds System

The controlling element firstly recognized by Barbara McClintock has become a powerful tool for gene isolation, since the element can be employed to correlate biological phenomena with molecular interactions. Many plant genes have been isolated using transposons as molecular tags without information on biochemical properties of their gene products and their expression patterns. The first cloning of a plant gene by transposon tagging was performed in 1984. Afterwards, a large number of genes were isolated by endogenous transposons as tags. 转座子突变库的构建

在植物中,利用转座子标签法,通过构建插入突变库,可以系统地分离与克隆功能基因和调控顺序。这一策略的主要依据是: · 植物细胞具有全能性,可以从体细胞再生完整植株; · 已建立一套成熟的转基因系统,使外源基因在转基因植株中成功表达; 植物有许多转座子系统,它们的转座机制已清楚,通过转座子的随机插入,可获得大量的突变型;根据插入的转座子顺序合成探针,可分离被破坏的基因座位,并分析它们的组成;

转座子可以发生回复突变,从插入的座位切离,使突变系重现野生表型。

Ac element: The 4.6 kb Ac element from the wx-m7 allele of maize was inserted

between the CaMV 35S promoter and the hygromycin-resistant (Hmr)

gene.

Ds element: The non-autonomous Ds element carrying a 1.6 kb internal deletion in

the Ac sequence.

AcTPase gene: Ds can excise only in the presence of the AcTPase gene driven by the

CaMV35S promoter.

Schematic diagram of plasmids pCKR262, pCKR234 and pCKR532

In Ac construct, the Ac element was inserted in the reverse orientation. Triangles at ends of Ac represent the inverted repeats. P35S, CaMV 35S promoter; Hmr, hygromycin phosphotransferase; T35S, CaMV 35S terminator; H, HindIII; B, BanII. Target site duplication and footprints of Ac/Ds in riceBold letters represent mutated nucleotides in footprints. Asterisks represent deletions in footprints.

The Ac/Ds system was improved by using enchancer-trap and gene-trap plasmids to transform Arabidopsis. This allows disrupted genes which are non-phenotypic, to be detected by the expression of the a reporter gene (such as Gus). 转座子载体

A.插入到T-DNA载体中的Ac因子,缺少两端反向重复顺序,含转座酶基因及

标记基因KanR;

B.位于T-DNA载体中DsE因子,具两端反向重复顺序,在转座酶存在时可以被

动转座,用于捕获增强子;

C.位于T-DNA载体中DsG因子,具两端反向重复顺序,用于捕获编码基因。 D. DsE因子插入增强子附近时,GUS基因表达;

E. DsG因子插入染色体基因的外显子时,有一种mRNA的剪接方式可表达含GUS的融合蛋白; F. DsG因子插入染色体基因的内含子时,也有一种mRNA的剪接方式可表达含GUS的融合蛋白;

G. 从T-DNA切离的Ds因子有几种可能的转座位置: 转座子载体的构建

(1)转座酶表达载体 将Ac因子转座酶的编码基因与组成型启动子如

35S构建成嵌合基因表达载体,由于除去了转座因子两侧的反向重复序列,

转座酶的编码基因不能自我转座。这一表达载体转化细胞获得的再生植株为A。

(2)外显子捕获载体 在转座子的边界顺序与标记基因之间插入内含子

剪接受体顺序,将它们转化细胞,获得再生植株B。

(3)增强子捕获载体 将核心启动子TATA盒框与标记基因编码顺序连

接,然后在其两侧安装转座子边界,转化细胞获得再生植株C。 插入事件:(1) 与原来插入位置不连锁或远离原来位置; (2) 在不同染色体之间转座; (3) 邻近原来位置;

(4) 紧靠IAAH基因位置。 GUS,b-葡萄醛酸酶;

IAAH,吲哚乙酸水解酸,可提供对NAM底物的敏感性,用于检测是否发生转座子切离事件。

插入事件(1)将植株A与植株B杂交,在转座酶的作用下,来自植株B的转座

子可以切离和转座。当它们插入到某一外显子时,基因转录加工后有可能获得含正确读框的mRNA。根据突变表型与标记基因的共分离筛选转化无性系,通过自交可得到纯合的不含转座酶基因的插入突变系。

(2)将植株A与植株C杂交,在转座酶的作用下,来自植株C的转座子可以转

移到增强子下游,启动标记基因表达。采取类似上述的方法,分离纯合的插入突变系,进一步检测增强子的组织特异性表达场所。

This type of insertional library makes use of a transposable element, such as Ac/Ds of maize, to produce a relatively small number (such as 1,000) of primary anchor mutants. After crossing a Ds-containing plant with an Ac-containing plant, a larger number of secondary insertional mutants can be generated from each primary mutant after transposition. One major advantage of this method is that from 1,000 anchor plant lines, over 200,000 secondary insertional-mutant plant lines can be generated without the need of additional time-consuming transformation steps. The advantage to the above approach is that a large number of plant lines can be screened relatively quickly using PCR. The disadvantage is that by using a gene-specific primer with the hope of learning more about the function of this gene, the identified

insertional mutant plant line may not show any phenotype. Since in Arabidopsis and rice, so far, only a low percent of insertional mutant plant lines give identifiable phenotypic changes, a considerable amount of effort may be consumed without learning the function of the gene of interest.All of the insertional mutant libraries described above have been constructed based on random insertions of a DNA into the plant genome. In other words, the insertional mutagenesis libraries are produced by a “shotgun”-type approach because the site of insertion is presumed to be random. In shotgun libraries, one usually needs to include a 4-fold redundancy in order to cover most of the genome on a statistical basis. Since, in fact, insertions are not random, one need to include perhaps a 10-fold redundancy. It is very labor-intensive to produce and analyze approximately 800,000 insertional mutant plant lines in rice.

9.3.5 RNA Interference在利用反义RNA(anti-RNA)阻止基因表达的一项实验中,人们意外地发现正义链和反义链RNA组成的双链RNA可有效地抑制靶基因的表达,这种现象称为RNA干涉(RNA interference,RNAi)。在大多数生物中,

当注射一段长度大于500bp的dsRNA(double strand RNA,双链RNA)到受体细胞时,它们可抑制顺序对应基因的表达,但对无关的基因没有任何影响。RNA干涉现象RNA is easily synthesized in a test tube by adding phage RNA polymerases that recognize phage promoters housed in expression vectors (of course rNTPs are required as well). Phage T7 and T3 promoters are included in many expression vectors to enable transcription of individual strands of the gene when T7 or T3

polymerases are provided. Often both T7 and T3 promoters flank the cloned gene so that use of T7 polymerase can drive expression of one strand and T3 can drive

expression of the complementary strand (scheme below). RNAi至少通过2种方式抑制基因的表达: · 引起靶基因的甲基化;

· 促使目标mRNA的降解。 RNAi诱导的mRNA降解涉及两个步骤:

由dsRNA产生一种顺序专一性的中间物(R),这一步在线虫中由基因rde-1和rde-4负责;

在基因rde-2和mut-7的控制下,指令中间R作用于靶子mRNA。RNAi has some remarkable properties:

RNAi is highly gene specific. The dsRNA should include at least about 300 bp of coding sequence homology, although specific requirements for size and amount of coding sequence that must be present are not well understood. It does work if introns are present in the intended interfering RNA. It is not clear how susceptible one member of a given gene family is to the inhibitory effects of a dsRNA of another closely related family member, but it has been suggested on the basis of some data that genes 80% identical at the nucleotide level are spared inhibitory effects of their closely related family member.dsRNA appears to move freely within the worm

It is not necessary to inject dsRNA directly into the gonad to get progeny that exhibit RNAi effects. Injection of dsRNA into the tail or gut will do! This indicates that the introduced dsRNA species must be able to move across cell boundaries freely. Even more remarkably, nematodes can be soaked in dsRNA or can be fed plasmids that make dsRNA and consequently exhibit RNAi effects. dsRNA干涉靶基因表达的现象,已在所有单细胞和多细胞真核生物中发现,包括原生生物、真菌、植物和动物。这一机制称为转录后基因沉默(Post-Transcriptional Gene Silencing)。

§9.4 DNA Chips

9.4.1 What Exactly is a DNA Microarray?

芯片是以硅晶体为材料制造的用来存储信息、进行科学计算等用途的半导体器件,如各种计算机芯片。硅芯片是通过电路高低电平来表示逻辑1或逻辑0,不同的0,1组合可以代表自然界的一切信息,从而可以方便存储。生物芯片的分类生物电子芯片

DNA芯片——毛细管电泳型芯片

DNA探针阵列型——寡核苷酸芯片 cDNA芯片生物芯片的形式多种多样:

· 按基质材料分,有尼龙膜、玻璃片、塑料、硅胶晶片、微型磁珠等; · 以检测的生物信号分,有核酸、蛋白质、生物组织碎片等; ·按工作原理分,有杂交型、合成型、连接型、亲和识别型等。 寡核苷酸的矩阵芯片人类基因组计划的目的是要建立人类基因组3×109个核苷酸序列的基因谱,破译人类全部遗传信息。目前的测序技术主要是Sanger法和Maxam-Gilbert法,高成本、低效率、低可靠性是这些传统测序方法的主要缺点。美国阿贡国家实验室和俄罗斯科学院恩格勒哈得分子生物学研究所的科学家于1994年在美国能源部、防御计划署、俄罗斯科学院和俄罗斯人类基因组计划1000多万美元的资助下,研制出一种生物芯片,并已用于检测β-地中海贫血病人备样的基困突变,筛选了一百多个β-地中海贫血已知的突变基因。这种生物芯片的基因译码速度比传统方法快1000倍。现在这种芯片为100美元/片,但将来只有1美元,甚至更低。因此,这种芯片被誉为快速价廉生物医学测试用的生物芯片。

寡核苷酸的矩阵芯片这种生物芯片包含了一个用凝胶套起来的1平方英寸的玻片,在这个玻片上有10000多个微凝胶。特殊机器人把6-9个碱基对(bp)长的DNA短链(寡核苷酸)固定在一个凝胶上,这个DNA短链即为传感器,其位置由机器人上的计算机精密确定。DNA短链就像由A、C、G、T四个DNA字母组成的“字”。对8个碱基对长DNA链的八聚体,有48=65536个可能的组合,每个传感器就是这65536个中的一个特定DNA字。 (a)两条在第8碱基位置发生了以

T取代C的单碱基变换的17-mer的靶DNA(I和II),与9个8-mer的寡核苷酸杂交,形成完全的双链体或具有一个碱基错配的双链体;

(b)标记的8-mer寡核苷酸与固定的17 –mer的靶DNA片段在溶液中杂交; (c) 8-mer寡核苷酸被固定成二维阵列后,与标记的17-mer的靶DNA片段杂

交。在杂交作用(b)和(c)中,形成完全双链体的分别以“+”

和“¢”表示;具有中间及某些末端碱基错配的双链体分别以“-”和“£”表示;具有G-C末端碱基错配的中间型双链体分别以“±”和“£”表示。 寡核苷酸的矩阵芯片用这种生物芯片进行DNA测序时,被测样品需先经过PCR扩增,限切酶剪切和荧光染料标记,然后将一小滴样本滴在生物芯片上,被测样本的DNA片段就与生物芯片上的互补寡核苷酸进行杂交。几个小时后,将生物芯片放到带有CCD照相机并连到计算机上的荧光显微镜上,记录每一个有荧光标记传感器的位置,指示出DNA片段结合到生物芯片上特殊寡核苷酸字的位置。计算机根据CCD上的荧光图案将其翻译成被测样本的DNA字。利用链的重叠部分,把DNA字连接成由基因构成的“句子”和“段落”,从而测出整个基因序列。这种生物芯片可以一次读出整个基因的句子和段落,而不是字符和字。因此,生物芯片可以一次完成DNA测序、遗传变异、基因表达、蛋白质相互作用和免疫应答的测试。

9.4.2 Designing a Microarray ExperimentA DNA Microarray Experiment 1 Prepare your DNA chip using your chosen target DNAs.

2. Generate a hybridization solution containing mixture of fluorescently-labeled cDNAs.

3. Incubate your hybridization mixture containing fluorescently labeled cDNAs with

your DNA chip.

4. Detect bound cDNA using laser technology and store data in a computer. 5. Analyze data using computational methods. The Basic Steps

The whole process is based on hybridization probing, a technique that uses fluorescently labeled nucleic acid molecules as \"mobile probes\" to identify

complementary molecules--sequences that are able to base-pair with one another. Each single stranded DNA fragment is made up of four different nucleotides, adenine (A), thymine (T), guanine (G) and cytosine (C), that are linked end to end. So, the complementary sequence to G-T-C-C-T-A will be C-A-G-G-A-T. When two

complementary sequences find each other--such as the immobilized target DNA and the mobile probe DNA, cDNA, or mRNA--they will lock together, or hybridize.

Consider two cells: cell type 1, a healthy cell, and cell type 2, a diseased cell. Both contain an identical set of four genes, A, B, C, and D. Scientists are interested in determining the expression profile of these four genes in the two cells types. To do this, scientists isolated mRNA from each cell type and used this mRNA as templates to generate cDNA with a \"fluorescent tag\" attached. Different tags (red and green) are used so that the samples can be differentiated in subsequent steps. The two labeled samples are then mixed and incubated with a microarray containing the immobilized genes A, B, C, and D. The labeled molecules bind to the sites on the array corresponding to the genes expressed in each cell.

After this hybridization step is complete, a researcher will place the microarray in a \"reader\" or \"scanner\" that consists of some lasers, a special microscope, and a camera. The fluorescent tags are excited by the laser, and the microscope and camera work together to create a digital image of the array. This data is then stored away in a computer and a special program is used to either calculate the red to green

fluorescence ratio or to subtract out background data for each microarray spot by analyzing the digital image of the array. Some microarray experiments can contain up to 30,000 target spots. Therefore, the data generated from a single array can mount up quickly.

The Colors of a Microarray

Each spot on an array is associated with a particular gene. Each color in an array represents either healthy (control) or diseased (sample) tissue. Depending on the type of array used, the location and intensity of a color will tell us whether the gene, or mutation, is present in either the control and/or sample DNA. It will also provide an estimate of the expression level of the gene(s) in the sample and control DNA. GREEN represents Control DNA where either DNA or cDNA derived from normal tissue is hybridized to the target DNA.

RED represents Sample DNA where either DNA or cDNA is derived from diseased tissue hybridized to the target DNA.

YELLOW represents a combination of Control and Sample DNA where both hybridized equally to the target DNA.

BLACK represents areas where neither the Control nor Sample DNA hybridized to the target DNA.

9.4.3 Types of Microarrays Types of Microarrays

There are three basic types of samples that can be used to construct DNA

microarrays--two are genomic and the other is \"transcriptomic,\" that is, it measures mRNA levels. What makes them different from each other is the kind of immobilized DNA used to generate the array and ultimately, the kind of information that is derived from the chip. The target DNA used will also determine the type of control and sample DNA that is used in the hybridization solution. I. Changes in Gene Expression Levels

Determining the level, or volume, at which a certain gene is expressed is called microarray expression analysis, and the arrays used in this kind of analysis are called are called \"expression chips.\" The immobilized DNA is cDNA derived from the mRNA of known genes, and, once again--at least in some experiments--the control and sample DNA hybridized to the chip is cDNA derived from the mRNA of normal and diseased tissue, respectively. If a gene is overexpressed in a certain disease state, then more sample cDNA, as compared to control cDNA, will hybridize to the spot representing that expressed gene. In turn, the spot will fluoresce red with greater intensity than it will fluoresce green. Once researchers have characterized the

expression patterns of various genes involved in many diseases, cDNA derived from diseased tissue from any individual can be hybridized to determine whether the

expression pattern of the gene from the individual matches the expression pattern of a known disease.

II. Genomic Gains And Losses

DNA repair genes are thought to be the body's frontline defense against

mutations, and, as such, play a major role in cancer. Mutations within these genes often manifest themselves as lost or broken chromosomes. Using different laboratory methods, researchers can measure gains and losses in the copy number of

chromosomal regions in tumor cells. Then, using mathematical models to analyze this data, they can predict which chromosomal regions are most likely to harbor important genes for tumor initiation and disease progression. The results of such an analysis may be depicted as a hierarchical treelike branching diagram, referred to as a \"tree model of tumor progression.\" III. Mutations in DNA

When researchers use microarrays to detect mutations or polymorphisms in a gene sequence, the target, or immobilized DNA, is usually that of a single gene. In this case though, the target sequence placed on any given spot within the array will differ from that of other spots in the same microarray, sometimes by only one or a few specific nucleotides. One type of sequence commonly used in this type of analysis is called a Single Nucleotide Polymorphism, or SNP--a small genetic change, or variation, that can occur within a person's DNA sequence. Another difference in

mutation microarray analysis, as compared to expression or CGH microarrays, is that this type of experiment only requires genomic DNA derived from a normal sample for use in the hybridization mixture.

Chapter ten Comparative Genetics

We have already seen how similarities between homologous genes from different organisms provide one way of assigning a function to an unknown gene. This is an example of how knowledge about the genome of one organism can help in

understanding the genome of a second organism. The possibility that comparative genomics might be a valuable means of deciphering the human genome was

recognized when the Human Genome Project was planned in the late 1980s, and the Project has actively stimulated the development of genome projects for model

organisms such as the mouse and fruit fly. In this Chapter, we will explore the extent to which comparisons between diferent genomes are proving useful.

§10.1 Comparison Based on Genomic Maps

The basis of comparative genomics is that the genomes of related organisms are similar. The argument is the same one that we considered when looking at

homologous genes. Two organisms with a relatively recent common ancestor will have genomes that display species-specific differences built onto the common plan possessed by the ancestral genome. The closer two organisms are on the evolutionary scale, the more related their genomes will be. If the two organisms are sufficiently closely related then their genomes might display synteny(共线性), the partial or complete conservation of gene order. Then it is possible to use map information from one genome to locate genes in the second genome.

10.1.1 Plant Comparative Genetics

In the mid 1980s, when restriction fragment length polymorphism (RFLP)

analysis was first applied to plants--tomato and maize in the United States and wheat in the United Kingdom--it became clear that complementary DNA RFLP probes could be cross-mapped to provide anchors that allowed genomes to be compared. Two studies, one that showed that the tomato and potato maps were very similar and another that showed that the three diploid genomes that form present-day hexaploid bread wheat had retained almost identical gene orders, gave the first hints that plant gene linkage arrangements might have remained conserved over long evolutionary periods. Over the past 10 years, close relationships have been demonstrated between the genomes of almost all economic grass crops, between the Solanaceae crops, between the Brassica crops and Arabidopsis, between pines, between rosaceous fruit tree species, and between several legumes.

禾本科基因组的比较

水稻(rice),2n=2x=24,n=12,第11和12染色体的短臂存在重复片段(R11和R12)。

小麦族(Triticeae),x=7,与水稻基因组相比,水稻的R10插入R5长臂中形成

小麦的group1,R8插入R6形成group7,R7插入R4形成group2。 燕麦(oat),六倍体AACCDD,R10插入R5,R8插入R6。 玉米(Maize),n=10,大多数标记都有重复座位,在1600万年前是同源四倍体,

x=5。与小麦相比,玉米的两个“基因组”不存在完全同源的染色体。大多数较大的染色体(1,2,3,4,6)组成一个基因组,而较小的染色体(5,7,8,9,10)组成另一个基因组。 高粱(sorghum),2n=2x=20,n=10,在分类上与玉米接近,但没有证据表明它曾

是同源四倍体。 甘蔗(sugar cane),栽培种为2n=80,x=10,野生种为2n=40-128,x=8。 小米(foxtoil millet),2n=2x=18,n=9,有5个染色体与水稻基本无变化,与

高粱的关系非常近,

10.1.2 Is Colinearity Good Enough for Cross-Species Gene Isolation?

If colinearity is perfect, then it should be possible to isolate genes that have been mapped precisely on the genetic map in large genome species by map-based cloning in a smaller genome model species, such as wheat genes in rice or oilseed rape genes in Arabidopsis. A map-based cloning approach in rice has been used for the isolation of the wheat Ph gene, which controls chromosome pairing. Similarly, work is under way to isolate Rpg1, a stem rust resistance gene in barley, by \"walking\" in rice. Although neither walk has yet been concluded successfully, remarkably precise colinearity has been observed over most of the corresponding regions. However, in both cases, breaks in complete correspondence did occur in or near the target regions. These indications that everything may not be perfect at the microlevel are similar to results from human-mouse comparisons, where colinearity is often interrupted by insertions, deletions, and inversions.

Within the Crucifereae, colinearity looks to be extremely strong between Arabidopsis and oilseed rape, which are said to be only 10 million years apart,

although, as yet, very little genomic sequence is available from the crop genome. One of the first results to emerge from the cross-mapping of Arabidopsis genes onto the Brassica genomes was that the basic Arabidopsis gene set is essentially triplicated in the diploid Brassica crops. The DNA content of the diploid Brassica crops, at 480 Mb, is, in fact, about three times that of Arabidopsis. Triplicated regions of similar genetic length have been identified that correspond with almost precise colinearity to segments of Arabidopsis that carry major flowering time genes. In another

fine-mapping study, T.C. Osborn's group at the University of Wisconsin, Madison, has established that the major vernalization-responsive flowering time gene in Brassica rapa, VFR2, is likely to be a homolog of FLC, which is located at the top of

Arabidopsis chromosome 5. Preliminary data from R.Schmidt's lab in Cologne show that there is extensive microcolinearity between a 200-kb region of Arabidopsis chromosome 4 and a region of the Capsella rubella genome where 17 Arabidopsis genes mapped to four Capsella cosmid contigs. Within the contigs, gene orders were

completely conserved and distances between genes, where they were established, were highly similar.

Ten million years is a short time in crucifer evolution

Although the genomes of B. rapa (red circle), B. oleracea (blue circle), and B. nigra (green circle) have different chromosome numbers, the maps of the three genomes can be aligned simply, revealing only a few chromosomal rearrangements that disturb complete colinearity. Moreover, each Brassica genome comprises three complete Arabidopsis genomes. The blue regions correspond to a 7.7-Mb Arabidopsis chromosome 4 region surrounding FCA. The yellow regions relate to the 2.2-Mb chromosome 5 region surrounding CONSTANS. §10.1.3 Animal Comparative Genetics

Traces of evolutionary history appear in functional morphology and DNA sequences of living and extinct species. These remnants of the past can lead to insights into the relationships among extant groups of animals, the forces driving evolution, and the utility of animal models for studying human disease. We present below one evolutionary interpretation of the still-disputed hierarchy of surviving placental mammalian orders (excluding monotremes and marsupials), a synthesis of accumulated molecular and morphological inferences. The time scale is derived largely from molecular data; indicated fossil remains are much younger, raising controversies around the precise age of mammal ancestors.

§10.2 Comparison Based on Genomic Sequences

Relatively few plant species have been subjected to comprehensive sequence analysis. Comparisons across species are incredibly valuable, but the questions that can be asked are quite different, depending on the degree of relatedness. For instance, identification of conserved processes between plants and animals will help us to understand the very basis of multicellular existence. However, to understand what makes mouse different from human, or rice from Arabidopsis, we need to identify both the shared and diverged genic complements of the individual species. When more plant sequencing projects are initiated, they will benefit greatly from the

availability of rice sequences for gene annotation and map assembly. More important, the comparisons and contrasts between the grasses will provide our first step toward understanding the commonalities and specific niche exploitations that have made this family of plants exceptionally successful since it first emerged 50 to 70 million years ago.

10.2.1 GC Contents

Genomic, exon, and intron GC contents. The average genomic GC content for

prokaryotes and eukaryotes varies widely. It ranges from less than 22% in the human malaria parasite, Plasmodium falciparum, to more than 68% in the large amplicon of

Halobactrium sp. NRC1. Local heterogeneity in GC content can be enormous,

ranging from 26 to 65% in the human genome alone. In contrast, AG content (purine) is homogeneous, fluctuating by just a few percent about a mean of 50%. Discussions have focused on the characterization of the human genome as a mosaic of GC-rich and AT-rich “isochores(等容线),\" which are observed in warm-blooded vertebrates, but not in cold-blooded vertebrates.

Major differences between sequence content in A. thaliana, rice, and human are observable even at the simplest level, from distributions of genomic GC content. We used a 500-bp window size, to obtain a smaller size than that of most plant genes. As previously reported, the A. thaliana distribution displayed a \"shoulder\" on the AT-rich side, which could be attributed to the sizable fraction of the genome that was in intergenic DNA. The primary peak at 0.382 was nearly identical to the 0.388 GC content of the average A. thaliana gene. In contrast, no shoulder was observed in rice. However, a \"tail\" was apparent on the GC-rich side. The human distribution also displayed no shoulder, but a minor tail might have been present.

GC content distribution for exons and introns in A. thaliana, O. sativa, and H. sapiens. All exon and intron sequences were derived from cDNA-to-genomic alignments.

GC content for individual exons as a function of their gene size, in A. thaliana, O. sativa, and H. sapiens. All exon and intron sequences were derived from

cDNA-to-genomic alignments. Each data point is a single exon. Exons for the same gene are plotted at the same abscissa and connected by a vertical line. The genes are sorted by size, where gene size is defined as the sum of exon and intron lengths. To make the figure legible, we use constant spacing between genes, thus resulting in nonuniform abscissa labels. We show only the 41 largest genes for which the entire cDNA could be aligned to genomic sequence. Given the draft nature of the rice genome, some of the largest rice genes had to be omitted.

GC content for homologous genes in A. thaliana and O. sativa as a function of gene position from the 5' to 3' end, computed on a sliding 129-bp window (equal to the median exon size in rice). Only the coding region is shown. GenBank locus identifiers are specified in the legend. The smaller gene is \"potassium channel beta subunit,\" and the larger gene is \"phytochrome B.\"

10.2.2 Functional Classification of Genes

Although 25,426 genes have been identified in A. thaliana, fewer than 10% have been documented experimentally. Consequently, functional classification of plant genes must rely heavily on homology, coupled with a few nonhomology-based methods, such as phylogenetic profiling, correlated gene expression, and conserved gene orders. Only 27.3 and 36.3% of A. thaliana genes have been classified by

InterPro and Gene Ontology Consortium, respectively. In total, 15.9 and 20.4% of rice gene predictions were classified by InterPro and Gene Ontology Consortium,

respectively. As a percentage of classified genes, the predicted gene sets for rice and A. thaliana are similarly distributed among different functional categories. We depict Gene Ontology Consortium because more genes were classified.

Functional classification of rice genes, according to Gene Ontology Consortium, and assigned by homology to categorized A. thaliana genes. In this ontology, \"biological process,\" \"cellular location,\" and \"molecular function\" are treated as independent attributes. Only 36.3% of the 25,426 predicted genes for A. thaliana are classified. For rice, only 20.4% of the 53,398 complete predictions, with both initial and terminal exons, could be classified.

Homology between monocots-eudicots. The asymmetry in the monocot-eudicot analysis was striking. About 80.6% of A. thaliana genes had a homolog in rice. The mean extent of homology was 80.1% of the protein length, and there was 60.0% amino acid identity. In contrast, only 49.4% of predicted rice genes had a homolog in A. thaliana. The mean extent of homology was 77.8% of the protein length, and there was 57.8% amino acid identity.

Distributions in extent of homology and maximum amino acid identity, for

Arabidopsis-to-rice and rice-to-Arabidopsis comparisons. These values are based on a comparison of predicted protein sequence against all six reading frames of the target genome sequence.

Size distribution of predicted rice genes with a homolog (WH), and with no homolog (NH), in A. thaliana, plus exon GC content and intron size for a random sampling of 3000 NH genes. Gene size refers to the size of the predicted coding region

10.2.3 Synteny Between Rice and Arabidopsis

Having established the major difference between the gene sets for rice and A. thaliana, we now consider the similarity. We had reported that 80.6% of the

predicted A. thaliana genes, and 94.9% of the SwissProt genes, had a homolog in rice. The actual number is likely to be even higher, because the gradients kept us from identifying potential homologs for smaller genes. We know that, within A. thaliana, the genes are highly duplicated. Are these genes duplicated in the same manner when mapped to rice? As a proxy for the number of gene homologs within and between genomes, we used the \"hits per gene,\" as defined in the notes. Considering that, in the Arabidopsis-to-rice comparison, we used a low coverage rule of 25% to compensate for the gradients, it was inevitable that we would experience more difficulty than usual in distinguishing between duplicated domains and duplicated genes. Thus, the number of hits per gene is an overestimate of the number of gene homologs.

Distributions in number of hits per gene and maximum-versus-minimum amino acid identity, for Arabidopsis-to-rice and Arabidopsis-to-Arabidopsis comparisons. \"Hits per gene\" is a proxy for the number of gene homologs, between and within genomes.

Distributions in the number of hits per gene, sorted according to Gene Ontology Consortium, for Arabidopsis-to-rice and Arabidopsis-to-Arabidopsis comparisons. This figure shows only the 36.3% of predicted A. thaliana genes that are classified.

10.2.4 Rice As a Model for Other Cereals

Sequence-based markers from syntenic regions of one cereal can be used for fine mapping and candidate gene identification across cereals. The small genome of rice

provides the genomic foundation for all cereals--enabling efficient identification of orthologous genes, regulatory regions, gene functions--and may facilitate the

sequencing of other cereal genomes. The extent of gene conservation was determined by compiling a set of full-length, nonredundant complete coding sequences for each nonrice cereal species. At significant similarity levels, almost every cereal protein was found to have a related gene in rice. At higher stringency, 80 to 90% of cereal gene queries identified rice homologs. These observations suggest that most genes are conserved across cereals, and that phenotypic variation is due to a small number of different genes or functional differences within similar genes.

TBLASTN comparison of rice versus other cereal proteins from GenBank. A set of full-length nonredundant cereal protein sequences was compiled using all available sequences from GenBank. Pairs of proteins with greater than 90% identity over an alignment of at least 100 amino acids were considered redundant and one of the two was removed.

Rice-maize synteny. Maize markers were mapped to the rice genome in silico. Maize map and sequence information were derived from MaizeDB (610 markers) and GenBank, respectively. Maize chromosomes are indicated along the vertical black lines; positions of specific markers and bins are defined by horizontal lines. Rice

chromosomes are represented by numbered, colored rectangles. Significant homology (at least 80% identity, over 100 continuous base pairs, between a maize chromosomal region and a particular rice region) is indicated by a colored rectangle to the right of the maize chromosome.

Maize QTLs mapped to the rice genome.

(A) Rice-maize comparative QTL mapping. Portions of maize chromosomes, represented by numbered, colored rectangles, that show sequence similarity (at least 80% identity over 100 continuous base pairs) with specific regions of the top of rice chromosome 1 are shown. The rice map is from the IRGSP. Genetic distance is indicated by the numbers to the left of the rice chromosome; specific markers that map to this region are indicated to the right. Regions from maize chromosomes 1, 2, and 7 show similarity with the tip of rice chromosome 1 as shown, and maize QTLs in these regions are indicated. The region represented by the thick black line comprises ~650 kbp in rice; each colored block represents varying amounts of maize DNA.

(B) Detailed example of rice-maize comparative QTL mapping. Grain yield QTL 21 is mapped to maize map bin 1.03 between cDNA markers csu 710 and csu 392, and is syntenic with rice chromosome 3. Additional markers from the same maize bin confirm microsynteny in this target region, which contains ~220 candidate genes and 120 SSR markers in rice. Dotted lines connect homologous genes with the indicated BLAST expectation values.

10.2.5 Rice Polymorphisms

Polymorphism rates relative to 93-11 (indica). Comparisons were made to

finished BAC sequences from GLA (indica) and Nipponbare (japonica), as well as to

PA64s contigs. Rates were computed for repeated and unique regions, in single-base substitutions (SNPs) and insertion-deletions (InDels). Nipponbare (japonica) PA64s GLA (indica)SNPs in repeated sequence (%) 0.88 0.68 0.65

InDels in repeated sequence (%) 0.33 0.45 0.27 SNPs in unique sequence (%) 0.50 0.35 0.50 InDels in unique sequence (%) 0.14 0.16 0.15 Repeated sequence fraction (%) 24.1 25.5 22.8 Unique sequence fraction (%) 74.8 74.3 74.1 Parts unalignable by BLAST (%) 1.1 0.3 3.1

Friends and relations. Phylogenetic relationships among multicellular organisms whose genomes have been sequenced or are currently being sequenced. Rice is the only cereal to have its genome sequenced. The genome sequence of the model plant Arabidopsis was largely completed in 2000. These two genome sequences will enable a detailed comparison between monocotyledonous and dicotyledonous flowering plants to be made. Species in dark blue are those with completed sequences or drafts that have been published; sequencing of genomes for species in turquoise is ongoing. Ma, millions of years ago.

Chapter 11 Bioinformatics

Over the past few decades, major advances in the field of molecular biology, coupled with advances in genomic technologies, have led to an explosive growth in the biological information generated by the scientific community. This deluge of genomic information has, in turn, led to an absolute requirement for computerized databases to store, organize and index the data, and for specialized tools to view and analyze the data.

Biology in the 21st century is being transformed from a purely lab-based science to an information science as well.

Bioinformatics Milestones

Listed below are some of the major events in bioinformatics over the last several decades. Most of the events in the list occurred long before the term, \"bioinformatics\was coined. In most cases, links take the user to outside sites that provide further explanation of each event and access to related resources.

Current Status in Utilization Genome Mapping

Genomic maps serve as a scaffold for orienting sequence information. A few years ago, a researcher wishing to localize a gene, or nucleotide sequence, was forced to manually map the genomic region of interest, a time-consuming and often painstaking process. Today, thanks to new technologies and the influx of sequence data, a number of high quality, genome-wide maps are available to the scientific community for use in their research.

Computerized maps make gene hunting faster, cheaper and more practical for almost any scientist. In a nutshell, a scientist would first use a genetic map to assign a gene to a relatively small area of a chromosome. They would then use a physical map to examine the region of interest close up, in order to determine a gene's precise

location. In light of these advances, a researcher's burden has shifted from mapping a genome or genomic region of interest, to navigating a vast number of Web sites and databases.

Map Viewer: A Tool for Visualizing Whole Genomes or Single Chromosomes

NCBI's Map Viewer is a tool that allows a user to view an organism's complete genome; integrated maps for each chromosome (when available); and/or sequence data for a genomic region of interest. When using Map Viewer, a researcher has the option of selecting either a \"Whole-Genome View\" or a \"Chromosome or Map View.\" The Genome View displays a schematic for all of an organism‟s chromosomes, while the Map View shows one or more detailed maps for a single chromosome. If more than one map exists for a chromosome, Map Viewer allows you to display these maps simultaneously.

Organisms represented in Map Viewer: · Arabidopsis thaliana--a plant · Fruit fly · Human · Mouse · Corn Using Map Viewer, researchers can find answers to questions such as: •Where does a particular gene exist within an organism's genome?

•Which genes are located on a particular chromosome and in what order?

•What is the corresponding sequence data for a gene that exists in a particular chromosomal region?

•What is the distance between two genes? Protein Modeling

The process of evolution has resulted in the production of DNA sequences that encode proteins with specific functions. In the absence of a protein structure that has

been determined by X-ray crystallography or NMR spectroscopy, researchers can try to predict the three-dimensional structure using protein or molecular modeling. This method uses experimentally determined protein structures (templates) to predict the structure of another protein that has a similar amino acid sequence (target).

Although molecular modeling may not be as accurate at determining a protein's structure as experimental methods, it is still extremely helpful in proposing and testing various biological hypotheses. Molecular modeling also provides a starting point for researchers wishing to confirm a structure through X-ray crystallography and NMR spectroscopy. As the different genome projects are producing more sequences, and because novel protein folds and families are being determined, protein modeling will become an increasingly important tool for scientists working to understand normal and disease-related processes in living organisms. The Four Steps of Protein Modeling

•Identify the proteins with known three-dimensional structures that are related to the target sequence.

•Align the related three-dimensional structures with the target sequence and determine those structures that will be used as templates.

•Construct a model for the target sequence based on its alignment with the template structure(s).

•Evaluate the model against a variety of criteria to determine if it is satisfactory.

Cn3D 4.0 display shows a structural alignment of two human tyrosine kinases, 1BYG and 1FG1, as computed by the VAST algorithm. The structures include a bound inhibitor, shown in a spacefilling representation along with all atoms within a radius of 5 angstroms in a ball and stick representation. The alignment viewer displays a portion of the structural alignment, with aligned residues in capital letters and each aligned block represented in the bar above the sequences.Evolutionary Biology

New insight into the molecular basis of a disease may come from investigating the function of homologs of a disease gene in model organisms. In this case, homology refers to the fact that two genes share a common evolutionary history. Scientists also use the term homology, or homologous, to simply mean similar, regardless of the evolutionary relationship.

Equally exciting is the potential for uncovering evolutionary relationships and patterns between different forms of life. With the aid of nucleotide and protein

sequences, it should be possible to find the ancestral ties between different organisms. So far, experience has taught us that closely related organisms have similar sequences and that more distantly related organisms have more dissimilar sequences. Proteins that show a significant sequence conservation indicating a clear evolutionary

relationship are said to be from the same protein family. By studying protein folds (distinct protein building blocks) and families, scientists are able to reconstruct the evolutionary relationship between two species and to estimate the time of divergence between two organisms since they last shared a common ancestor.

The rapidly emerging field of bioinformatics promises to lead to advances in understanding basic biological processes, and in turn, advances in the diagnosis,

treatment, and prevention of many genetic diseases. Bioinformatics has transformed the discipline of biology from a purely lab-based science to an information science as well. Increasingly, biological studies begin with a scientist conducting vast numbers of database and Web site searches to formulate specific hypotheses or design

large-scale experiments. The implications behind this change, for both science and medicine, are staggering.

因篇幅问题不能全部显示,请点此查看更多更全内容

Copyright © 2019- igat.cn 版权所有

违法及侵权请联系:TEL:199 1889 7713 E-MAIL:2724546146@qq.com

本站由北京市万商天勤律师事务所王兴未律师提供法律服务