关于visdb

Oncogenic viruses account for about one sixth of tumorigenesis cases. A detailed and clear understanding of viral integration location, distribution and identification of specific viral sequences within human genome is helpful for curing cancer caused by viral infection. It is of great significance to investigate the integration of viruses into host genes or sequences from the perspective of genomics to screen virus susceptible populations, prevent virus infection, develop new therapeutics and precisely optimize treatment. In this study, we develop VISDB(virusntegrationsitedATAbASE,,VISDB)提供有关现场相关信息的知识基础,这些信息与人类基因组的病毒相关信息,并使研究病毒和相关恶性疾病的研究人员受益。beplay苹果手机能用吗
VISDBcovers 9 main viruses, including 5 DNA oncoviruses (HBV, HPV, EBV, MCV, AAV2) and 4 RNA retroviruses (HIV, MLV, HTLV, XMRV). The current version of VISDB deposits 77,602 integration sites carefully curated from 108 publications(some articles harbor multiple viruses).
  1. HBV,20558 Vises,45个出版物
  2. HPV,5118 Vises,31个出版物
  3. EBV 1144虎头钳7出版物
  4. MCV,55个Vises,9个出版物
  5. AAV2,24个诉讼,3个出版物
  6. 艾滋病毒,16797年,10份出版物
  7. htlv-1,33845 vises,4个出版物
  8. MLV, 32 VISes, 1 publication
  9. HBV,29个vises,2个出版物
Figure 1 shows the overview of VISDB. Scientific literatures harboring VISes, VISes and VIS-related data such as virus sequences, genes, miRNAs, fragile sites and other kinds of annotations, are curated and stored in VISDB. We firstly extract VISes from literature downloaded from public databases such as PubMed and ScienceDirect. Target genes, nearby genes, fragile sites and miRNA are provided in some publications, but in other cases, we only get a partof the information from the original publication. However, that information can be curated if exact integration position is provided by publications,including genome assembly,chromosome and locations on chromosome. Therefore, we discard studies that do not contain exact integration position information. Moreover, because we have collected a large number of VISes in VISDB, discarding a small number of VISes will not affect the integrity of VIS coverage. Oncogenes and tumor suppressor genes are further screened out from target genes and nearby genes, and the associations between genes and miRNA are also curated. Finally, functions such as browse, search, curation, gene feature, microRNA feature, download, statistics, help, and feedback are provided for users.
...
Figure1. Overview of VISDB

v我sdata model

我们建立了一个通用病毒整合事件模型,以协调各种原始出版物或数据集的病毒整合站点。该模型涵盖了病毒整合位点的检测,分析和档案(VIS),允许连接序列显示出复杂的病毒整合病例,例如重排,突变,反向插入和微型学连接。如图2所示,我们的模型由四种数据组成:基本信息,病毒序列,目标序列和连接序列。基本信息的类别包括用于检测或描述原始文章中的对象或工具:样本和在实验和文献来源等中使用的实验测定。此外,我们添加了一些识别属性,例如状态属性,以显示VIS是否通过实验测定和完整性属性验证,以评估VIS的完整性。
对于VISDB的病毒序列类别,我们提取并记录一些有关该病毒序列最初与基因组组件,基因组名称,病毒基因组的FASTA文件和基因组超链接的参考基因组的信息。在NCBI核苷酸数据库等中,我们还记录了有关每个序列的集成序列和元数据,例如起始断点,停止断点和病毒基因名称等。如果提出了参考基因组和断点。对于两个断点,我们在开始断点和终端断点之间提取序列。对于仅一个断点的情况,我们从VIS的上游和下游提取两个100bp序列,将它们串成一个由子字符串“ | |”划分的序列。
集成在人类参考基因组上的立场is critical for curating of VIS. The best case is that we can extract chromosome, cytoband, locations on chromosome and genome assembly from original articles. Sequences the virus integrated into are also extracted and deposited in database. For VIS with a sole location, we record sequences both upstream and downstream of inserted sites. However, if both start position and end position are provided, then the upstream sequence of start position, the downstream sequence of end position, and the sequence between the start position and end position are curated according to the genome assembly declared in the article.
The junction sequence category has a significant role in analyzing integration patterns. Though we wish to store the FASTA file of the junction sequence and its annotation, the coverage and mapped reads of each VIS, we only find junction sequences in less than 5 percent of VISes. Furthermore, we mark all endpoints pertaining to the human sequence or virus sequence and map these points with specific positions in the reference genome to ensure the visualization of integration event as shown in figure 3.
...
图2.集成事件的数据视图。集成事件包含有关VIS的基本信息,有关集成病毒序列,目标序列或宿主基因组中的目标序列或点的信息以及集成事件发生后的结序。双椭圆意味着可能存在多个病毒序列。

v我svisualization

We develop a visualization tool to display rich information about features of integration sites. Virus integration with human genome may have many different patterns. The simplest pattern is when a segment of the virus sequence is broken and inserted into the host's genome without any other process in the occurrence of integration event. However, reverse-inserts, rearrangements, microhomology and mutations may take place in the process of integration, and the integration event may be complex. In this study, we consider a virus-integrated within a human sequence to have the form of “human sequence” + ”virus-mixed sequences” + ”human sequence”. In other words, a junction sequence is composed of a human sequence preceding the integrating region, a sequence mixed with virus sequences and unknown sequences excluding human sequences, and a human sequence following the integration region. Notably, overlap of human sequence and virus sequence and unknown sequence between human sequence and virus sequence are both allowed. However, no human sequence can exist in the mixed sequence, otherwise, the integration event is divided into two events.
...
图3。集成事件的示例。红线是病毒序列,黑线是人类序列,灰色线序列既不属于病毒,也不属于人类,绿线是病毒序列和人类序列的重叠。不包括集成事件中的突变。蓝色的字符意味着它们是病毒或人参考基因组的坐标。坐标被设计为与NCBI基因组组件或NCBI核苷酸相关的超链接。

构建此知识基础的步骤

我们的数据库中的Vises是从科学文献中提取的或从其他VIS数据库中下载的。由于我们的数据模型在VIS标准上非常全面,因此只有少数几个可以完全满足我们的要求。因此,我们首先用从原始文章中提取或从公共数据库下载的值填充这些项目,然后策划这些vises以提高VIS信息的完整性。但是,对于使用所述方法无法获得的粘合信息,不可避免的是无效的。

Literature-based VISes

收集包含诉讼的文献
全部scientific literature containing VISes was downloaded from PubMed, ScienceDirect, Google Scholar and Wiley with the authorization of the University of Texas, Health Science Center at Houston. We searched these data sources by using different combinations of the following keywords: virus integration, viral integration, integration, integration site, full name of virus, abbreviation of virus, etc. Articles recruited were referred to as an initial literature set. For example, articles related to HBV are retrieved by the following statements:
  1. (( viral integration[Title/Abstract] or virus integration[Title/Abstract] ) AND (HBV[Title/Abstract] or hepatitis B virus[Title/Abstract]))
  2. ((集成站点[title/摘要])和(HBV [title/摘要]或乙型肝炎病毒[title/Abstract]))
  3. ((集成[title])和(HBV [title/Abstract]或乙型肝炎病毒[title/Abstract]))
我n this way, we collected about 150 research articles for HBV, 23 papers for AAV2, 29 papers for EBV, 21 papers for MCV, 90 papers for HPV,etc.
通过质量较高和下载补充文件过滤文献
对于以前的文献集,我们扫描了全文以检查是否有有效的集成站点。如果文件包含诉讼,则下载了原始文章的补充文件,以进行进一步的插值。如果未提供一定的补充文件,我们将纸放弃。此外,随着下一代测序的发展,发现了更多的诉讼,并且近年来发表的文章的文章几乎涵盖了20或30年前出版的文献中报道的那些文章。实际上,2004年之前发表的很少有研究报告了宿主基因组或病毒基因组中断点的特定位置,我们不得不丢弃它们。因此,2004年之前发表的一些文章未在VISDB中部署。完成此步骤后,我们确定了108篇论文。
从文献中提取信息
VIS信息分布在科学文献和公共生物数据库中,没有任何固定格式。因此,我们必须手动从文学中提取有关。在某些情况下,可以从文献库,原始纸张或补充文件中提取以下VIS信息。
  1. 纸质元数据,例如PubMed ID,标题,期刊,出版年,作者及其隶属关系
  2. samples used in detecting VIS, including sample name, sample type, sample size, donator's age and gender
  3. Human reference genome used to align the junction sequence
  4. 病毒基因组用于检测整合病毒序列
  5. 用于检测VIS的方法
  6. 疾病
  7. 我ntegrationlocations in human genome, including chromosome, loci, orientation, start position and end position, etc
  8. 集成的病毒序列,包括开始断点,终点断点,方向等(也记录了与NGS相关的技术检测到的,覆盖范围和读数编号)
  9. VIS附近的靶基因或基因
  10. VIS所在的脆弱站点
  11. miRNA related to target genes
  12. Other information
不幸的是,直接从文献中获取有关VIS的上述信息是很麻烦的。在大多数情况下,我们使用以下策略:
  1. 搜索the whole article to extract the objective information such as human genome, virus genome, experimental assay
  2. Copy the text in literature with pdf format, save it to a text file and import it to Excel
  3. Use string functions provided by Excel( such as concat, find, left, right and len)to extract or normalize data items
  4. 使用Excel提供的排序和替换功能来删除重复项或将行分组以加速数据汇编。
使用公共生物数据库策划诉讼
后提取V我sinformation from literature, we curate VIS with public biological database such as NCBI GenBank, KEGG, ENCODE, Genecards, RID, ONGene, TSGene, HumCFS, miTarBase, miRNA, etc.
  1. 靶基因的验证
  2. 验证基因附近VIS而不靶向任何基因
  3. Annotatation of VISes with oncogene from Oncogene database
  4. 从TSGene数据库中用肿瘤抑制基因的伴侣注释
  5. 提取上游和下游序列
  6. 如果提供了两个位置或断点,则提取整合病毒序列和人类序列
  7. 从HUMCF中提取脆弱的位点,并将其与cistes相关联
  8. 从mitarbase和miRNA中提取miRNA,并将它们与cistes相关联
  9. set links to GenBank, KEGG, ENCODE, Genecards, etc.
修改vises和计算辅助数据以可视化vises
当我们完成9种病毒的Vises时,这是一个巨大的挑战,因为我们发现相同类型的VIS信息之间存在太多不一致之处。例如,一些作者使用CHR23和CHR24来指代异体体,而大多数论文使用CHRX和CHRY。据报道MLL4是导致肝癌的热点癌基因,但该基因的官方符号是GenBank的KMT2B。此外,一些作者还将其别名用作基因名称,例如RX2,MLL2,TRX2,WBP7,DYT28,MLL1B,WBP-7和CXXC10。这些案件在统计数据中导致不准确。为了降低不准确性的风险,我们执行以下修订或规范化:
  1. 将基因名称与GenBank中的官方符号归一化
  2. Use GRCh38/hg38 as the reference genome to visualize the integration event for literature-curated VIS(for imported VIS, we’ll navigate to the source database)
  3. state VISes detected by NGS-related technology (no need for experimental assay)
  4. 在计算基因的距离时使用BP作为测量单位
  5. 使用毫米作为样本量的测量单位
  6. Give each virus a default code if no specific virus reference genome is provided, and this code can’t link to any genome in Nucleotide database
  7. Give all VISes without specific reference genome a default code that does not link to UCSC or NCBI.

进口景点

下载Vises和文学
RID (Retrovirus Integration Database, https://rid.ncifcrf.gov/) is a relational database containing information about retrovirus integration sites in host genomes and is sponsored by the HIV Dynamics and Replication Program (HIV DRP), National Cancer Institute, NIH. It collects about 4 million VISes from 18 papers of HIV, HTLV, MLV and ALV. Insert position on host chromosome, target genes or nearest genes and distance are presented. In addition, the locations in host human genome are mapped to hg19. However, the reference virus genome, the details of sample and experiment assay are not listed by the website, let alone the sequence around the integration site and the junction sites. To give a a more details about those VISes, we download some VISes as well as literatures from RID for further curation of VIS information that is not provide by RID.
HPVBase是人乳头瘤病毒介导的癌的集成病毒资源和分析平台。它包含1257个条目,包括甲基化模式和miRNA表达。我们下载了一些与我们计划一致的访问,并通过更多的VIS信息来策划它们。

Replenishing VISes to match VISDB
After downloading VISes from RID and referencing to literature, we curated those VISes according to the original paper and public biological databases.
  1. 从原始纸中提取实验测定,并将其与VIS相关
  2. 从原始纸张中提取样品信息,并将其与相应的景点相关联
  3. 提取与样品有关的疾病信息,并将其与样品相关联
  4. 提取与该连接序列对齐的病毒参考基因组
  5. 补充没有基因ID的基因在RED中并通过官方符号正常化。

联系我们

We appreciate your feedback. Please send an Email if you wish to make a request, a comment, or report a bug.

Zhongming Zhao,博士,MS
Chair Professor for Precision Health
Professor of Biomedical Informatics and Human Genetics
生物医学信息学和公共卫生学院
Founding Director, Center for Precision Health
director, UTHealth Cancer Genomics Core
Beplay体育中心
电话:713-500-3631
电子邮件:zhongming.zhao@uth.tmc.edu

Deyou Tang,博士
访问学者
生物医学信息学和公共卫生学院
University of Texas Health Science Center at Houston
电子邮件:deyou.tang@uth.tmc.edu

数据库的引用:

要在出版物中引用VISDB网站,请引用以下内容:
Tang D, Li B, Xu T, Hu R, Tan D, Song X, Jia P, Zhao Z (2020) VISDB: a manually curated database of viral integration sites in the human genome. Nucleic Acids Research 48(D1):D633-D641
https://www.ncbi.nlm.nih.gov/pubmed/31598702

Baidu