- HOME
- 问题描述
- 规则
- 常见问题解答
- PAST EVENTS
- REGISTER NOW
- About
- 职业
- A-Z
- Webmail
- beplay全方位手机娱乐平台Secured Page
-
搜索UTHealth Houston
The 2nd SBMI Healthcare Machine Learning Datathon is calling capable and motivated undergraduate and graduate students from Gulf Coast Consortia institutions and other Houston area universities. Come join us for this great opportunity to challenge your coding skills, meet new people, and enjoy the gathering of young hackers. This 24-hour Datathon is organized by the Center for Center for Secure Artificial intelligence For hEalthcare(SAFE) at the School of Biomedical Informatics in UTHealth.该活动由Vir Biotechnology赞助,获奖者的总奖金为2,100美元。来自墨西哥湾沿岸联盟中研究所的学生(包括Uthealth,MDACC,UH,RICE,RICE,TAMU,UTMB,IBT和BAYLOR)和TMC附近的学院的本科生,硕士和博士学位(仅第一年和第二年)和来自墨西哥湾海岸的研究所的学生(包括Uthealth,MDACC,UH,Rice,Tamu,UTMB,IBT和Baylor)鼓励申请。这是基于个人的活动(没有团队参与)。
确定疾病的新疗法是医学的长期目标。在传统环境中,药物发现的成本非常高。先前结果和知识的系统整合可能会通过识别高度有希望的药物来节省成本和加速发现来改变游戏。
使用计算方法进行高通量药物筛查有可能通过基于基因组和药理数据自动估算药物敏感性来大大提高成本效益。1-7These computational methods utilize drug sensitivity data at certain cell lines and predict promising drugs that potentially have high sensitivity in other cell lines. In this Datathon, participants are asked to build a prediction model that ranks promising drugs in given cancer cell lines.
Key words:药物重新定位,协作过滤,推荐系统,冷启动,图形卷积神经网络,张量分解,随机步行。
Objective
The goal is to use machine learning to predict drug’s sensitivity in given cell lines. Participants should rank drugs that are likely to be sensitive (i.e., relative inhibition > 50%) in cell lines given in test sets. A key challenge is that testing cell lines have limited experimental drug response data in the training set. Participants are encouraged to use relative dense observations from other types of cancer cell lines to predict these data-rare cancer cell lines.
Cold-start problem
一个关键挑战在于不同组织实验数据的失衡。例如,我们为主要组织的细胞系(例如肺和乳腺癌)积累了许多药物敏感性实验数据。传统方法针对研究通常观察到的组织,8–15which take known drug response data at certain cell lines and attempt to find other drugs responses at other cell lines within the data-rich tissues.
在此Datathon中,我们将专注于对胰腺癌的药物进行排名,这是致命且无法治愈的。胰腺癌细胞系的研究有限,机器学习模型的训练数据不足。缓解此数据差问题的一种潜在方法是利用共同特征 - 基因表达水平,因为这些不同的组织在基因表达方面具有部分共享生物学共性,因此以相似的方式反应药物。16We will provide the cell’s gene expression level as external features, and participants need to incorporate the features into the prediction model.
./train.csv |
Train data is a drug’s sensitivity of given pair of (drug,cell). We will use relative inhibition (RI) as a drug sensitivity measure. The RI has been binarized as 1 if the drug shows efficacy in the cell; and 0 otherwise.
The train data are in a single csv file (train.csv) or binarized pickle file (train.pkl) with format below:
。/drug/target_genes.tsv 。/drug/fingerprint.tsv (optional) 。/drug/profiles.csv (optional) |
药物重新定位的最重要特征之一是药物的靶基因。药物正在抑制特定的靶基因,从而导致疾病的修饰。我们将提供药物的功能作为TSV文件和二元化的泡菜文件)。
(可选)此外,可以使用分子结构进一步研究该药物的化学特征。参赛者可以自由使用分子访问系统(MACC)指纹17and native chemical compounds to boost prediction performance. MACCS fingerprints contains 166 chemical structures such as the number of oxygens, S-S bonds, ring. In addition, we represented drug as a native chemical structure using SMILES. SMILES is a linear notation to represent chemical compound in a unique way; in the SMILES representation atoms are represented as their atomic symbols (e.g., c for carbon); special characters are also used to represent relationship (e.g., “=”: double bonds; “#”:triple bonds; “.”:ionic bond; “:”: aromatic bond)18。SMILES can provide richer features space that strictly represent functional substructures and express structural differences such as compound’s chirality19。
。/cell/gene_expressions.csv 。/cell/profiles.csv (optional) |
The cell lines come from 14 different tissues including lung, ovary, and skin and 14 different types of cancer including carcinoma, adenocarcinoma, and melanoma. We provided gene expression profiles of each cell line using Fragments Per Kilobase of transcript per Million reads mapped (FPKM).20,21This gene expression profiles is an accurate quantification of the cell’s genetic status. Note that some cell lines have missing gene expression.
(Optional) We also provide cell line’s profiles with identifiers linking external databases.
Contestants are asked to select the most efficacious 20 drugs for each 44 pancreatic cancer cell lines given in the test set. Contestants should predict and rank the efficacious drugs with computational method. Test data contains drug and cell line that we’d like to predict the sensitivity. It is a csv file (test.csv) with format below:
The submission file (submission.csv) should be formatted as:
最低等级(0、1、2、3,...)对应于最有效的药物。提交将根据排名得分 - 归一化折扣累积收益(NDCG)进行评判。
作为我们的最终目标是基于莱克阀门等级药物elihood of efficacy and help prioritizing drug experiments in vitro, we will measure the accuracy of the model as ranking performance. For ranking measure, we will evaluate normalized discounted cumulative gain (NDCG). In information retrieval, cumulative gain is the sum of relevance values (i.e., relative inhibition) of high-ranked drugs per each query (cell). Discounted cumulative gain (DCG) is the sum of graded/weighted/discounted relevance scores in the top ranking list. The formula for DCG accumulated at top-20 ranking list is
常规心电图将会很高,如果高度有效的药物是ranked high and/or if highly efficacious drugs are ranked higher than marginally efficacious drugs. The DCG score can be normalized by the maximum DCG or ideal DCG (IDCG) in which the ranking is perfectly matched:
我们将使用所有44个单元线的平均NDCG20作为此Datathon的最终性能度量。
Useful links and literature:
一等奖:$ 1200
Second prize: $600
Third prize: $300
Location:
School of Biomedical Informatics (SBMI), University of Texas Health Science Center at Houston (UTHealth)2020年2月1日
2020年2月2日
Reference: