Skip Navigation and Go To Content

机器学习黑客马拉松徽标

SBMI Healthcare Machine Learning Datathon

2月1日至2日,2020年

School of Biomedical Informatics (SBMI), University of Texas Health Science Center at Houston (UTHealth)

E6 Level, 7000 Fannin St., Houston, TX 77030

Co-organizers: Xiaoqian Jiang, Yejin Kim

Project Manager: Marijane deTranaltes


一等奖:$ 1200
Second prize: $600
Third prize: $300


Architectural/Content/Logistical Support: Judy Young, David Ha, Luyao Chen, Queen Chambliss, Marcos Hernandez, Angela Wilkes

指导委员会:Jing Tang,Shuyu Zheng,Pora Kim,Fei Wang,Jim Zheng,Shaghayegh Agah,Steve Wong,Rong Xu,Santiago Segarra



ABOUT DATATHON


The 2nd SBMI Healthcare Machine Learning Datathon is calling capable and motivated undergraduate and graduate students from Gulf Coast Consortia institutions and other Houston area universities. Come join us for this great opportunity to challenge your coding skills, meet new people, and enjoy the gathering of young hackers. This 24-hour Datathon is organized by the Center for Center for Secure Artificial intelligence For hEalthcare(SAFE) at the School of Biomedical Informatics in UTHealth.该活动由Vir Biotechnology赞助,获奖者的总奖金为2,100美元。来自墨西哥湾沿岸联盟中研究所的学生(包括Uthealth,MDACC,UH,RICE,RICE,TAMU,UTMB,IBT和BAYLOR)和TMC附近的学院的本科生,硕士和博士学位(仅第一年和第二年)和来自墨西哥湾海岸的研究所的学生(包括Uthealth,MDACC,UH,Rice,Tamu,UTMB,IBT和Baylor)鼓励申请。这是基于个人的活动(没有团队参与)。


THEME


确定疾病的新疗法是医学的长期目标。在传统环境中,药物发现的成本非常高。先前结果和知识的系统整合可能会通过识别高度有希望的药物来节省成本和加速发现来改变游戏。

使用计算方法进行高通量药物筛查有可能通过基于基因组和药理数据自动估算药物敏感性来大大提高成本效益。1-7These computational methods utilize drug sensitivity data at certain cell lines and predict promising drugs that potentially have high sensitivity in other cell lines. In this Datathon, participants are asked to build a prediction model that ranks promising drugs in given cancer cell lines.

Key words:药物重新定位,协作过滤,推荐系统,冷启动,图形卷积神经网络,张量分解,随机步行。


问题描述


Objective

The goal is to use machine learning to predict drug’s sensitivity in given cell lines. Participants should rank drugs that are likely to be sensitive (i.e., relative inhibition > 50%) in cell lines given in test sets. A key challenge is that testing cell lines have limited experimental drug response data in the training set. Participants are encouraged to use relative dense observations from other types of cancer cell lines to predict these data-rare cancer cell lines.

Cold-start problem

一个关键挑战在于不同组织实验数据的失衡。例如,我们为主要组织的细胞系(例如肺和乳腺癌)积累了许多药物敏感性实验数据。传统方法针对研究通常观察到的组织,8–15which take known drug response data at certain cell lines and attempt to find other drugs responses at other cell lines within the data-rich tissues.

在此Datathon中,我们将专注于对胰腺癌的药物进行排名,这是致命且无法治愈的。胰腺癌细胞系的研究有限,机器学习模型的训练数据不足。缓解此数据差问题的一种潜在方法是利用共同特征 - 基因表达水平,因为这些不同的组织在基因表达方面具有部分共享生物学共性,因此以相似的方式反应药物。16We will provide the cell’s gene expression level as external features, and participants need to incorporate the features into the prediction model.

图1。Number of tested drugs in each cell line. The drug distribution has a long tail.

图2。冷启动问题。要求参与者预测胰腺组织中细胞的药物敏感性,这些细胞不包括在训练集中。


DATA DESCRIPTION


Sensitivity of drug and cell pair
./train.csv

Train data is a drug’s sensitivity of given pair of (drug,cell). We will use relative inhibition (RI) as a drug sensitivity measure. The RI has been binarized as 1 if the drug shows efficacy in the cell; and 0 otherwise.

The train data are in a single csv file (train.csv) or binarized pickle file (train.pkl) with format below:




药物的功能
。/drug/target_genes.tsv
。/drug/fingerprint.tsv (optional)
。/drug/profiles.csv (optional)

药物重新定位的最重要特征之一是药物的靶基因。药物正在抑制特定的靶基因,从而导致疾病的修饰。我们将提供药物的功能作为TSV文件和二元化的泡菜文件)。


(可选)此外,可以使用分子结构进一步研究该药物的化学特征。参赛者可以自由使用分子访问系统(MACC)指纹17and native chemical compounds to boost prediction performance. MACCS fingerprints contains 166 chemical structures such as the number of oxygens, S-S bonds, ring. In addition, we represented drug as a native chemical structure using SMILES. SMILES is a linear notation to represent chemical compound in a unique way; in the SMILES representation atoms are represented as their atomic symbols (e.g., c for carbon); special characters are also used to represent relationship (e.g., “=”: double bonds; “#”:triple bonds; “.”:ionic bond; “:”: aromatic bond)18。SMILES can provide richer features space that strictly represent functional substructures and express structural differences such as compound’s chirality19







Cell line’s features.
。/cell/gene_expressions.csv
。/cell/profiles.csv (optional)

The cell lines come from 14 different tissues including lung, ovary, and skin and 14 different types of cancer including carcinoma, adenocarcinoma, and melanoma. We provided gene expression profiles of each cell line using Fragments Per Kilobase of transcript per Million reads mapped (FPKM).20,21This gene expression profiles is an accurate quantification of the cell’s genetic status. Note that some cell lines have missing gene expression.




(Optional) We also provide cell line’s profiles with identifiers linking external databases.



EVALUATION


Contestants are asked to select the most efficacious 20 drugs for each 44 pancreatic cancer cell lines given in the test set. Contestants should predict and rank the efficacious drugs with computational method. Test data contains drug and cell line that we’d like to predict the sensitivity. It is a csv file (test.csv) with format below:



The submission file (submission.csv) should be formatted as:




最低等级(0、1、2、3,...)对应于最有效的药物。提交将根据排名得分 - 归一化折扣累积收益(NDCG)进行评判。

作为我们的最终目标是基于莱克阀门等级药物elihood of efficacy and help prioritizing drug experiments in vitro, we will measure the accuracy of the model as ranking performance. For ranking measure, we will evaluate normalized discounted cumulative gain (NDCG). In information retrieval, cumulative gain is the sum of relevance values (i.e., relative inhibition) of high-ranked drugs per each query (cell). Discounted cumulative gain (DCG) is the sum of graded/weighted/discounted relevance scores in the top ranking list. The formula for DCG accumulated at top-20 ranking list is

常规心电图将会很高,如果高度有效的药物是ranked high and/or if highly efficacious drugs are ranked higher than marginally efficacious drugs. The DCG score can be normalized by the maximum DCG or ideal DCG (IDCG) in which the ranking is perfectly matched:

我们将使用所有44个单元线的平均NDCG20作为此Datathon的最终性能度量。


Useful links and literature:


规则


  • Participants are required to submit source codes (e.g., Jupyter Notebook) in self-contained way.
  • Downloading data from our server and save them locally to be used after the competition is not permitted.
  • 不允许在竞争期间私下共享我们提供的环境之外的数据。
  • 允许使用外部数据,只要它与官方竞争数据中的标签不直接相关。
  • Participants must use an algorithmic approach to classify the segments. Any changes to the methodology must be done in an automated way, so that your approach will generalize to new subjects.
  • Top contestants are asked to prepare summary slides to describe their models at the end of the Datathon.
  • Top three participants will be asked to give a short presentation and the ones on the leaderboard (top 10) may have an opportunity to publish their results on Special Issue of a journal


PRIZE


一等奖:$ 1200
Second prize: $600
Third prize: $300



AGENDA


Location:

School of Biomedical Informatics (SBMI), University of Texas Health Science Center at Houston (UTHealth)
UCT教室612和614
7000 Fannin St., Houston, TX 77030

2020年2月1日

  • 1:00pm Opening remarks by the organizer Dr. Xiaoqian Jiang and Dean Jiajie Zhang
    • Challenge projects announcement and greetings from the sponsor
  • 1:30pm Warm up
    • 环境准备
  • 下午2:00享受Datathon!

2020年2月2日

  • 2:00pm End of Datathon
  • 2:30pm Announcement of Awardees
    • Mrs. Marijane Detranaltes
  • 3:00pm Demonstrations


常见问题解答


FREQUENTLY ASKED QUESTIONS

Undergraduate or Graduate students (including 1st and 2nd year PhD student) from a Texas institute, a program from the institutes within the Gulf Coast Consortia (including UTHealth, MDACC, UH, Rice, TAMU, UTMB, IBT, and Baylor), and colleges in the vicinity of TMC. Those who are affiliated with the Center for Secure Healthcare Machine Learning are not eligible to participate.
不,这个大数据完全免费!
Just bring your laptop and charger, student ID, and your brilliant mind. Of course, feel free to bring things that can help you - earbuds, clothes/blankets, sleeping bag if you plan on sleeping, etc.
这是一个编码Datathon。期望您掌握基本的编程技能和机器学习知识。
是的!会有餐点,小吃,饮料,咖啡等等!
YES! There is a total prize of $2,100 for the winners sponsored by Vir Biotechnology.
The Datathon will be held at the University Center Tower (UCT) of UTHealth, Houston. The address is Room 614, 7000 Fannin St., Houston, TX 77030. Pubic parking is available in the UCT garage. We will validate the UCT parking tickets for car riders. If you come by Houston METRORail, the TMC Transit Center Station of Red Line is just across the street. We will not provide travel reimbursement.
我们的法官小组是Uthealth生物医学信息学学院的教职员工。您的项目将根据usefulness, design, difficulty and creativity. Top selected projects will be demonstrated to the group at the end of the event
如果您有此处未列出的问题,请联系Dr.Xiaohong Bi


Reference:

  1. Guan N-N, Zhao Y, Wang C-C, Li J-Q, Chen X, Piao X. Anticancer Drug Response Prediction in Cell Lines Using Weighted Graph Regularized Matrix Factorization.mol丁基核酸。2019; 17:164-174。
  2. Menden MP, Iorio F, Garnett M, et al. Machine learning prediction of cancer cell sensitivity to drugs based on genomic and chemical properties.PLOS ONE。2013;8(4):e61318.
  3. Geeleher P,Cox NJ,Huang RS。可以使用基线基因表达水平和细胞系中的体外药物敏感性来预测临床药物反应。Genome Biol。2014; 15(3):R47。
  4. Donner Y, Kazmierczak S, Fortney K. Drug Repurposing Using Deep Embeddings of Gene Expression Profiles.Mol Pharm。2018; 15(10):4314-4325。
  5. Pushpakom S, Iorio F, Eyers PA, et al. Drug repurposing: progress, challenges and recommendations.Nat Rev Drug Discov。2019;18(1):41-58.
  6. Yang J, Li A, Li Y, Guo X, Wang M. A novel approach for drug response prediction in cancer cell lines via network representation learning.Bioinformatics。2019;35(9):1527-1535.
  7. Madhukar NS, Khade PK, Huang L, et al. A Bayesian machine learning approach for drug target identification using diverse data types.Nat Commun。2019;10(1):5221.
  8. Sun X, Vilar S, Tatonetti NP. High-throughput methods for combinatorial drug discovery.Sci Transl Med。2013; 5(205):205RV1。
  9. Celebi R, Bear Don’t Walk O 4th, Movva R, Alpsoy S, Dumontier M. In-silico Prediction of Synergistic Anti-Cancer Drug Combinations Using Multi-omics Data.Sci Rep。2019; 9(1):8949。
  10. Preer K,Lewis RPI,Hochreiter S,Bender A,Bulusu KC,Klambauer G. DeepSynergy:预测具有深度学习的抗癌药物协同作用。Bioinformatics2018;34(9):1538-1546.
  11. Huang L,Li F,Sheng J等。Drugcomboranker:基于目标网络分析的药物组合发现。Bioinformatics。2014;30(12):i228-i236.
  12. Bansal M, Yang J, Karan C, et al. A community computational challenge to predict the activity of pairs of compounds.Nat Biotechnol。2014;32(12):1213-1222.
  13. Zhao X-M, Iskar M, Zeller G, Kuhn M, van Noort V, Bork P. Prediction of drug combinations by integrating molecular and pharmacological data.PLoS Comput Biol。2011; 7(12):E1002323。
  14. Chen G, Tsoi A, Xu H, Jim Zheng W. Predict effective drug combination by deep belief network and ontology fingerprints.生物医学信息学杂志。2018;85:149-154. doi: 10.1016/j.jbi.2018.07.024
  15. Tang J, Gautam P, Gupta A, et al. Network pharmacology modeling identifies synergistic Aurora B and ZAK interaction in triple-negative breast cancer.NPJ Syst Biol Appl。2019;5:20.
  16. Xu C, Ai D, Shi D, et al. Accurate Drug Repositioning through Non-tissue-Specific Core Signatures from Cancer Transcriptomes.Cell Rep。2019;29(4):1055.
  17. Polton DJ. Installation and operational experiences with MACCS (Molecular Access System).Online Review。1982;6(3):235-242. doi: 10.1108/eb024099
  18. Segler MHS, Kogej T, Tyrchan C, Waller MP. Generating Focused Molecule Libraries for Drug Discovery with Recurrent Neural Networks.ACS Cent Sci。2018;4(1):120-131.
  19. Hirohara M, Saito Y, Koda Y, Sato K, Sakakibara Y. Convolutional neural network based on SMILES representation of compounds for detecting chemical motif.BMC生物信息学。2018; 19(增刊19):526。
  20. Barretina J, Caponigro G, Stransky N, et al. The Cancer Cell Line Encyclopedia enables predictive modelling of anticancer drug sensitivity.自然。2012; 483(7391):603-607。

Baidu