Tensor Completion for Cancer Drug Repositioning

In this notebook, we leverage ENSIGN's ability to find the low-rank structure of multi-dimensional data in order to propose novel drug-target-disease relationships. The goal is to suggest candidates for drug repositioning, the process of repurposing a drug for a different disease. Drug repositioning has the potential to help treat patients and improve their quality of life by bringing new therapeutic options to market. This drug discovery method is also beneficial to pharmaceutical companies because it brings new drug treatments to the market faster and at lower cost.

Given known drug-protein, protein-disease, and drug-disease relationships, we construct a 3-mode tensor describing drug-target-disease relationships. Specifically, we construct the tensor proposed by Wang et al. in Predicting associations among drugs, targets and diseases by tensor decomposition for drug repositioning. Decomposing this tensor yields components that not only reconstruct the known entries, but also others that were not part of the original tensor. These entries correspond to previously unknown drug-target-disease relationships, and they indicate possible candidates for drug repositioning.

Table of Contents

In [1]:
from itertools import product

import numpy as np
import pandas as pd

import ensign.csv2tensor as c2t
import ensign.cp_decomp as cpd

%matplotlib inline


As in the paper cited above, we use the data provided by Luo et al. in A network integration approach for drug-target interaction prediction and computational drug repositioning from heterogeneous information in a public repo. The data we use here are the lists of drugs, proteins, and diseases and drug-protein and protein-disease interaction matrices. In this study, we will consider only cancers. There are 708 drugs, 1512 targets, and 313 diseases.

In [2]:
drugs = np.loadtxt('data/drug.txt', dtype=str, delimiter='\n')
proteins = np.loadtxt('data/protein.txt', dtype=str, delimiter='\n')
diseases = np.loadtxt('data/disease.txt', dtype=str, delimiter='\n')

drug_protein = np.loadtxt('data/mat_drug_protein.txt', dtype=int, delimiter=' ')
protein_disease = np.loadtxt('data/mat_protein_disease.txt', dtype=int, delimiter=' ')
drug_disease = np.loadtxt('data/mat_drug_disease.txt', dtype=int, delimiter=' ')

cancer_filter = lambda d : 'cancer' in d or \
                           'carcinoma' in d or \
                           'melanoma' in d or \
                           'lymphoma' in d or \
                           'sarcoma' in d or \
                           'cytoma' in d or \
                           'leukemia' in d or \
                           'neoplasm' in d or \
                           'tumor' in d

cancer_idxs = [i for i, d in enumerate(diseases) if cancer_filter(d)]

diseases = diseases[cancer_idxs]
protein_disease = protein_disease[:, cancer_idxs]
drug_disease = drug_disease[:, cancer_idxs]

len(drugs), len(proteins), len(diseases)
(708, 1512, 313)

Tensor Construction

We build a 3-mode binary tensor. The indices of each mode correspond to the drugs, proteins, and diseases, respectively. For a (drug, protein, disease) triple, the corresponding tensor entry is 1 if the drug targets the protein, if there is a known relationship between the protein and the disease, and the drug has been used to treat the disease, and 0 otherwise. Not all of the drugs, proteins, and diseases in the dataset have such relationships. In the final tensor, there are 473 drugs, 397 proteins, and 229 diseases.

In [3]:
drug_target_disease = []
for i, drug in enumerate(drugs):
    for j, target in enumerate(proteins):
        if drug_protein[i, j]:
            for k, disease in enumerate(diseases):
                if protein_disease[j, k] and drug_disease[i, k]:
                    drug_target_disease.append([drug, target, disease])

data = pd.DataFrame(data=drug_target_disease, columns=['drug', 'target', 'disease'])
data.to_csv('data/bioinformatics_data.csv', index=False)

tensor = c2t.csv2tensor('data/bioinformatics_data.csv', entries='boolean')
[473, 397, 229]

Tensor Completion

We now decompose the tensor encoding drug-protein-disease relationships in order to find previously unknown potential drug-protein-disease relationships. We found that CP-ALS and the rank suggested by Wang et al. resulted in the best fit. After the decomposition, components are considered individually as follows: the entries are reconstructed by the outer product of the mode vectors. Any non-trivial entry in the component that was not in the original tensor indicates a previously unknown drug-protein-disease relationship. As the task is drug repositioning, we present a list of candidate drug-protein-disease relationships sorted by their scores in the reconstructed tensor.

In [4]:
decomp = cpd.cp_als(tensor, 250)
cpd.write_cp_decomp_dir('bioinformatics_decomposition', decomp, write_tensor=True)
In [5]:
def get_predictions(decomp, tensor):
    lookup = tensor.entries.groupby(['drug', 'target', 'disease']).aggregate(list).to_dict()['val_idx']
    predictions = {}
    for comp_id in range(decomp.rank):
        modes = [f[:, comp_id] for f in decomp.factors]
        maxes = [np.abs(v).max() for v in modes]
        indices = [np.where(np.abs(v) > 0.01 * m)[0] for v, m in zip(modes, maxes)]
        weight = decomp.weights[comp_id]
        for triple in product(*indices):
                labels = tuple(decomp.labels[i][triple[i]] for i in range(3))
                score = weight * np.prod([modes[i][triple[i]] for i in range(3)])
                if labels in predictions:
                    predictions[labels] += score
                    predictions[labels] = score

    return sorted(predictions.items(), key=lambda item : item[1], reverse=True)
In [6]:
get_predictions(decomp, tensor)[:10]
[(('DB00661', 'Q14654', 'neoplasm invasiveness'), 0.9816644861410989),
 (('DB01159', 'O14649', 'neoplasm invasiveness'), 0.9170851635908437),
 (('DB00661', 'Q01668', 'carcinoma, renal cell'), 0.9127377844163775),
 (('DB01595', 'Q99928', 'prostatic neoplasms'), 0.9068675304732172),
 (('DB01159', 'O14649', 'colonic neoplasms'), 0.9014446759030068),
 (('DB01159', 'P59768', 'carcinoma'), 0.8989606965149075),
 (('DB00277', 'P29275', 'leukemia, t-cell'), 0.8980218298878795),
 (('DB01088', 'P34995', 'thyroid neoplasms'), 0.8787987904563658),
 (('DB00795', 'O15111', 'nasopharyngeal neoplasms'), 0.8750328467615847),
 (('DB01115', 'Q13936', 'pituitary neoplasms'), 0.8736441050723573)]

Validating the Predictions

The drug-protein-disease associations identified above were not identified by the original data, yet in the reconstructed tensor, many have non-trivial scores. We validate that the top predictions are truly medically viable by consulting the literature. We identify the drugs using the DrugBank database and find studies indicating there may be a connection between the drug and the disease.

  • ('DB00661', 'Q14654', 'neoplasm invasiveness'): The drug labeled 'DB00661' is Verapamil, a phenylalkylamine calcium channel blocker used in the treatment of high blood pressure, heart arrhythmias, and angina. One study finds that verapamil can inhibit tumor cell invasion.

  • ('DB01159', 'O14649', 'colonic neoplasms'): The drug labeled 'DB01159' is Halothane, a nonflammable, halogenated, hydrocarbon anesthetic. One study finds that halothane shows anti-tumor activity toward colon cancer cells.

  • ('DB00277', 'P29275', 'leukemia, t-cell'): The drug labeled 'DB00277' is Theophylline, a methylxanthine derivative from tea with diuretic, smooth muscle relaxant, bronchial dilation, cardiac and central nervous system stimulant activities used to treat chronic asthma and chronic lung disease. One review finds that it has therapeutic effects on leukemia.

Key Takeaways

In this notebook, we showed that a tensor decomposition can be used to infer missing data and that they are a useful machine learning technique in the bioinformatics toolkit. The tensor completion functionality that we demonstrated can easily be transferred to any recommendation system problem. For example, a similar approach could be applied to past purchase histories to recommend products on an e-commerce platform. Taking greater advantage of the multidimensionality of tensor methods, a more advanced approach could also incorporate customer demographic information or other factors. Moreover, tensor decompositions are applicable to other bioinformatics problems besides the specific one that we investigated here. Notably, Hore et al. use tensor decompositions to discover relationships between genetic variation and biological processes in their paper Tensor decomposition for multi-tissue gene expression experiments. Their use of tensors allows for the incorporation of omics, environmental, and phenotypic data.