--- Build Your Own AI Models to Contribute
Eric Shi, 2023-05-16
Photo by National Cancer Institute on Unsplash
Simulated biomedicine, an interdisciplinary field combining computer science, mathematics, physics, biology, and medicine, holds immense potential for medical research. By applying computer simulations to biological systems and processes, simulated biomedicine overcomes the limitations of experimental sciences. It enables us to uncover new phenomena, propose and test hypotheses, and design innovative disease treatments. This article explores the exciting world of simulated biomedicine and demonstrates how individuals with a bit of coding knowledge can contribute to the expansion of human knowledge.
Application Examples of Simulated Biomedicine
Simulated biomedicine harnesses the power of AI models to tackle challenging biomedical issues that are difficult to handle experimentally. Here are some notable applications:
Virtual Screening to Identify Potential Drug Candidates: Virtual screening employs AI methods to screen vast databases of compounds and identify potential drug candidates. This approach significantly reduces time and cost compared to traditional experimental screening methods. Techniques such as molecular docking, molecular dynamics simulations, and machine learning algorithms are utilized to predict compounds' binding affinity and efficacy to target proteins. For instance, in a study published in the Journal of Medicinal Chemistry [1], virtual screening was used to identify potential inhibitors of SARS-CoV-2 (the virus causing COVID-19).
Predicting Drug Interactions and Effects: Simulated biomedicine assists in predicting how small molecule drugs interact with target proteins or enzymes, as well as how drugs are absorbed, distributed, metabolized, and excreted by the body. These predictions aid in optimizing drug structures, dosages, and dosing schedules, leading to enhanced efficacy and reduced side effects.
Assessing Drug Toxicity and Clinical Trial Outcomes: Simulation models help predict the potential toxicity of drug candidates, enabling researchers to eliminate harmful candidates early in the development process. Additionally, simulated biomedicine can predict outcomes of clinical trials, allowing for more efficient trial designs and informed decisions by regulatory agencies.
Simulated biomedicine offers opportunities for individuals to engage in their own biomedical research by building AI models. Even those with basic coding knowledge can make valuable contributions. Let's explore a recent research example.
Virtual Screening to Identify Potential Inhibitors of SARS-CoV-2
In the study [1], researchers employed virtual screening to identify potential inhibitors of SARS-CoV-2. By analyzing chemical properties and 3D molecular configurations, the simulation determined drug molecules' binding affinity and efficacy to the virus. The study successfully identified several promising drug candidates from an extensive compound database, showcasing the power of simulated biomedicine in accelerating drug discovery.
As illustrated in Figure 1, researchers derived molecular binding affinity values and medical efficacy data from the drug molecules' chemical formulas and 3D molecular configurations.
Figure 1. A computer-assisted visualization of binding poses of 4 small drug molecules to a large protein (SARS-CoV-2). Source of the graph: reference [1]. In this figure, the four small drug molecules are E01 in Inset A, E19 in Inset B, E20 in Inset C, and E25 in Inset D.
In this research, having a strong binding between the candidate drug molecules and the target protein segment (i.e., the SARS-CoV-2 of COVID-19 here) is the objective function of the AI simulation in the virtual screening.
The four candidate drug molecules are shown in Figure 2. As one can see, the sizes of these molecules are just about right for them to fit nicely into the target cavity of the SARS-CoV-2 protein and, at the same time, enable them to bind to the critical aminoacidic domains to block the active site of the SARS-CoV-2, via a combination of hydrogen bonds, the π−π stacking interactions, and the halogen bond interactions between the guest drug molecules and the critical amino acids of the host SARS-CoV-2 protein.
Figure 2. Illustration of molecular structures of small drug molecules E01, E19, E20, and E25 mentioned in Ref. [1].
In the case of E01, a hydrogen bond between the carbonyl oxygen atom of E01 and the proton of Glu 166 of SARS-CoV-2 as well as the π−π stacking interactions between the aromatic rings of E01 and His 41 SARS-CoV-2 were identified per AI simulation.
Similarly, in the case of E19, a hydrogen bond between the carbonyl oxygen atom of E19 and the proton of Glu 166 of SARS-CoV-2, as well as the π−π stacking interactions between the aromatic rings of E19 and His 41 of SARS-CoV-2, plus the halogen bond interaction between the chlorine atom of E19 and Gly 143 of SARS-CoV-2, were identified per AI simulation.
In the case of E20, the π−π stacking interactions are present between the aromatic rings of E20 and His 41 of SARS-CoV-2 as well as a hydrogen bond between the carbonyl oxygen atom of E20 and the proton of Glu 166 of SARS-CoV-2 were identified per AI simulation.
Similarly, in the case of E25, a hydrogen bond between the carbonyl oxygen atom of E25 and the proton of Glu 166 of SARS-CoV-2 as well as the π−π stacking interactions between the aromatic rings of E25 and catalytic His 41 of SARS-CoV-2, plus a hydrogen bond between the nitrile group of E25 and the proton of Cys 44 of SARS-CoV-2 were identified per AI simulation.
The large databases of compounds used include the well-known ZINC database, which contains millions of compounds that can be screened for potential drug candidates.
The researchers of the study [1] screened over 1.3 million compounds from the ZINC database and identified several compounds that showed promising results in vitro. It took only a few days to complete the virtual screening (over the 1.3 million compounds), whereas any traditional experimental method would have taken years to complete. In the face of an urgent public health crisis, such as the outbreak of the COVID-19 pandemic, the power of simulated biomedicine over traditional medical research is quite apparent.
Take a Closer Look at Related AI Models
In order to perform a virtual screen, one needs to build an AI model that can be trained through machine learning (ML) to acquire the necessary skills to analyze the database(s) in hand. These databases are typically large collections of compounds that can be screened using virtual screening techniques.
For example, one such study, as published on www.nature.com | Scientific Reports, used a combination of molecular docking and machine learning algorithms to identify potential inhibitors of the influenza A virus from a selected database [2]. The researchers trained a machine learning model on a set of known virus inhibitors and used this model to predict the likelihood of binding the compounds in the database to the target protein. The top candidate compounds were subsequently tested in vitro and in vivo, and several were found to be effective virus inhibitors.
One AI model commonly used in simulated biomedical research is the Support Vector Machine (SVM). This technique can be incorporated into the virtual screening process.
The SVM model learns, through training, how to find the best hyperplane that separates the promising inhibitor candidates (if any) from the non-inhibitors (e.g., the rest of the ZINC database). Once the SVM was trained, it could predict the likelihood of binding the drug compounds to the target protein based on their chemical properties.
Python Code for a 13-line SVM Model
A short Python script is written below for this article to illustrate the minimal coding blocks needed to construct a simple SVM model. In order not to trouble readers with too many coding details, we decided to use the internal libraries of Python to write the script (13 lines only) and a straightforward dataset --- iris.
# Import the required libraries and functions
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score
# Load the iris dataset
iris = load_iris()
X = iris.data
y = iris.target
# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Create an SVM model
svm_model = SVC(kernel='linear', C=1.0)
# Train the SVM model
svm_model.fit(X_train, y_train)
# Predict the classes for the testing data
y_pred = svm_model.predict(X_test)
# Evaluate the performance of the model
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)
In this short script, we first import an ML library in Python: sklearn (i.e., scikit-learn). Then we load the iris dataset and split it into training and testing sets using the train_test_split function from scikit-learn. We then create an SVM model using the SVC function from scikit-learn and specify the linear kernel and regularization parameter C. The model is trained on the training data using the fit method, and the predicted classes for the testing data are obtained using the predict method. Finally, we evaluate the performance of the SVM model using the accuracy_score from scikit-learn’s metrics module.
The iris dataset is one of the smallest datasets that one can use to test and evaluate his/her classifier (i.e., to perform a virtual screening). The dataset contains only 50 samples from each of three species of Iris (Iris setosa, Iris virginica, and Iris versicolor). Four measured features are listed for each of the 150 samples. They are sepal length, sepal width, petal length, and petal width, stored in a 150x4 data array (numpy.ndarray). The iris dataset is used as an example here.
The 13-line script shown above is written to help readers visualize the skeleton architecture needed for virtual screening. By executing this code, you will obtain an accuracy score (e.g., "Accuracy: 1.0") on your computer screen, confirming the model's performance.
Python Code for Visualizing the Data Distribution
To visualize the data distribution generated by the SVM model, you can enhance the script as follows:
# Import the required libraries and function
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
# Visualize the data distribution
fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')
# Assign different colors to each class
colors = ['red', 'green', 'blue']
# Plot the data points
for target, color in zip(range(3), colors):
indices = y == target
ax.scatter(X[indices, 0], X[indices, 1], X[indices, 2], c=color, label=iris.target_names[target])
# Set labels and title
ax.set_xlabel('Sepal Length (cm)')
ax.set_ylabel('Sepal Width (cm)')
ax.set_zlabel('Petal Length (cm)')
ax.set_title('Data Distribution of Iris Flowers')
# Add a legend
ax.legend()
# Save the 3D plot as a JPG file
plt.savefig('iris_data_distribution.jpg')
# Show the 3D plot
plt.show()
In this second script:
We import matplotlib.pyplot and mpl_toolkits.mplot3d.Axes3D to create the 3D scatter plot.
After evaluating the performance of the model, we proceed with visualizing the data distribution by creating a figure and an Axes3D subplot for the 3D plot.
We assign different colors to each class using the colors list.
We create a boolean array (indices) for each class to identify the corresponding data points. We then use the scatter function to plot the data points with the appropriate color and label. The labels for the legend are set using iris.target_names[target].
The x-axis represents the sepal length in centimeters, the y-axis represents the sepal width in centimeters, and the z-axis represents the petal length in centimeters. These are three of the four features in the Iris dataset. We set the plot title as “Data Distribution of Iris Flowers.”
Finally, we add a legend, save the plot (into a jpg, but you can change it to another format if you want) using plt.savefig, and display the 3D plot using plt.show().
By running the merged script (i.e., that containing both the 'Python Code for the 13-line SVM Model' and the 'Python Code for Visualizing the Data Distribution'), you can visualize the data distribution in a 3D plot, showcasing the separation and distribution of iris flowers' features among different classes. As illustrated in Figure 3, the plot provides insights into how the SVM model separates and classifies the data.
Figure 3. An illustration of the data distribution (as generated by the author) achieved by deploying the SVM model to the iris dataset. It provides a visual aid for one to imagine the positions of the hyperplanes that separate the red dots from the green dots and blue dots, for those separate the green dots from the red dots and blue dots, and so forth.
Suppose you aim not to classify whether one flower belongs to one of the three types of Irises but to virtual screen one type of bird from many other types, one type of sedan from all known vehicles, or a promising set of inhibitor candidates of a virus. In that case, you will need to import a proper dataset (or construct it yourself) and adjust some hyperparameters and kernel function(s) used, depending on the specific problem being addressed.
With all of these issues discussed, it is relevant to point out that there are ways to improve the sophistication level of the SVM model. For instance, we can rewrite the code to incorporate deep learning into the SVM model and build a Support Vector Machine-Deep Learning (SVM-DL) model or a Support Vector Regression-Deep Learning (SVR-DL) model. These AI models will combine the strengths of deep learning and SVMs to bring the model's performance to a higher level and enable the model to undertake more sophisticated classification and regression tasks.
Summary
Simulated biomedicine, with its integration of computer simulations and AI models, is poised to revolutionize the field of medicine. By leveraging the power of AI, researchers can overcome experimental limitations, accelerate drug discovery, predict drug interactions and effects, assess toxicity, and optimize clinical trials. Furthermore, individuals with coding knowledge can actively participate in simulated biomedical research by building their own AI models.
As the realm of simulated biomedicine continues to evolve, it is an excellent opportunity for researchers, practitioners, and enthusiasts to explore its potential and contribute to its growth. By combining the principles of computer science, mathematics, physics, biology, and medicine, simulated biomedicine offers a promising path toward expanding our understanding of the human body, discovering novel treatments, and transforming the landscape of healthcare.
… …
If you have read this far, you must be intrigued by simulated biomedicine. In the near future, I will delve into SVM-DL and SVR-DL models and their applications in simulated biomedicine in a forthcoming article.
… …
Reference
[1] Sarah Huff, Indrasena Reddy Kummetha, Shashi Kant Tiwari, Matthew B. Huante, Alex E. Clark, Shaobo Wang, William Bray, Davey Smith, Aaron F. Carlin, Mark Endsley, and Tariq M. Rana. Journal of Medicinal Chemistry 2022 65 (4), 2866-2879.
[2] Zekun Liu1, JunpengZhao1, Weichen Li1, Li Shen1, Shengbo Huang2, JingjingTang2, Jie Duan2, Fang Fang2, Yuelong Huang1, HaiyanChang2, Ze Chen2 & Ran Zhang1. Scientific Report 2016 | www. nature.com | Scientific Reports | 6:19095 | DOI: 10.1038/srep19095.
Comments