Volcano plots are commonly used in data analysis workflows where you want to compare effect size and statistical significance in the same chart. A typical volcano plot places log fold change on the x-axis and `-log10(p-value)` on the y-axis, making it easy to spot points that are both far from zero and highly significant. This tutorial shows how to build a volcano plot in Matplotlib, add significance thresholds, style important groups, and create a more practical reporting example. ### Basic Volcano Plot Let's start with a simple volcano plot using simulated fold changes and p-values.
import numpy as np
import matplotlib.pyplot as plt
np.random.seed(0)
# Sample data
log_fold_change = np.random.normal(0, 1.2, 200)
p_values = np.random.uniform(0.001, 1.0, 200)
plt.scatter(log_fold_change, -np.log10(p_values), color="black", alpha=0.6)
plt.xlabel("Log2 Fold Change")
plt.ylabel("-log10(p-value)")
plt.title("Basic Volcano Plot")
plt.show()- **`plt.scatter(log_fold_change, -np.log10(p_values))`** plots effect size against significance. - Larger y-values indicate smaller p-values and therefore stronger statistical evidence. ### Adding Threshold Lines Threshold lines help readers quickly distinguish between points that pass significance cutoffs and those that do not.
import numpy as np
import matplotlib.pyplot as plt
np.random.seed(1)
# Sample data
log_fold_change = np.random.normal(0, 1.1, 250)
p_values = np.random.uniform(0.001, 1.0, 250)
significance_threshold = 0.05
fold_change_threshold = 1.0
plt.scatter(log_fold_change, -np.log10(p_values), color="black", alpha=0.5)
plt.axhline(-np.log10(significance_threshold), color="red", linestyle="--")
plt.axvline(-fold_change_threshold, color="gray", linestyle="--")
plt.axvline(fold_change_threshold, color="gray", linestyle="--")
plt.xlabel("Log2 Fold Change")
plt.ylabel("-log10(p-value)")
plt.title("Volcano Plot with Thresholds")
plt.show()- **`plt.axhline(-np.log10(significance_threshold), ...)`** marks the statistical significance cutoff. - **`plt.axvline(...)`** lines mark the effect-size boundaries on both sides of zero. ### Highlighting Significant Points Coloring points by category makes the plot easier to interpret, especially when you want to separate upregulated, downregulated, and non-significant observations.
import numpy as np
import matplotlib.pyplot as plt
np.random.seed(2)
# Sample data
log_fold_change = np.random.normal(0, 1.3, 300)
p_values = np.random.uniform(0.001, 1.0, 300)
significant_up = (log_fold_change >= 1) & (p_values < 0.05)
significant_down = (log_fold_change <= -1) & (p_values < 0.05)
not_significant = ~(significant_up | significant_down)
plt.scatter(log_fold_change[not_significant], -np.log10(p_values[not_significant]), color="lightgray", alpha=0.6, label="Not significant")
plt.scatter(log_fold_change[significant_up], -np.log10(p_values[significant_up]), color="crimson", alpha=0.8, label="Upregulated")
plt.scatter(log_fold_change[significant_down], -np.log10(p_values[significant_down]), color="royalblue", alpha=0.8, label="Downregulated")
plt.axhline(-np.log10(0.05), color="black", linestyle="--", linewidth=1)
plt.axvline(-1, color="black", linestyle="--", linewidth=1)
plt.axvline(1, color="black", linestyle="--", linewidth=1)
plt.xlabel("Log2 Fold Change")
plt.ylabel("-log10(p-value)")
plt.title("Volcano Plot with Highlighted Groups")
plt.legend()
plt.show()- Points above the horizontal cutoff and outside the vertical cutoffs are typically the ones of most interest. - Separate colors make directional changes much easier to scan. ### Labeling Top Hits If a few points are especially important, you can label them directly on the chart.
import numpy as np
import matplotlib.pyplot as plt
np.random.seed(3)
# Sample data
genes = [f"Gene {i}" for i in range(1, 21)]
log_fold_change = np.random.normal(0, 1.5, 20)
p_values = np.random.uniform(0.0005, 0.2, 20)
y_values = -np.log10(p_values)
plt.scatter(log_fold_change, y_values, color="darkslategray")
top_hits = np.argsort(p_values)[:5]
for index in top_hits:
plt.text(log_fold_change[index], y_values[index] + 0.1, genes[index], fontsize=9, ha="center")
plt.axhline(-np.log10(0.05), color="red", linestyle="--")
plt.xlabel("Log2 Fold Change")
plt.ylabel("-log10(p-value)")
plt.title("Volcano Plot with Labeled Hits")
plt.show()- **`np.argsort(p_values)[:5]`** finds the smallest p-values. - **`plt.text(...)`** adds labels above the selected points. ### Practical Example: Differential Expression Style Plot Here is a more realistic example showing how you might summarize many features at once and emphasize the most important ones.
import numpy as np
import matplotlib.pyplot as plt
np.random.seed(4)
# Simulated differential expression style data
n = 500
log_fold_change = np.random.normal(0, 1.4, n)
p_values = np.random.uniform(0.0001, 1.0, n)
y_values = -np.log10(p_values)
fc_cutoff = 1.0
p_cutoff = 0.05
upregulated = (log_fold_change >= fc_cutoff) & (p_values < p_cutoff)
downregulated = (log_fold_change <= -fc_cutoff) & (p_values < p_cutoff)
background = ~(upregulated | downregulated)
fig, ax = plt.subplots(figsize=(9, 6))
ax.scatter(log_fold_change[background], y_values[background], color="#CFCFCF", alpha=0.5, s=24, label="Background")
ax.scatter(log_fold_change[upregulated], y_values[upregulated], color="#D62728", alpha=0.8, s=28, label="Upregulated")
ax.scatter(log_fold_change[downregulated], y_values[downregulated], color="#1F77B4", alpha=0.8, s=28, label="Downregulated")
ax.axhline(-np.log10(p_cutoff), color="black", linestyle="--", linewidth=1)
ax.axvline(-fc_cutoff, color="black", linestyle="--", linewidth=1)
ax.axvline(fc_cutoff, color="black", linestyle="--", linewidth=1)
ax.set_xlabel("Log2 Fold Change")
ax.set_ylabel("-log10(p-value)")
ax.set_title("Differential Expression Style Volcano Plot")
ax.legend(loc="upper right")
ax.grid(alpha=0.2)
plt.show()- This layout makes it easy to identify observations with both large effect size and strong statistical support. - In practice, you would replace the simulated arrays with results from your analysis pipeline. ### Conclusion Volcano plots in Matplotlib are an effective way to combine effect size and significance in a single figure. By adding cutoffs, color coding, and selective labels, you can turn a basic scatter plot into a much more useful analysis and reporting tool.