Tutorials

Volcano Plots in Matplotlib

Volcano plots are commonly used in data analysis workflows where you want to compare effect size and statistical significance in the same chart. A typical volcano plot places log fold change on the x-axis and `-log10(p-value)` on the y-axis, making it easy to spot points that are both far from zero and highly significant.

This tutorial shows how to build a volcano plot in Matplotlib, add significance thresholds, style important groups, and create a more practical reporting example.

### Basic Volcano Plot

Let's start with a simple volcano plot using simulated fold changes and p-values.

import numpy as np
import matplotlib.pyplot as plt

np.random.seed(0)

# Sample data
log_fold_change = np.random.normal(0, 1.2, 200)
p_values = np.random.uniform(0.001, 1.0, 200)

plt.scatter(log_fold_change, -np.log10(p_values), color="black", alpha=0.6)

plt.xlabel("Log2 Fold Change")
plt.ylabel("-log10(p-value)")
plt.title("Basic Volcano Plot")
plt.show()
- **`plt.scatter(log_fold_change, -np.log10(p_values))`** plots effect size against significance.
- Larger y-values indicate smaller p-values and therefore stronger statistical evidence.

### Adding Threshold Lines

Threshold lines help readers quickly distinguish between points that pass significance cutoffs and those that do not.

import numpy as np
import matplotlib.pyplot as plt

np.random.seed(1)

# Sample data
log_fold_change = np.random.normal(0, 1.1, 250)
p_values = np.random.uniform(0.001, 1.0, 250)

significance_threshold = 0.05
fold_change_threshold = 1.0

plt.scatter(log_fold_change, -np.log10(p_values), color="black", alpha=0.5)

plt.axhline(-np.log10(significance_threshold), color="red", linestyle="--")
plt.axvline(-fold_change_threshold, color="gray", linestyle="--")
plt.axvline(fold_change_threshold, color="gray", linestyle="--")

plt.xlabel("Log2 Fold Change")
plt.ylabel("-log10(p-value)")
plt.title("Volcano Plot with Thresholds")
plt.show()
- **`plt.axhline(-np.log10(significance_threshold), ...)`** marks the statistical significance cutoff.
- **`plt.axvline(...)`** lines mark the effect-size boundaries on both sides of zero.

### Highlighting Significant Points

Coloring points by category makes the plot easier to interpret, especially when you want to separate upregulated, downregulated, and non-significant observations.

import numpy as np
import matplotlib.pyplot as plt

np.random.seed(2)

# Sample data
log_fold_change = np.random.normal(0, 1.3, 300)
p_values = np.random.uniform(0.001, 1.0, 300)

significant_up = (log_fold_change >= 1) & (p_values < 0.05)
significant_down = (log_fold_change <= -1) & (p_values < 0.05)
not_significant = ~(significant_up | significant_down)

plt.scatter(log_fold_change[not_significant], -np.log10(p_values[not_significant]), color="lightgray", alpha=0.6, label="Not significant")
plt.scatter(log_fold_change[significant_up], -np.log10(p_values[significant_up]), color="crimson", alpha=0.8, label="Upregulated")
plt.scatter(log_fold_change[significant_down], -np.log10(p_values[significant_down]), color="royalblue", alpha=0.8, label="Downregulated")

plt.axhline(-np.log10(0.05), color="black", linestyle="--", linewidth=1)
plt.axvline(-1, color="black", linestyle="--", linewidth=1)
plt.axvline(1, color="black", linestyle="--", linewidth=1)

plt.xlabel("Log2 Fold Change")
plt.ylabel("-log10(p-value)")
plt.title("Volcano Plot with Highlighted Groups")
plt.legend()
plt.show()
- Points above the horizontal cutoff and outside the vertical cutoffs are typically the ones of most interest.
- Separate colors make directional changes much easier to scan.

### Labeling Top Hits

If a few points are especially important, you can label them directly on the chart.

import numpy as np
import matplotlib.pyplot as plt

np.random.seed(3)

# Sample data
genes = [f"Gene {i}" for i in range(1, 21)]
log_fold_change = np.random.normal(0, 1.5, 20)
p_values = np.random.uniform(0.0005, 0.2, 20)

y_values = -np.log10(p_values)

plt.scatter(log_fold_change, y_values, color="darkslategray")

top_hits = np.argsort(p_values)[:5]
for index in top_hits:
    plt.text(log_fold_change[index], y_values[index] + 0.1, genes[index], fontsize=9, ha="center")

plt.axhline(-np.log10(0.05), color="red", linestyle="--")
plt.xlabel("Log2 Fold Change")
plt.ylabel("-log10(p-value)")
plt.title("Volcano Plot with Labeled Hits")
plt.show()
- **`np.argsort(p_values)[:5]`** finds the smallest p-values.
- **`plt.text(...)`** adds labels above the selected points.

### Practical Example: Differential Expression Style Plot

Here is a more realistic example showing how you might summarize many features at once and emphasize the most important ones.

import numpy as np
import matplotlib.pyplot as plt

np.random.seed(4)

# Simulated differential expression style data
n = 500
log_fold_change = np.random.normal(0, 1.4, n)
p_values = np.random.uniform(0.0001, 1.0, n)
y_values = -np.log10(p_values)

fc_cutoff = 1.0
p_cutoff = 0.05

upregulated = (log_fold_change >= fc_cutoff) & (p_values < p_cutoff)
downregulated = (log_fold_change <= -fc_cutoff) & (p_values < p_cutoff)
background = ~(upregulated | downregulated)

fig, ax = plt.subplots(figsize=(9, 6))

ax.scatter(log_fold_change[background], y_values[background], color="#CFCFCF", alpha=0.5, s=24, label="Background")
ax.scatter(log_fold_change[upregulated], y_values[upregulated], color="#D62728", alpha=0.8, s=28, label="Upregulated")
ax.scatter(log_fold_change[downregulated], y_values[downregulated], color="#1F77B4", alpha=0.8, s=28, label="Downregulated")

ax.axhline(-np.log10(p_cutoff), color="black", linestyle="--", linewidth=1)
ax.axvline(-fc_cutoff, color="black", linestyle="--", linewidth=1)
ax.axvline(fc_cutoff, color="black", linestyle="--", linewidth=1)

ax.set_xlabel("Log2 Fold Change")
ax.set_ylabel("-log10(p-value)")
ax.set_title("Differential Expression Style Volcano Plot")
ax.legend(loc="upper right")
ax.grid(alpha=0.2)

plt.show()
- This layout makes it easy to identify observations with both large effect size and strong statistical support.
- In practice, you would replace the simulated arrays with results from your analysis pipeline.

### Conclusion

Volcano plots in Matplotlib are an effective way to combine effect size and significance in a single figure. By adding cutoffs, color coding, and selective labels, you can turn a basic scatter plot into a much more useful analysis and reporting tool.