Scatterplot


A scatter plot displays the relationship between 2 numeric variables. Each data point is represented as a circle. Several tools allow to build one in python, this section provides code samples for Seaborn, Matplotlib and Plotly for interactive versions. Note that this online course has a chapter dedicated to scatterplots.

⏱ Quick start (Seaborn)

The regplot() function of the Seaborn library is definitely the best way to build a scatterplot in minutes. 🔥

Simply pass a numeric column of a data frame to both the x and y variable and the function will handle the rest.

# library & dataset
import seaborn as sns
df = sns.load_dataset('iris')

# use the function regplot to make a scatterplot
sns.regplot(x=df["sepal_length"], y=df["sepal_width"])

⚠️ Scatterplot and overplotting

The main danger with scatterplots is overplotting. When the sample size gets big, circles tend to overlap, making the figure unreadable.

Several workarounds exist to fix the issue, like using opacity or switching to another chart type:

Seaborn logoScatterplots with Seaborn

Seaborn is a python library allowing to make better charts easily. The regplot() function should get you started in minutes. The first example below explains how to build the most basic scatterplot with python. Then, several types of customization are described: adding a regression line, tweaking markersand axis, adding labels and more.

If you are interested in scatterplots, some other chart could be useful to you.

A scatterplot with marginal distribution allows to check the distribution of both the x and y variables. A correlogram allows to check the relationship between each pair of numeric variables in a dataset.

⏱ Quick start (Matplotlib)

Matplotlib also requires only a few lines of code to draw a scatterplot thanks to its plot() function. The resulting chart is not as good-looking, but the function probably offers more flexibility in term of customization.

# libraries
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

# Create a dataset:
df=pd.DataFrame({'x_values': range(1,101), 'y_values': np.random.randn(100)*15+range(1,101) })

# plot
plt.plot( 'x_values', 'y_values', data=df, linestyle='none', marker='o')
plt.show()

Matplotlib logoFrom the web

The web is full of astonishing charts made by awesome bloggers, (often using R). The Python graph gallery tries to display (or translate from R) some of the best creations and explain how their source code works.

The first example below demos how to add clean labels on a scatterplot, automatically avoiding overlapping. It also explains how to control background, fonts, titles and more.

If you want to display your work here, please drop me a word or even better, submit a Pull Request!

Contact

👋 This document is a work by Yan Holtz. Any feedback is highly encouraged. You can fill an issue on Github, drop me a message onTwitter, or send an email pasting yan.holtz.data with gmail.com.

Violin

Density

Histogram

Boxplot

Ridgeline

Scatterplot

Heatmap

Correlogram

Bubble

Connected Scatter

2D Density

Barplot

Spider / Radar

Wordcloud

Parallel

Lollipop

Circular Barplot

Treemap

Venn Diagram

Donut

Pie Chart

Dendrogram

Circular Packing

Line chart

Area chart

Stacked Area

Streamgraph

Timeseries with python

Timeseries

Map

Choropleth

Hexbin

Cartogram

Connection

Bubble

Chord Diagram

Network

Sankey

Arc Diagram

Edge Bundling

Colors

Interactivity

Animation with python

Animation

Cheat sheets

Caveats

3D