matplotlib scatter: the more overlapping points the bigger the marker

  • Last Update :
  • Techknowledgy :

You can make use of the occurrence frequency of the x-points (or even y-points for this particular data set) which can be obtained using Counter module. The frequencies can then be used as a rescaling factor for defining the size of the markers. Here 200 is just a big number to emphasize the size of the markers.

from matplotlib
import pyplot as plt
from collections
import Counter

a = [1, 1, 1, 1, 2, 2]
b = [2, 2, 2, 2, 1, 1]

weights = [200 * i
   for i in Counter(a).values() for j in range(i)
]
plt.scatter(a, b, s = weights)
plt.show()

Another option to visualise the distribution is a bar chart

freqs = Counter(a)

plt.bar(freqs.keys(), freqs.values(), width = 0.5)
plt.xticks(list(freqs.keys()))

Suggestion : 2

You can try to decrease marker size in your plot. This way they won't overlap and the patterns will be clearer.,Lastly, you can draw a marginal plot using seaborn in order to aviod overlapping in your graph. Note that you can check this post to find out details on marginal plots.,You can use jitter when you have overlapping points, it makes easier to see the distribution. Seaborn has a function stripplot() you can use for this purpose:,Density graph is a good alternative for an overplotted scatterplot. You can see the relationships between the variables better:

# libraries
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
import pandas as pd

# Dataset:
   df = pd.DataFrame({
      'x': np.random.normal(10, 1.2, 20000),
      'y': np.random.normal(10, 1.2, 20000),
      'group': np.repeat('A', 20000)
   })
tmp1 = pd.DataFrame({
   'x': np.random.normal(14.5, 1.2, 20000),
   'y': np.random.normal(14.5, 1.2, 20000),
   'group': np.repeat('B', 20000)
})
tmp2 = pd.DataFrame({
   'x': np.random.normal(9.5, 1.5, 20000),
   'y': np.random.normal(15.5, 1.5, 20000),
   'group': np.repeat('C', 20000)
})
df = df.append(tmp1).append(tmp2)

# plot
plt.plot('x', 'y', "", data = df, linestyle = '', marker = 'o')
plt.xlabel('Value of X')
plt.ylabel('Value of Y')
plt.title('Overplotting looks like that:', loc = 'left')
plt.show()
# libraries
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
import pandas as pd

# Dataset:
   df = pd.DataFrame({
      'x': np.random.normal(10, 1.2, 20000),
      'y': np.random.normal(10, 1.2, 20000),
      'group': np.repeat('A', 20000)
   })
tmp1 = pd.DataFrame({
   'x': np.random.normal(14.5, 1.2, 20000),
   'y': np.random.normal(14.5, 1.2, 20000),
   'group': np.repeat('B', 20000)
})
tmp2 = pd.DataFrame({
   'x': np.random.normal(9.5, 1.5, 20000),
   'y': np.random.normal(15.5, 1.5, 20000),
   'group': np.repeat('C', 20000)
})
df = df.append(tmp1).append(tmp2)

# Plot with small marker size
plt.plot('x', 'y', "", data = df, linestyle = '', marker = 'o', markersize = 0.7)
plt.xlabel('Value of X')
plt.ylabel('Value of Y')
plt.title('Overplotting? Try to reduce the dot size', loc = 'left')
plt.show()
# libraries
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
import pandas as pd

# Dataset:
   df = pd.DataFrame({
      'x': np.random.normal(10, 1.2, 20000),
      'y': np.random.normal(10, 1.2, 20000),
      'group': np.repeat('A', 20000)
   })
tmp1 = pd.DataFrame({
   'x': np.random.normal(14.5, 1.2, 20000),
   'y': np.random.normal(14.5, 1.2, 20000),
   'group': np.repeat('B', 20000)
})
tmp2 = pd.DataFrame({
   'x': np.random.normal(9.5, 1.5, 20000),
   'y': np.random.normal(15.5, 1.5, 20000),
   'group': np.repeat('C', 20000)
})
df = df.append(tmp1).append(tmp2)

# Plot with transparency
plt.plot('x', 'y', "", data = df, linestyle = '', marker = 'o', markersize = 3, alpha = 0.05, color = "purple")

# Titles
plt.xlabel('Value of X')
plt.ylabel('Value of Y')
plt.title('Overplotting? Try to use transparency', loc = 'left')
plt.show()
# libraries
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
import pandas as pd

# Dataset:
   df = pd.DataFrame({
      'x': np.random.normal(10, 1.2, 20000),
      'y': np.random.normal(10, 1.2, 20000),
      'group': np.repeat('A', 20000)
   })
tmp1 = pd.DataFrame({
   'x': np.random.normal(14.5, 1.2, 20000),
   'y': np.random.normal(14.5, 1.2, 20000),
   'group': np.repeat('B', 20000)
})
tmp2 = pd.DataFrame({
   'x': np.random.normal(9.5, 1.5, 20000),
   'y': np.random.normal(15.5, 1.5, 20000),
   'group': np.repeat('C', 20000)
})
df = df.append(tmp1).append(tmp2)

# 2 D density plot:
   sns.kdeplot(data = df, x = "x", y = "y", cmap = "Reds", shade = True)
plt.title('Overplotting? Try 2D density graph', loc = 'left')
plt.show()
# libraries
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
import pandas as pd

# Dataset:
   df = pd.DataFrame({
      'x': np.random.normal(10, 1.2, 20000),
      'y': np.random.normal(10, 1.2, 20000),
      'group': np.repeat('A', 20000)
   })
tmp1 = pd.DataFrame({
   'x': np.random.normal(14.5, 1.2, 20000),
   'y': np.random.normal(14.5, 1.2, 20000),
   'group': np.repeat('B', 20000)
})
tmp2 = pd.DataFrame({
   'x': np.random.normal(9.5, 1.5, 20000),
   'y': np.random.normal(15.5, 1.5, 20000),
   'group': np.repeat('C', 20000)
})
df = df.append(tmp1).append(tmp2)

# Sample 1000 random lines
df_sample = df.sample(1000)

# Make the plot with this subset
plt.plot('x', 'y', "", data = df_sample, linestyle = '', marker = 'o')

# titles
plt.xlabel('Value of X')
plt.ylabel('Value of Y')
plt.title('Overplotting? Sample your data', loc = 'left')
plt.show()
# libraries
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
import pandas as pd

# Dataset:
   df = pd.DataFrame({
      'x': np.random.normal(10, 1.2, 20000),
      'y': np.random.normal(10, 1.2, 20000),
      'group': np.repeat('A', 20000)
   })
tmp1 = pd.DataFrame({
   'x': np.random.normal(14.5, 1.2, 20000),
   'y': np.random.normal(14.5, 1.2, 20000),
   'group': np.repeat('B', 20000)
})
tmp2 = pd.DataFrame({
   'x': np.random.normal(9.5, 1.5, 20000),
   'y': np.random.normal(15.5, 1.5, 20000),
   'group': np.repeat('C', 20000)
})
df = df.append(tmp1).append(tmp2)

# Filter the data randomly
df_filtered = df[df['group'] == 'A']
# Plot the whole dataset
plt.plot('x', 'y', "", data = df, linestyle = '', marker = 'o', markersize = 1.5, color = "grey", alpha = 0.3, label = 'other group')

# Add the group to study
plt.plot('x', 'y', "", data = df_filtered, linestyle = '', marker = 'o', markersize = 1.5, alpha = 0.3, label = 'group A')

# Add titles and legend
plt.legend(markerscale = 8)
plt.xlabel('Value of X')
plt.ylabel('Value of Y')
plt.title('Overplotting? Show a specific group', loc = 'left')
plt.show()

Suggestion : 3

Using the s argument, you can set the size of your markers, in points squared. If you want a marker 10 points high, choose s=100. , Using the marker argument and the right character code, you can choose whichever style you like. Here are a few of the common ones. , There's no way to specify multiple marker styles in a single scatter() call, but we can separate our data out into groups and plot each marker style separately. Here we chopped our data up into three equal groups. , The real power of the scatter() function somes out when we want to modify markers individually.

ax.scatter(x, y, s = 80)
sizes = np.random.sample(size = x.size)
ax.scatter(x, y, s = sizes)
ax.scatter(x, y, marker = "v")
ax.scatter(x1, y1, marker = "o")
ax.scatter(x2, y2, marker = "x")
ax.scatter(x3, y3, marker = "s")
ax.scatter(x, y, c = "orange")
ax.scatter(x, y, c = x - y)

Suggestion : 4

Another commonly used plot type is the simple scatter plot, a close cousin of the line plot. Instead of points being joined by line segments, here the points are represented individually with a dot, circle, or other shape. We’ll start by setting up the notebook for plotting and importing the functions we will use:,For even more possibilities, these character codes can be used together with line and color codes to plot points along with a line connecting them:,A second, more powerful method of creating scatter plots is the plt.scatter function, which can be used very similarly to the plt.plot function:,Let's show this by creating a random scatter plot with points of many colors and sizes. In order to better see the overlapping results, we'll also use the alpha keyword to adjust the transparency level:

% matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('seaborn-whitegrid')
import numpy as np
x = np.linspace(0, 10, 30)
y = np.sin(x)

plt.plot(x, y, 'o', color = 'black');
rng = np.random.RandomState(0)
for marker in ['o', '.', ',', 'x', '+', 'v', '^', '<', '>', 's', 'd']:
   plt.plot(rng.rand(5), rng.rand(5), marker,
      label = "marker='{0}'".format(marker))
plt.legend(numpoints = 1)
plt.xlim(0, 1.8);
plt.plot(x, y, '-ok');
plt.plot(x, y, '-p', color = 'gray',
   markersize = 15, linewidth = 4,
   markerfacecolor = 'white',
   markeredgecolor = 'gray',
   markeredgewidth = 2)
plt.ylim(-1.2, 1.2);
plt.scatter(x, y, marker = 'o');

Suggestion : 5

At this point, any plt plot command will cause a figure window to open, and further commands can be run to update the plot. Some changes (such as modifying properties of lines that are already drawn) will not draw automatically; to force an update, use plt.draw(). Using plt.show() in Matplotlib mode is not required.,The simplest legend can be created with the plt.legend() command, which automatically creates a legend for any labeled plot elements (Figure 4-41):,Here the fmt is a format code controlling the appearance of lines and points, and has the same syntax as the shorthand used in plt.plot, outlined in “Simple Line Plots” and “Simple Scatter Plots”.,It’s important to note that this interface is stateful: it keeps track of the “current” figure and axes, which are where all plt commands are applied. You can get a reference to these using the plt.gcf() (get current figure) and plt.gca() (get current axes) routines.

Just as we use the np shorthand for NumPy and the pd shorthand for Pandas, we will use some standard shorthands for Matplotlib imports:

In[1]: import matplotlib as mpl
import matplotlib.pyplot as plt

We will use the plt.style directive to choose appropriate aesthetic styles for our figures. Here we will set the classic style, which ensures that the plots we create use the classic Matplotlib style:

In[2]: plt.style.use('classic')

So, for example, you may have a file called myplot.py containing the following:

#-- -- -- - file: myplot.py-- -- --
import matplotlib.pyplot as plt
import numpy as np

x = np.linspace(0, 10, 100)

plt.plot(x, np.sin(x))
plt.plot(x, np.cos(x))

plt.show()

You can then run this script from the command-line prompt, which will result in a window opening with your figure displayed:

$ python myplot.py

It can be very convenient to use Matplotlib interactively within an IPython shell (see Chapter 1). IPython is built to work well with Matplotlib if you specify Matplotlib mode. To enable this mode, you can use the %matplotlib magic command after starting ipython:

In[1]: % matplotlib
Using matplotlib backend: TkAgg

In[2]: import matplotlib.pyplot as plt

For this book, we will generally opt for %matplotlib inline:

In[3]: % matplotlib inline

After you run this command (it needs to be done only once per kernel/session), any cell within the notebook that creates a plot will embed a PNG image of the resulting graphic (Figure 4-1):

In[4]: import numpy as np
x = np.linspace(0, 10, 100)

fig = plt.figure()
plt.plot(x, np.sin(x), '-')
plt.plot(x, np.cos(x), '--');

One nice feature of Matplotlib is the ability to save figures in a wide variety of formats. You can save a figure using the savefig() command. For example, to save the previous figure as a PNG file, you can run this:

In[5]: fig.savefig('my_figure.png')

We now have a file called my_figure.png in the current working directory:

In[6]: !ls - lh my_figure.png
In[6]: !ls -lh my_figure.png
-rw - r--r--1 jakevdp staff 16 K Aug 11 10: 59 my_figure.png

In savefig(), the file format is inferred from the extension of the given filename. Depending on what backends you have installed, many different file formats are available. You can find the list of supported file types for your system by using the following method of the figure canvas object:

In[8]: fig.canvas.get_supported_filetypes()
In[8]: fig.canvas.get_supported_filetypes()
Out[8]: {
   'eps': 'Encapsulated Postscript',
   'jpeg': 'Joint Photographic Experts Group',
   'jpg': 'Joint Photographic Experts Group',
   'pdf': 'Portable Document Format',
   'pgf': 'PGF code for LaTeX',
   'png': 'Portable Network Graphics',
   'ps': 'Postscript',
   'raw': 'Raw RGBA bitmap',
   'rgba': 'Raw RGBA bitmap',
   'svg': 'Scalable Vector Graphics',
   'svgz': 'Scalable Vector Graphics',
   'tif': 'Tagged Image File Format',
   'tiff': 'Tagged Image File Format'
}

Suggestion : 6

The other matplotlib functions do not define marker size in this way. For most of them, if you want markers with area 5, you write s=5. We’re not sure why plt.scatter() defines this differently. ,Matplotlib Scatter Marker Size,To set the best marker size for a scatter plot, draw it a few times with different s values. ,Matplotlib Scatter Marker Types

The following code shows a minimal example of creating a scatter plot in Python.

import matplotlib.pyplot as plt

x = [0, 1, 2, 3, 4, 5]
y = [1, 2, 4, 8, 16, 32]

plt.plot(x, y, 'o')
plt.show()

First, let’s import the modules we’ll be using and load the dataset.

import matplotlib.pyplot as plt
import seaborn as sns

# Optional step
# Seaborn 's default settings look much nicer than matplotlib
sns.set()

tips_df = sns.load_dataset('tips')

total_bill = tips_df.total_bill.to_numpy()
tip = tips_df.tip.to_numpy()

Let’s make a scatter plot of total_bill against tip. It’s very easy to do in matplotlib – use the plt.scatter() function. First, we pass the x-axis variable, then the y-axis one. We call the former the independent variable and the latter the dependent variable. A scatter graph shows what happens to the dependent variable (y) when we change the independent variable (x). 

plt.scatter(total_bill, tip)
plt.show()

To set the best marker size for a scatter plot, draw it a few times with different s values. 

# Small s
plt.scatter(total_bill, tip, s = 1)
plt.show()

A small number makes each marker small. Setting s=1 is too small for this plot and makes it hard to read. For some plots with a lot of data, setting s to a very small number makes it much easier to read. 

# Big s
plt.scatter(total_bill, tip, s = 100)
plt.show()