You can make use of the occurrence frequency of the x-points (or even y-points for this particular data set) which can be obtained using Counter
module. The frequencies can then be used as a rescaling factor for defining the size of the markers. Here 200
is just a big number to emphasize the size of the markers.
from matplotlib
import pyplot as plt
from collections
import Counter
a = [1, 1, 1, 1, 2, 2]
b = [2, 2, 2, 2, 1, 1]
weights = [200 * i
for i in Counter(a).values() for j in range(i)
]
plt.scatter(a, b, s = weights)
plt.show()
Another option to visualise the distribution is a bar chart
freqs = Counter(a) plt.bar(freqs.keys(), freqs.values(), width = 0.5) plt.xticks(list(freqs.keys()))
You can try to decrease marker size in your plot. This way they won't overlap and the patterns will be clearer.,Lastly, you can draw a marginal plot using seaborn in order to aviod overlapping in your graph. Note that you can check this post to find out details on marginal plots.,You can use jitter when you have overlapping points, it makes easier to see the distribution. Seaborn has a function stripplot() you can use for this purpose:,Density graph is a good alternative for an overplotted scatterplot. You can see the relationships between the variables better:
# libraries import matplotlib.pyplot as plt import numpy as np import seaborn as sns import pandas as pd # Dataset: df = pd.DataFrame({ 'x': np.random.normal(10, 1.2, 20000), 'y': np.random.normal(10, 1.2, 20000), 'group': np.repeat('A', 20000) }) tmp1 = pd.DataFrame({ 'x': np.random.normal(14.5, 1.2, 20000), 'y': np.random.normal(14.5, 1.2, 20000), 'group': np.repeat('B', 20000) }) tmp2 = pd.DataFrame({ 'x': np.random.normal(9.5, 1.5, 20000), 'y': np.random.normal(15.5, 1.5, 20000), 'group': np.repeat('C', 20000) }) df = df.append(tmp1).append(tmp2) # plot plt.plot('x', 'y', "", data = df, linestyle = '', marker = 'o') plt.xlabel('Value of X') plt.ylabel('Value of Y') plt.title('Overplotting looks like that:', loc = 'left') plt.show()
# libraries import matplotlib.pyplot as plt import numpy as np import seaborn as sns import pandas as pd # Dataset: df = pd.DataFrame({ 'x': np.random.normal(10, 1.2, 20000), 'y': np.random.normal(10, 1.2, 20000), 'group': np.repeat('A', 20000) }) tmp1 = pd.DataFrame({ 'x': np.random.normal(14.5, 1.2, 20000), 'y': np.random.normal(14.5, 1.2, 20000), 'group': np.repeat('B', 20000) }) tmp2 = pd.DataFrame({ 'x': np.random.normal(9.5, 1.5, 20000), 'y': np.random.normal(15.5, 1.5, 20000), 'group': np.repeat('C', 20000) }) df = df.append(tmp1).append(tmp2) # Plot with small marker size plt.plot('x', 'y', "", data = df, linestyle = '', marker = 'o', markersize = 0.7) plt.xlabel('Value of X') plt.ylabel('Value of Y') plt.title('Overplotting? Try to reduce the dot size', loc = 'left') plt.show()
# libraries import matplotlib.pyplot as plt import numpy as np import seaborn as sns import pandas as pd # Dataset: df = pd.DataFrame({ 'x': np.random.normal(10, 1.2, 20000), 'y': np.random.normal(10, 1.2, 20000), 'group': np.repeat('A', 20000) }) tmp1 = pd.DataFrame({ 'x': np.random.normal(14.5, 1.2, 20000), 'y': np.random.normal(14.5, 1.2, 20000), 'group': np.repeat('B', 20000) }) tmp2 = pd.DataFrame({ 'x': np.random.normal(9.5, 1.5, 20000), 'y': np.random.normal(15.5, 1.5, 20000), 'group': np.repeat('C', 20000) }) df = df.append(tmp1).append(tmp2) # Plot with transparency plt.plot('x', 'y', "", data = df, linestyle = '', marker = 'o', markersize = 3, alpha = 0.05, color = "purple") # Titles plt.xlabel('Value of X') plt.ylabel('Value of Y') plt.title('Overplotting? Try to use transparency', loc = 'left') plt.show()
# libraries import matplotlib.pyplot as plt import numpy as np import seaborn as sns import pandas as pd # Dataset: df = pd.DataFrame({ 'x': np.random.normal(10, 1.2, 20000), 'y': np.random.normal(10, 1.2, 20000), 'group': np.repeat('A', 20000) }) tmp1 = pd.DataFrame({ 'x': np.random.normal(14.5, 1.2, 20000), 'y': np.random.normal(14.5, 1.2, 20000), 'group': np.repeat('B', 20000) }) tmp2 = pd.DataFrame({ 'x': np.random.normal(9.5, 1.5, 20000), 'y': np.random.normal(15.5, 1.5, 20000), 'group': np.repeat('C', 20000) }) df = df.append(tmp1).append(tmp2) # 2 D density plot: sns.kdeplot(data = df, x = "x", y = "y", cmap = "Reds", shade = True) plt.title('Overplotting? Try 2D density graph', loc = 'left') plt.show()
# libraries import matplotlib.pyplot as plt import numpy as np import seaborn as sns import pandas as pd # Dataset: df = pd.DataFrame({ 'x': np.random.normal(10, 1.2, 20000), 'y': np.random.normal(10, 1.2, 20000), 'group': np.repeat('A', 20000) }) tmp1 = pd.DataFrame({ 'x': np.random.normal(14.5, 1.2, 20000), 'y': np.random.normal(14.5, 1.2, 20000), 'group': np.repeat('B', 20000) }) tmp2 = pd.DataFrame({ 'x': np.random.normal(9.5, 1.5, 20000), 'y': np.random.normal(15.5, 1.5, 20000), 'group': np.repeat('C', 20000) }) df = df.append(tmp1).append(tmp2) # Sample 1000 random lines df_sample = df.sample(1000) # Make the plot with this subset plt.plot('x', 'y', "", data = df_sample, linestyle = '', marker = 'o') # titles plt.xlabel('Value of X') plt.ylabel('Value of Y') plt.title('Overplotting? Sample your data', loc = 'left') plt.show()
# libraries import matplotlib.pyplot as plt import numpy as np import seaborn as sns import pandas as pd # Dataset: df = pd.DataFrame({ 'x': np.random.normal(10, 1.2, 20000), 'y': np.random.normal(10, 1.2, 20000), 'group': np.repeat('A', 20000) }) tmp1 = pd.DataFrame({ 'x': np.random.normal(14.5, 1.2, 20000), 'y': np.random.normal(14.5, 1.2, 20000), 'group': np.repeat('B', 20000) }) tmp2 = pd.DataFrame({ 'x': np.random.normal(9.5, 1.5, 20000), 'y': np.random.normal(15.5, 1.5, 20000), 'group': np.repeat('C', 20000) }) df = df.append(tmp1).append(tmp2) # Filter the data randomly df_filtered = df[df['group'] == 'A'] # Plot the whole dataset plt.plot('x', 'y', "", data = df, linestyle = '', marker = 'o', markersize = 1.5, color = "grey", alpha = 0.3, label = 'other group') # Add the group to study plt.plot('x', 'y', "", data = df_filtered, linestyle = '', marker = 'o', markersize = 1.5, alpha = 0.3, label = 'group A') # Add titles and legend plt.legend(markerscale = 8) plt.xlabel('Value of X') plt.ylabel('Value of Y') plt.title('Overplotting? Show a specific group', loc = 'left') plt.show()
Using the s argument, you can set the size of your markers, in points squared. If you want a marker 10 points high, choose s=100. , Using the marker argument and the right character code, you can choose whichever style you like. Here are a few of the common ones. , There's no way to specify multiple marker styles in a single scatter() call, but we can separate our data out into groups and plot each marker style separately. Here we chopped our data up into three equal groups. , The real power of the scatter() function somes out when we want to modify markers individually.
ax.scatter(x, y, s = 80)
sizes = np.random.sample(size = x.size) ax.scatter(x, y, s = sizes)
ax.scatter(x, y, marker = "v")
ax.scatter(x1, y1, marker = "o")
ax.scatter(x2, y2, marker = "x")
ax.scatter(x3, y3, marker = "s")
ax.scatter(x, y, c = "orange")
ax.scatter(x, y, c = x - y)
Another commonly used plot type is the simple scatter plot, a close cousin of the line plot. Instead of points being joined by line segments, here the points are represented individually with a dot, circle, or other shape. We’ll start by setting up the notebook for plotting and importing the functions we will use:,For even more possibilities, these character codes can be used together with line and color codes to plot points along with a line connecting them:,A second, more powerful method of creating scatter plots is the plt.scatter function, which can be used very similarly to the plt.plot function:,Let's show this by creating a random scatter plot with points of many colors and sizes. In order to better see the overlapping results, we'll also use the alpha keyword to adjust the transparency level:
% matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('seaborn-whitegrid')
import numpy as np
x = np.linspace(0, 10, 30)
y = np.sin(x)
plt.plot(x, y, 'o', color = 'black');
rng = np.random.RandomState(0)
for marker in ['o', '.', ',', 'x', '+', 'v', '^', '<', '>', 's', 'd']:
plt.plot(rng.rand(5), rng.rand(5), marker,
label = "marker='{0}'".format(marker))
plt.legend(numpoints = 1)
plt.xlim(0, 1.8);
plt.plot(x, y, '-ok');
plt.plot(x, y, '-p', color = 'gray',
markersize = 15, linewidth = 4,
markerfacecolor = 'white',
markeredgecolor = 'gray',
markeredgewidth = 2)
plt.ylim(-1.2, 1.2);
plt.scatter(x, y, marker = 'o');
At this point, any plt plot command will cause a figure window to open, and further commands can be run to update the plot. Some changes (such as modifying properties of lines that are already drawn) will not draw automatically; to force an update, use plt.draw(). Using plt.show() in Matplotlib mode is not required.,The simplest legend can be created with the plt.legend() command, which automatically creates a legend for any labeled plot elements (Figure 4-41):,Here the fmt is a format code controlling the appearance of lines and points, and has the same syntax as the shorthand used in plt.plot, outlined in “Simple Line Plots” and “Simple Scatter Plots”.,It’s important to note that this interface is stateful: it keeps track of the “current” figure and axes, which are where all plt commands are applied. You can get a reference to these using the plt.gcf() (get current figure) and plt.gca() (get current axes) routines.
So, for example, you may have a file called myplot.py containing the following:
#-- -- -- - file: myplot.py-- -- --
import matplotlib.pyplot as plt
import numpy as np
x = np.linspace(0, 10, 100)
plt.plot(x, np.sin(x))
plt.plot(x, np.cos(x))
plt.show()
You can then run this script from the command-line prompt, which will result in a window opening with your figure displayed:
$ python myplot.py
It can be very convenient to use Matplotlib interactively within an
IPython shell (see
Chapter 1). IPython is built to work well with Matplotlib if you specify
Matplotlib mode. To enable this mode, you can use the %matplotlib
magic command after starting ipython
:
In[1]: % matplotlib
Using matplotlib backend: TkAgg
In[2]: import matplotlib.pyplot as plt
For this book, we will generally opt for %matplotlib inline
:
In[3]: % matplotlib inline
After you run this command (it needs to be done only once per kernel/session), any cell within the notebook that creates a plot will embed a PNG image of the resulting graphic (Figure 4-1):
In[4]: import numpy as np
x = np.linspace(0, 10, 100)
fig = plt.figure()
plt.plot(x, np.sin(x), '-')
plt.plot(x, np.cos(x), '--');
One nice feature of Matplotlib is the ability to save figures in a wide
variety of formats. You can save a figure using the savefig()
command. For example, to save the previous figure as a PNG file, you can
run this:
In[5]: fig.savefig('my_figure.png')
We now have a file called my_figure.png in the current working directory:
In[6]: !ls - lh my_figure.png
In
[
6
]:
!
ls
-
lh
my_figure
.
png
-rw - r--r--1 jakevdp staff 16 K Aug 11 10: 59 my_figure.png
In savefig()
, the file format is inferred from the extension of the
given filename. Depending on what backends you have installed, many
different file formats are available. You can find the list of supported file types for your system by using the following method of the figure
canvas
object:
In[8]: fig.canvas.get_supported_filetypes()
In
[
8
]:
fig
.
canvas
.
get_supported_filetypes
()
Out[8]: {
'eps': 'Encapsulated Postscript',
'jpeg': 'Joint Photographic Experts Group',
'jpg': 'Joint Photographic Experts Group',
'pdf': 'Portable Document Format',
'pgf': 'PGF code for LaTeX',
'png': 'Portable Network Graphics',
'ps': 'Postscript',
'raw': 'Raw RGBA bitmap',
'rgba': 'Raw RGBA bitmap',
'svg': 'Scalable Vector Graphics',
'svgz': 'Scalable Vector Graphics',
'tif': 'Tagged Image File Format',
'tiff': 'Tagged Image File Format'
}
The other matplotlib functions do not define marker size in this way. For most of them, if you want markers with area 5, you write s=5. We’re not sure why plt.scatter() defines this differently. ,Matplotlib Scatter Marker Size,To set the best marker size for a scatter plot, draw it a few times with different s values. ,Matplotlib Scatter Marker Types
The following code shows a minimal example of creating a scatter plot in Python.
import matplotlib.pyplot as plt
x = [0, 1, 2, 3, 4, 5]
y = [1, 2, 4, 8, 16, 32]
plt.plot(x, y, 'o')
plt.show()
First, let’s import the modules we’ll be using and load the dataset.
import matplotlib.pyplot as plt import seaborn as sns # Optional step # Seaborn 's default settings look much nicer than matplotlib sns.set() tips_df = sns.load_dataset('tips') total_bill = tips_df.total_bill.to_numpy() tip = tips_df.tip.to_numpy()
Let’s make a scatter plot of total_bill
against tip. It’s very easy to do in matplotlib – use the plt.scatter()
function. First, we pass the x-axis variable, then the y-axis one. We call the former the independent variable and the latter the dependent variable. A scatter graph shows what happens to the dependent variable (y) when we change the independent variable (x).
plt.scatter(total_bill, tip) plt.show()
To set the best marker size for a scatter plot, draw it a few times with different s
values.
# Small s plt.scatter(total_bill, tip, s = 1) plt.show()
A small number makes each marker small. Setting s=1
is too small for this plot and makes it hard to read. For some plots with a lot of data, setting s
to a very small number makes it much easier to read.
# Big s plt.scatter(total_bill, tip, s = 100) plt.show()