Advanced visualization tutorial with Seaborn

In this visualization tutorial you will learn how to use the Seaborn library to create advanced plots. If you are familiar with basic line, scatter and histogram plots, then this tutorial could help upgrade your visualization skills.

What is Datalore by JetBrains?

Datalore is an online Jupyter notebook with smart coding assistance hosted by JetBrains. Datalore comes with an intelligent code editor and powerful computation machines. Clean up data, produce interactive visualizations, and create Machine learning and Deep learning models faster.

Click "Edit" button in the upper right conner to run this notebook in Datalore.

Learn more about Datalore on datalore.jetbrains.com.

About the Dataset

In Datalore we've manually assembled a dataset with different GPU machines and their technical details. Plotting with seaborn library will help us to play with data and learn more how this type of hardware has improved over time.

Dataset overview

The dataset contains of 495 GPU models. It includes the following parameters:

GPU manufacturer
GPU class
Name
Year of release
Fab = fabrication process (nm). It defines the size of transistors in a processor. The transistors are the buildings block of any integrated circuit, GPU and CPU included.
Number of transistors (millions)
Die size
Memory size in megabytes
GFLOPS = billions of Floating Operations Per Second
TDP (thermal design power) = the maximum amount of heat generated by a computer chip or component

import urllib

urllib.request.urlretrieve('https://datalore-samples.s3-eu-west-1.amazonaws.com/tutorials/gpus.csv',
                           'gpus.csv')

('gpus.csv', <http.client.HTTPMessage at 0x7f8475cca210>)

df = pd.read_csv('gpus.csv')
df

Scatter plot, line plot and reg plot

Let's start from examining the Moore's law which says that the number of transistors in an integrated circuit doubles every 2 years. Let's take a look if it applies to GPUs as well!

sns.scatterplot() and sns.lineplot() look quite basic, before you combine them in one figure. We'll use scatter plot to display the number of transistors in released GPUs across the 2006-2021 years.

To distinguish different Fab categories, use palette parameter in sns.scatterplot() and a predefined color set "Set1".

plt.figure(figsize=(15, 10))
ax = sns.scatterplot(x='Year',
                     y='Transistors (mln)',
                     hue='Fab',
                     legend='full',
                     data=df,
                     palette=sns.color_palette("Set1", n_colors=len(df.Fab.unique())))
max_transistors_per_year = df.groupby('Year')['Transistors (mln)'].max()
sns.lineplot(data=max_transistors_per_year,
             ax=ax.axes,
             color='black')
ax.set_xlim(2006, 2021)
plt.show()

Data insight #1

As we could see from the plot above, the Moore's law is true for 2019-2020 years, but over the 2006-2018 years the growth was almost linear.

However, the Moore's law is almost true for the transistors density. Look at the plot below and see it doubling every ~3 years.

To display a smoothing function for the number of transistors we'll use sns.regplot(). Use order parametr to estimate a polynomial regression.

plt.figure(figsize=(15, 10))
df['Transistors/mm2'] = df['Transistors (mln)'] / df['Die size']
ax = sns.scatterplot(x='Year',
                     y='Transistors/mm2',
                     hue='Fab',
                     legend='full',
                     data=df,
                     x_jitter=1000,
                     palette=sns.color_palette("Set1", n_colors=len(df.Fab.unique())))
ax = sns.regplot(x='Year',
                 y='Transistors/mm2',
                 data=df,
                 scatter=False,
                 ax=ax.axes,
                 order=4)
ax.set_xlim(2006, 2021)
ax.set_ylim(0, 70)
plt.show()

Tip #1

If you forget about any of the method parameters, you can access the documentation directly in Datalore.

Use:

Ctrl/Command when hovering a method for brief documentation
F1 for full documentation

Distplot

Another interesting topic is GPU power consumption and heating.

This dataset splits GPUs into three groups:

Consumer GPUs, e.g. Nvidia GeForce and AMD Radeon R series
Workstation GPUs such as Nvidia Quadro and AMD Radeon FirePro
Server GPU, e.g. Nvidia P/V/T series and AMD Radeon Instinct

We will use sns.displot() to display the distribution of TDP (thermal design power) for the 3 GPU groups. This method shows the exact distribution in bins and the smoothing line approximation.

When defining plt.subplots() we'll set sharex=True to skip displaying the same xlabels across these 3 plots.

Sns.despine() removes the top and right spines from plot. If you add left=True parametr, the left spine will also be removed.

Note: Do not use sns.distplot() as it's deprecated after the recent seaborn update.

sns.set(style="white", palette="muted", color_codes=True)
sns.displot(data=df, x="TDP", hue="Class", col="Class", kde=True)
plt.show()

Data insight #2

While server GPUs are more hot and powerful on average than consumer and workstation ones, it's interesting that there's a lot of low-power workstation GPUs. These ones are intended not for 3D graphics or parallel computing, but for managing display panels and video walls.

Violin plot

sns.violinplot() helps to see the distribution across several variables. In the next plot we'll see how the distribution changes depending on the Fabrication process and the TDP across two manufacturers (Nvidia and AMD Radeon).

Violin settings:

split=True combines two violin plots
bw is the scale factor to use when computing the kernel bandwidth
cut=True trims each violin
linewidth defines the contour line thickness

Feel free to play with these paremetrs to understand in detail what they do.

plt.figure(figsize=(15, 10))
sns.set()
_, ax = plt.subplots(figsize=(10, 7))
sns.violinplot(x='Fab', 
               y='TDP', 
               hue='Manufacturer', 
               data=df, 
               split=True, 
               bw=.5, 
               cut=0.3, 
               linewidth=1,
               palette=sns.color_palette(['green', 'red']))
ax.set(ylim=(0, 700))
plt.show()

Data insight #3

While there's no some clear correlation regarging manufacturer and chip TDP, it's interesting to see that same could be said regarding fabrication process and TDP.

Although power consumption of transistors gets lower as they become smaller, growing amount of transistors on a chip compensates for this effect.

Swarm plot

We could split the same data across GPU classes instead of manufacturers.

Use sns.swarmplot() to draw a categorical scatterplot with non-overlapping points.

_, ax = plt.subplots(figsize=(21, 9))
sns.swarmplot(x='Fab', y='TDP', hue='Class', data=df, size=4)
plt.show()

Strip plot and point plot

sns.stripplot() helps to draw a scatter plot where one variable is categorical. dodge=True separates the points of different classes.

sns.pontplot() shows an estimate of central tendency for the distribution.

Let's display the TDP distribution across classes and manufacturers, disregard of fabrication process level.

_, ax = plt.subplots(figsize=(10, 7))
sns.despine(bottom=True, left=True)

sns.stripplot(x="TDP", y="Class", hue="Manufacturer",
              data=df, dodge=.5, alpha=.55, zorder=1)

sns.pointplot(x="TDP", y="Class", hue="Manufacturer",
              data=df, dodge=.5, join=False, palette="dark",
              markers="d", scale=.75, ci=None)

handles, labels = ax.get_legend_handles_labels()
ax.legend(handles[2:], labels[2:], title="Class",
          handletextpad=0, columnspacing=1,
          loc="best", ncol=2, frameon=True)

<matplotlib.legend.Legend at 0x7f84751d0790>

Lm plot

sns.lmplot() helps to print the data and its regression model fit in one plot. Set the col parameter to seperate the visualizations across the classes. col_wrap defines how many colums will be displayed in one row.

sns.lmplot(x='Die size', y='TDP', data=df, hue='Class', col='Class', col_wrap=3)
plt.tight_layout()

Data insight #4

From the plots above we can see a linear correlation of TDP and the physical amount of silicon, including die size.

Reg plot

A combination of sns.lmplot() with fit_reg=False and sns.regplot() produces a scatter plot with one regression approximation line.

When we use hue="Fab" seperation and fit_reg=True in lmplot, it will produce a regression line for every class, but in our case we need the general approximation line.

ax = sns.lmplot(x='Year', y='GFLOPS', data=df, hue='Fab', fit_reg=False)
ax.axes[0, 0].set_xlim((2006, 2021))
sns.regplot(x='Year', y='GFLOPS', data=df, scatter=False, ax=ax.axes[0, 0], order=3)
plt.show()

Let's compute an additional feature: the number of GFLOPS per million of transistors:

df['GFLOPS/transistors'] = df['GFLOPS'] / df['Transistors (mln)']
df

Pair grid

sns.PairGrid() is used to display pairwise relationships using different kinds of plots.

From the image below you could see 3 types of plots: scatter plot and 2 kde plots with different settings. You can change the type of the plots of sns.PairGrid() by using ax.map_upper(), ax.map_lower() and ax.map_diag().

ax = sns.PairGrid(df[['Year', 'Transistors/mm2', 'GFLOPS/transistors', 'TDP']], 
                 diag_sharey=False)
ax.map_upper(sns.scatterplot)
ax.map_lower(sns.kdeplot, colors="C0")
ax.map_diag(sns.kdeplot, lw=1, shade=True)
plt.show()

Data insight #5

The reason for the spike around 250 W TDP is caused by popularity of this value among both AMD and Nvidia's top-tier-chips. However, Nvidia RTX 30 GPUs are much hotter than its predecessors - RTX 3080 and 3090 are coming with 320W and 350W TDP correspondigly.

Data insight #6

While Nvidia chips are bigger in general compared to AMD, in 2020 the gap between sizes of their chips is bigger than ever, as the new 54-billion transistors A100 chip came out. The interesting thing though is that AMD has more or less the same GFLOPS and TDP across its smaller chips as its larger counterpart.

Cat plot

sns.catplot() allows to display the relation between categorical and numerical features.

height parametr is the height of the facet and aspect×height equals the width of each facet. kind="point" displays the confidence intervals.

ax = sns.catplot(x="Year", y="Transistors (mln)", hue="Manufacturer", height=6, aspect=2,
                 capsize=.2, kind="point", data=df)
ax.despine(left=True)
plt.show()

sns.FacetFrid() is used to display conditional relationships. On the plot below we display histograms of die size across classes and manufacturers.

ax = sns.FacetGrid(df, col="Manufacturer", row="Class", margin_titles=True)
ax.map(plt.hist, "Die size")
plt.show()

Data insight #7

Another interesting observation from the plot below is regarding aforementioned A100. It has the same GFLOPS as the other top-tier GPUs of 2020 and 2019 (although having ~2x more transostors). Seems like such overhead is caused by the novel Multi-Instance GPU technology, allowing A100 be partitioned into seven GPU instances.

Let's use plt.text() to mark A100 (PCIe and SXM4 versions) on the plot:

ax = sns.lmplot(x='Transistors (mln)', y='GFLOPS', data=df, hue='Year', fit_reg=False,
                palette=sns.color_palette('RdBu', n_colors=16), height=7)
sns.regplot(x='Transistors (mln)', y='GFLOPS', data=df, scatter=False, ax=ax.axes[0, 0], order=2)
plt.text(44000, 19000, 'A100 SXM4')
plt.text(45000, 13000, 'A100 PCIe')
ax.axes[0, 0].set_xlim((0, 57000))
plt.show()

Joint plot

sns.jointplot() displays the scatter plot with the linear approximation on the main plot and additional histograms for x and y axes.

Data insight #8

Another interesting observation is that almost no growth of GFLOPS per number of transistors.

sns.jointplot(x='Year', y='GFLOPS/transistors', data=df,
              kind="reg", truncate=False, marginal_kws={'bins': 14},
              xlim=(2006, 2021),
              color="m", height=7)
plt.show()

Kde plot

sns.kdeplot() is another way to display how the die size has improved over the years. The darkest areas mean the highest consentration od data points there.

from matplotlib.ticker import MaxNLocator
plt.figure(figsize=(6, 6))
ax = sns.kdeplot(x='Year', y='Die size', data=df, shade=True)
ax.axes.xaxis.set_major_locator(MaxNLocator(integer=True))
plt.show()

Pair plot

sns.pairplot() provides a comprehensive view on our resulting dataframe.

Tip #2

You can copy any output image by holding Shift and right-clicking and then selecting "Copy".

sns.pairplot(df, hue='Class')
plt.show()

sns.pairplot(df, hue='Manufacturer')
plt.show()

That's it for the Advanced visualization tutorial with Seaborn. We hope you got inspired to create advanced visulizations to better illustrate your work and communicate ideas.

Feel free to play with our GPU dataset available here. Please mention the Datalore team as the source of the Dataset. We would love to see and share your work with our community.

Other tutorials