We Downloaded 10,000,000 Jupyter Notebooks From Github – This Is What We Learned

Here’s how we used the hundreds of thousands of publicly accessible repos on GitHub to learn more about the current state of data science.

How to work with this notebook

If you want to edit a copy of this published notebook:

  • Click the "Edit a copy" button in the upper right-hand corner of the window.
  • Sign in to Datalore (you may need to sign up first).
  • Run the Setup section of this notebook to download the preprocessed data and import the packages.

This notebook contains 3 sheets:

  • Kernels
  • Notebook cells
  • Imports

💡 Important note

Though we have preprocessed the initial 5TB of data and reduced it to 3GB, there are still 10,000,000 files being processed and it could take some time to download and work with them. Using a basic machine on the Community plan, you'll be able to run the Kernels and Notebook cells sheets. To run the Imports sheet you will need a large machine that is only available on the Professional plan. Learn more at https://datalore.jetbrains.com/pricing/.

Setup

import json
from collections import defaultdict
from urllib.request import urlretrieve

import pandas as pd
import seaborn as sns
from matplotlib import pyplot as plt
from tqdm import tqdm

sns.set()

The code below downloads preprocessed files from a public Amazon S3 bucket.

files = ['ntb_2020_consistency.csv''ntb_2020_from_mooc.csv''ntb_2020_imports.csv''ntb_2020_md_stats.csv',
         'ntb_2020_text_counts.csv''ntb_2020_versions.csv''2019_imports_4128764_nbs.json']
with tqdm(total=len(files), desc='Downloading files'as pbar:
    for f in files:
        urlretrieve(f'https://github-notebooks-samples.s3-eu-west-1.amazonaws.com/{f}', f)
        pbar.update()
Downloading files: 100%|██████████| 7/7 [03:09<00:00, 27.02s/it]

The language of data science

Let's take a look at the programming languages and kernels used in notebooks from our sample.

versions_df = pd.read_csv('ntb_2020_versions.csv').drop(columns=['Unnamed: 0'])
versions_df.head()
ntb_versions = versions_df['version'].value_counts().to_frame().reset_index().rename(columns={'index''kernel''version''count'})[:10]
ntb_versions.head(10)

Despite the rapid growth in popularity of R and Julia in recent years, Python still remains the most commonly used language for writing code in Jupyter Notebooks by an enormous margin.

Other commonly used languages include Scala, Scilab, C++, Bash, MatLab, and Java.

The chart below illustrates how common various programming languages were in the sample of notebooks we studied.

The “nan” value refers to notebooks for which the language metadata was not available.

ntb_kernels = versions_df['version'].map(lambda x: str(x).split()[0]).value_counts().to_frame().reset_index().rename(columns={'index''kernel_type''version''count'})[:10]
plt.figure(figsize=(20,10))
ax = sns.barplot(data=ntb_kernels, x='kernel_type'y='count'orient='v')
for idx, row in ntb_kernels.iterrows():
    ax.text(idx, row['count'] + 20000, row['count'], ha='center')
ax.set_title('Language distribution')
ax.set_xlabel('Language')
ax.set_ylabel('Number of notebooks')
plt.show()

Below you can see the distribution of notebooks in Python and R divided according to their respective versions:

plt.figure(figsize=(20,10))
ax = sns.barplot(data=ntb_versions, x='kernel'y='count'orient='v')
for idx, row in ntb_versions.iterrows():
    ax.text(idx, row['count'] + 20000, row['count'], ha='center')
ax.set_title('Kernel distribution')
ax.set_xlabel('Kernel')
ax.set_ylabel('Number of notebooks')
plt.show()
# will be used to count only Python 3.6+ notebooks
is_python_36_plus = versions_df['version'].map(lambda x: str(x) in ['python 3.6''python 3.7''python 3.8'])
python_2_count = versions_df['version'].map(lambda x: str(x).startswith('python 2')).sum()
python_2_count
1154495
python_3_count = versions_df['version'].map(lambda x: str(x).startswith('python 3')).sum()
python_3_count
7709913
not_python_count = versions_df['version'].map(lambda x: x and not str(x).startswith('python')).sum()
not_python_count
1050064
new_nbs_count = is_python_36_plus.sum()
new_nbs_count
6586485
del ntb_versions
del ntb_kernels
del versions_df

Consistency of notebooks

It’s a known problem for Jupyter Notebooks that not all the notebooks can be reproduced. We tried to investigate a part of this problem by taking a look at the order in which code cells were executed. If code cells were not originally executed in a linear order, we can’t be sure that the result of linear order execution will be the same.

We categorized notebooks that were executed in a non-linear order as “not consistent”, and it turns out that 36% of the notebooks we investigated fell into this category.

consistency_df = pd.read_csv('ntb_2020_consistency.csv').drop(columns=['Unnamed: 0'])
consistency_df = consistency_df[is_python_36_plus]
consistency_df.head()
mooc_df = pd.read_csv('ntb_2020_from_mooc.csv').drop(columns=['Unnamed: 0'])
mooc_df = mooc_df[is_python_36_plus]
mooc_df.head()
print(f'{consistency_df["is_consistent"].sum()/len(consistency_df)*100:.2f}% notebooks are consistent.')
print(f'At least {mooc_df["from_mooc"].sum()/len(mooc_df)*100:.2f}% notebooks are from some online courses.')
64.32% notebooks are consistent.
At least 2.76% notebooks are from some online courses.
del mooc_df
markdown_df = pd.read_csv('ntb_2020_md_stats.csv').drop(columns=['Unnamed: 0'])
markdown_df = markdown_df[is_python_36_plus]
markdown_df.head()
text_df = pd.read_csv('ntb_2020_text_counts.csv').drop(columns=['Unnamed: 0'])
text_df = text_df[is_python_36_plus]
text_df.head()

Notebook cells and their contents

Some general statistics regarding notebook contents:

  • 71.90% of notebooks contain markdown.
  • 42.13% of notebooks contain image outputs (plots).
  • 12.34% of notebooks contain LaTex.
  • 19.77% of notebooks contain HTML.
  • 20.63% of notebooks contain code inside Markdown.
print(f'{len(text_df[text_df["md_chars_count"]>0])/len(text_df)*100:.2f}% notebooks contain markdown.')
print(f'{len(text_df[text_df["images_count"]>0])/len(text_df)*100:.2f}% notebooks contain image outputs (plots).')
print(f'{markdown_df["contains_latex"].sum()/len(markdown_df)*100:.2f}% notebooks contain latex.')
print(f'{markdown_df["contains_html"].sum()/len(markdown_df)*100:.2f}% notebooks contain HTML.')
print(f'{markdown_df["contains_code"].sum()/len(markdown_df)*100:.2f}% notebooks contain code.')
71.90% notebooks contain markdown.
42.13% notebooks contain image outputs (plots).
12.34% notebooks contain latex.
19.77% notebooks contain HTML.
20.63% notebooks contain code.
del markdown_df
text_df.quantile([.05.15.25.5.75.9.95])

Markdown is extensively used in notebooks. 50% of notebooks contain fewer than 4 Markdown cells and more than 66 code cells.

The charts below illustrate distribution of Markdown and code cells across notebooks.

plt.figure(figsize=(2010))
plt.hist(text_df[text_df['md_cells_count'] > 0][text_df['md_cells_count'] < 80]['md_cells_count'],
         bins=79density=Falselog=False)
plt.title('Distribution of Markdown cells number')
plt.xlabel('Number of Markdown cells')
plt.ylabel('Number of notebooks')
plt.show()
/opt/anaconda3/envs/datalore-user/lib/python3.7/site-packages/ipykernel_launcher.py:2: UserWarning: Boolean Series key will be reindexed to match DataFrame index.

plt.figure(figsize=(2010))
plt.hist(text_df[text_df['md_lines_count'] != 0][text_df['md_lines_count'] < 300]['md_lines_count'], 
        bins=149density=Falselog=False)
plt.title('Distribution of Markdown lines number')
plt.xlabel('Number of Markdown lines')
plt.ylabel('Number of notebooks')
plt.show()
/opt/anaconda3/envs/datalore-user/lib/python3.7/site-packages/ipykernel_launcher.py:2: UserWarning: Boolean Series key will be reindexed to match DataFrame index.

plt.figure(figsize=(2010))
plt.hist(text_df[text_df['md_chars_count'] != 0][text_df['md_chars_count'] < 1500]['md_chars_count'],
         bins=199density=Falselog=False)
plt.title('Distribution of Markdown chars number')
plt.xlabel('Number of Markdown chars')
plt.ylabel('Number of notebooks')
plt.show()
/opt/anaconda3/envs/datalore-user/lib/python3.7/site-packages/ipykernel_launcher.py:2: UserWarning: Boolean Series key will be reindexed to match DataFrame index.

plt.figure(figsize=(2010))
plt.hist(text_df[text_df['md_words_count'] != 0][text_df['md_words_count'] < 1500]['md_words_count'], bins=199,
         density=Falselog=False)
plt.title('Distribution of Markdown words number')
plt.xlabel('Number of Markdown words')
plt.ylabel('Number of notebooks')
plt.show()
/opt/anaconda3/envs/datalore-user/lib/python3.7/site-packages/ipykernel_launcher.py:2: UserWarning: Boolean Series key will be reindexed to match DataFrame index.

plt.figure(figsize=(2010))
plt.hist(text_df[text_df['code_cells_count'] != 0][text_df['code_cells_count'] < 100]['code_cells_count'],
         bins=99density=Falselog=False)
plt.title('Distribution of сode cells number')
plt.xlabel('Number of сode cells in a notebook')
plt.ylabel('Number of notebooks')
plt.show()
/opt/anaconda3/envs/datalore-user/lib/python3.7/site-packages/ipykernel_launcher.py:2: UserWarning: Boolean Series key will be reindexed to match DataFrame index.

And here is the distribution of the number of code lines. Although there are some outliers, like notebooks with more than 25,000 code lines, 95% of the notebooks contain less than 465 lines of code.

plt.figure(figsize=(2010))
plt.hist(text_df[text_df['code_lines_count'] != 0][text_df['code_lines_count'] < 800]['code_lines_count'],
         bins=199density=Falselog=False)
plt.title('Distribution of code lines number')
plt.xlabel('Number of code lines in a notebook')
plt.ylabel('Number of notebooks')
plt.show()
/opt/anaconda3/envs/datalore-user/lib/python3.7/site-packages/ipykernel_launcher.py:2: UserWarning: Boolean Series key will be reindexed to match DataFrame index.

Plotting is also very popular. As we mentioned above, 42% of notebooks contain plots, and 10% of these notebooks contain more than 8 plots.

plt.figure(figsize=(2010))
plt.hist(text_df[text_df['images_count'] != 0][text_df['images_count'] < 50]['images_count'],
         bins=48density=Falselog=False)
plt.title('Distribution of plots number')
plt.xlabel('Number of plots in a notebook')
plt.ylabel('Number of notebooks')
plt.show()
/opt/anaconda3/envs/datalore-user/lib/python3.7/site-packages/ipykernel_launcher.py:2: UserWarning: Boolean Series key will be reindexed to match DataFrame index.

text_df.drop(columns=['md_cells_count''md_lines_count''md_chars_count''md_words_count',
                      'code_lines_count''code_chars_count''images_count'], inplace=True)

We looked at whether consistency depends on the number of markdown and code cells. Unsurprisingly, we found that the fewer code and Markdown cells a notebook has, the more it is likely to be consistent.

plt.figure(figsize=(2010))
sns.kdeplot(data=text_df[text_df['code_cells_count'] < 100],
            x='code_cells_count'hue=consistency_df['is_consistent'], cumulative=Falsecommon_norm=Falsecommon_grid=Falsegridsize=99)
plt.title('Consistency of notebooks')
plt.xlabel('Number of code cells in a notebook')
plt.show()
del text_df
del consistency_df

Top data science libraries

imports_df = pd.read_csv('ntb_2020_imports.csv').drop(columns=['Unnamed: 0'])
imports_df = imports_df[is_python_36_plus]
imports_df.head()
new_nbs_count = len(imports_df)
def parse_list_column(s):
    try:
        s = json.loads(s.replace("'"'"'))
    except:
        s = []
    return s
builtin_imports = list(set(['__future__''__main__''_dummy_thread''_thread''abc''aifc''argparse''array''ast''asynchat''asyncio''asyncore''atexit''audioop''base64''bdb''binascii''binhex''bisect''builtins''bz2''cProfile''calendar''cgi''cgitb''chunk''cmath''cmd''code''codecs''codeop''collections''collections.abc''colorsys''compileall''concurrent.futures''configparser''contextlib''copy''copyreg''crypt''csv''ctypes''curses''curses.ascii''curses.panel''curses.textpad''datetime''dbm''dbm.dumb''dbm.gnu''dbm.ndbm''decimal''difflib''dis''distutils''distutils.archive_util''distutils.bcppcompiler''distutils.ccompiler''distutils.cmd''distutils.command''distutils.command.bdist''distutils.command.bdist_dumb''distutils.command.bdist_msi''distutils.command.bdist_packager''distutils.command.bdist_rpm''distutils.command.bdist_wininst''distutils.command.build''distutils.command.build_clib''distutils.command.build_ext''distutils.command.build_py''distutils.command.build_scripts''distutils.command.check''distutils.command.clean''distutils.command.config''distutils.command.install''distutils.command.install_data''distutils.command.install_headers''distutils.command.install_lib''distutils.command.install_scripts''distutils.command.register''distutils.command.sdist''distutils.core''distutils.cygwinccompiler''distutils.debug''distutils.dep_util''distutils.dir_util''distutils.dist''distutils.errors''distutils.extension''distutils.fancy_getopt''distutils.file_util''distutils.filelist''distutils.log''distutils.msvccompiler''distutils.spawn''distutils.sysconfig''distutils.text_file''distutils.unixccompiler''distutils.util''distutils.version''doctest''dummy_threading''email''email.charset''email.contentmanager''email.encoders''email.errors''email.generator''email.header''email.headerregistry''email.iterators''email.message''email.mime''email.parser''email.policy''email.utils''encodings.idna''encodings.mbcs''encodings.utf_8_sig''ensurepip''enum''errno''faulthandler''fcntl''filecmp''fileinput''fnmatch''formatter''fpectl''fractions''ftplib''functools''gc''getopt''getpass''gettext''glob''grp''gzip''hashlib''heapq''hmac''html''html.entities''html.parser''http''http.client''http.cookiejar''http.cookies''http.server''imaplib''imghdr''imp''importlib''importlib.abc''importlib.machinery''importlib.util''inspect''io''ipaddress''itertools''json''json.tool''keyword''lib2to3''linecache''locale''logging''logging.config''logging.handlers''lzma''macpath''mailbox''mailcap''marshal''math''mimetypes''mmap''modulefinder''msilib''msvcrt''multiprocessing''multiprocessing.connection''multiprocessing.dummy''multiprocessing.managers''multiprocessing.pool''multiprocessing.sharedctypes''netrc''nis''nntplib''numbers''operator''optparse''os''os.path''ossaudiodev''parser''pathlib''pdb''pickle''pickletools''pipes''pkgutil''platform''plistlib''poplib''posix''pprint''profile''pstats''pty''pwd''py_compile''pyclbr''pydoc''queue''quopri''random''re''readline''reprlib''resource''rlcompleter''runpy''sched''select''selectors''shelve''shlex''shutil''signal''site''smtpd''smtplib''sndhdr''socket''socketserver''spwd''sqlite3''ssl''stat''statistics''string''stringprep''struct''subprocess''sunau''symbol''symtable''sys''sysconfig''syslog''tabnanny''tarfile''telnetlib''tempfile''termios''test''test.support''textwrap''threading''time''timeit''tkinter''tkinter.scrolledtext''tkinter.tix''tkinter.ttk''token''tokenize''trace''traceback''tracemalloc''tty''turtle''turtledemo''types''typing''unicodedata''unittest''unittest.mock''urllib''urllib.error''urllib.parse''urllib.request''urllib.response''urllib.robotparser''uu''uuid''venv''warnings''wave''weakref''webbrowser''winreg''winsound''wsgiref''wsgiref.handlers''wsgiref.headers''wsgiref.simple_server''wsgiref.util''wsgiref.validate''xdrlib''xml''xml.dom''xml.dom.minidom''xml.dom.pulldom''xml.etree.ElementTree''xml.parsers.expat''xml.parsers.expat.errors''xml.parsers.expat.model''xml.sax''xml.sax.handler''xml.sax.saxutils''xml.sax.xmlreader''xmlrpc.client''xmlrpc.server''zipapp''zipfile''zipimport''zlib'] + ['AL''BaseHTTPServer''Bastion''CGIHTTPServer''Carbon.AE''Carbon.AH''Carbon.App''Carbon.Appearance''Carbon.CF''Carbon.CG''Carbon.CarbonEvents''Carbon.CarbonEvt''Carbon.Cm''Carbon.Components''Carbon.ControlAccessor''Carbon.Controls''Carbon.CoreFounation''Carbon.CoreGraphics''Carbon.Ctl''Carbon.Dialogs''Carbon.Dlg''Carbon.Drag''Carbon.Dragconst''Carbon.Events''Carbon.Evt''Carbon.File''Carbon.Files''Carbon.Fm''Carbon.Folder''Carbon.Folders''Carbon.Fonts''Carbon.Help''Carbon.IBCarbon''Carbon.IBCarbonRuntime''Carbon.Icns''Carbon.Icons''Carbon.Launch''Carbon.LaunchServices''Carbon.List''Carbon.Lists''Carbon.MacHelp''Carbon.MediaDescr''Carbon.Menu''Carbon.Menus''Carbon.Mlte''Carbon.OSA''Carbon.OSAconst''Carbon.QDOffscreen''Carbon.Qd''Carbon.Qdoffs''Carbon.Qt''Carbon.QuickDraw''Carbon.QuickTime''Carbon.Res''Carbon.Resources''Carbon.Scrap''Carbon.Snd''Carbon.Sound''Carbon.TE''Carbon.TextEdit''Carbon.Win''Carbon.Windows''ColorPicker''ConfigParser''Cookie''DEVICE''DocXMLRPCServer''EasyDialogs''FL''FrameWork''GL''HTMLParser''MacOS''MimeWriter''MiniAEFrame''Nav''PixMapWrapper''Queue''SUNAUDIODEV''ScrolledText''SimpleHTTPServer''SimpleXMLRPCServer''SocketServer''StringIO''Tix''Tkinter''UserDict''UserList''UserString''W''__builtin__''__future__''__main__''_winreg''abc''aepack''aetools''aetypes''aifc''al''anydbm''applesingle''argparse''array''ast''asynchat''asyncore''atexit''audioop''autoGIL''base64''bdb''binascii''binhex''bisect''bsddb''buildtools''bz2''cPickle''cProfile''cStringIO''calendar''cd''cfmfile''cgi''cgitb''chunk''cmath''cmd''code''codecs''codeop''collections''colorsys''commands''compileall''compiler''compiler.ast''compiler.visitor''contextlib''cookielib''copy''copy_reg''crypt''csv''ctypes''curses''curses.ascii''curses.panel''curses.textpad''datetime''dbhash''dbm''decimal''difflib''dircache''dis''distutils''distutils.archive_util''distutils.bcppcompiler''distutils.ccompiler''distutils.cmd''distutils.command''distutils.command.bdist''distutils.command.bdist_dumb''distutils.command.bdist_msi''distutils.command.bdist_packager''distutils.command.bdist_rpm''distutils.command.bdist_wininst''distutils.command.build''distutils.command.build_clib''distutils.command.build_ext''distutils.command.build_py''distutils.command.build_scripts''distutils.command.check''distutils.command.clean''distutils.command.config''distutils.command.install''distutils.command.install_data''distutils.command.install_headers''distutils.command.install_lib''distutils.command.install_scripts''distutils.command.register''distutils.command.sdist''distutils.core''distutils.cygwinccompiler''distutils.debug''distutils.dep_util''distutils.dir_util''distutils.dist''distutils.emxccompiler''distutils.errors''distutils.extension''distutils.fancy_getopt''distutils.file_util''distutils.filelist''distutils.log''distutils.msvccompiler''distutils.spawn''distutils.sysconfig''distutils.text_file''distutils.unixccompiler''distutils.util''distutils.version''dl''doctest''dumbdbm''dummy_thread''dummy_threading''email''email.charset''email.encoders''email.errors''email.generator''email.header''email.iterators''email.message''email.mime''email.parser''email.utils''encodings.idna''encodings.utf_8_sig''ensurepip''errno''exceptions''fcntl''filecmp''fileinput''findertools''fl''flp''fm''fnmatch''formatter''fpectl''fpformat''fractions''ftplib''functools''future_builtins''gc''gdbm''gensuitemodule''getopt''getpass''gettext''gl''glob''grp''gzip''hashlib''heapq''hmac''hotshot''hotshot.stats''htmlentitydefs''htmllib''httplib''ic''icopen''imageop''imaplib''imgfile''imghdr''imp''importlib''imputil''inspect''io''itertools''jpeg''json''keyword''lib2to3''linecache''locale''logging''logging.config''logging.handlers''macerrors''macostools''macpath''macresource''mailbox''mailcap''marshal''math''md5''mhlib''mimetools''mimetypes''mimify''mmap''modulefinder''msilib''msvcrt''multifile''multiprocessing''multiprocessing.connection''multiprocessing.dummy''multiprocessing.managers''multiprocessing.pool''multiprocessing.sharedctypes''mutex''netrc''new''nis''nntplib''numbers''operator''optparse''os''os.path''ossaudiodev''parser''pdb''pickle''pickletools''pipes''pkgutil''platform''plistlib''popen2''poplib''posix''posixfile''pprint''profile''pstats''pty''pwd''py_compile''pyclbr''pydoc''quopri''random''re''readline''resource''rexec''rfc822''rlcompleter''robotparser''runpy''sched''select''sets''sgmllib''sha''shelve''shlex''shutil''signal''site''smtpd''smtplib''sndhdr''socket''spwd''sqlite3''ssl''stat''statvfs''string''stringprep''struct''subprocess''sunau''sunaudiodev''symbol''symtable''sys''sysconfig''syslog''tabnanny''tarfile''telnetlib''tempfile''termios''test''test.test_support''textwrap''thread''threading''time''timeit''token''tokenize''trace''traceback''ttk''tty''turtle''types''unicodedata''unittest''urllib''urllib2''urlparse''user''uu''uuid''videoreader''warnings''wave''weakref''webbrowser''whichdb''winsound''wsgiref''wsgiref.handlers''wsgiref.headers''wsgiref.simple_server''wsgiref.util''wsgiref.validate''xdrlib''xml''xml.dom''xml.dom.minidom''xml.dom.pulldom''xml.etree.ElementTree''xml.parsers.expat''xml.sax''xml.sax.handler''xml.sax.saxutils''xml.sax.xmlreader''xmlrpclib''zipfile''zipimport''zlib']))
imports_df['imports'] = imports_df['imports'].apply(parse_list_column)  # pandas messes with lists when converting to/from CSV
imports_df['num_imports'] = imports_df['imports'].apply(len)
imports_df.head()
plt.figure(figsize=(2010))
plt.hist(imports_df[imports_df['num_imports'] != 0][imports_df['num_imports'] < 50]['num_imports'],
         bins=48density=Falselog=False)
plt.title('Distribution of imports number')
plt.xlabel('Number of imports in a notebook')
plt.ylabel('Number of notebooks')
plt.show()
/opt/anaconda3/envs/datalore-user/lib/python3.7/site-packages/ipykernel_launcher.py:2: UserWarning: Boolean Series key will be reindexed to match DataFrame index.

imports_df.drop(columns=['num_imports'], inplace=True)
imports_count = defaultdict(int)
for imports in imports_df['imports']:
    imports = set(lib.split('.')[0for lib in imports)
    for lib in imports:
        imports_count[lib] = imports_count.get(lib0) + 1
sorted_imports_count = sorted(imports_count.items(), key=lambda itm: itm[1], reverse=True)
sorted_imports_count = [x for x in sorted_imports_count if x[1] > 1000]
frequent_imports = list(x[0for x in sorted_imports_count)
print(f'Got {len(sorted_imports_count)} unique libraries')
Got 607 unique libraries
sorted_imports_count = pd.DataFrame(sorted_imports_count, columns=['library''count'])

To help Datalore users get started quickly and easily, we decided to preinstall the commonly used Python packages. For this we calculated the stats on the most used library imports.

It was no surprise to discover that 60% of notebooks contain Numpy imports, 47% feature Pandas and Matplotlib imports.

The charts below illustrates the popularity of different packages in our notebook sample:

plt.figure(figsize=(25,10))
ax = sns.barplot(data=sorted_imports_count[:25], x='library'y='count'orient='v')
for idx, row in sorted_imports_count[:25].iterrows():
    ax.text(idx, row['count'] + 20000f"{row['count']}\n({row['count']/new_nbs_count*100:.02f}%)"ha='center')
ax.set_title('Most popular Python libraries (built-ins included)')
ax.set_xlabel('Library')
ax.set_ylabel('Number of notebooks')
plt.show()
with open('2019_imports_4128764_nbs.json''r'as fp:
    old_imports_stats = json.load(fp)
old_imports_stats = pd.DataFrame(old_imports_statscolumns=['library''count'])
old_nbs_count = 4128764
plt.figure(figsize=(2510))
ax = sns.barplot(data=old_imports_stats[:25], x='library'y='count'orient='v')
for idx, row in old_imports_stats[:25].iterrows():
    ax.text(idx, row['count'] + 1000f"{row['count']}\n({row['count'] / old_nbs_count * 100:.02f}%)"ha='center')
ax.set_title('Most popular 2019 Python libraries (built-ins included)')
ax.set_xlabel('Library')
ax.set_ylabel('Number of notebooks')
plt.show()
top_level_imports = [tuple(sorted(x)) for x in imports_df['imports']]
top_level_imports = [tuple(set(x.split('.')[0for x in libs)) for libs in top_level_imports]
unique_imports_count = defaultdict(int)
for bag in top_level_imports:
    unique_imports_count[bag] += 1
sorted_unique_imports_count = sorted(unique_imports_count.items(), key=lambda itm: itm[1], reverse=True)
print(f'Notebooks w/o imports: {sorted_unique_imports_count[0][1]}')
Notebooks w/o imports: 916818
sorted_unique_imports_count = [('\n'.join(x), y )for x, y in sorted_unique_imports_count]
sorted_unique_imports_count[:10]
[('', 916818),
('pandas\nnumpy', 143806),
('pandas', 128702),
('numpy', 102773),
('sklearn\nseaborn\nnumpy\npandas\nmatplotlib', 100199),
('matplotlib\nnumpy', 97520),
('pandas\nmatplotlib\nnumpy\nsklearn', 87340),
('pandas\nmatplotlib\nnumpy', 86943),
('pandas\nmatplotlib\nnumpy\nseaborn', 61403),
('pandas\nnumpy\nsklearn', 45548)]
sorted_unique_imports_count = pd.DataFrame(sorted_unique_imports_count[1:], columns=['library''count'])
plt.figure(figsize=(2510))
ax = sns.barplot(data=sorted_unique_imports_count[:15], x='library'y='count'orient='v')
for idx, row in sorted_unique_imports_count[:15].iterrows():
    ax.text(idx, row['count'] + 1000f"{row['count']}\n({row['count'] / new_nbs_count * 100:.02f}%)"ha='center')
ax.set_title('Most popular Python environments (built-ins included)')
ax.set_xlabel('Libraries')
ax.set_ylabel('Number of notebooks')
plt.show()
sorted_imports_count = sorted(imports_count.items(), key=lambda itm: itm[1], reverse=True)
sorted_imports_count = [x for x in sorted_imports_count if x[1] > 1000 and x[0not in builtin_imports]
frequent_imports = list(x[0for x in sorted_imports_count)
print(f'Got {len(sorted_imports_count)} unique libraries')
Got 503 unique libraries
sorted_imports_count = pd.DataFrame(sorted_imports_count, columns=['library''count'])
plt.figure(figsize=(25,10))
ax = sns.barplot(data=sorted_imports_count[:20], x='library'y='count'orient='v')
for idx, row in sorted_imports_count[:20].iterrows():
    ax.text(idx, row['count'] + 20000f"{row['count']}\n({row['count']/(new_nbs_count - 1186901)*100:.02f}%)"ha='center')
ax.set_title('Most popular Python libraries (built-ins excluded)')
ax.set_xlabel('Library')
ax.set_ylabel('Number of notebooks')
plt.show()
old_imports_stats
old_imports_stats = old_imports_stats[old_imports_stats['library'].map(lambda x: x not in builtin_imports)].reset_index()
plt.figure(figsize=(2510))
ax = sns.barplot(data=old_imports_stats[:25], x='library'y='count'orient='v')
for idx, row in old_imports_stats[:25].iterrows():
    ax.text(idx, row['count'] + 1000f"{row['count']}\n({row['count'] / old_nbs_count * 100:.02f}%)"ha='center')
ax.set_title('Most popular 2019 Python libraries (built-ins included)')
ax.set_xlabel('Library')
ax.set_ylabel('Number of notebooks')
plt.show()
del top_level_imports, unique_imports_count, sorted_unique_imports_count
top_level_imports = [tuple(sorted(x)) for x in imports_df['imports']]
top_level_imports = [tuple(set(x.split('.')[0for x in libs)) for libs in top_level_imports]
top_level_imports = [tuple(x for x in libs if x not in builtin_imports) for libs in top_level_imports]
unique_imports_count = defaultdict(int)
for bag in top_level_imports:
    unique_imports_count[bag] += 1
sorted_unique_imports_count = sorted(unique_imports_count.items(), key=lambda itm: itm[1], reverse=True)
sorted_unique_imports_count = [('\n'.join(x), y )for x, y in sorted_unique_imports_count]
sorted_unique_imports_count[:10]
[('', 1186927),
('pandas\nnumpy', 192948),
('pandas', 180326),
('sklearn\nseaborn\nnumpy\npandas\nmatplotlib', 167154),
('numpy', 139884),
('matplotlib\nnumpy', 133750),
('pandas\nmatplotlib\nnumpy', 129441),
('pandas\nmatplotlib\nnumpy\nsklearn', 87340),
('pandas\nnumpy\nsklearn', 64296),
('pandas\nmatplotlib\nnumpy\nseaborn', 61403)]
sorted_unique_imports_count = pd.DataFrame(sorted_unique_imports_count[1:], columns=['library''count'])
plt.figure(figsize=(25,10))
ax = sns.barplot(data=sorted_unique_imports_count[:15], x='library'y='count'orient='v')
for idx, row in sorted_unique_imports_count[:15].iterrows():
    ax.text(idx, row['count'] + 1000f"{row['count']}\n({row['count']/(new_nbs_count - 1186901)*100:.02f}%)"ha='center')
ax.set_title('Most popular Python environments (built-ins excluded)')
ax.set_xlabel('Libraries')
ax.set_ylabel('Number of notebooks')
plt.show()