Testing ipynb to markdown in distill

Python Ipynb Jupyter Tutorials

Having migrated my site from hugo Academic to {{distill}} I wanted to see if I could fold-in python rendered notebooks with ease, and what the results would look like. Suffice it to say the process has been painless so far … create the usual Rmd for the post >>> export the ipynb notebook from jupyter as a markdown file (.md) >>> copy-and-paste md file’s content into the post’s Rmd file.

Ani Ruhil true
01-16-2021

Having migrated my site from hugo Academic to {{distill}} I wanted to see if I could fold-in python rendered notebooks with ease, and what the results would look like. Suffice it to say the process has been painless so far …

  1. create the usual Rmd for the post,
  2. export the ipynb notebook from jupyter as a markdown file (.md), and
  3. copy-and-paste md file’s content into the post’s Rmd file,
  4. add some css.

I had some pretty long and wide tables in the notebook that did not fit within the page-width and so I had to use some css to create a scrollable div ([@braican's solution found here](https://stackoverflow.com/a/17451132)) and voila!

.scrollit {
    overflow:scroll;
    height:250px;
  }

Lesson 01 … Welcome to pandas

We will start with importing some libraries we need and then play with some data to understand basic python commands. What data shall we work with? Well, let us pull down some data on criminal incidences that were reported.

First we install a particular library called pandas and in the command that follows, note that pd is just the alias that pandas assumes so that we can type pd and have all the pandas commands at our disposal.

import pandas as pd 
import numpy as np

The crime incident reports data are available here and span multiple years so we may end up working only with 2019 data but for now we proceed by gathering everything.

In the command below, the key part is pd.read_csv() and inside it is the URL for the comma-separated variable file. Once the file is downloaded by pandas we are saving it in python with the name df

Note that data-sets, data-files are usually referred to as a data-frame in python and hence the alias of df.

df = pd.read_csv("https://tinyurl.com/y6hgsnkm")

Let us look at the first 5 rows of data to get a feel for the layout. The command is .head(5)

df.head(5)
INCIDENT_NUMBER OFFENSE_CODE OFFENSE_CODE_GROUP OFFENSE_DESCRIPTION DISTRICT REPORTING_AREA SHOOTING OCCURRED_ON_DATE YEAR MONTH DAY_OF_WEEK HOUR UCR_PART STREET Lat Long Location
0 TESTTEST2 423 NaN ASSAULT - AGGRAVATED External 0 2019-10-16 00:00:00 2019 10 Wednesday 0 NaN RIVERVIEW DR NaN NaN (0.00000000, 0.00000000)
1 S97333701 3301 NaN VERBAL DISPUTE C6 915 0 2020-07-18 14:34:00 2020 7 Saturday 14 NaN MARY BOYLE WAY 42.330813 -71.051368 (42.33081300, -71.05136800)
2 S47513131 2647 NaN THREATS TO DO BODILY HARM E18 530 0 2020-06-24 10:15:00 2020 6 Wednesday 10 NaN READVILLE ST 42.239491 -71.135954 (42.23949100, -71.13595400)
3 I92102201 3301 NaN VERBAL DISPUTE E13 583 0 2019-12-20 03:08:00 2019 12 Friday 3 NaN DAY ST 42.325122 -71.107779 (42.32512200, -71.10777900)
4 I92097173 3115 NaN INVESTIGATE PERSON C11 355 0 2019-10-23 00:00:00 2019 10 Wednesday 0 NaN GIBSON ST 42.297555 -71.059709 (42.29755500, -71.05970900)

What about the last 10 rows of the data?

df.tail(10)
INCIDENT_NUMBER OFFENSE_CODE OFFENSE_CODE_GROUP OFFENSE_DESCRIPTION DISTRICT REPORTING_AREA SHOOTING OCCURRED_ON_DATE YEAR MONTH DAY_OF_WEEK HOUR UCR_PART STREET Lat Long Location
515072 102095489 3115 NaN INVESTIGATE PERSON E18 520 0 2019-11-25 16:30:00 2019 11 Monday 16 NaN HYDE PARK AVE 42.256215 -71.124019 (42.25621500, -71.12401900)
515073 102091671 2647 NaN THREATS TO DO BODILY HARM B3 417 0 2019-11-12 12:00:00 2019 11 Tuesday 12 NaN MORA ST 42.282081 -71.073648 (42.28208100, -71.07364800)
515074 020224065 3018 NaN SICK/INJURED/MEDICAL - POLICE B2 282 0 2020-03-19 07:30:00 2020 3 Thursday 7 NaN WASHINGTON ST 42.353272 -71.173738 (42.35327200, -71.17373800)
515075 020202856 2672 NaN BIOLOGICAL THREATS B2 282 0 2020-03-19 08:30:00 2020 3 Thursday 8 NaN WARREN ST 42.328234 -71.083289 (42.32823400, -71.08328900)
515076 020063425 3114 NaN INVESTIGATE PROPERTY A7 21 0 2020-09-01 00:00:00 2020 9 Tuesday 0 NaN PARIS ST 42.374426 -71.035278 (42.37442600, -71.03527800)
515077 020062356 3115 NaN INVESTIGATE PERSON E18 520 0 2020-08-28 18:39:00 2020 8 Friday 18 NaN HYDE PARK AVE 42.256215 -71.124019 (42.25621500, -71.12401900)
515078 020054040 3501 NaN MISSING PERSON C11 0 2020-07-30 15:30:00 2020 7 Thursday 15 NaN GIBSON ST 42.297555 -71.059709 (42.29755500, -71.05970900)
515079 020046400 1501 NaN WEAPON VIOLATION - CARRY/ POSSESSING/ SALE/ TR… B2 330 0 2020-07-02 01:38:00 2020 7 Thursday 1 NaN PASADENA RD 42.305760 -71.083771 (42.30576000, -71.08377100)
515080 020038446 1501 NaN WEAPON VIOLATION - CARRY/ POSSESSING/ SALE/ TR… B2 300 0 2020-06-03 01:15:00 2020 6 Wednesday 1 NaN WASHINGTON ST 42.323807 -71.089150 (42.32380700, -71.08915000)
515081 020030892 540 NaN BURGLARY - COMMERICAL C11 380 0 2020-05-03 00:00:00 2020 5 Sunday 0 NaN GALLIVAN BLVD 42.283700 -71.047761 (42.28370000, -71.04776100)

Let us look at the contents of the data-frame …

Column Name Description
[incident_num] varchar NOT NULL, Internal BPD report number
[offense_code][varchar](25) NULL, Numerical code of offense description
[Offense_Code_Group_Description][varchar](80) NULL, Internal categorization of [offense_description]
[Offense_Description][varchar](80) NULL, Primary descriptor of incident
[district] varchar NULL, What district the crime was reported in
[reporting_area] varchar NULL, RA number associated with the where the crime was reported from.
[shooting][char] (1) NULL, Indicated a shooting took place.
[occurred_on] datetime2 NULL, Earliest date and time the incident could have taken place
[UCR_Part] varchar NULL, Universal Crime Reporting Part number (1,2, 3)
[street] varchar NULL, Street name the incident took place

Offense Codes are available here

We could also look at the offense codes by reading them in as a data-frame. This is an Excel file so we will have to switch to .read_excel()

offense_codes = pd.read_excel("https://tinyurl.com/y6hgsnkm")
print(offense_codes)
     CODE                                       NAME
0     612           LARCENY PURSE SNATCH - NO FORCE 
1     613                        LARCENY SHOPLIFTING
2     615    LARCENY THEFT OF MV PARTS & ACCESSORIES
3    1731                                     INCEST
4    3111                  LICENSE PREMISE VIOLATION
..    ...                                        ...
571  1806  DRUGS - CLASS B TRAFFICKING OVER 18 GRAMS
572  1807  DRUGS - CLASS D TRAFFICKING OVER 50 GRAMS
573  1610    HUMAN TRAFFICKING - COMMERCIAL SEX ACTS
574  2010                              HOME INVASION
575  1620  HUMAN TRAFFICKING - INVOLUNTARY SERVITUDE

[576 rows x 2 columns]

The next step would be to see how many data points we have, and what the minimum, maximum values, what is the average, etc. This can be done with .describe()

df.describe()
OFFENSE_CODE YEAR MONTH HOUR Lat Long
count 515082.000000 515082.000000 515082.000000 515082.000000 485909.000000 485909.000000
mean 2333.275632 2017.542933 6.634194 13.079170 42.239043 -70.949353
std 1182.489822 1.543329 3.317964 6.347259 1.891645 3.060012
min 111.000000 2015.000000 1.000000 0.000000 -1.000000 -71.203312
25% 1102.000000 2016.000000 4.000000 9.000000 42.296861 -71.097465
50% 3005.000000 2018.000000 7.000000 14.000000 42.325029 -71.077723
75% 3201.000000 2019.000000 9.000000 18.000000 42.348312 -71.062562
max 3831.000000 2020.000000 12.000000 23.000000 42.395042 0.000000

By default the command will report the values with decimals but we may not want that. Decimals can be rounded or removed altogether as shown below.

df.describe().round(2)
OFFENSE_CODE YEAR MONTH HOUR Lat Long
count 515082.00 515082.00 515082.00 515082.00 485909.00 485909.00
mean 2333.28 2017.54 6.63 13.08 42.24 -70.95
std 1182.49 1.54 3.32 6.35 1.89 3.06
min 111.00 2015.00 1.00 0.00 -1.00 -71.20
25% 1102.00 2016.00 4.00 9.00 42.30 -71.10
50% 3005.00 2018.00 7.00 14.00 42.33 -71.08
75% 3201.00 2019.00 9.00 18.00 42.35 -71.06
max 3831.00 2020.00 12.00 23.00 42.40 0.00
df.describe().round(0)
OFFENSE_CODE YEAR MONTH HOUR Lat Long
count 515082.0 515082.0 515082.0 515082.0 485909.0 485909.0
mean 2333.0 2018.0 7.0 13.0 42.0 -71.0
std 1182.0 2.0 3.0 6.0 2.0 3.0
min 111.0 2015.0 1.0 0.0 -1.0 -71.0
25% 1102.0 2016.0 4.0 9.0 42.0 -71.0
50% 3005.0 2018.0 7.0 14.0 42.0 -71.0
75% 3201.0 2019.0 9.0 18.0 42.0 -71.0
max 3831.0 2020.0 12.0 23.0 42.0 0.0

Note a few things here.

Say we want to restrict the dataframe just to 2020. How can we do that?

df20 = df[ df['YEAR'] == 2020 ]

Notice the sequence here dataframe[ dataframe['column-name'] == somevalue ] & pay attention to the double equal sign == which is a strict equality.

df20.describe()
OFFENSE_CODE YEAR MONTH HOUR Lat Long
count 63733.000000 63733.0 63733.000000 63733.000000 62200.000000 62200.000000
mean 2353.137323 2020.0 4.900554 12.923525 42.319872 -71.084193
std 1182.670996 0.0 2.561463 6.566899 0.032339 0.030578
min 111.000000 2020.0 1.000000 0.000000 42.181845 -71.203312
25% 1001.000000 2020.0 3.000000 9.000000 42.295353 -71.098579
50% 3005.000000 2020.0 5.000000 14.000000 42.321918 -71.078444
75% 3207.000000 2020.0 7.000000 18.000000 42.344561 -71.062000
max 3831.000000 2020.0 9.000000 23.000000 42.395041 -70.953726

At this point we might be curious to know what types of offenses are most often reported? Before we that, however, let us also see how many unique values of OFFENSE_CODE are there

df20['OFFENSE_CODE'].nunique()
130
df20['OFFENSE_CODE'].value_counts()
3301    6234
3115    5494
801     3908
3005    3227
3831    2700
        ... 
1603       2
3203       1
2672       1
990        1
2628       1
Name: OFFENSE_CODE, Length: 130, dtype: int64

So code 3301 leads with 6234 reports in 2020, followed by code 3115, then 801, then 3005, and then 3831. Code 3005 is missing from their list so we have no idea what it is!! That is a crime in itself.

# Just another way to accomplish the same thing but in a more complicated way.

df20.groupby('OFFENSE_CODE')['OFFENSE_CODE'].count().reset_index(name='count').sort_values(['count'], ascending = False) 
OFFENSE_CODE count
109 3301 6234
95 3115 5494
23 801 3908
82 3005 3227
129 3831 2700
86 3016 2
26 990 1
106 3203 1
76 2672 1
65 2628 1

130 rows × 2 columns

Not bad. I suppose with forcing fixed headers and some other aesthetic tuning of the table rendering this could be a pretty efficient solution to add python notebooks. I am really coming to love distill!!

Reuse

Text and figures are licensed under Creative Commons Attribution CC BY-SA 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".

Citation

For attribution, please cite this work as

Ruhil (2021, Jan. 16). From an Attican Hollow: Testing ipynb to markdown in distill. Retrieved from https://aniruhil.org/posts/2021-01-16-testing-ipynb-to-markdown-in-distill/

BibTeX citation

@misc{ruhil2021testing,
  author = {Ruhil, Ani},
  title = {From an Attican Hollow: Testing ipynb to markdown in distill},
  url = {https://aniruhil.org/posts/2021-01-16-testing-ipynb-to-markdown-in-distill/},
  year = {2021}
}