From an Attican Hollow: Testing ipynb to markdown in distill

Ani Ruhil

Having migrated my site from hugo Academic to {{distill}} I wanted to see if I could fold-in python rendered notebooks with ease, and what the results would look like. Suffice it to say the process has been painless so far …

create the usual Rmd for the post,
export the ipynb notebook from jupyter as a markdown file (.md), and
copy-and-paste md file’s content into the post’s Rmd file,
add some css.

I had some pretty long and wide tables in the notebook that did not fit within the page-width and so I had to use some css to create a scrollable div ([@braican's solution found here](https://stackoverflow.com/a/17451132)) and voila!

.scrollit {
    overflow:scroll;
    height:250px;
  }

Lesson 01 … Welcome to pandas

We will start with importing some libraries we need and then play with some data to understand basic python commands. What data shall we work with? Well, let us pull down some data on criminal incidences that were reported.

First we install a particular library called pandas and in the command that follows, note that pd is just the alias that pandas assumes so that we can type pd and have all the pandas commands at our disposal.

import pandas as pd 
import numpy as np

The crime incident reports data are available here and span multiple years so we may end up working only with 2019 data but for now we proceed by gathering everything.

In the command below, the key part is pd.read_csv() and inside it is the URL for the comma-separated variable file. Once the file is downloaded by pandas we are saving it in python with the name df

Note that data-sets, data-files are usually referred to as a data-frame in python and hence the alias of df.

df = pd.read_csv("https://tinyurl.com/y6hgsnkm")

Let us look at the first 5 rows of data to get a feel for the layout. The command is .head(5)

df.head(5)

	INCIDENT_NUMBER	OFFENSE_CODE	OFFENSE_CODE_GROUP	OFFENSE_DESCRIPTION	DISTRICT	REPORTING_AREA	OCCURRED_ON_DATE	YEAR	MONTH	DAY_OF_WEEK	HOUR	UCR_PART	STREET	Lat	Long	Location
0	TESTTEST2	423	NaN	ASSAULT - AGGRAVATED	External		2019-10-16 00:00:00	2019	10	Wednesday	0	NaN	RIVERVIEW DR	NaN	NaN	(0.00000000, 0.00000000)
1	S97333701	3301	NaN	VERBAL DISPUTE	C6	915	2020-07-18 14:34:00	2020	7	Saturday	14	NaN	MARY BOYLE WAY	42.330813	-71.051368	(42.33081300, -71.05136800)
2	S47513131	2647	NaN	THREATS TO DO BODILY HARM	E18	530	2020-06-24 10:15:00	2020	6	Wednesday	10	NaN	READVILLE ST	42.239491	-71.135954	(42.23949100, -71.13595400)
3	I92102201	3301	NaN	VERBAL DISPUTE	E13	583	2019-12-20 03:08:00	2019	12	Friday	3	NaN	DAY ST	42.325122	-71.107779	(42.32512200, -71.10777900)
4	I92097173	3115	NaN	INVESTIGATE PERSON	C11	355	2019-10-23 00:00:00	2019	10	Wednesday	0	NaN	GIBSON ST	42.297555	-71.059709	(42.29755500, -71.05970900)

What about the last 10 rows of the data?

df.tail(10)

	INCIDENT_NUMBER	OFFENSE_CODE	OFFENSE_CODE_GROUP	OFFENSE_DESCRIPTION	DISTRICT	REPORTING_AREA	OCCURRED_ON_DATE	YEAR	MONTH	DAY_OF_WEEK	HOUR	UCR_PART	STREET	Lat	Long	Location
515072	102095489	3115	NaN	INVESTIGATE PERSON	E18	520	2019-11-25 16:30:00	2019	11	Monday	16	NaN	HYDE PARK AVE	42.256215	-71.124019	(42.25621500, -71.12401900)
515073	102091671	2647	NaN	THREATS TO DO BODILY HARM	B3	417	2019-11-12 12:00:00	2019	11	Tuesday	12	NaN	MORA ST	42.282081	-71.073648	(42.28208100, -71.07364800)
515074	020224065	3018	NaN	SICK/INJURED/MEDICAL - POLICE	B2	282	2020-03-19 07:30:00	2020	3	Thursday	7	NaN	WASHINGTON ST	42.353272	-71.173738	(42.35327200, -71.17373800)
515075	020202856	2672	NaN	BIOLOGICAL THREATS	B2	282	2020-03-19 08:30:00	2020	3	Thursday	8	NaN	WARREN ST	42.328234	-71.083289	(42.32823400, -71.08328900)
515076	020063425	3114	NaN	INVESTIGATE PROPERTY	A7	21	2020-09-01 00:00:00	2020	9	Tuesday	0	NaN	PARIS ST	42.374426	-71.035278	(42.37442600, -71.03527800)
515077	020062356	3115	NaN	INVESTIGATE PERSON	E18	520	2020-08-28 18:39:00	2020	8	Friday	18	NaN	HYDE PARK AVE	42.256215	-71.124019	(42.25621500, -71.12401900)
515078	020054040	3501	NaN	MISSING PERSON	C11		2020-07-30 15:30:00	2020	7	Thursday	15	NaN	GIBSON ST	42.297555	-71.059709	(42.29755500, -71.05970900)
515079	020046400	1501	NaN	WEAPON VIOLATION - CARRY/ POSSESSING/ SALE/ TR…	B2	330	2020-07-02 01:38:00	2020	7	Thursday	1	NaN	PASADENA RD	42.305760	-71.083771	(42.30576000, -71.08377100)
515080	020038446	1501	NaN	WEAPON VIOLATION - CARRY/ POSSESSING/ SALE/ TR…	B2	300	2020-06-03 01:15:00	2020	6	Wednesday	1	NaN	WASHINGTON ST	42.323807	-71.089150	(42.32380700, -71.08915000)
515081	020030892	540	NaN	BURGLARY - COMMERICAL	C11	380	2020-05-03 00:00:00	2020	5	Sunday	0	NaN	GALLIVAN BLVD	42.283700	-71.047761	(42.28370000, -71.04776100)

Let us look at the contents of the data-frame …

Column Name	Description
[incident_num] varchar NOT NULL,	Internal BPD report number
[offense_code][varchar](25) NULL,	Numerical code of offense description
[Offense_Code_Group_Description][varchar](80) NULL,	Internal categorization of [offense_description]
[Offense_Description][varchar](80) NULL,	Primary descriptor of incident
[district] varchar NULL,	What district the crime was reported in
[reporting_area] varchar NULL,	RA number associated with the where the crime was reported from.
[shooting][char] (1) NULL,	Indicated a shooting took place.
[occurred_on] datetime2 NULL,	Earliest date and time the incident could have taken place
[UCR_Part] varchar NULL,	Universal Crime Reporting Part number (1,2, 3)
[street] varchar NULL,	Street name the incident took place

Offense Codes are available here

We could also look at the offense codes by reading them in as a data-frame. This is an Excel file so we will have to switch to .read_excel()

offense_codes = pd.read_excel("https://tinyurl.com/y6hgsnkm")

print(offense_codes)

     CODE                                       NAME
0     612           LARCENY PURSE SNATCH - NO FORCE 
1     613                        LARCENY SHOPLIFTING
2     615    LARCENY THEFT OF MV PARTS & ACCESSORIES
3    1731                                     INCEST
4    3111                  LICENSE PREMISE VIOLATION
..    ...                                        ...
571  1806  DRUGS - CLASS B TRAFFICKING OVER 18 GRAMS
572  1807  DRUGS - CLASS D TRAFFICKING OVER 50 GRAMS
573  1610    HUMAN TRAFFICKING - COMMERCIAL SEX ACTS
574  2010                              HOME INVASION
575  1620  HUMAN TRAFFICKING - INVOLUNTARY SERVITUDE

[576 rows x 2 columns]

The next step would be to see how many data points we have, and what the minimum, maximum values, what is the average, etc. This can be done with .describe()

df.describe()

	OFFENSE_CODE	YEAR	MONTH	HOUR	Lat	Long
count	515082.000000	515082.000000	515082.000000	515082.000000	485909.000000	485909.000000
mean	2333.275632	2017.542933	6.634194	13.079170	42.239043	-70.949353
std	1182.489822	1.543329	3.317964	6.347259	1.891645	3.060012
min	111.000000	2015.000000	1.000000	0.000000	-1.000000	-71.203312
25%	1102.000000	2016.000000	4.000000	9.000000	42.296861	-71.097465
50%	3005.000000	2018.000000	7.000000	14.000000	42.325029	-71.077723
75%	3201.000000	2019.000000	9.000000	18.000000	42.348312	-71.062562
max	3831.000000	2020.000000	12.000000	23.000000	42.395042	0.000000

By default the command will report the values with decimals but we may not want that. Decimals can be rounded or removed altogether as shown below.

df.describe().round(2)

	OFFENSE_CODE	YEAR	MONTH	HOUR	Lat	Long
count	515082.00	515082.00	515082.00	515082.00	485909.00	485909.00
mean	2333.28	2017.54	6.63	13.08	42.24	-70.95
std	1182.49	1.54	3.32	6.35	1.89	3.06
min	111.00	2015.00	1.00	0.00	-1.00	-71.20
25%	1102.00	2016.00	4.00	9.00	42.30	-71.10
50%	3005.00	2018.00	7.00	14.00	42.33	-71.08
75%	3201.00	2019.00	9.00	18.00	42.35	-71.06
max	3831.00	2020.00	12.00	23.00	42.40	0.00

df.describe().round(0)

	OFFENSE_CODE	YEAR	MONTH	HOUR	Lat	Long
count	515082.0	515082.0	515082.0	515082.0	485909.0	485909.0
mean	2333.0	2018.0	7.0	13.0	42.0	-71.0
std	1182.0	2.0	3.0	6.0	2.0	3.0
min	111.0	2015.0	1.0	0.0	-1.0	-71.0
25%	1102.0	2016.0	4.0	9.0	42.0	-71.0
50%	3005.0	2018.0	7.0	14.0	42.0	-71.0
75%	3201.0	2019.0	9.0	18.0	42.0	-71.0
max	3831.0	2020.0	12.0	23.0	42.0	0.0

Note a few things here.

We have a total of 515082 incidents logged. But the latitude and longitude are availale for no more than 485909 incidents.

Say we want to restrict the dataframe just to 2020. How can we do that?

df20 = df[ df['YEAR'] == 2020 ]

Notice the sequence here dataframe[ dataframe['column-name'] == somevalue ] & pay attention to the double equal sign == which is a strict equality.

df20.describe()

	OFFENSE_CODE	YEAR	MONTH	HOUR	Lat	Long
count	63733.000000	63733.0	63733.000000	63733.000000	62200.000000	62200.000000
mean	2353.137323	2020.0	4.900554	12.923525	42.319872	-71.084193
std	1182.670996	0.0	2.561463	6.566899	0.032339	0.030578
min	111.000000	2020.0	1.000000	0.000000	42.181845	-71.203312
25%	1001.000000	2020.0	3.000000	9.000000	42.295353	-71.098579
50%	3005.000000	2020.0	5.000000	14.000000	42.321918	-71.078444
75%	3207.000000	2020.0	7.000000	18.000000	42.344561	-71.062000
max	3831.000000	2020.0	9.000000	23.000000	42.395041	-70.953726

At this point we might be curious to know what types of offenses are most often reported? Before we that, however, let us also see how many unique values of OFFENSE_CODE are there

df20['OFFENSE_CODE'].nunique()

df20['OFFENSE_CODE'].value_counts()

3301    6234
3115    5494
801     3908
3005    3227
3831    2700
        ... 
1603       2
3203       1
2672       1
990        1
2628       1
Name: OFFENSE_CODE, Length: 130, dtype: int64

So code 3301 leads with 6234 reports in 2020, followed by code 3115, then 801, then 3005, and then 3831. Code 3005 is missing from their list so we have no idea what it is!! That is a crime in itself.

# Just another way to accomplish the same thing but in a more complicated way.

df20.groupby('OFFENSE_CODE')['OFFENSE_CODE'].count().reset_index(name='count').sort_values(['count'], ascending = False)

	OFFENSE_CODE	count
109	3301	6234
95	3115	5494
23	801	3908
82	3005	3227
129	3831	2700
…	…	…
86	3016	2
26	990	1
106	3203	1
76	2672	1
65	2628	1

130 rows × 2 columns

Not bad. I suppose with forcing fixed headers and some other aesthetic tuning of the table rendering this could be a pretty efficient solution to add python notebooks. I am really coming to love distill!!

Testing ipynb to markdown in distill

Lesson 01 … Welcome to pandas

Reuse

Citation