Having migrated my site from hugo Academic to {{distill}} I wanted to see if I could fold-in python rendered notebooks with ease, and what the results would look like. Suffice it to say the process has been painless so far … create the usual Rmd for the post >>> export the ipynb notebook from jupyter as a markdown file (.md) >>> copy-and-paste md file’s content into the post’s Rmd file.
Having migrated my site from hugo Academic to {{distill}} I wanted to see if I could fold-in python rendered notebooks with ease, and what the results would look like. Suffice it to say the process has been painless so far …
css
.I had some pretty long and wide tables in the notebook that did not fit within the page-width and so I had to use some css to create a scrollable div
([@braican's solution found here](https://stackoverflow.com/a/17451132)) and voila!
.scrollit {
overflow:scroll;
height:250px;
}
We will start with importing some libraries we need and then play with some data to understand basic python commands. What data shall we work with? Well, let us pull down some data on criminal incidences that were reported.
First we install a particular library called pandas
and in the command that follows, note that pd
is just the alias that pandas assumes so that we can type pd
and have all the pandas
commands at our disposal.
import pandas as pd
import numpy as np
The crime incident reports data are available here and span multiple years so we may end up working only with 2019 data but for now we proceed by gathering everything.
In the command below, the key part is pd.read_csv()
and inside it is the URL for the comma-separated variable file. Once the file is downloaded by pandas
we are saving it in python with the name df
Note that data-sets, data-files are usually referred to as a data-frame
in python and hence the alias of df
.
= pd.read_csv("https://tinyurl.com/y6hgsnkm") df
Let us look at the first 5 rows of data to get a feel for the layout. The command is .head(5)
5) df.head(
INCIDENT_NUMBER | OFFENSE_CODE | OFFENSE_CODE_GROUP | OFFENSE_DESCRIPTION | DISTRICT | REPORTING_AREA | SHOOTING | OCCURRED_ON_DATE | YEAR | MONTH | DAY_OF_WEEK | HOUR | UCR_PART | STREET | Lat | Long | Location | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | TESTTEST2 | 423 | NaN | ASSAULT - AGGRAVATED | External | 0 | 2019-10-16 00:00:00 | 2019 | 10 | Wednesday | 0 | NaN | RIVERVIEW DR | NaN | NaN | (0.00000000, 0.00000000) | |
1 | S97333701 | 3301 | NaN | VERBAL DISPUTE | C6 | 915 | 0 | 2020-07-18 14:34:00 | 2020 | 7 | Saturday | 14 | NaN | MARY BOYLE WAY | 42.330813 | -71.051368 | (42.33081300, -71.05136800) |
2 | S47513131 | 2647 | NaN | THREATS TO DO BODILY HARM | E18 | 530 | 0 | 2020-06-24 10:15:00 | 2020 | 6 | Wednesday | 10 | NaN | READVILLE ST | 42.239491 | -71.135954 | (42.23949100, -71.13595400) |
3 | I92102201 | 3301 | NaN | VERBAL DISPUTE | E13 | 583 | 0 | 2019-12-20 03:08:00 | 2019 | 12 | Friday | 3 | NaN | DAY ST | 42.325122 | -71.107779 | (42.32512200, -71.10777900) |
4 | I92097173 | 3115 | NaN | INVESTIGATE PERSON | C11 | 355 | 0 | 2019-10-23 00:00:00 | 2019 | 10 | Wednesday | 0 | NaN | GIBSON ST | 42.297555 | -71.059709 | (42.29755500, -71.05970900) |
What about the last 10 rows of the data?
10) df.tail(
INCIDENT_NUMBER | OFFENSE_CODE | OFFENSE_CODE_GROUP | OFFENSE_DESCRIPTION | DISTRICT | REPORTING_AREA | SHOOTING | OCCURRED_ON_DATE | YEAR | MONTH | DAY_OF_WEEK | HOUR | UCR_PART | STREET | Lat | Long | Location | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
515072 | 102095489 | 3115 | NaN | INVESTIGATE PERSON | E18 | 520 | 0 | 2019-11-25 16:30:00 | 2019 | 11 | Monday | 16 | NaN | HYDE PARK AVE | 42.256215 | -71.124019 | (42.25621500, -71.12401900) |
515073 | 102091671 | 2647 | NaN | THREATS TO DO BODILY HARM | B3 | 417 | 0 | 2019-11-12 12:00:00 | 2019 | 11 | Tuesday | 12 | NaN | MORA ST | 42.282081 | -71.073648 | (42.28208100, -71.07364800) |
515074 | 020224065 | 3018 | NaN | SICK/INJURED/MEDICAL - POLICE | B2 | 282 | 0 | 2020-03-19 07:30:00 | 2020 | 3 | Thursday | 7 | NaN | WASHINGTON ST | 42.353272 | -71.173738 | (42.35327200, -71.17373800) |
515075 | 020202856 | 2672 | NaN | BIOLOGICAL THREATS | B2 | 282 | 0 | 2020-03-19 08:30:00 | 2020 | 3 | Thursday | 8 | NaN | WARREN ST | 42.328234 | -71.083289 | (42.32823400, -71.08328900) |
515076 | 020063425 | 3114 | NaN | INVESTIGATE PROPERTY | A7 | 21 | 0 | 2020-09-01 00:00:00 | 2020 | 9 | Tuesday | 0 | NaN | PARIS ST | 42.374426 | -71.035278 | (42.37442600, -71.03527800) |
515077 | 020062356 | 3115 | NaN | INVESTIGATE PERSON | E18 | 520 | 0 | 2020-08-28 18:39:00 | 2020 | 8 | Friday | 18 | NaN | HYDE PARK AVE | 42.256215 | -71.124019 | (42.25621500, -71.12401900) |
515078 | 020054040 | 3501 | NaN | MISSING PERSON | C11 | 0 | 2020-07-30 15:30:00 | 2020 | 7 | Thursday | 15 | NaN | GIBSON ST | 42.297555 | -71.059709 | (42.29755500, -71.05970900) | |
515079 | 020046400 | 1501 | NaN | WEAPON VIOLATION - CARRY/ POSSESSING/ SALE/ TR… | B2 | 330 | 0 | 2020-07-02 01:38:00 | 2020 | 7 | Thursday | 1 | NaN | PASADENA RD | 42.305760 | -71.083771 | (42.30576000, -71.08377100) |
515080 | 020038446 | 1501 | NaN | WEAPON VIOLATION - CARRY/ POSSESSING/ SALE/ TR… | B2 | 300 | 0 | 2020-06-03 01:15:00 | 2020 | 6 | Wednesday | 1 | NaN | WASHINGTON ST | 42.323807 | -71.089150 | (42.32380700, -71.08915000) |
515081 | 020030892 | 540 | NaN | BURGLARY - COMMERICAL | C11 | 380 | 0 | 2020-05-03 00:00:00 | 2020 | 5 | Sunday | 0 | NaN | GALLIVAN BLVD | 42.283700 | -71.047761 | (42.28370000, -71.04776100) |
Let us look at the contents of the data-frame …
Column Name | Description |
---|---|
[incident_num] varchar NOT NULL, | Internal BPD report number |
[offense_code][varchar](25) NULL, | Numerical code of offense description |
[Offense_Code_Group_Description][varchar](80) NULL, | Internal categorization of [offense_description] |
[Offense_Description][varchar](80) NULL, | Primary descriptor of incident |
[district] varchar NULL, | What district the crime was reported in |
[reporting_area] varchar NULL, | RA number associated with the where the crime was reported from. |
[shooting][char] (1) NULL, | Indicated a shooting took place. |
[occurred_on] datetime2 NULL, | Earliest date and time the incident could have taken place |
[UCR_Part] varchar NULL, | Universal Crime Reporting Part number (1,2, 3) |
[street] varchar NULL, | Street name the incident took place |
Offense Codes are available here
We could also look at the offense codes by reading them in as a data-frame. This is an Excel file so we will have to switch to .read_excel()
= pd.read_excel("https://tinyurl.com/y6hgsnkm") offense_codes
print(offense_codes)
CODE NAME
0 612 LARCENY PURSE SNATCH - NO FORCE
1 613 LARCENY SHOPLIFTING
2 615 LARCENY THEFT OF MV PARTS & ACCESSORIES
3 1731 INCEST
4 3111 LICENSE PREMISE VIOLATION
.. ... ...
571 1806 DRUGS - CLASS B TRAFFICKING OVER 18 GRAMS
572 1807 DRUGS - CLASS D TRAFFICKING OVER 50 GRAMS
573 1610 HUMAN TRAFFICKING - COMMERCIAL SEX ACTS
574 2010 HOME INVASION
575 1620 HUMAN TRAFFICKING - INVOLUNTARY SERVITUDE
[576 rows x 2 columns]
The next step would be to see how many data points we have, and what the minimum, maximum values, what is the average, etc. This can be done with .describe()
df.describe()
OFFENSE_CODE | YEAR | MONTH | HOUR | Lat | Long | |
---|---|---|---|---|---|---|
count | 515082.000000 | 515082.000000 | 515082.000000 | 515082.000000 | 485909.000000 | 485909.000000 |
mean | 2333.275632 | 2017.542933 | 6.634194 | 13.079170 | 42.239043 | -70.949353 |
std | 1182.489822 | 1.543329 | 3.317964 | 6.347259 | 1.891645 | 3.060012 |
min | 111.000000 | 2015.000000 | 1.000000 | 0.000000 | -1.000000 | -71.203312 |
25% | 1102.000000 | 2016.000000 | 4.000000 | 9.000000 | 42.296861 | -71.097465 |
50% | 3005.000000 | 2018.000000 | 7.000000 | 14.000000 | 42.325029 | -71.077723 |
75% | 3201.000000 | 2019.000000 | 9.000000 | 18.000000 | 42.348312 | -71.062562 |
max | 3831.000000 | 2020.000000 | 12.000000 | 23.000000 | 42.395042 | 0.000000 |
By default the command will report the values with decimals but we may not want that. Decimals can be rounded
or removed altogether as shown below.
round(2) df.describe().
OFFENSE_CODE | YEAR | MONTH | HOUR | Lat | Long | |
---|---|---|---|---|---|---|
count | 515082.00 | 515082.00 | 515082.00 | 515082.00 | 485909.00 | 485909.00 |
mean | 2333.28 | 2017.54 | 6.63 | 13.08 | 42.24 | -70.95 |
std | 1182.49 | 1.54 | 3.32 | 6.35 | 1.89 | 3.06 |
min | 111.00 | 2015.00 | 1.00 | 0.00 | -1.00 | -71.20 |
25% | 1102.00 | 2016.00 | 4.00 | 9.00 | 42.30 | -71.10 |
50% | 3005.00 | 2018.00 | 7.00 | 14.00 | 42.33 | -71.08 |
75% | 3201.00 | 2019.00 | 9.00 | 18.00 | 42.35 | -71.06 |
max | 3831.00 | 2020.00 | 12.00 | 23.00 | 42.40 | 0.00 |
round(0) df.describe().
OFFENSE_CODE | YEAR | MONTH | HOUR | Lat | Long | |
---|---|---|---|---|---|---|
count | 515082.0 | 515082.0 | 515082.0 | 515082.0 | 485909.0 | 485909.0 |
mean | 2333.0 | 2018.0 | 7.0 | 13.0 | 42.0 | -71.0 |
std | 1182.0 | 2.0 | 3.0 | 6.0 | 2.0 | 3.0 |
min | 111.0 | 2015.0 | 1.0 | 0.0 | -1.0 | -71.0 |
25% | 1102.0 | 2016.0 | 4.0 | 9.0 | 42.0 | -71.0 |
50% | 3005.0 | 2018.0 | 7.0 | 14.0 | 42.0 | -71.0 |
75% | 3201.0 | 2019.0 | 9.0 | 18.0 | 42.0 | -71.0 |
max | 3831.0 | 2020.0 | 12.0 | 23.0 | 42.0 | 0.0 |
Note a few things here.
Say we want to restrict the dataframe just to 2020. How can we do that?
= df[ df['YEAR'] == 2020 ] df20
Notice the sequence here dataframe[ dataframe['column-name'] == somevalue ]
& pay attention to the double equal sign ==
which is a strict equality.
df20.describe()
OFFENSE_CODE | YEAR | MONTH | HOUR | Lat | Long | |
---|---|---|---|---|---|---|
count | 63733.000000 | 63733.0 | 63733.000000 | 63733.000000 | 62200.000000 | 62200.000000 |
mean | 2353.137323 | 2020.0 | 4.900554 | 12.923525 | 42.319872 | -71.084193 |
std | 1182.670996 | 0.0 | 2.561463 | 6.566899 | 0.032339 | 0.030578 |
min | 111.000000 | 2020.0 | 1.000000 | 0.000000 | 42.181845 | -71.203312 |
25% | 1001.000000 | 2020.0 | 3.000000 | 9.000000 | 42.295353 | -71.098579 |
50% | 3005.000000 | 2020.0 | 5.000000 | 14.000000 | 42.321918 | -71.078444 |
75% | 3207.000000 | 2020.0 | 7.000000 | 18.000000 | 42.344561 | -71.062000 |
max | 3831.000000 | 2020.0 | 9.000000 | 23.000000 | 42.395041 | -70.953726 |
At this point we might be curious to know what types of offenses are most often reported? Before we that, however, let us also see how many unique values of OFFENSE_CODE are there
'OFFENSE_CODE'].nunique() df20[
130
'OFFENSE_CODE'].value_counts() df20[
3301 6234
3115 5494
801 3908
3005 3227
3831 2700
...
1603 2
3203 1
2672 1
990 1
2628 1
Name: OFFENSE_CODE, Length: 130, dtype: int64
So code 3301 leads with 6234 reports in 2020, followed by code 3115, then 801, then 3005, and then 3831. Code 3005 is missing from their list so we have no idea what it is!! That is a crime in itself.
# Just another way to accomplish the same thing but in a more complicated way.
'OFFENSE_CODE')['OFFENSE_CODE'].count().reset_index(name='count').sort_values(['count'], ascending = False) df20.groupby(
OFFENSE_CODE | count | |
---|---|---|
109 | 3301 | 6234 |
95 | 3115 | 5494 |
23 | 801 | 3908 |
82 | 3005 | 3227 |
129 | 3831 | 2700 |
… | … | … |
86 | 3016 | 2 |
26 | 990 | 1 |
106 | 3203 | 1 |
76 | 2672 | 1 |
65 | 2628 | 1 |
130 rows × 2 columns
Not bad. I suppose with forcing fixed headers and some other aesthetic tuning of the table rendering this could be a pretty efficient solution to add python notebooks. I am really coming to love distill
!!
Text and figures are licensed under Creative Commons Attribution CC BY-SA 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".
For attribution, please cite this work as
Ruhil (2021, Jan. 16). From an Attican Hollow: Testing ipynb to markdown in distill. Retrieved from https://aniruhil.org/posts/2021-01-16-testing-ipynb-to-markdown-in-distill/
BibTeX citation
@misc{ruhil2021testing, author = {Ruhil, Ani}, title = {From an Attican Hollow: Testing ipynb to markdown in distill}, url = {https://aniruhil.org/posts/2021-01-16-testing-ipynb-to-markdown-in-distill/}, year = {2021} }