An Analysis of Central Park Squirrels in 2018

Tiffanie Choi

Outline

1. Introduction

2. Data Collection

3. Data Visualization

4. Data Analysis

5. Machine Learning

6. Conclusion

Introduction

1A. Project Overview

The objective of this project is to analyze Central Park Squirrels in 2018 to see common patterns of squirrel traits/characteristics seen during sightings, if there is any correlation and/or relationship between squirrel characteristics and their interactions with humans, and dive into predictive modeling with Central Park Squirrel data.

1B. Libraries

Data Collection

2A. Dataset Background Information

NYC OpenData and The Squirrel Census provided data on 2018 Central Park Squirrel Census on (https://data.cityofnewyork.us/Environment/2018-Central-Park-Squirrel-Census-Squirrel-Data/vfnx-vebw). This dataset provides information about squirrel data for each of the 3,023 sightings, including location coordinates, age, primary and secondary fur color, elevation, activities, communications, and interactions between squirrels and with humans.

2B. Data Preperation: Load the Data

Use pandas to read in the csv file (https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html), and save it into a dataframe

2B. Data Preperation: View the Data

View the first 50 rows of the dataframe to view the columns and a sample of the data.

2B. Data Preperation: Clean the Data

We can see above that since there are so many columns, we are not able to see all of it. Lets drop the columns we will not be using to make it easier to view the dataframe later on when we create models. Use df.drop (https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.drop.html) to remove columns we will not be using. Then only take a sample of 500 entries to avoid map clustering.

3. Data Visualization

3A. Visualize the area on a Map

Using folium, lets create a map over Central Park, NY with latitude and longitude coordinates of (40.785091, -73.968285).

Here is more info on this tool: http://python-visualization.github.io/folium/

3B. Visualize Squirrel Ages on a Map

Lets see common squirrel ages in this area. We can use numpy to list all of the age types, use those each value in the unique list as a key in a dictionary to translate age type keys to value. We can also create a dictionary for each age type.

Marker Color and Icon Key: [Adult Squirrel: Blue color with triangle icon, Baby Squirrel: Green color with circle icon, Unknown Age: Red color with a question mark icon, Undefined Age: Purple color with a x icon]

We can observe that there are mainly adult squirrels represented on this map above.

3C. Visualize Squirrel Fur Colors on a Map

Lets see what squirrel primary fur colors appear most in this area. We can use numpy to list all of the primary fur colors types, use those each value in the unique list as a key in a dictionary to translate primary fur colors type keys to value. We can also create a dictionary for each primary fur color type.

Marker Color Key: [Gray Fur Squirrel: Gray Marler, Cinnamon Fur Squirrel: Beige Marker, Black Fur Squirrel : Black Marker, Undefined Fur Squirrel Age: White Marker]

We can observe that there are mainly gray squirrels represented on this map above.

3D. Visualize Squirrel Sighting Times on a Map

Lets see which squirrel sighting time is the most popular, morning or night. We can use numpy to list all of the sighting times, use those each value in the unique list as a key in a dictionary to translate sighting time keys to value. We can also create a dictionary for each sighting time

Marker Color and Icon Key: [Day Time Sighting: Orange color with a cloud icon, Night Time Sighting: Cadet blue color with a star icon]

We can observe that there is an even distribution of night and day time sightings represented on this map above.

4. Data Analysis

4A. Are there more adults or baby squirrels in Central Park, NY?

From the map in 3B, we can see mainly blue markers with the triangle icon more than any other marker. From an observational analysis, we can observe that there are mostly adult squirrels rather than baby squirrels. We can also see that adult squirrels are mainly in the middle of Central Park, while baby squirrels are on the edge of the park.

Lets plot up the entire dataset to see if our random sample was a good representation of the whole dataset. We can use seaborn to create a catplot of all squirrel's ages.

More info on seaborn: https://seaborn.pydata.org/

This plot shows that there are more adult squirrels sighted than baby squirrels. This can be due to human instinct, where humans can spot larger squirrels more easily than smaller squirrels.

4B. What is the geographic spread of squirrel color fur in Central Park, NY?

From the map in 3C, we can see mainly gray markers more than any other marker. From an observational analysis, we can observe that there are mostly squirrels with the primary fur color of gray rather squirrels of any other primary fur color. We can also see that there is no geographic pattern/correlation of squirrel fur color and location they were seen.

Lets plot up the entire dataset to see if our random sample was a good representation of the whole dataset. We can use seaborn to create a catplot of all squirrel's ages.

This plot shows that there are more gray squirrels sighted than baby squirrels. This can be due to human instinct, where humans are more attentive to gray squirrels than squirrels of other fur colors.

4C. What is the most common time for squirrel sightings in Central Park, NY?

From the map in 3D, we can see there is a even distribution of orange and blue markers. From an observational analysis, we can observe that it is hard to tell whether or not there are more morning or night time sightings. There is no obvious trend or pattern, so we should plot up all sightings time to find the answer since the sample in the map does not represent the entire dataset well.

We can use seaborn to create a catplot of all squirrel's ages.

This plot shows that there are more night time sightings rather than day time sightings. This can be due to the fact that there are more humans out at night than in the day time. This fits the sterotype of the city being named "the city that never sleeps."

4D. Is there a correlation between a squirrel's age and their actions (and/or interactions with humans)?

Since the map can only tell us so much about single variables, we should see if any variables have any correlation. We want to find out if any squirrels characteristics correlate with how they interact with humans or their actions.

My hypothesis is that older squirrels will be found running more than younger squirrels, since they most likely see humans as a threat.

Another hypothesis is that younger squirrels will approach humans more than older squirrels since they are seeking food from a source.

My last hypothesis is that older squirrels will be found foraging more than younger squirrels since they are more indepedent on searching for food.

Lets create a correlation matrix to see if our hypothesis was true.

Then we can create a countplot using seaborn to visualize the results.

More info on correlation matrix: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.corr.html

This correlation matrix is hard to visualize, so lets make a heatmap using seaborn to represent the correlation matrix for easier visual analysis.

Since the correlation matrix does not show age, lets visualize it differently. We can plot up categorical data with seaborn.

Hypothesis 1: Older squirrels will be found running more than younger squirrels because they most likely see humans as a threat.

We cannot tell from the countplot whether or not adults or baby squirrels tend to run away from humans more, so lets calculate the ratio for both. We want to find the total number of an age group, then total number of runners in the age group, then divide the runners from the total to find the ratio.

As we can see, the results are a near close one. It seems that baby squirrels that were seen running were three percentage more than adults squirrels seen running. This means that more baby squirrels found humans as a threat than adult squirrels did in central park in 2018.

Therefore, my hypothesis was proven wrong.

Hypothesis 2: Younger squirrels will approach humans more than older squirrels because they are seeking food from a source.

From the countplot, it seems that more adult squirrels approach humans than baby squirrels do. Lets calculate the ratios to check to see if our observational analysis was correct.

The results show that baby squirrels do have a higher percentage of approaching humans than adult squirrels do. This can be due to the fact that they had less time in the world, that can result in trusting others more for help. Also from being dependent on humans for food.

Therfore, my hypothesis was proven was correct.

Hypothesis 3: Older squirrels will be found foraging more than younger squirrels because they are more indepedent on searching for food.

From the countplot, it seems that more adult squirrels were seen foraging than baby squirrels did. Lets calculate the ratios to check to see if our observational analysis was correct.

The results show that a much higher percentage of adult squirrels were found foraging than baby squirrels did. This can be due to many reasons, one of which is from my hypothesis that adult squirrels are more independent than baby squirrels are in regard to survival.

Machine Learning

5A. Training the Data

Lets create a new dataframe with variables that we want to take into consideration for our predictive model. Since we have qualitative data, we have to transform the non-numerical labels to numerical labels. More information can be found here with the tool, Label Encoder, we will be using (https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html). We will also have to standardize our quantitative data by fitting and transforming it with the tool, Robust Scaler, with more info found here (https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.RobustScaler.html#sklearn.preprocessing.RobustScaler.fit_transform). We will be using sklearn to train and build the models

We can now split the data for our target values and data.

We can now train our data as shown below. We will split our data into two parts training (70%), testing (30%).

We will need these methods below to calculate the model accuracy and root mean squared error.

5B. Building the Model: K-Nearest Neighbors Classifier

We will be using a classifier since we are predicting primary fur color which is a categorical dataset, not numerical

Based on the results, it seems that the accuracy of the training and test data did well. However, the training data out-performed the test data by 4%, this means that we overfitted the data.

5B. Building the Model: Decision Tree Classifier

Based on the results, it seems that the test data accuracy performed better than the previous model; however, the train data accuracy was a percentage lower than the previous model. There is not a big difference on performance for the train and test data; thus, we underfitted the data since it could be better and not as poor for both train and test data. With better hyper parameters or different models, we can improve these accuracy scores. Overall, the decision tree classifier was a better model compared to the K-Nearest Neighbors Classifier.

6. Conclusion

6A. Key Takeaways

Thank you for joining me for this tutorial! We learned how to view, load, and clean a dataset. We learned how to visualize a dataset on a map using folium with different icons and markers. We were then able to analyze the maps, and check our results by plotting up more than a sample with seaborn plots. We plotted up squrriel ages, fur colors, and sighting times! We were able to prove/disprove our hypothesis of squirrel ages, color fur spread, sighting times, and whether or not squirrel's age determine their actions with computations and visual analysis. Lastly we trained caterogial data, and built a K-Nearest Neighbors Classifier and Decision Tree Classifier to predict primary fur colors. Both models had an overall 80% accuracy which is an average performance. We saw that the decision tree classifier was an overall better model for our data

6B. Future Questions