Airbnb® is an American company operating an online marketplace for lodging, primarily for vacation rentals
The purpose of our study is to perform an exploratory data analysis of the two datasets containing Airbnb® listings and across 10 major cities.
We aim to use various data visualizations to gain valuable insight on the effects of pricing, covid, and more!
# Importing Libraries
import pandas as pd
import numpy as np
from datetime import date
import geopandas as gpd
import matplotlib as mpl
import matplotlib.pyplot as plt
import plotly.express as px
from plotly.subplots import make_subplots
import plotly.graph_objects as go
from shapely.geometry import Point
from shapely.geometry.polygon import Polygon
from warnings import filterwarnings
import csv
import seaborn as sns
from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator
import ipywidgets as widget
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from nltk import ngrams
from collections import Counter
from collections import OrderedDict
import re
filterwarnings("ignore")
With this guiding question, we will be exploring the relationship between chosen metrics and airbnb listing success. The specific metrics we will look at will be location, price, review score, and the hosts superhost status. This is a very important question to ask, as high listing success is valuable to both hosts and airbnb. We will gain powerful insights into what people value when choosing an airbnb.
This question has evolved a little since our proposal. Initially, we were also looking into seasonal booking trends relative to COVID-19. However, we removed this from our analysis as we did not have enough COVID-19 data to study seasonal trends.
With this guiding question, we are going to explore how certain Airbnb® metrics have changed since COVID-19. This is a valuable insight as it shows exactly what areas of Airbnb® were impacted by COVID-19, so they can be targeted in attempting to return to normal levels.
With this guiding question, we will explore how Airbnb® trends vary by city. We are also including the average temperatures for each of the cities in our analysis, as we suspect this plays the largest role in booking trends.
This guiding question as also evolved since our proposal. Rather than studying what makes a host a "good host", we will look into what metrics vary between regular hosts and Airbnb® defined Superhosts. Specifically, compared average review score, amenities, and response rates for hosts and superhosts. This will provide powerful insight to hosts on what they should be focusing on, as well as to Airbnb® on what they are focusing most on.
We are working with two datasets containing data sourced from Kaggle (Bhat, 2021), and was originally pulled from Airbnb's API (Airbnb, 2022). The data is available with CCO:Public Domain Licence
The first dataset containts information on over 250,000 listings in 10 major cities. It containts specific information on both the host and listing, as well as review scores. The full unabbreviated set of columns can be seen in the head of table 1 above.
Our second dataset contained information on the dates of when reviews came in for listings. This proved to be a useful dataset for us, as we realized we could quantify the number of reviews a listing recieved. This was critical in our analyses.
airbnb_raw = pd.read_csv('https://media.githubusercontent.com/media/imadahmad97/EDA-of-Airbnb-Data/main/DataSets/Raw%20Dataset%20from%20kaggle/Listings.csv', encoding='iso-8859-1')
airbnb_review_raw = pd.read_csv('https://media.githubusercontent.com/media/imadahmad97/EDA-of-Airbnb-Data/main/DataSets/Raw%20Dataset%20from%20kaggle/Reviews.csv')
display(airbnb_raw.head())
display(airbnb_review_raw.head())
listing_id | name | host_id | host_since | host_location | host_response_time | host_response_rate | host_acceptance_rate | host_is_superhost | host_total_listings_count | ... | minimum_nights | maximum_nights | review_scores_rating | review_scores_accuracy | review_scores_cleanliness | review_scores_checkin | review_scores_communication | review_scores_location | review_scores_value | instant_bookable | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 281420 | Beautiful Flat in le Village Montmartre, Paris | 1466919 | 2011-12-03 | Paris, Ile-de-France, France | NaN | NaN | NaN | f | 1.0 | ... | 2 | 1125 | 100.0 | 10.0 | 10.0 | 10.0 | 10.0 | 10.0 | 10.0 | f |
1 | 3705183 | 39 mò Paris (Sacre Cà âur) | 10328771 | 2013-11-29 | Paris, Ile-de-France, France | NaN | NaN | NaN | f | 1.0 | ... | 2 | 1125 | 100.0 | 10.0 | 10.0 | 10.0 | 10.0 | 10.0 | 10.0 | f |
2 | 4082273 | Lovely apartment with Terrace, 60m2 | 19252768 | 2014-07-31 | Paris, Ile-de-France, France | NaN | NaN | NaN | f | 1.0 | ... | 2 | 1125 | 100.0 | 10.0 | 10.0 | 10.0 | 10.0 | 10.0 | 10.0 | f |
3 | 4797344 | Cosy studio (close to Eiffel tower) | 10668311 | 2013-12-17 | Paris, Ile-de-France, France | NaN | NaN | NaN | f | 1.0 | ... | 2 | 1125 | 100.0 | 10.0 | 10.0 | 10.0 | 10.0 | 10.0 | 10.0 | f |
4 | 4823489 | Close to Eiffel Tower - Beautiful flat : 2 rooms | 24837558 | 2014-12-14 | Paris, Ile-de-France, France | NaN | NaN | NaN | f | 1.0 | ... | 2 | 1125 | 100.0 | 10.0 | 10.0 | 10.0 | 10.0 | 10.0 | 10.0 | f |
5 rows × 33 columns
listing_id | review_id | date | reviewer_id | |
---|---|---|---|---|
0 | 11798 | 330265172 | 2018-09-30 | 11863072 |
1 | 15383 | 330103585 | 2018-09-30 | 39147453 |
2 | 16455 | 329985788 | 2018-09-30 | 1125378 |
3 | 17919 | 330016899 | 2018-09-30 | 172717984 |
4 | 26827 | 329995638 | 2018-09-30 | 17542859 |
We first assigned proper data to our columns. Many of the columns containing strings were listed as objects. Although strings are objects, for good measure we just converted them to strings. All the columns containing boolean values had the values listed as 't' or 'f'. In addition to converting these columns from objects to booleans, we mapped the python boolean terms 'TRUE' and 'FALSE' to these columns.
We also had to deal with the date values in the host_since column. We created a new column that contained the host_since values in DateTime format, as well as a column that only contained the host_since years in DateTime format. This was a preliminary wrangling step that proved to be useful in many parts of our analysis, as you will soon see.
# Correcting datatype for string columns (note: Pandas stores strings as objects)
airbnb_raw[['name', 'host_location', 'host_response_time',
'neighbourhood', 'district', 'city',
'property_type', 'room_type']] = airbnb_raw[['name','host_location',
'host_response_time','neighbourhood','district',
'city','property_type','room_type']].astype('str')
# Correcting labelling and datatype for boolean columns
airbnb_raw['host_is_superhost'] = airbnb_raw['host_is_superhost'].map({'t':True,'f':False}).astype(bool)
airbnb_raw['host_has_profile_pic'] = airbnb_raw['host_has_profile_pic'].map({'t':True,'f':False}).astype(bool)
airbnb_raw['host_identity_verified'] = airbnb_raw['host_identity_verified'].map({'t':True,'f':False}).astype(bool)
airbnb_raw['instant_bookable'] = airbnb_raw['instant_bookable'].map({'t':True,'f':False}).astype(bool)
# Creating 2 columns, one with host_since in DateTime format, and one with the year values of host_since in DateTime format
airbnb_raw['host_since_dt'] = pd.to_datetime(airbnb_raw['host_since'])
airbnb_raw['host_since_dt_year'] = airbnb_raw['host_since_dt'].apply(lambda x: str(x.year))
We then read in our second reviews dataset, the head of which can be seen below:
airbnb_review_raw.head()
listing_id | review_id | date | reviewer_id | |
---|---|---|---|---|
0 | 11798 | 330265172 | 2018-09-30 | 11863072 |
1 | 15383 | 330103585 | 2018-09-30 | 39147453 |
2 | 16455 | 329985788 | 2018-09-30 | 1125378 |
3 | 17919 | 330016899 | 2018-09-30 | 172717984 |
4 | 26827 | 329995638 | 2018-09-30 | 17542859 |
With only four columns, we made the assumption that there would not be a lot of cleaning and wrangling to be done with this dataset. However, we underestimated how important it would be to our analysis, and the different ways in which we would have to wrangle the data to merge the datasets proved to be a welcome challenge.
We created a new column containing the dates in DateTime, as well as a column containing the days since the last review. Creating this second column was an interesting puzzle. First, we took our newly created review_date_dt column, sorted it, and grouped the table by listing IDs. We then used the pandas function "shift" to create a new column, which contains these sorted DateTime values shifted one index down (meaning the top row in this new column contains a null value).
Now, we have a column containing the sorted review dates, as well as a column containing the sorted review dates shifted by one index, and our table is grouped by listing. To get the days since last review, all that we need to go is subtract the columns from eachother, and store the result as a number of days in a new column.
airbnb_review_raw['review_date_dt'] = pd.to_datetime(airbnb_review_raw['date'])
airbnb_review_raw['days_since_last_review'] = (airbnb_review_raw.sort_values('review_date_dt').
groupby('listing_id').review_date_dt.shift() - airbnb_review_raw.
review_date_dt).dt.days.abs()
Our next step was binning the number of reviews for each listing into months. We did this by first creating a year_month column that contained each month in our dataset. This column contained months in the format YYYY_MM. We then created a separate dataframe that was grouped by listings and year_month bins. This new dataset contained counts of how many reviews took place in each year_month bin. We unstacked it to allow the data to display each month as a column. We now have a dataset that has listings with YEAR-MONTH bins, with each bin containing the number of reviews for that listing in that month.
At this stage we also filled our NA values with 0. This was important specifically for our reviews dataset, as our analysis depended on knowing the number of reviews a listing recieved in a given month, even if that number was zero.
The reason we are wrangling the data in this was will become clear in our analysis, as these monthly bins provide a very important metric in many of our visualizations (add reference to part where you explain the relationship between review counts and number of bookings).
Note: The columns are abbreviated by ellipses as we are analyzing over 147 separate months, each with it's own column!
Note 2: Our reason for wrangling that data this way will become more clear in our analyses, as many of them required our data in this format.
airbnb_review_raw['year_month'] = airbnb_review_raw['review_date_dt'].dt.strftime('%Y_%m')
airbnb_year_month = airbnb_review_raw.groupby(['listing_id','year_month'])['review_id'].count()
airbnb_year_month = airbnb_year_month.unstack(level=-1).fillna(0)
airbnb_year_month.head()
year_month | 2008_11 | 2009_01 | 2009_02 | 2009_04 | 2009_05 | 2009_06 | 2009_07 | 2009_08 | 2009_09 | 2009_10 | ... | 2020_06 | 2020_07 | 2020_08 | 2020_09 | 2020_10 | 2020_11 | 2020_12 | 2021_01 | 2021_02 | 2021_03 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
listing_id | |||||||||||||||||||||
2577 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
2595 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
2737 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
2903 | 2.0 | 1.0 | 1.0 | 1.0 | 0.0 | 1.0 | 4.0 | 0.0 | 1.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
3079 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
5 rows × 147 columns
Since in our original reviews dataset, we had dates for each review, any months that had no reviews were not made into columns in our airbnb_year_month dataset (add proper reference to figure). Our data still needed to reflect that there were no reviews these months, so we inserted columns with 0s into the dataframe for each month with no data.
# These columns needed to be inserted at a specific location or at the front
airbnb_year_month.insert(3,'2009_03',[*[0] * len(airbnb_year_month)])
airbnb_year_month.insert(1,'2008_12',[*[0] * len(airbnb_year_month)])
airbnb_year_month.insert(0,'2008_10',[*[0] * len(airbnb_year_month)])
airbnb_year_month.insert(0,'2008_09',[*[0] * len(airbnb_year_month)])
airbnb_year_month.insert(0,'2008_08',[*[0] * len(airbnb_year_month)])
airbnb_year_month.insert(0,'2008_07',[*[0] * len(airbnb_year_month)])
airbnb_year_month.insert(0,'2008_06',[*[0] * len(airbnb_year_month)])
airbnb_year_month.insert(0,'2008_05',[*[0] * len(airbnb_year_month)])
airbnb_year_month.insert(0,'2008_04',[*[0] * len(airbnb_year_month)])
airbnb_year_month.insert(0,'2008_03',[*[0] * len(airbnb_year_month)])
airbnb_year_month.insert(0,'2008_02',[*[0] * len(airbnb_year_month)])
airbnb_year_month.insert(0,'2008_01',[*[0] * len(airbnb_year_month)])
# These columns could be named without using .insert, as we are adding them to the right side
airbnb_year_month['2021_04'] = 0
airbnb_year_month['2021_05'] = 0
airbnb_year_month['2021_06'] = 0
airbnb_year_month['2021_07'] = 0
airbnb_year_month['2021_08'] = 0
airbnb_year_month['2021_09'] = 0
airbnb_year_month['2021_10'] = 0
airbnb_year_month['2021_11'] = 0
airbnb_year_month['2021_12'] = 0
Next, we created a separate data frame, containing the minimum/maximum review date, total counts of distinct review IDs and distinct reviewer IDs. We also added a column for the average days between reviews, thinking preliminarily about the analyses we would perform. We called this dataframe airbnb_review_info.
# Using the .agg function, we were able to add all the columns in one block of code
airbnb_review_info = airbnb_review_raw.groupby(['listing_id']).agg(
review_date_min=('review_date_dt', np.min),
review_date_max=('review_date_dt', np.max),
review_id_distinct_count=('review_id',lambda x: x.nunique()),
reviewer_id_distinct_count=('reviewer_id',lambda x: x.nunique()),
avg_no_of_day_btw_review=('days_since_last_review', np.nanmean), #ignore nan in calculating avg
).reset_index()
The penultimate step was merging the three data frames:
We merged them one at a time into a dataframe called airbnb_raw_plus_review. We added them to the left one at a time, merging on listing ID.
We performed some final cleanup and we had our clean dataset for analysis of our guiding questions. The head of the resulting data frame is displayed below (columns abbreviated). The clean dataset is available in the DataSet folder as 'airbnb_raw_plus_review' (github)
airbnb_raw_plus_review = airbnb_raw.merge(airbnb_review_info, on='listing_id', how='left')
airbnb_raw_plus_review = airbnb_raw_plus_review.merge(airbnb_year_month, on='listing_id', how='left')
airbnb_raw_plus_review = airbnb_raw_plus_review.replace('nan', 'No Data') # Some columns had nan as a string
airbnb_raw_plus_review.head()
listing_id | name | host_id | host_since | host_location | host_response_time | host_response_rate | host_acceptance_rate | host_is_superhost | host_total_listings_count | ... | 2021_03 | 2021_04 | 2021_05 | 2021_06 | 2021_07 | 2021_08 | 2021_09 | 2021_10 | 2021_11 | 2021_12 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 281420 | Beautiful Flat in le Village Montmartre, Paris | 1466919 | 2011-12-03 | Paris, Ile-de-France, France | No Data | NaN | NaN | False | 1.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
1 | 3705183 | 39 mò Paris (Sacre Cà âur) | 10328771 | 2013-11-29 | Paris, Ile-de-France, France | No Data | NaN | NaN | False | 1.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
2 | 4082273 | Lovely apartment with Terrace, 60m2 | 19252768 | 2014-07-31 | Paris, Ile-de-France, France | No Data | NaN | NaN | False | 1.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
3 | 4797344 | Cosy studio (close to Eiffel tower) | 10668311 | 2013-12-17 | Paris, Ile-de-France, France | No Data | NaN | NaN | False | 1.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
4 | 4823489 | Close to Eiffel Tower - Beautiful flat : 2 rooms | 24837558 | 2014-12-14 | Paris, Ile-de-France, France | No Data | NaN | NaN | False | 1.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
5 rows × 208 columns
Finally, all that was left to be done was to convert all the currencies into a common currency, USD in this case. We created a list containing currency exchange rates to USD, and divided each value by it's respective exchange rate. The clean dataset is available in the DataSet folder as 'airbnb_raw_plus_review' (github).
exchange_table = [['Bangkok','THB', 37.89], ['Cape Town', 'ZAR', 17.92], ['Hong Kong', 'HKD',7.85], ['Istanbul', 'TRY',18.41],
['Mexico City', 'MXN',20.17], ['New York', 'USD',1], ['Paris', 'EUR',1.04], ['Rio de Janeiro', 'BRL',5.39],
['Rome', 'EUR',1.04], ['Sydney', 'AUD',1.54]]
exchange_table_df = pd.DataFrame(exchange_table, columns=['city', 'currency','currency_rate'])
airbnb_raw_plus_review = airbnb_raw_plus_review.merge(exchange_table_df, on='city', how='left')
airbnb_raw_plus_review['price_USD']=airbnb_raw_plus_review['price']/airbnb_raw_plus_review['currency_rate']
#airbnb_raw_plus_review.to_csv('airbnb_raw_plus_review.csv') #code to produce clean/merged csv for further use.
airbnb_raw_plus_review.head()
listing_id | name | host_id | host_since | host_location | host_response_time | host_response_rate | host_acceptance_rate | host_is_superhost | host_total_listings_count | ... | 2021_06 | 2021_07 | 2021_08 | 2021_09 | 2021_10 | 2021_11 | 2021_12 | currency | currency_rate | price_USD | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 281420 | Beautiful Flat in le Village Montmartre, Paris | 1466919 | 2011-12-03 | Paris, Ile-de-France, France | No Data | NaN | NaN | False | 1.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | EUR | 1.04 | 50.961538 |
1 | 3705183 | 39 mò Paris (Sacre Cà âur) | 10328771 | 2013-11-29 | Paris, Ile-de-France, France | No Data | NaN | NaN | False | 1.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | EUR | 1.04 | 115.384615 |
2 | 4082273 | Lovely apartment with Terrace, 60m2 | 19252768 | 2014-07-31 | Paris, Ile-de-France, France | No Data | NaN | NaN | False | 1.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | EUR | 1.04 | 85.576923 |
3 | 4797344 | Cosy studio (close to Eiffel tower) | 10668311 | 2013-12-17 | Paris, Ile-de-France, France | No Data | NaN | NaN | False | 1.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | EUR | 1.04 | 55.769231 |
4 | 4823489 | Close to Eiffel Tower - Beautiful flat : 2 rooms | 24837558 | 2014-12-14 | Paris, Ile-de-France, France | No Data | NaN | NaN | False | 1.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | EUR | 1.04 | 57.692308 |
5 rows × 211 columns
A note about the data:
Our data did not contain information on how many bookings a listing has received. Many of our analyses look into listing success, so we have decided to quantify listing success as the number of reviews a listing has received. This is something we discussed and thought a lot about and decided it was okay for the following reasons:
There is never an analysis where we are looking at the absolute value of bookings a listing received, we are only comparing listing success.
The average number of bookings per review is most likely the same across Airbnb® with some variation, and so they should correlate well for comparisons.
In looking at what metrics play a role in Airbnb® listing success, the first one that jumped out at us was location. (Our survey in Sep 2022 of Data 601 class revealed a lot of insights. The pie chart below is one of the results from that survey that was done to get a sample of what people think). It makes sense that certain vacation hubs would have more bookings than other places. Previous studies have found a strong correlation between listing revenue and location (Deboosere, Kerrigan, Wachsmuth and El-Geneidy, 2019). We are going to be comparing districts in each city utilizing the geopandas library. All geojson files are available in the "GEOJSON Files" folder (github).
We read in the appropriate GEOJSON file, and wrangled our data accordingly to correlate with it. The result is displayed below:
#read the GeoJson file
census_file = 'https://raw.githubusercontent.com/imadahmad97/EDA-of-Airbnb-Data/main/DataSets/GEOJSON%20Files/communes-75-paris.geojson'
cendf = gpd.read_file(census_file)
# Let's index by district name
cendf.set_index('nom', inplace=True)
cendf_2 = cendf
# Filter cleaned file to only Paris
paris_df_map = airbnb_raw_plus_review[(airbnb_raw_plus_review['city']=='Paris')].reset_index()
#combine the long and lat as coordinates
paris_df_map['coordinates'] = list(zip(paris_df_map["longitude"], paris_df_map["latitude"]))
paris_df_map['coordinates'] = paris_df_map['coordinates'].apply(Point)
locsdf = gpd.GeoDataFrame(paris_df_map, geometry='coordinates')
locsdf = locsdf.set_crs(epsg=4326)
output_dict=dict()
coordinates = paris_df_map['coordinates'].tolist()
count_of_review = paris_df_map['review_id_distinct_count'].tolist()
geo = cendf_2['geometry'].tolist()
district_code = cendf_2['code'].tolist()
output_dict=dict()
for j in range(len(geo)):
for i in range(len(coordinates)):
if geo[j].contains(coordinates[i]):
if district_code[j] in output_dict:
if not (np.isnan(count_of_review[i])):
output_dict[district_code[j]]+=count_of_review[i]
else:
if not (np.isnan(count_of_review[i])):
output_dict[district_code[j]]=count_of_review[i]
output_dict_paris = output_dict
#turn the dictionary to dataframe, then left join to the cendf file
output_dict_df = pd.DataFrame(output_dict_paris.items(), columns=['code', 'review_count'])
cendf_2['nom'] = cendf_2.index
cendf_3 = pd.merge(cendf_2, output_dict_df , on='code', how='left')
cendf_3.set_index('nom', inplace=True)
# Choropleth map of calgary resident count via px.choropleth_mapbox()
cendf_4 = cendf_3.to_crs(epsg=4326)
# plot the map
fig = px.choropleth_mapbox(cendf_4, geojson=cendf_4,
locations=cendf_4.index,
color="review_count",
color_continuous_scale = 'YlGn',
center={"lat": 48.8566, "lon": 2.3522}, # Paris
mapbox_style="carto-positron",
opacity=0.75,
zoom=10,
title = 'Paris Review Density')
fig.update_layout(margin={"r":50,"t":50,"l":50,"b":50},
autosize=True,
height=600 )
fig.show()
The Effiel Tower is in 7th Arrodissement, Louvre Museum and Arc de Triomphe are in the 1st Arrodissement. All of them are at the central part of Paris. From the GeoJson chart, higher proportion of visitors stayed at the north and southwest part of Paris.
#read the GeoJson file
census_file = 'https://raw.githubusercontent.com/imadahmad97/EDA-of-Airbnb-Data/main/DataSets/GEOJSON%20Files/rome-rioni_.geojson'
cendf = gpd.read_file(census_file)
# Let's index by district name
cendf.set_index('name', inplace=True)
cendf_2 = cendf
# Filter cleaned file to only Rome
rome_df_map = airbnb_raw_plus_review[(airbnb_raw_plus_review['city']=='Rome')].reset_index()
#combine the long and lat as coordinates
rome_df_map['coordinates'] = list(zip(rome_df_map["longitude"], rome_df_map["latitude"]))
rome_df_map['coordinates'] = rome_df_map['coordinates'].apply(Point)
locsdf = gpd.GeoDataFrame(rome_df_map, geometry='coordinates')
locsdf = locsdf.set_crs(epsg=4326)
#for counting the number of reviews in each district
coordinates = rome_df_map['coordinates'].tolist()
count_of_review = rome_df_map['review_id_distinct_count'].tolist()
geo = cendf_2['geometry'].tolist()
district_code = cendf_2['cartodb_id'].tolist()
output_dict=dict()
for j in range(len(geo)):
for i in range(len(coordinates)):
if geo[j].contains(coordinates[i]):
if district_code[j] in output_dict:
if not (np.isnan(count_of_review[i])):
output_dict[district_code[j]]+=count_of_review[i]
else:
if not (np.isnan(count_of_review[i])):
output_dict[district_code[j]]=count_of_review[i]
#print(output_dict)
output_dict_rome = output_dict
#turn the dictionary to dataframe, then left join to the cendf file
output_dict_df = pd.DataFrame(output_dict_rome.items(), columns=['cartodb_id', 'review_count'])
cendf_2['name'] = cendf_2.index
cendf_3 = pd.merge(cendf_2, output_dict_df , on='cartodb_id', how='left')
cendf_3.set_index('name', inplace=True)
# Choropleth map of calgary resident count via px.choropleth_mapbox()
cendf_4 = cendf_3.to_crs(epsg=4326)
# plot the map
fig = px.choropleth_mapbox(cendf_4, geojson=cendf_4,
locations=cendf_4.index,
color="review_count",
color_continuous_scale = 'YlGn',
center={"lat": 41.90, "lon": 12.48}, # Rome
mapbox_style="carto-positron",
opacity=0.75,
zoom=12,
title = 'Rome Review Density')
fig.update_layout(margin={"r":50,"t":50,"l":50,"b":50},
autosize=True,
height=600 )
fig.show()
Colosseum - ripa, Pantheon - pigna and Trevi Fountain - trevi, are all located in the central area of Rome. High proportion of visitors stayed in the east and west. Vatican is located on the west, which is not highlighted in the above GeoJson.
We can clearly see a difference in the listing success within one city. There are very dark areas next to very bright areas, indicating clear demarcations of high/low listing success zones.
The next metric we chose to look at was price. It's obvious that price plays a role in choosing an Airbnb®. We aimed to study how many bookings were falling into our chosen price bins. We placed all the USD prices into price bins from 0-100, the cutoff of bins being every $100. We obtained the following histogram:
priceSuccessData = airbnb_raw_plus_review
priceSuccessData = priceSuccessData.dropna()
priceSuccessData['Bins'] = pd.cut(priceSuccessData.price_USD, bins=[0,100,200,300,
400,500,600,700,
800,900,1000], labels = ['0:100','100:200','200:300','300:400',
'400:500','500:600','600:700','700:800',
'800:900','900:1000'])
px.histogram(priceSuccessData, x='Bins', title = "Histogram of Price Categories Across all Airbnbs Tested")
This histogram at first surprised us. Most of the time when we look for Airbnbs, they are priced greater than $100 USD. Recent studies have also recorded an average price higher than this (Sainaghi, Abrate, and Mauri, 2021). Upon deeper analysis, we saw that average Airbnb prices differ greatly around the world. Thus, it is not unrealistic to see a bar plot like this. A possible further analysis could separate by continent or country category, showing true averages for different locations.
Finally, we chose to look into if review scores or superhost status plays a role in listing success. In order to measure these 3 variables, we obtained scatterplots with review scores on the x-axis and review counts (listing success) on the y-axis. We also coloured the dots depending on if the individual was a superhost or not. The results are displayed below:
graph = sns.FacetGrid(airbnb_raw_plus_review, col ='city', hue='host_is_superhost', col_wrap = 4, height=3.5, aspect = 1, legend_out = False, ylim =None)
graph.map(plt.scatter, 'review_scores_rating', 'review_id_distinct_count', edgecolor ="w")
graph.set(yticklabels=[])
plt.show()
These results show interesting findings. First, we can see that superhosts are a lot more left skewed, indicating they have higher review scores on average. This would make sense, as Airbnb looks into review scores when deciding superhost status.
The interesting point that we noticed was that in a lot of cities, the number of bookings Airbnb superhosts and regular hosts were getting was comparable. This is surprising because, in our data and worldwide, superhosts only make up about 20% of all hosts on Airbnb, yet they are still earning the same amount of bookings as regular hosts. Clearly, superhosts are well labelled, as they perform much stronger than regular hosts.
Now, we aimed to investigate what the most common sets of words were in listing titles. We created visualizations of word bigrams and trigrams below:
airbnb_text_paris = airbnb_raw_plus_review.loc[airbnb_raw_plus_review['city'] == 'Paris']
airbnb_text_newyork = airbnb_raw_plus_review.loc[airbnb_raw_plus_review['city'] == 'New York']
airbnb_text_mexicocity = airbnb_raw_plus_review.loc[airbnb_raw_plus_review['city'] == 'Mexico City']
#function to get the text column 'name' of Paris listings ready for NLP
def NLP_prep(airbnb_raw_plus_review):
#pulling relevant columns for this analysis
listing_name = airbnb_raw_plus_review.loc[:, ['name','city'] ]
#removing non-alphabetical characters from text column, create new column 'name_str'
listing_name['name_str'] = listing_name['name'].fillna('').astype(str).str.replace(r'[^A-Za-z ]', '', regex=True).replace('', np.nan, regex=False)
# drop nan rows from name_str column
listing_name = listing_name.dropna(subset=['name_str'])
#keeping unique words in each row of the string column
listing_name['name_str_nodup'] = (listing_name['name_str'].str.split()
.apply(lambda x: OrderedDict.fromkeys(x).keys())
.str.join(' '))
return listing_name['name_str_nodup']
#listing_name.head(100)
listingtext = NLP_prep(airbnb_text_paris)
# NLP_prep()'s output is used as input for the NLP_airbnb() function which does word analysis.
def NLP_airbnb(listingtext):
listingtext =" ".join(listingtext)
# creates tokens, creates lower class, removes numbers and lemmatizes the words
new_tokens = word_tokenize(listingtext)
new_tokens = [t.lower() for t in new_tokens]
new_tokens = [t for t in new_tokens if t not in stopwords.words('english', 'french')]
new_tokens = [t for t in new_tokens if t.isalpha()]
lemmatizer = WordNetLemmatizer()
new_tokens = [lemmatizer.lemmatize(t) for t in new_tokens]
#counts the frequencies
counted = Counter(new_tokens)
counted_2 = Counter(ngrams(new_tokens,2))
counted_3 = Counter(ngrams(new_tokens,3))
#generates separate dataframes for each type of combination
word_freq = pd.DataFrame(counted.items(),columns=['word','frequency']).sort_values(by='frequency',ascending=False)
word_pairs = pd.DataFrame(counted_2.items(),columns=['bigrams','frequency']).sort_values(by='frequency',ascending=False)
trigrams = pd.DataFrame(counted_3.items(),columns=['trigrams','frequency']).sort_values(by='frequency',ascending=False)
return word_freq,word_pairs,trigrams
word1, bigrams1, trigrams1 = NLP_airbnb(listingtext)
#These are for Paris only
word1, bigrams1, trigrams1 = NLP_airbnb(listingtext)
#function to get the text column 'name' of New York listings ready for NLP
def NLP_prep(airbnb_raw_plus_review):
#pulling relevant columns for this analysis
listing_name = airbnb_raw_plus_review.loc[:, ['name','city'] ]
#removing non-alphabetical characters from text column, create new column 'name_str'
listing_name['name_str'] = listing_name['name'].fillna('').astype(str).str.replace(r'[^A-Za-z ]', '', regex=True).replace('', np.nan, regex=False)
# drop nan rows from name_str column
listing_name = listing_name.dropna(subset=['name_str'])
#keeping unique words in each row of the string column
listing_name['name_str_nodup'] = (listing_name['name_str'].str.split()
.apply(lambda x: OrderedDict.fromkeys(x).keys())
.str.join(' '))
return listing_name['name_str_nodup']
#listing_name.head(100)
listingtext = NLP_prep(airbnb_text_newyork)
# NLP_prep()'s output is used as input for the NLP_airbnb() function which does word analysis.
def NLP_airbnb(listingtext):
listingtext =" ".join(listingtext)
# creates tokens, creates lower class, removes numbers and lemmatizes the words
new_tokens = word_tokenize(listingtext)
new_tokens = [t.lower() for t in new_tokens]
new_tokens = [t for t in new_tokens if t not in stopwords.words('english', 'french')]
new_tokens = [t for t in new_tokens if t.isalpha()]
lemmatizer = WordNetLemmatizer()
new_tokens = [lemmatizer.lemmatize(t) for t in new_tokens]
#counts the frequencies
counted = Counter(new_tokens)
counted_2 = Counter(ngrams(new_tokens,2))
counted_3 = Counter(ngrams(new_tokens,3))
#generates separate dataframes for each type of combination
word_freq = pd.DataFrame(counted.items(),columns=['word','frequency']).sort_values(by='frequency',ascending=False)
word_pairs = pd.DataFrame(counted_2.items(),columns=['bigrams','frequency']).sort_values(by='frequency',ascending=False)
trigrams = pd.DataFrame(counted_3.items(),columns=['trigrams','frequency']).sort_values(by='frequency',ascending=False)
return word_freq,word_pairs,trigrams
word2, bigrams2, trigrams2 = NLP_airbnb(listingtext)
#These are for New York only
word2, bigrams2, trigrams2 = NLP_airbnb(listingtext)
#For Paris Data
fig, axes=plt.subplots(1,1, figsize=(8,10))
sns.barplot(ax=axes,x='frequency',y='bigrams',data=bigrams1.head(30)).set_title("Paris Listing Title Bigrams")
plt.show()
fig, axes=plt.subplots(1,1, figsize=(8,10))
sns.barplot(ax=axes,x='frequency',y='trigrams',data=trigrams1.head(30)).set_title("Paris Listing Title Trigrams")
plt.show()
#For New York Data
fig, axes = plt.subplots(1,1,figsize=(8,10))
sns.barplot(ax=axes,x='frequency',y='bigrams',data=bigrams2.head(30)).set_title("New York Listing Title Bigrams")
plt.show()
fig, axes = plt.subplots(1,1,figsize=(8,10))
sns.barplot(ax=axes,x='frequency',y='trigrams',data=trigrams2.head(30)).set_title("New York Listing Title Trigrams")
plt.show()
The first trend we noticed was that most cities (in their own language) have the words “in the heart of”. Additionally, most hosts tend to put landmarks near them in their title name. Note that our analysis does not take into account if hosts who use these words were more successful, just more commonly used.
Shifting to our second guiding question, we are aiming to analyze how COVID-19 has impacted listing success. In order to do this, we first imported data on COVID-19 cases from each of our major cities. We counted Feb. 2020 as the start of COVID-19, as this is when cases started to ramp us in these cities. We included a line graph of listing success, with a bar graph behind to show covid cases in the cities. The graph is also shaded with our "start of covid" date. The result can be seen below:
#Data cleaning for this part of the question
covid_data = pd.read_csv('https://media.githubusercontent.com/media/imadahmad97/EDA-of-Airbnb-Data/main/DataSets/Covid%20Data/owid-covid-data.csv', encoding='iso-8859-1',
warn_bad_lines=True, error_bad_lines=False)
#get relevant columns from covid data
covid_data2 = covid_data.loc[:, ['location', 'date','new_cases'] ]
#display(covid_data2['location'].unique())
#wrangle data to get useful information for merging to airbnb data for each city
#france covid data
covid_fr = covid_data2.loc[covid_data2['location'] == 'France'].drop(['location'], axis = 1)
covid_fr['Year-month'] = pd.to_datetime(covid_fr['date']).dt.strftime('%Y_%m')
covid_fr = covid_fr.drop(['date'], axis=1).groupby(by='Year-month', as_index = False).sum()
#US covid data
covid_us = covid_data2.loc[covid_data2['location'] == 'United States'].drop(['location'], axis = 1)
covid_us['Year-month'] = pd.to_datetime(covid_us['date']).dt.strftime('%Y_%m')
covid_us = covid_us.drop(['date'], axis=1).groupby(by='Year-month', as_index = False).sum()
#Australia covid data
covid_au = covid_data2.loc[covid_data2['location'] == 'Australia'].drop(['location'], axis = 1)
covid_au['Year-month'] = pd.to_datetime(covid_au['date']).dt.strftime('%Y_%m')
covid_au = covid_au.drop(['date'], axis=1).groupby(by='Year-month', as_index = False).sum()
#Italy covid data
covid_it = covid_data2.loc[covid_data2['location'] == 'Italy'].drop(['location'], axis = 1)
covid_it['Year-month'] = pd.to_datetime(covid_it['date']).dt.strftime('%Y_%m')
covid_it = covid_it.drop(['date'], axis=1).groupby(by='Year-month', as_index = False).sum()
#Brazil covid data
covid_br = covid_data2.loc[covid_data2['location'] == 'Brazil'].drop(['location'], axis = 1)
covid_br['Year-month'] = pd.to_datetime(covid_br['date']).dt.strftime('%Y_%m')
covid_br = covid_br.drop(['date'], axis=1).groupby(by='Year-month', as_index = False).sum()
#Turkey covid data
covid_tu = covid_data2.loc[covid_data2['location'] == 'Turkey'].drop(['location'], axis = 1)
covid_tu['Year-month'] = pd.to_datetime(covid_tu['date']).dt.strftime('%Y_%m')
covid_tu = covid_tu.drop(['date'], axis=1).groupby(by='Year-month', as_index = False).sum()
#Mexico covid data
covid_mx = covid_data2.loc[covid_data2['location'] == 'Mexico'].drop(['location'], axis = 1)
covid_mx['Year-month'] = pd.to_datetime(covid_mx['date']).dt.strftime('%Y_%m')
covid_mx = covid_mx.drop(['date'], axis=1).groupby(by='Year-month', as_index = False).sum()
#Thailand covid data
covid_th = covid_data2.loc[covid_data2['location'] == 'Thailand'].drop(['location'], axis = 1)
covid_th['Year-month'] = pd.to_datetime(covid_th['date']).dt.strftime('%Y_%m')
covid_th = covid_th.drop(['date'], axis=1).groupby(by='Year-month', as_index = False).sum()
#South Africa covid data
covid_sa = covid_data2.loc[covid_data2['location'] == 'South Africa'].drop(['location'], axis = 1)
covid_sa['Year-month'] = pd.to_datetime(covid_sa['date']).dt.strftime('%Y_%m')
covid_sa = covid_sa.drop(['date'], axis=1).groupby(by='Year-month', as_index = False).sum()
#Hong Kong covid data
covid_hk = covid_data2.loc[covid_data2['location'] == 'Hong Kong'].drop(['location'], axis = 1)
covid_hk['Year-month'] = pd.to_datetime(covid_hk['date']).dt.strftime('%Y_%m')
covid_hk = covid_hk.drop(['date'], axis=1).groupby(by='Year-month', as_index = False).sum()
#pulling relevant columns from main airbnb data required for guiding question 2 analysis.
pd.set_option('display.max_columns', None)
airbnb_covid = airbnb_raw_plus_review.loc[:, ['listing_id', 'city','2019_01','2019_02','2019_03','2019_04','2019_05',
'2019_06','2019_07','2019_08','2019_09','2019_10','2019_11','2019_12',
'2020_01','2020_02','2020_03','2020_04','2020_05','2020_06','2020_07',
'2020_08','2020_09','2020_10','2020_11','2020_12','2021_01','2021_02',
'2021_03','2021_04'] ]
#pulling information out of airbnb data for each specific city and suming monthly values for transactions and dropping unnecessary columns.
airbnb_covid_paris = airbnb_covid.loc[airbnb_covid['city'] == 'Paris'].drop(['city','listing_id'], axis = 1).sum(axis = 'rows')
airbnb_covid_newyork = airbnb_covid.loc[airbnb_covid['city'] == 'New York'].drop(['city','listing_id'], axis = 1).sum(axis = 'rows')
airbnb_covid_bangkok = airbnb_covid.loc[airbnb_covid['city'] == 'Bangkok'].drop(['city','listing_id'], axis = 1).sum(axis = 'rows')
airbnb_covid_riodejaneiro = airbnb_covid.loc[airbnb_covid['city'] == 'Rio de Janeiro'].drop(['city','listing_id'], axis = 1).sum(axis = 'rows')
airbnb_covid_sydney = airbnb_covid.loc[airbnb_covid['city'] == 'Sydney'].drop(['city','listing_id'], axis = 1).sum(axis = 'rows')
airbnb_covid_istanbul = airbnb_covid.loc[airbnb_covid['city'] == 'Istanbul'].drop(['city','listing_id'], axis = 1).sum(axis = 'rows')
airbnb_covid_rome = airbnb_covid.loc[airbnb_covid['city'] == 'Rome'].drop(['city','listing_id'], axis = 1).sum(axis = 'rows')
airbnb_covid_hongkong = airbnb_covid.loc[airbnb_covid['city'] == 'Hong Kong'].drop(['city','listing_id'], axis = 1).sum(axis = 'rows')
airbnb_covid_mexicocity = airbnb_covid.loc[airbnb_covid['city'] == 'Mexico City'].drop(['city','listing_id'], axis = 1).sum(axis = 'rows')
airbnb_covid_capetown = airbnb_covid.loc[airbnb_covid['city'] == 'Cape Town'].drop(['city','listing_id'], axis = 1).sum(axis = 'rows')
#convert series back to dataframe and rename axis
airbnb_covid_paris = airbnb_covid_paris.to_frame(name="Total transactions").rename_axis('Year-month').reset_index()
airbnb_covid_newyork = airbnb_covid_newyork.to_frame(name="Total transactions").rename_axis('Year-month').reset_index()
airbnb_covid_bangkok = airbnb_covid_bangkok.to_frame(name="Total transactions").rename_axis('Year-month').reset_index()
airbnb_covid_riodejaneiro = airbnb_covid_riodejaneiro.to_frame(name="Total transactions").rename_axis('Year-month').reset_index()
airbnb_covid_sydney = airbnb_covid_sydney.to_frame(name="Total transactions").rename_axis('Year-month').reset_index()
airbnb_covid_istanbul = airbnb_covid_istanbul.to_frame(name="Total transactions").rename_axis('Year-month').reset_index()
airbnb_covid_rome = airbnb_covid_rome.to_frame(name="Total transactions").rename_axis('Year-month').reset_index()
airbnb_covid_hongkong = airbnb_covid_hongkong.to_frame(name="Total transactions").rename_axis('Year-month').reset_index()
airbnb_covid_mexicocity = airbnb_covid_mexicocity.to_frame(name="Total transactions").rename_axis('Year-month').reset_index()
airbnb_covid_capetown = airbnb_covid_capetown.to_frame(name="Total transactions").rename_axis('Year-month').reset_index()
airbnbcovid_paris = airbnb_covid_paris.merge(covid_fr, on='Year-month', how='left').fillna(0)
airbnbcovid_newyork = airbnb_covid_newyork.merge(covid_us, on='Year-month', how='left').fillna(0)
airbnbcovid_bangkok = airbnb_covid_bangkok.merge(covid_th, on='Year-month', how='left').fillna(0)
airbnbcovid_riodejaneiro = airbnb_covid_riodejaneiro.merge(covid_br, on='Year-month', how='left').fillna(0)
airbnbcovid_sydney = airbnb_covid_sydney.merge(covid_au, on='Year-month', how='left').fillna(0)
airbnbcovid_istanbul = airbnb_covid_istanbul.merge(covid_tu, on='Year-month', how='left').fillna(0)
airbnbcovid_rome = airbnb_covid_rome.merge(covid_it, on='Year-month', how='left').fillna(0)
airbnbcovid_hongkong = airbnb_covid_hongkong.merge(covid_hk, on='Year-month', how='left').fillna(0)
airbnbcovid_mexicocity = airbnb_covid_mexicocity.merge(covid_mx, on='Year-month', how='left').fillna(0)
airbnbcovid_capetown = airbnb_covid_capetown.merge(covid_sa, on='Year-month', how='left').fillna(0)
#Plotting Cape Town data
fig = make_subplots(1,1)
fig.add_trace(go.Bar(x=airbnbcovid_capetown['Year-month'], y=(airbnbcovid_capetown['new_cases']/40),
name='Covid cases',
marker_color = 'skyblue',
opacity=0.4,
marker_line_color='rgb(8,48,107)',
marker_line_width=2))
fig.add_trace(go.Scatter(x=airbnbcovid_capetown['Year-month'], y=airbnbcovid_capetown['Total transactions'], name="Capetown Airbnb Transactions",
line_shape='linear', showlegend = True, mode='lines+markers', ))
fig.update_layout(
autosize=False,
width=1150,
height=700,
margin=dict(
l=50,
r=50,
b=100,
t=100,
pad=4))
fig.add_vrect(x0="2020_02", x1="2020_05",
annotation_text="Pandemic started", annotation_position="top left",
fillcolor="red", opacity=0.1, line_width=0)
fig.add_vrect(x0="2020_08", x1="2020_11",
annotation_text="Restrictions lifted", annotation_position="top left",
fillcolor="green", opacity=0.1, line_width=0)
fig.add_vrect(x0="2021_01", x1="2021_03",
annotation_text="More restrictions", annotation_position="top left",
fillcolor="blue", opacity=0.1, line_width=0)
fig.show()
# Plotting Paris data
fig = make_subplots(1,1)
fig.add_trace(go.Bar(x=airbnbcovid_paris['Year-month'], y=(airbnbcovid_paris['new_cases']/20),
name='Covid cases',
marker_color = 'skyblue',
opacity=0.4,
marker_line_color='rgb(8,48,107)',
marker_line_width=2))
fig.add_trace(go.Scatter(x=airbnbcovid_paris['Year-month'], y=airbnbcovid_paris['Total transactions'], name="Paris Airbnb Transactions",
line_shape='linear', showlegend = True, mode='lines+markers', ))
fig.update_layout(
autosize=False,
width=1150,
height=700,
margin=dict(
l=50,
r=50,
b=100,
t=100,
pad=4))
fig.add_vrect(x0="2020_03", x1="2020_05",
annotation_text="Pandemic started", annotation_position="top left",
fillcolor="red", opacity=0.1, line_width=0)
fig.add_vrect(x0="2020_08", x1="2021_04",
annotation_text="Restrictions lifted", annotation_position="top left",
fillcolor="green", opacity=0.1, line_width=0)
fig.show()
We can clearly see a drop in bookings as the pandemic started. A rise also began a few months after this time, and most cities experienced a second drop in early 2021. This is when we were hit with another wave of covid (we can see this by looking at the cases), and so clearly bookings again dropped at this time.
Studies have found that more people are choosing places to themselves over shared places post covid (Bagnera, S.M., Stewart, E. and Edition, S., 2020). We hypothesize that booking an entire place has increased post-covid.
# Data cleaning for this part of the question
#picking out relevant columns for this part's analysis
airbnb_roomtype = airbnb_raw_plus_review.loc[:, ['city','room_type','2019_01','2019_02','2019_03','2019_04','2019_05','2019_06','2019_07','2019_08','2019_09','2019_10','2019_11','2019_12','2020_01','2020_02','2020_03','2020_04','2020_05','2020_06','2020_07','2020_08','2020_09','2020_10','2020_11','2020_12','2021_01','2021_02','2021_03','2021_04'] ]
#Extract data based on 1 year & 2 months before and 1 year & 1 month after start of Covid then sum them
b_covid_cols = airbnb_roomtype.columns.str.contains('2019|2020_01|2020_02', na=False)
b_covid_sum = airbnb_roomtype.loc[:, b_covid_cols].sum(axis=1)
a_covid_cols = airbnb_roomtype.columns.str.contains('2020_03|2020_04|2020_05|2020_06|2020_07|2020_08|2020_09|2020_10|2020_11|2020_12|2021_01|2021_02|2021_03|2021_04', na=False)
a_covid_sum = airbnb_roomtype.loc[:, a_covid_cols].sum(axis=1)
#Create dataframes from series objects and reindex
b_covid_sum2 = b_covid_sum.to_frame(name="Total transactions before Covid")
a_covid_sum2 = a_covid_sum.to_frame(name="Total transactions after Covid")
#Concat the dataframes with Airbnb room type, before/ after transactions count
airbnb_prop = pd.concat([airbnb_roomtype, b_covid_sum2, a_covid_sum2], axis=1)
airbnb_prop = airbnb_prop.loc[:, ['city','room_type','Total transactions before Covid','Total transactions after Covid'] ]
#wrangling data by city, room-type
airbnb_prop_paris = airbnb_prop.loc[airbnb_prop['city'] == 'Paris'].drop(['city'], axis = 1).groupby(by='room_type').sum()
airbnb_prop_newyork = airbnb_prop.loc[airbnb_prop['city'] == 'New York'].drop(['city'], axis = 1).groupby(by='room_type').sum()
airbnb_prop_bangkok = airbnb_prop.loc[airbnb_prop['city'] == 'Bangkok'].drop(['city'], axis = 1).groupby(by='room_type').sum()
airbnb_prop_riodejaneiro = airbnb_prop.loc[airbnb_prop['city'] == 'Rio de Janeiro'].drop(['city'], axis = 1).groupby(by='room_type').sum()
airbnb_prop_sydney = airbnb_prop.loc[airbnb_prop['city'] == 'Sydney'].drop(['city'], axis = 1).groupby(by='room_type').sum()
airbnb_prop_istanbul = airbnb_prop.loc[airbnb_prop['city'] == 'Istanbul'].drop(['city'], axis = 1).groupby(by='room_type').sum()
airbnb_prop_rome = airbnb_prop.loc[airbnb_prop['city'] == 'Rome'].drop(['city'], axis = 1).groupby(by='room_type').sum()
airbnb_prop_hongkong = airbnb_prop.loc[airbnb_prop['city'] == 'Hong Kong'].drop(['city'], axis = 1).groupby(by='room_type').sum()
airbnb_prop_mexicocity = airbnb_prop.loc[airbnb_prop['city'] == 'Mexico City'].drop(['city'], axis = 1).groupby(by='room_type').sum()
airbnb_prop_capetown = airbnb_prop.loc[airbnb_prop['city'] == 'Cape Town'].drop(['city'], axis = 1).groupby(by='room_type').sum()
#create proportions of total for each column
hongkong_bc = airbnb_prop_hongkong['Total transactions before Covid'].sum()
hongkong_ac = airbnb_prop_hongkong['Total transactions after Covid'].sum()
airbnb_prop_hongkong['Total transactions before Covid'] = airbnb_prop_hongkong['Total transactions before Covid'].div(hongkong_bc)
airbnb_prop_hongkong['Total transactions after Covid'] = airbnb_prop_hongkong['Total transactions after Covid'].div(hongkong_ac)
airbnb_prop_hongkong = airbnb_prop_hongkong.reset_index(level=0).rename(columns = {'room_type':'Room Type'})
#plotting data
px.bar(
data_frame = airbnb_prop_hongkong,
x = "Room Type",
y = ["Total transactions before Covid","Total transactions after Covid"],
opacity = 0.9,
orientation = "v",
barmode = 'group',
title='Hong Kong - Room type chosen before and after Covid',
)
Upon analysis, we see that, though the data does vary, more people are choosing an entire place and less people are choosing a private room post covid. This effect is especially pronounced in Hong Kong. We would expect this, as people are being more cautious of their health post COVID-19 (To, K.K. and Yuen, K.Y., 2020).
#Istanbul proportion
#create proportions of total for each column
istanbul_bc = airbnb_prop_istanbul['Total transactions before Covid'].sum()
istanbul_ac = airbnb_prop_istanbul['Total transactions after Covid'].sum()
airbnb_prop_istanbul['Total transactions before Covid'] = airbnb_prop_istanbul['Total transactions before Covid'].div(istanbul_bc)
airbnb_prop_istanbul['Total transactions after Covid'] = airbnb_prop_istanbul['Total transactions after Covid'].div(istanbul_ac)
airbnb_prop_istanbul = airbnb_prop_istanbul.reset_index(level=0).rename(columns = {'room_type':'Room Type'})
display(airbnb_prop_istanbul)
#plotting data
px.bar(
data_frame = airbnb_prop_istanbul,
x = "Room Type",
y = ["Total transactions before Covid","Total transactions after Covid"],
opacity = 0.9,
orientation = "v",
barmode = 'group',
title='Istanbul - Room type chosen before and after Covid',
)
Room Type | Total transactions before Covid | Total transactions after Covid | |
---|---|---|---|
0 | Entire place | 0.755190 | 0.807088 |
1 | Hotel room | 0.050473 | 0.026782 |
2 | Private room | 0.191255 | 0.163736 |
3 | Shared room | 0.003083 | 0.002394 |
We now aimed to study how seasonal booking trends vary from city to city. To do this, we observed listing success heat maps with average temperature. The result is visualized below:
paris_df = airbnb_raw_plus_review[(airbnb_raw_plus_review['city']=='Paris')].iloc[:,[*range(40,208)]]
Year_Month = pd.date_range('2008-01-01','2021-12-31', freq='M').strftime("%Y-%b").tolist()
value=[]
for ind, column in enumerate(paris_df.columns):
value.append(paris_df[column].sum())
paris_df_2 = pd.DataFrame([Year_Month, value]).transpose()
paris_df_2[0] = pd.to_datetime(paris_df_2[0],format='%Y-%b')
paris_df_2['Year']=paris_df_2[0].dt.year
paris_df_2['Month']=paris_df_2[0].dt.month
paris_df_2['review_count']=paris_df_2[1].astype(int)
#Plot the heatmap
sns.set(font_scale=3)
fig, ((ax1, cbar_ax), (ax2, dummy_ax)) = plt.subplots(nrows=2, ncols=2, figsize=(26, 16), sharex='col',
gridspec_kw={'height_ratios': [5, 1], 'width_ratios': [20, 1]})
#Paris - no. of transaction heatmap
#UEFA Euro 2016 (Jun to Jul 2016)
the_label = pd.DataFrame(np.array(['']*101+['UEFA']+['']*(168-101-1)).reshape(14,12))
#print(the_label)
annot_kws = {"ha": 'left'}
b = sns.heatmap(paris_df_2.pivot("Year","Month","review_count"), cmap="rocket_r", fmt="s", annot_kws=annot_kws, annot=the_label, cbar_ax=cbar_ax, linewidths=2, ax=ax1).set(title='Paris - no. of transactions heatmap')
ax2.set_xlabel('Month')
#avg temp sourced from climate-data.org
avg_temp = {'Jan':4.3,'Feb':4.6,'Mar':7.4,'Apr':10.7,'May':14.3,'Jun':17.7,'Jul':19.8,'Aug':19.4,'Sep':16.4,'Oct':12.6,'Nov':7.9,'Dec':4.8}
avg_temp_df = pd.DataFrame(avg_temp.items(), columns=['Month', 'Temp degree in C'])
ax2.bar(avg_temp_df['Month'], avg_temp_df['Temp degree in C'], align='edge', color = 'grey')
ax2.set_xticklabels(avg_temp_df['Month'])
ax2.set_ylabel('Temp in C')
dummy_ax.axis('off')
plt.tight_layout()
plt.show()
The peak season in Paris is in Jun-Jul, Sep-Oct, when the weather is not cold. There was UEFA Euro country football tournament in 2016, but no significant surge in the airbnb demand.
#Rio_de_Janeiro - no. of transaction heatmap
#2014 World Cup in Brazil
#2016 Olympic
the_label = pd.DataFrame(np.array(['']*77+['World Cup']+['']*(25)+['Olympic']+['']*(168-77-1-1-25)).reshape(14,12))
#print(the_label)
Rio_de_Janeiro_df = airbnb_raw_plus_review[(airbnb_raw_plus_review['city']=='Rio de Janeiro')].iloc[:,[*range(40,208)]]
Year_Month = pd.date_range('2008-01-01','2021-12-31', freq='M').strftime("%Y-%b").tolist()
value=[]
for ind, column in enumerate(Rio_de_Janeiro_df.columns):
value.append(Rio_de_Janeiro_df[column].sum())
Rio_de_Janeiro_df_2 = pd.DataFrame([Year_Month, value]).transpose()
Rio_de_Janeiro_df_2[0] = pd.to_datetime(Rio_de_Janeiro_df_2[0],format='%Y-%b')
Rio_de_Janeiro_df_2['Year']=Rio_de_Janeiro_df_2[0].dt.year
Rio_de_Janeiro_df_2['Month']=Rio_de_Janeiro_df_2[0].dt.month
Rio_de_Janeiro_df_2['review_count']=Rio_de_Janeiro_df_2[1].astype(int)
#plot the heatmap
sns.set(font_scale=3)
fig, ((ax1, cbar_ax), (ax2, dummy_ax)) = plt.subplots(nrows=2, ncols=2, figsize=(26, 16), sharex='col',
gridspec_kw={'height_ratios': [5, 1], 'width_ratios': [20, 1]})
annot_kws = {"ha": 'left'}
b = sns.heatmap(Rio_de_Janeiro_df_2.pivot("Year","Month","review_count"), cmap="rocket_r", fmt="s", annot_kws=annot_kws, annot=the_label, cbar_ax=cbar_ax, linewidths=2, ax=ax1).set(title='Rio_de_Janeiro - no. of transactions heatmap')
ax2.set_xlabel('Month')
#avg temp sourced from climate-data.org
avg_temp = {'Jan':26.7,'Feb':27,'Mar':25.9,'Apr':24.3,'May':21.8,'Jun':20.8,'Jul':20.1,'Aug':20.9,'Sep':22.2,'Oct':23.7,'Nov':24.2,'Dec':25.8}
avg_temp_df = pd.DataFrame(avg_temp.items(), columns=['Month', 'Temp degree in C'])
ax2.bar(avg_temp_df['Month'], avg_temp_df['Temp degree in C'], align='edge', color = 'grey')
ax2.set_xticklabels(avg_temp_df['Month'])
ax2.set_ylabel('Temp in C')
dummy_ax.axis('off')
plt.tight_layout()
plt.show()
The airbnb seasonal trend is not obvious for Rio de Janeiro. Some of the international event such as World Cup (2014) and Olympic (2016) took place in Rio de Janeiro.
#Rio_de_Janeiro - host_since join date trend
#2014 World Cup in Brazil
#2018 Olympic
Rio_de_Janeiro_df_join_date = pd.DataFrame(airbnb_raw_plus_review[(airbnb_raw_plus_review['city']=='Rio de Janeiro')]['host_since'])
Rio_de_Janeiro_df_join_date['Year_join_date'] = pd.to_datetime(airbnb_raw_plus_review['host_since'],format='%Y-%m-%d').dt.year.convert_dtypes()
Rio_de_Janeiro_df_join_date['Month_join_date'] = pd.to_datetime(airbnb_raw_plus_review['host_since'],format='%Y-%m-%d').dt.month.convert_dtypes()
Rio_de_Janeiro_df_join_date_2 = Rio_de_Janeiro_df_join_date.groupby(['Year_join_date','Month_join_date']).agg(new_join_count=('host_since', np.count_nonzero)).reset_index()
from matplotlib.pyplot import figure
figure(figsize = (20, 8), dpi = 80)
import seaborn as sns; sns.set_theme()
sns.set(font_scale=2)
ax = sns.heatmap(Rio_de_Janeiro_df_join_date_2.pivot("Year_join_date","Month_join_date","new_join_count"), cmap="crest").set(title='Rio_de_Janeiro - no. of Airbnb new join heatmap')
plt.show()
On the Supply side, there were airbnb new joiners rushed in before the World Cup (2014) and Olympic (2016).
#Bangkok - no. of transaction heatmap
bangkok_df = airbnb_raw_plus_review[(airbnb_raw_plus_review['city']=='Bangkok')].iloc[:,[*range(40,208)]]
Year_Month = pd.date_range('2008-01-01','2021-12-31', freq='M').strftime("%Y-%b").tolist()
value=[]
for ind, column in enumerate(bangkok_df.columns):
value.append(bangkok_df[column].sum())
bangkok_df_2 = pd.DataFrame([Year_Month, value]).transpose()
bangkok_df_2[0] = pd.to_datetime(bangkok_df_2[0],format='%Y-%b')
bangkok_df_2['Year']=bangkok_df_2[0].dt.year
bangkok_df_2['Month']=bangkok_df_2[0].dt.month
bangkok_df_2['review_count']=bangkok_df_2[1].astype(int)
#plot the heatmap
sns.set(font_scale=3)
fig, ((ax1, cbar_ax), (ax2, dummy_ax)) = plt.subplots(nrows=2, ncols=2, figsize=(26, 16), sharex='col',
gridspec_kw={'height_ratios': [5, 1], 'width_ratios': [20, 1]})
#Bangkok - no. of transaction heatmap
annot_kws = {"ha": 'left'}
b = sns.heatmap(bangkok_df_2.pivot("Year","Month","review_count"), cmap="rocket_r", cbar_ax=cbar_ax, linewidths=2, ax=ax1).set(title='Bangkok - no. of transactions heatmap')
ax2.set_xlabel('Month')
#avg temp sourced from climate-data.org
avg_temp = {'Jan':26,'Feb':27.4,'Mar':28.8,'Apr':29.9,'May':29.1,'Jun':28.5,'Jul':28,'Aug':27.8,'Sep':27.3,'Oct':27,'Nov':26.7,'Dec':25.9}
avg_temp_df = pd.DataFrame(avg_temp.items(), columns=['Month', 'Temp degree in C'])
ax2.bar(avg_temp_df['Month'], avg_temp_df['Temp degree in C'], align='edge', color = 'grey')
ax2.set_xticklabels(avg_temp_df['Month'])
ax2.set_ylabel('Temp in C')
dummy_ax.axis('off')
plt.tight_layout()
plt.show()
The peak season is in December and January. It is winter time for some of the western countries, so people from the west might travel to Bangkok to escape winter.
#Sydney - no. of transaction heatmap
Sydney_df = airbnb_raw_plus_review[(airbnb_raw_plus_review['city']=='Sydney')].iloc[:,[*range(40,208)]]
Year_Month = pd.date_range('2008-01-01','2021-12-31', freq='M').strftime("%Y-%b").tolist()
value=[]
for ind, column in enumerate(Sydney_df.columns):
value.append(Sydney_df[column].sum())
Sydney_df_2 = pd.DataFrame([Year_Month, value]).transpose()
Sydney_df_2[0] = pd.to_datetime(Sydney_df_2[0],format='%Y-%b')
Sydney_df_2['Year']=Sydney_df_2[0].dt.year
Sydney_df_2['Month']=Sydney_df_2[0].dt.month
Sydney_df_2['review_count']=Sydney_df_2[1].astype(int)
#plot the heatmap
sns.set(font_scale=3)
fig, ((ax1, cbar_ax), (ax2, dummy_ax)) = plt.subplots(nrows=2, ncols=2, figsize=(26, 16), sharex='col',
gridspec_kw={'height_ratios': [5, 1], 'width_ratios': [20, 1]})
#Sydney - no. of transaction heatmap
annot_kws = {"ha": 'left'}
b = sns.heatmap(Sydney_df_2.pivot("Year","Month","review_count"), cmap="rocket_r", cbar_ax=cbar_ax, linewidths=2, ax=ax1).set(title='Sydney - no. of transactions heatmap')
ax2.set_xlabel('Month')
#avg temp sourced from climate-data.org
avg_temp = {'Jan':22.8,'Feb':22.6,'Mar':21.3,'Apr':18.8,'May':15.8,'Jun':13.6,'Jul':12.7,'Aug':13.5,'Sep':16,'Oct':18,'Nov':19.7,'Dec':21.4}
avg_temp_df = pd.DataFrame(avg_temp.items(), columns=['Month', 'Temp degree in C'])
ax2.bar(avg_temp_df['Month'], avg_temp_df['Temp degree in C'], align='edge', color = 'grey')
ax2.set_xticklabels(avg_temp_df['Month'])
ax2.set_ylabel('Temp in C')
dummy_ax.axis('off')
plt.tight_layout()
plt.show()
The peak season is in between Oct and Feb next year. The period is the summer time in Australia.
To conclude, although clear seasonal trends don’t seem to vary, it appears more people visit in hotter months. This can be seen especially in Rio De Janeiro, where the seasonal bookings trends and temperatures are reversed.
Note that we have included big events taking place in these cities to account for any data that does not follow the pattern.
A survey was done in the class during the proposal presentation, the result shows Wifi is the most crucial amenity. In fact, Wifi is the Top 1 amenity offered by Airbnb.
Next, we wanted to see if location affected the amenities offered. Below are a series of word clouds with amenities for each city:
#extract the amenity items from the lists
import re
L_Paris = []
L_NewYork = []
L_Bangkok = []
L_RiodeJaneiro = []
L_Sydney = []
L_Istanbul = []
L_Rome = []
L_HongKong = []
L_MexicoCity = []
L_CapeTown = []
amenities_list = airbnb_raw['amenities'].tolist()
city_list = airbnb_raw['city'].tolist()
for i in range(len(airbnb_raw['amenities'])):
if city_list[i] == 'Paris':
for j in range(len(re.findall(r'"(.*?)"', amenities_list[i]))):
L_Paris.append(re.findall(r'"(.*?)"', amenities_list[i])[j])
elif city_list[i] == 'New York':
for j in range(len(re.findall(r'"(.*?)"', amenities_list[i]))):
L_NewYork.append(re.findall(r'"(.*?)"', amenities_list[i])[j])
elif city_list[i] == 'Bangkok':
for j in range(len(re.findall(r'"(.*?)"', amenities_list[i]))):
L_Bangkok.append(re.findall(r'"(.*?)"', amenities_list[i])[j])
elif city_list[i] == 'Rio de Janeiro':
for j in range(len(re.findall(r'"(.*?)"', amenities_list[i]))):
L_RiodeJaneiro.append(re.findall(r'"(.*?)"', amenities_list[i])[j])
elif city_list[i] == 'Sydney':
for j in range(len(re.findall(r'"(.*?)"', amenities_list[i]))):
L_Sydney.append(re.findall(r'"(.*?)"', amenities_list[i])[j])
elif city_list[i] == 'Istanbul':
for j in range(len(re.findall(r'"(.*?)"', amenities_list[i]))):
L_Istanbul.append(re.findall(r'"(.*?)"', amenities_list[i])[j])
elif city_list[i] == 'Rome':
for j in range(len(re.findall(r'"(.*?)"', amenities_list[i]))):
L_Rome.append(re.findall(r'"(.*?)"', amenities_list[i])[j])
elif city_list[i] == 'Hong Kong':
for j in range(len(re.findall(r'"(.*?)"', amenities_list[i]))):
L_HongKong.append(re.findall(r'"(.*?)"', amenities_list[i])[j])
elif city_list[i] == 'Mexico City':
for j in range(len(re.findall(r'"(.*?)"', amenities_list[i]))):
L_MexicoCity.append(re.findall(r'"(.*?)"', amenities_list[i])[j])
elif city_list[i] == 'Cape Town':
for j in range(len(re.findall(r'"(.*?)"', amenities_list[i]))):
L_CapeTown.append(re.findall(r'"(.*?)"', amenities_list[i])[j])
# count the frequency of the items in amenities list (dict)
def CountFrequency(my_list):
# Creating an empty dictionary
freq = {}
for item in my_list:
if (item in freq):
freq[item] += 1
else:
freq[item] = 1
return freq
# Paris - create the WordCloud object
wordcloud = WordCloud(min_word_length =3,
background_color='white')
# generate the word cloud
wordcloud.generate_from_frequencies(CountFrequency(L_Paris))
from matplotlib.pyplot import figure
figure(figsize = (10, 6), dpi = 80)
#plot
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off')
plt.title('Paris - Airbnb amenities', fontdict = {'fontsize' : 20})
plt.show()
Heating and Wifi are the most common amenities offered in Paris Airbnb.
# New York - create the WordCloud object
wordcloud = WordCloud(min_word_length =3,
background_color='white')
# generate the word cloud
wordcloud.generate_from_frequencies(CountFrequency(L_NewYork))
from matplotlib.pyplot import figure
figure(figsize = (10, 6), dpi = 80)
#plot
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off')
plt.title('New York - Airbnb amenities', fontdict = {'fontsize' : 20})
plt.show()
For New York, the amenities offered are quite diverse. Wifi and Long term stays allowed are common.
# Rio de Janeiro - create the WordCloud object
wordcloud = WordCloud(min_word_length =3,
background_color='white')
# generate the word cloud
wordcloud.generate_from_frequencies(CountFrequency(L_RiodeJaneiro))
from matplotlib.pyplot import figure
figure(figsize = (10, 6), dpi = 80)
#plot
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off')
plt.title('Rio de Janeiro - Airbnb amenities', fontdict = {'fontsize' : 20})
plt.show()
In Rio de Janeiro, Kitchen and Wifi seems like a must item in Airbnb.
# Hong Kong - create the WordCloud object
wordcloud = WordCloud(min_word_length =3,
background_color='white')
# generate the word cloud
wordcloud.generate_from_frequencies(CountFrequency(L_HongKong))
from matplotlib.pyplot import figure
figure(figsize = (10, 6), dpi = 80)
#plot
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off')
plt.title('Hong Kong - Airbnb amenities', fontdict = {'fontsize' : 20})
plt.show()
In Hong Kong, the amenity items are diverse. Long term stays alloweed and Dedicated workspace might be the amenities that could differentiate the Airbnb from the Hotels.
To conclude, we can clearly see that Wifi and Kitchen are highly favoured in a lot of cities. We also noticed that Paris and Hong Kong were very eveny distributed, indicating more overall amenities are offered in these cities. Additionally, work spaces are valued in places like Hong Kong as hotels don’t offer this service.
We aimed to look into if the average score differs between hosts and superhosts. We visualized this with a bar plot and radial column chart:
reviewBar = pd.read_csv('https://media.githubusercontent.com/media/imadahmad97/EDA-of-Airbnb-Data/main/DataSets/airbnb_raw_plus_review.csv')
reviewCols = reviewBar.loc[: , "review_scores_rating":"review_scores_value"]
reviewBar['avg_rev_score'] = reviewCols.mean(axis=1)
only_superhosts = reviewBar[reviewBar.host_is_superhost == True]
only_regularhosts = reviewBar[reviewBar.host_is_superhost == False]
superhost_mean = only_superhosts['avg_rev_score'].mean()
regularhost_mean = only_regularhosts['avg_rev_score'].mean()
reviewBarData = pd.DataFrame()
reviewBarData['Mean Review Score'] = (regularhost_mean, superhost_mean)
reviewBarData['Host Status'] = ('Regular Host', 'SuperHost')
px.bar(reviewBarData, x='Host Status', y='Mean Review Score')
The bar plot shows us that there is not a big difference in average review score between hosts and superhosts. The radial plot showed that this holds true for each category of review. This is what we expected, as most people give a full score on a review, barring any large issues.
#ratMean = only_superhosts['review_scores_rating'].mean()
accMean = only_superhosts['review_scores_accuracy'].mean()
clenMean = only_superhosts['review_scores_cleanliness'].mean()
checMean = only_superhosts['review_scores_checkin'].mean()
comMean = only_superhosts['review_scores_communication'].mean()
locMean = only_superhosts['review_scores_location'].mean()
valMean = only_superhosts['review_scores_value'].mean()
accMean1 = only_regularhosts['review_scores_accuracy'].mean()
clenMean1 = only_regularhosts['review_scores_cleanliness'].mean()
checMean1 = only_regularhosts['review_scores_checkin'].mean()
comMean1 = only_regularhosts['review_scores_communication'].mean()
locMean1 = only_regularhosts['review_scores_location'].mean()
valMean1 = only_regularhosts['review_scores_value'].mean()
ratMean = only_superhosts['review_scores_rating'].mean()
accMean = only_superhosts['review_scores_accuracy'].mean()
clenMean = only_superhosts['review_scores_cleanliness'].mean()
checMean = only_superhosts['review_scores_checkin'].mean()
comMean = only_superhosts['review_scores_communication'].mean()
locMean = only_superhosts['review_scores_location'].mean()
valMean = only_superhosts['review_scores_value'].mean()
fig = make_subplots(rows=2, cols=1, specs=[[{'type': 'polar'}]]*2)
fig.add_trace(go.Scatterpolar(
name = "angular categories (w/ categoryarray)",
r = [accMean, clenMean, checMean, comMean,locMean,valMean],
theta = ["accMean", "clenMean", "checMean", "comMean", "locMean", "valMean"],
), 2, 1)
fig.update_traces(fill='toself')
fig.update_layout(autosize=False, width=1000, height=1000, showlegend=False, margin={"l":0,"r":0,"t":0,"b":0})
fig.update_layout(
polar = dict(
radialaxis_angle = -45,
angularaxis = dict(
direction = "clockwise",
period = 6)
),
polar2 = dict(
radialaxis = dict(
angle = 180,
tickangle = -180 # so that tick labels are not upside down
)
),
polar3 = dict(
sector = [80, 400],
radialaxis_angle = -45,
angularaxis_categoryarray = ["accMean", "clenMean", "checMean", "comMean", "locMean", "valMean"]
),
polar4 = dict(
radialaxis_categoryorder = "category descending",
angularaxis = dict(
thetaunit = "radians",
dtick = 0.3141592653589793
))
)
Finally, we looked into if the number of amenities differs between hosts and superhosts, and we visualized this with a bar plot
boxData = pd.read_csv('https://media.githubusercontent.com/media/imadahmad97/EDA-of-Airbnb-Data/main/DataSets/airbnb_raw_plus_review.csv')
boxData['Amenity Counts'] = boxData['amenities'].str.split(',').str.len()
boxData['host_is_superhost'] = boxData['host_is_superhost'].astype(str)
boxData.replace({'False': 'Regular Host', 'True': 'SuperHost'}, inplace=True)
boxData.rename(columns={'host_is_superhost':'Superhost Status'}, inplace=True)
fig = px.box(boxData, x = 'Superhost Status', y='Amenity Counts', width=1100, height=650, color='Superhost Status')
fig = fig.update_layout(showlegend=False)
fig.show()
We can see that there is a slight difference in the number of amenities offered by hosts Vs. superhosts. The 3rd quartile for hosts does not reach the median of superhosts. Note that we are not looking into the quality of amenities, only the number offered, which is more for superhosts.
From our above analysis, we can see that all 3 of location, price, and superhost status play a large role in listing success (quantified by the number of reviews). We saw that COVID-19 impacted booking numbers as well as room type.
We saw through our heatmaps that booking trends do not vary greatly from city to city, except for the influence of temperature. We also saw that amenity distributions vary between cities depending on their needs, but the rankings tend to stay the same, with WiFi at the top. In terms of listing titles, they all tend to have the same theme, with phrases like “in the heart of” and landmarks near the listing.
Finally, we saw that hosts and superhosts do not differ as much as we first suspected. They have similar average review scores and review scores in each category. However, we did notice that superhosts had a higher number of total amenities.
Airbnb is the future of travel accommodations, and any insights into key variables would be extremely valuable. Our study was limited, as it did not include a large amount of post covid data. Possible further studies should include more. In studying the Airbnb industry in these cities, I gained a lot of valuable information about the city trends in general, so future research could use this as a tool.
Airbnb, 2022, Airbnb API
Bagnera, S.M., Stewart, E. and Edition, S., 2020. Navigating hotel operations in times of COVID-19. Boston Hospitality Review, 1(7).
Bhat, M. 2021, Airbnb Listings & Reviews, electronic dataset, Kaggle, viewed 13Sept. 2022, Dataset-link
Deboosere, R., Kerrigan, D., Wachsmuth, D. and El-Geneidy, A., 2019. Location, location and professionalization: a multilevel hedonic analysis of Airbnb listing prices and revenue. Regional Studies, Regional Science, 6(1), pp.143-156.
Sainaghi, R., Abrate, G. and Mauri, A., 2021. Price and RevPAR determinants of Airbnb listings: Convergent and divergent evidence. International Journal of Hospitality Management, 92, p.102709.
To, K.K. and Yuen, K.Y., 2020. Responding to COVID-19 in Hong Kong. Hong Kong Medical Journal.
Deboosere, R., Kerrigan, D., Wachsmuth, D. and El-Geneidy, A., 2019. Location, location and professionalization: a multilevel hedonic analysis of Airbnb listing prices and revenue. Regional Studies, Regional Science, 6(1), pp.143-156.
#read the GeoJson file
census_file = 'https://raw.githubusercontent.com/imadahmad97/EDA-of-Airbnb-Data/main/DataSets/GEOJSON%20Files/new_york_neighbourhoods.geojson'
cendf = gpd.read_file(census_file)
# Let's index by district name
cendf['id'] = cendf.index
cendf.set_index('neighbourhood', inplace=True)
cendf_2 = cendf
# Filter cleaned file to only New York
new_york_df_map = airbnb_raw_plus_review[(airbnb_raw_plus_review['city']=='New York')].reset_index()
#combine the long and lat as coordinates
new_york_df_map['coordinates'] = list(zip(new_york_df_map["longitude"], new_york_df_map["latitude"]))
new_york_df_map['coordinates'] = new_york_df_map['coordinates'].apply(Point)
locsdf = gpd.GeoDataFrame(new_york_df_map, geometry='coordinates')
locsdf = locsdf.set_crs(epsg=4326)
#for counting the number of reviews in each district
coordinates = new_york_df_map['coordinates'].tolist()
count_of_review = new_york_df_map['review_id_distinct_count'].tolist()
geo = cendf_2['geometry'].tolist()
district_code = cendf_2['id'].tolist()
output_dict=dict()
for j in range(len(geo)):
for i in range(len(coordinates)):
if geo[j].contains(coordinates[i]):
if district_code[j] in output_dict:
if not (np.isnan(count_of_review[i])):
output_dict[district_code[j]]+=count_of_review[i]
else:
if not (np.isnan(count_of_review[i])):
output_dict[district_code[j]]=count_of_review[i]
#print(output_dict)
output_dict_new_york = output_dict
#turn the dictionary to dataframe, then left join to the cendf file
output_dict_df = pd.DataFrame(output_dict_new_york.items(), columns=['id', 'review_count'])
cendf_2['neighbourhood'] = cendf_2.index
cendf_3 = pd.merge(cendf_2, output_dict_df , on='id', how='left')
cendf_3.set_index('neighbourhood', inplace=True)
# Choropleth map of calgary resident count via px.choropleth_mapbox()
cendf_4_new_york = cendf_3.to_crs(epsg=4326)
# plot the map
fig = px.choropleth_mapbox(cendf_4_new_york, geojson=cendf_4_new_york,
locations=cendf_4_new_york.index,
color="review_count",
color_continuous_scale = 'YlGn',
center={"lat": 40.7167, "lon": -74}, # New York
mapbox_style="carto-positron",
opacity=0.75,
zoom=9,
title = 'New York Review Density')
fig.update_layout(margin={"r":50,"t":50,"l":50,"b":50},
autosize=True,
height=600 )
fig.show()
Central Park, Statue of Liberty and Time Square are located in Manhattan, which high portion of visitors stayed there, but the highest density district is Bedford Stuyvesant.
#read the GeoJson file
census_file = 'https://raw.githubusercontent.com/imadahmad97/EDA-of-Airbnb-Data/main/DataSets/GEOJSON%20Files/sydney.geojson'
cendf = gpd.read_file(census_file)
# Let's index by district name
cendf.set_index('name', inplace=True)
cendf_2 = cendf
# Filter cleaned file to only Sydney
sydney_df_map = airbnb_raw_plus_review[(airbnb_raw_plus_review['city']=='Sydney')].reset_index()
#combine the long and lat as coordinates
sydney_df_map['coordinates'] = list(zip(sydney_df_map["longitude"], sydney_df_map["latitude"]))
sydney_df_map['coordinates'] = sydney_df_map['coordinates'].apply(Point)
locsdf = gpd.GeoDataFrame(sydney_df_map, geometry='coordinates')
locsdf = locsdf.set_crs(epsg=4326)
coordinates = sydney_df_map['coordinates'].tolist()
count_of_review = sydney_df_map['review_id_distinct_count'].tolist()
geo = cendf_2['geometry'].tolist()
district_code = cendf_2['cartodb_id'].tolist()
output_dict=dict()
for j in range(len(geo)):
for i in range(len(coordinates)):
if geo[j].contains(coordinates[i]):
if district_code[j] in output_dict:
if not (np.isnan(count_of_review[i])):
output_dict[district_code[j]]+=count_of_review[i]
else:
if not (np.isnan(count_of_review[i])):
output_dict[district_code[j]]=count_of_review[i]
#print(output_dict)
output_dict_sydney = output_dict
#turn the dictionary to dataframe, then left join to the cendf file
output_dict_df = pd.DataFrame(output_dict_sydney.items(), columns=['cartodb_id', 'review_count'])
cendf_2['name'] = cendf_2.index
cendf_3 = pd.merge(cendf_2, output_dict_df , on='cartodb_id', how='left')
cendf_3.set_index('name', inplace=True)
# Choropleth map of calgary resident count via px.choropleth_mapbox()
cendf_4_sydney = cendf_3.to_crs(epsg=4326)
# plot the map
fig = px.choropleth_mapbox(cendf_4_sydney, geojson=cendf_4_sydney,
locations=cendf_4_sydney.index,
color="review_count",
color_continuous_scale = 'YlGn',
center={"lat": -33.89, "lon": 151.210}, # Sydney
mapbox_style="carto-positron",
opacity=0.75,
zoom=12,
title = 'Sydney Review Density')
fig.update_layout(margin={"r":50,"t":50,"l":50,"b":50},
autosize=True,
height=600 )
fig.show()
The most popular airbnb locations are in the North, the area near the sea and where Sydney Opera House is located.
#read the GeoJson file
census_file = 'https://raw.githubusercontent.com/imadahmad97/EDA-of-Airbnb-Data/main/DataSets/GEOJSON%20Files/rio_neighbourhoods.geojson'
cendf = gpd.read_file(census_file)
# Let's index by district name
cendf['id'] = cendf.index
cendf.set_index('neighbourhood', inplace=True)
cendf_2 = cendf
# Filter cleaned file to only Rio de Janeiro
rio_de_janeiro_df_map = airbnb_raw_plus_review[(airbnb_raw_plus_review['city']=='Rio de Janeiro')].reset_index()
#combine the long and lat as coordinates
rio_de_janeiro_df_map['coordinates'] = list(zip(rio_de_janeiro_df_map["longitude"], rio_de_janeiro_df_map["latitude"]))
rio_de_janeiro_df_map['coordinates'] = rio_de_janeiro_df_map['coordinates'].apply(Point)
locsdf = gpd.GeoDataFrame(rio_de_janeiro_df_map, geometry='coordinates')
locsdf = locsdf.set_crs(epsg=4326)
#for counting the number of reviews in each district
coordinates = rio_de_janeiro_df_map['coordinates'].tolist()
count_of_review = rio_de_janeiro_df_map['review_id_distinct_count'].tolist()
geo = cendf_2['geometry'].tolist()
district_code = cendf_2['id'].tolist()
output_dict=dict()
for j in range(len(geo)):
for i in range(len(coordinates)):
if geo[j].contains(coordinates[i]):
if district_code[j] in output_dict:
if not (np.isnan(count_of_review[i])):
output_dict[district_code[j]]+=count_of_review[i]
else:
if not (np.isnan(count_of_review[i])):
output_dict[district_code[j]]=count_of_review[i]
#print(output_dict)
output_dict_rio = output_dict
#turn the dictionary to dataframe, then left join to the cendf file
output_dict_df = pd.DataFrame(output_dict_rio.items(), columns=['id', 'review_count'])
cendf_2['neighbourhood'] = cendf_2.index
cendf_3 = pd.merge(cendf_2, output_dict_df , on='id', how='left')
cendf_3.set_index('neighbourhood', inplace=True)
# Choropleth map of calgary resident count via px.choropleth_mapbox()
cendf_4_rio = cendf_3.to_crs(epsg=4326)
# plot the map
fig = px.choropleth_mapbox(cendf_4_rio, geojson=cendf_4_rio,
locations=cendf_4_rio.index,
color="review_count",
color_continuous_scale = 'YlGn',
center={"lat": -22.908333, "lon": -43.4}, # Rio de Janeiro
mapbox_style="carto-positron",
opacity=0.75,
zoom=9,
title = 'Rio de Janeiro Review Density')
fig.update_layout(margin={"r":50,"t":50,"l":50,"b":50},
autosize=True,
height=600 )
fig.show()
Christ the Redeemer is near the copacabana, which is the most popular airbnb location in Rio de Janeiro.
#read the GeoJson file
census_file = 'https://raw.githubusercontent.com/imadahmad97/EDA-of-Airbnb-Data/main/DataSets/GEOJSON%20Files/mexico_city_neighbourhoods.geojson'
cendf = gpd.read_file(census_file)
# Let's index by district name
cendf['id'] = cendf.index
cendf.set_index('neighbourhood', inplace=True)
cendf_2 = cendf
# Filter cleaned file to only Mexico City
Mexico_city_df_map = airbnb_raw_plus_review[(airbnb_raw_plus_review['city']=='Mexico City')].reset_index()
#combine the long and lat as coordinates
Mexico_city_df_map['coordinates'] = list(zip(Mexico_city_df_map["longitude"], Mexico_city_df_map["latitude"]))
Mexico_city_df_map['coordinates'] = Mexico_city_df_map['coordinates'].apply(Point)
locsdf = gpd.GeoDataFrame(Mexico_city_df_map, geometry='coordinates')
locsdf = locsdf.set_crs(epsg=4326)
#for counting the number of reviews in each district
coordinates = Mexico_city_df_map['coordinates'].tolist()
count_of_review = Mexico_city_df_map['review_id_distinct_count'].tolist()
geo = cendf_2['geometry'].tolist()
district_code = cendf_2['id'].tolist()
output_dict=dict()
for j in range(len(geo)):
for i in range(len(coordinates)):
if geo[j].contains(coordinates[i]):
if district_code[j] in output_dict:
if not (np.isnan(count_of_review[i])):
output_dict[district_code[j]]+=count_of_review[i]
else:
if not (np.isnan(count_of_review[i])):
output_dict[district_code[j]]=count_of_review[i]
#print(output_dict)
output_dict_mexico_city = output_dict
#turn the dictionary to dataframe, then left join to the cendf file
output_dict_df = pd.DataFrame(output_dict_mexico_city.items(), columns=['id', 'review_count'])
cendf_2['neighbourhood'] = cendf_2.index
cendf_3 = pd.merge(cendf_2, output_dict_df , on='id', how='left')
cendf_3.set_index('neighbourhood', inplace=True)
# Choropleth map of calgary resident count via px.choropleth_mapbox()
cendf_4_mexico_city = cendf_3.to_crs(epsg=4326)
# plot the map
fig = px.choropleth_mapbox(cendf_4_mexico_city, geojson=cendf_4_mexico_city,
locations=cendf_4_mexico_city.index,
color="review_count",
color_continuous_scale = 'YlGn',
center={"lat": 19.3, "lon": -99.133209}, # Mexico City
mapbox_style="carto-positron",
opacity=0.75,
zoom=9,
title = 'Mexico CIty Review Density')
fig.update_layout(margin={"r":50,"t":50,"l":50,"b":50},
autosize=True,
height=600 )
fig.show()
#read the GeoJson file
census_file = 'https://raw.githubusercontent.com/imadahmad97/EDA-of-Airbnb-Data/main/DataSets/GEOJSON%20Files/cape_town_neighbourhoods.geojson'
cendf = gpd.read_file(census_file)
# Let's index by district name
cendf['id'] = cendf.index
cendf.set_index('neighbourhood', inplace=True)
cendf_2 = cendf
# Filter cleaned file to only Cape Town
Cape_town_df_map = airbnb_raw_plus_review[(airbnb_raw_plus_review['city']=='Cape Town')].reset_index()
#combine the long and lat as coordinates
Cape_town_df_map['coordinates'] = list(zip(Cape_town_df_map["longitude"], Cape_town_df_map["latitude"]))
Cape_town_df_map['coordinates'] = Cape_town_df_map['coordinates'].apply(Point)
locsdf = gpd.GeoDataFrame(Cape_town_df_map, geometry='coordinates')
locsdf = locsdf.set_crs(epsg=4326)
#for counting the number of reviews in each district
coordinates = Cape_town_df_map['coordinates'].tolist()
count_of_review = Cape_town_df_map['review_id_distinct_count'].tolist()
geo = cendf_2['geometry'].tolist()
district_code = cendf_2['id'].tolist()
output_dict=dict()
for j in range(len(geo)):
for i in range(len(coordinates)):
if geo[j].contains(coordinates[i]):
if district_code[j] in output_dict:
if not (np.isnan(count_of_review[i])):
output_dict[district_code[j]]+=count_of_review[i]
else:
if not (np.isnan(count_of_review[i])):
output_dict[district_code[j]]=count_of_review[i]
#print(output_dict)
output_dict_cape_town = output_dict
#turn the dictionary to dataframe, then left join to the cendf file
output_dict_df = pd.DataFrame(output_dict_cape_town.items(), columns=['id', 'review_count'])
cendf_2['neighbourhood'] = cendf_2.index
cendf_3 = pd.merge(cendf_2, output_dict_df , on='id', how='left')
cendf_3.set_index('neighbourhood', inplace=True)
# Choropleth map of calgary resident count via px.choropleth_mapbox()
cendf_4_cape_town = cendf_3.to_crs(epsg=4326)
# plot the map
fig = px.choropleth_mapbox(cendf_4_cape_town, geojson=cendf_4_cape_town,
locations=cendf_4_cape_town.index,
color="review_count",
color_continuous_scale = 'YlGn',
center={"lat": -33.918861, "lon": 18.423300}, # Cape Town
mapbox_style="carto-positron",
opacity=0.75,
zoom=9,
title = 'Cape Town Review Density')
fig.update_layout(margin={"r":50,"t":50,"l":50,"b":50},
autosize=True,
height=600 )
fig.show()
Cape of Good Hope is in the South and Kirstenbosch National Botanical is near the popular areas in North.
#read the GeoJson file
census_file = 'https://raw.githubusercontent.com/imadahmad97/EDA-of-Airbnb-Data/main/DataSets/GEOJSON%20Files/bangkok_neighbourhoods.geojson'
cendf = gpd.read_file(census_file)
# Let's index by district name
cendf['id'] = cendf.index
cendf.set_index('neighbourhood', inplace=True)
cendf_2 = cendf
# Filter cleaned file to only Bangkok
Bangkok_df_map = airbnb_raw_plus_review[(airbnb_raw_plus_review['city']=='Bangkok')].reset_index()
#combine the long and lat as coordinates
Bangkok_df_map['coordinates'] = list(zip(Bangkok_df_map["longitude"], Bangkok_df_map["latitude"]))
Bangkok_df_map['coordinates'] = Bangkok_df_map['coordinates'].apply(Point)
locsdf = gpd.GeoDataFrame(Bangkok_df_map, geometry='coordinates')
locsdf = locsdf.set_crs(epsg=4326)
#for counting the number of reviews in each district
coordinates = Bangkok_df_map['coordinates'].tolist()
count_of_review = Bangkok_df_map['review_id_distinct_count'].tolist()
geo = cendf_2['geometry'].tolist()
district_code = cendf_2['id'].tolist()
output_dict=dict()
for j in range(len(geo)):
for i in range(len(coordinates)):
if geo[j].contains(coordinates[i]):
if district_code[j] in output_dict:
if not (np.isnan(count_of_review[i])):
output_dict[district_code[j]]+=count_of_review[i]
else:
if not (np.isnan(count_of_review[i])):
output_dict[district_code[j]]=count_of_review[i]
#print(output_dict)
output_dict_bangkok = output_dict
#turn the dictionary to dataframe, then left join to the cendf file
output_dict_df = pd.DataFrame(output_dict_bangkok.items(), columns=['id', 'review_count'])
cendf_2['neighbourhood'] = cendf_2.index
cendf_3 = pd.merge(cendf_2, output_dict_df , on='id', how='left')
cendf_3.set_index('neighbourhood', inplace=True)
# Choropleth map of calgary resident count via px.choropleth_mapbox()
cendf_4_bangkok = cendf_3.to_crs(epsg=4326)
# plot the map
fig = px.choropleth_mapbox(cendf_4_bangkok, geojson=cendf_4_bangkok,
locations=cendf_4_bangkok.index,
color="review_count",
color_continuous_scale = 'YlGn',
center={"lat": 13.668217, "lon": 100.614021}, # Bangkok
mapbox_style="carto-positron",
opacity=0.75,
zoom=9,
title = 'Bangkok Review Density')
fig.update_layout(margin={"r":50,"t":50,"l":50,"b":50},
autosize=True,
height=600 )
fig.show()
#read the GeoJson file
census_file = 'https://raw.githubusercontent.com/imadahmad97/EDA-of-Airbnb-Data/main/DataSets/GEOJSON%20Files/hong-kong_1150.geojson'
cendf = gpd.read_file(census_file)
# Let's index by district name
cendf.set_index('name', inplace=True)
cendf_2 = cendf
# Filter cleaned file to only Hong Kong
Hong_kong_df_map = airbnb_raw_plus_review[(airbnb_raw_plus_review['city']=='Hong Kong')].reset_index()
#combine the long and lat as coordinates
Hong_kong_df_map['coordinates'] = list(zip(Hong_kong_df_map["longitude"], Hong_kong_df_map["latitude"]))
Hong_kong_df_map['coordinates'] = Hong_kong_df_map['coordinates'].apply(Point)
locsdf = gpd.GeoDataFrame(Hong_kong_df_map, geometry='coordinates')
locsdf = locsdf.set_crs(epsg=4326)
#for counting the number of reviews in each district
coordinates = Hong_kong_df_map['coordinates'].tolist()
count_of_review = Hong_kong_df_map['review_id_distinct_count'].tolist()
geo = cendf_2['geometry'].tolist()
district_code = cendf_2['id'].tolist()
output_dict=dict()
for j in range(len(geo)):
for i in range(len(coordinates)):
if geo[j].contains(coordinates[i]):
if district_code[j] in output_dict:
if not (np.isnan(count_of_review[i])):
output_dict[district_code[j]]+=count_of_review[i]
else:
if not (np.isnan(count_of_review[i])):
output_dict[district_code[j]]=count_of_review[i]
#print(output_dict)
output_dict_hong_kong = output_dict
#turn the dictionary to dataframe, then left join to the cendf file
output_dict_df = pd.DataFrame(output_dict_hong_kong.items(), columns=['id', 'review_count'])
cendf_2['name'] = cendf_2.index
cendf_3 = pd.merge(cendf_2, output_dict_df , on='id', how='left')
cendf_3.set_index('name', inplace=True)
# Choropleth map of calgary resident count via px.choropleth_mapbox()
cendf_4_hong_kong = cendf_3.to_crs(epsg=4326)
# plot the map
fig = px.choropleth_mapbox(cendf_4_hong_kong, geojson=cendf_4_hong_kong,
locations=cendf_4_hong_kong.index,
color="review_count",
color_continuous_scale = 'YlGn',
center={"lat": 22.302711, "lon": 114.177216}, # Hong Kong
mapbox_style="carto-positron",
opacity=0.75,
zoom=9,
title = 'Hong Kong Review Density')
fig.update_layout(margin={"r":50,"t":50,"l":50,"b":50},
autosize=True,
height=600 )
fig.show()
In Hong Kong, the most popular airbnb district is Central Kowloon, which is the most convenient place to travel around.
#New_york - no. of transaction heatmap
new_york_df = airbnb_raw_plus_review[(airbnb_raw_plus_review['city']=='New York')].iloc[:,[*range(40,208)]]
Year_Month = pd.date_range('2008-01-01','2021-12-31', freq='M').strftime("%Y-%b").tolist()
value=[]
for ind, column in enumerate(new_york_df.columns):
value.append(new_york_df[column].sum())
new_york_df_2 = pd.DataFrame([Year_Month, value]).transpose()
new_york_df_2[0] = pd.to_datetime(new_york_df_2[0],format='%Y-%b')
new_york_df_2['Year']=new_york_df_2[0].dt.year
new_york_df_2['Month']=new_york_df_2[0].dt.month
new_york_df_2['review_count']=new_york_df_2[1].astype(int)
#plot the heatmap
sns.set(font_scale=3)
fig, ((ax1, cbar_ax), (ax2, dummy_ax)) = plt.subplots(nrows=2, ncols=2, figsize=(26, 16), sharex='col',
gridspec_kw={'height_ratios': [5, 1], 'width_ratios': [20, 1]})
#New_york - no. of transaction heatmap
annot_kws = {"ha": 'left'}
b = sns.heatmap(new_york_df_2.pivot("Year","Month","review_count"), cmap="rocket_r", cbar_ax=cbar_ax, linewidths=2, ax=ax1).set(title='New York - no. of transactions heatmap')
ax2.set_xlabel('Month')
#avg temp sourced from climate-data.org
avg_temp = {'Jan':-1,'Feb':0,'Mar':4.1,'Apr':10.4,'May':16,'Jun':21.3,'Jul':24.5,'Aug':23.6,'Sep':20.1,'Oct':13.7,'Nov':7.7,'Dec':2.5}
avg_temp_df = pd.DataFrame(avg_temp.items(), columns=['Month', 'Temp degree in C'])
ax2.bar(avg_temp_df['Month'], avg_temp_df['Temp degree in C'], align='edge', color = 'grey')
ax2.set_xticklabels(avg_temp_df['Month'])
ax2.set_ylabel('Temp in C')
dummy_ax.axis('off')
plt.tight_layout()
plt.show()
The peak airbnb season is in Sep and Oct. Christmas time is also a hot season.
#Istanbul - no. of transaction heatmap
Istanbul_df = airbnb_raw_plus_review[(airbnb_raw_plus_review['city']=='Istanbul')].iloc[:,[*range(40,208)]]
Year_Month = pd.date_range('2008-01-01','2021-12-31', freq='M').strftime("%Y-%b").tolist()
value=[]
for ind, column in enumerate(Istanbul_df.columns):
value.append(Istanbul_df[column].sum())
Istanbul_df_2 = pd.DataFrame([Year_Month, value]).transpose()
Istanbul_df_2[0] = pd.to_datetime(Istanbul_df_2[0],format='%Y-%b')
Istanbul_df_2['Year']=Istanbul_df_2[0].dt.year
Istanbul_df_2['Month']=Istanbul_df_2[0].dt.month
Istanbul_df_2['review_count']=Istanbul_df_2[1].astype(int)
#plot the heatmap
sns.set(font_scale=3)
fig, ((ax1, cbar_ax), (ax2, dummy_ax)) = plt.subplots(nrows=2, ncols=2, figsize=(26, 16), sharex='col',
gridspec_kw={'height_ratios': [5, 1], 'width_ratios': [20, 1]})
#Istanbul - no. of transaction heatmap
annot_kws = {"ha": 'left'}
b = sns.heatmap(Istanbul_df_2.pivot("Year","Month","review_count"), cmap="rocket_r", cbar_ax=cbar_ax, linewidths=2, ax=ax1).set(title='Istanbul - no. of transactions heatmap')
ax2.set_xlabel('Month')
#avg temp sourced from climate-data.org
avg_temp = {'Jan':6,'Feb':6.5,'Mar':8.5,'Apr':12,'May':16.9,'Jun':21.7,'Jul':24.3,'Aug':24.6,'Sep':21.1,'Oct':16.4,'Nov':12.2,'Dec':8.1}
avg_temp_df = pd.DataFrame(avg_temp.items(), columns=['Month', 'Temp degree in C'])
ax2.bar(avg_temp_df['Month'], avg_temp_df['Temp degree in C'], align='edge', color = 'grey')
ax2.set_xticklabels(avg_temp_df['Month'])
ax2.set_ylabel('Temp in C')
dummy_ax.axis('off')
plt.tight_layout()
plt.show()
Istanbul popular season is from Aug to Oct, late summer, the time the temperature started going below 20 degrees.
#Hong_Kong - no. of transaction heatmap
Hong_Kong_df = airbnb_raw_plus_review[(airbnb_raw_plus_review['city']=='Hong Kong')].iloc[:,[*range(40,208)]]
Year_Month = pd.date_range('2008-01-01','2021-12-31', freq='M').strftime("%Y-%b").tolist()
value=[]
for ind, column in enumerate(Hong_Kong_df.columns):
value.append(Hong_Kong_df[column].sum())
Hong_Kong_df_2 = pd.DataFrame([Year_Month, value]).transpose()
Hong_Kong_df_2[0] = pd.to_datetime(Hong_Kong_df_2[0],format='%Y-%b')
Hong_Kong_df_2['Year']=Hong_Kong_df_2[0].dt.year
Hong_Kong_df_2['Month']=Hong_Kong_df_2[0].dt.month
Hong_Kong_df_2['review_count']=Hong_Kong_df_2[1].astype(int)
#plot the heatmap
sns.set(font_scale=3)
fig, ((ax1, cbar_ax), (ax2, dummy_ax)) = plt.subplots(nrows=2, ncols=2, figsize=(26, 16), sharex='col',
gridspec_kw={'height_ratios': [5, 1], 'width_ratios': [20, 1]})
#Hong_Kong - no. of transaction heatmap
annot_kws = {"ha": 'left'}
b = sns.heatmap(Hong_Kong_df_2.pivot("Year","Month","review_count"), cmap="rocket_r", cbar_ax=cbar_ax, linewidths=2, ax=ax1).set(title='Hong_Kong - no. of transactions heatmap')
ax2.set_xlabel('Month')
#avg temp sourced from climate-data.org
avg_temp = {'Jan':15.6,'Feb':17.1,'Mar':19.6,'Apr':22.7,'May':25.4,'Jun':27.1,'Jul':27.6,'Aug':27.4,'Sep':26.6,'Oct':24.5,'Nov':21,'Dec':16.7}
avg_temp_df = pd.DataFrame(avg_temp.items(), columns=['Month', 'Temp degree in C'])
ax2.bar(avg_temp_df['Month'], avg_temp_df['Temp degree in C'], align='edge', color = 'grey')
ax2.set_xticklabels(avg_temp_df['Month'])
ax2.set_ylabel('Temp in C')
dummy_ax.axis('off')
plt.tight_layout()
plt.show()
The peak season is not obvious for Hong Kong but the social event started in mid-2019 had some effect on the volume of visitors.
#Mexico_City - no. of transaction heatmap
Mexico_City_df = airbnb_raw_plus_review[(airbnb_raw_plus_review['city']=='Mexico City')].iloc[:,[*range(40,208)]]
Year_Month = pd.date_range('2008-01-01','2021-12-31', freq='M').strftime("%Y-%b").tolist()
value=[]
for ind, column in enumerate(Mexico_City_df.columns):
value.append(Mexico_City_df[column].sum())
Mexico_City_df_2 = pd.DataFrame([Year_Month, value]).transpose()
Mexico_City_df_2[0] = pd.to_datetime(Mexico_City_df_2[0],format='%Y-%b')
Mexico_City_df_2['Year']=Mexico_City_df_2[0].dt.year
Mexico_City_df_2['Month']=Mexico_City_df_2[0].dt.month
Mexico_City_df_2['review_count']=Mexico_City_df_2[1].astype(int)
#plot the heatmap
sns.set(font_scale=3)
fig, ((ax1, cbar_ax), (ax2, dummy_ax)) = plt.subplots(nrows=2, ncols=2, figsize=(26, 16), sharex='col',
gridspec_kw={'height_ratios': [5, 1], 'width_ratios': [20, 1]})
#Mexico_City - no. of transaction heatmap
annot_kws = {"ha": 'left'}
b = sns.heatmap(Mexico_City_df_2.pivot("Year","Month","review_count"), cmap="rocket_r", cbar_ax=cbar_ax, linewidths=2, ax=ax1).set(title='Mexico_City - no. of transactions heatmap')
ax2.set_xlabel('Month')
#avg temp sourced from climate-data.org
avg_temp = {'Jan':13.4,'Feb':15.1,'Mar':16.7,'Apr':18.4,'May':18.7,'Jun':17.6,'Jul':16.4,'Aug':16.5,'Sep':16,'Oct':15.1,'Nov':14.1,'Dec':13.7}
avg_temp_df = pd.DataFrame(avg_temp.items(), columns=['Month', 'Temp degree in C'])
ax2.bar(avg_temp_df['Month'], avg_temp_df['Temp degree in C'], align='edge', color = 'grey')
ax2.set_xticklabels(avg_temp_df['Month'])
ax2.set_ylabel('Temp in C')
dummy_ax.axis('off')
plt.tight_layout()
plt.show()
The peak season of Mexico City is in between Oct and Feb next year.
#Cape_Town - no. of transaction heatmap
Cape_Town_df = airbnb_raw_plus_review[(airbnb_raw_plus_review['city']=='Cape Town')].iloc[:,[*range(40,208)]]
Year_Month = pd.date_range('2008-01-01','2021-12-31', freq='M').strftime("%Y-%b").tolist()
value=[]
for ind, column in enumerate(Cape_Town_df.columns):
value.append(Cape_Town_df[column].sum())
Cape_Town_df_2 = pd.DataFrame([Year_Month, value]).transpose()
Cape_Town_df_2[0] = pd.to_datetime(Cape_Town_df_2[0],format='%Y-%b')
Cape_Town_df_2['Year']=Cape_Town_df_2[0].dt.year
Cape_Town_df_2['Month']=Cape_Town_df_2[0].dt.month
Cape_Town_df_2['review_count']=Cape_Town_df_2[1].astype(int)
#plot the heatmap
sns.set(font_scale=3)
fig, ((ax1, cbar_ax), (ax2, dummy_ax)) = plt.subplots(nrows=2, ncols=2, figsize=(26, 16), sharex='col',
gridspec_kw={'height_ratios': [5, 1], 'width_ratios': [20, 1]})
#Cape_Town - no. of transaction heatmap
annot_kws = {"ha": 'left'}
b = sns.heatmap(Cape_Town_df_2.pivot("Year","Month","review_count"), cmap="rocket_r", cbar_ax=cbar_ax, linewidths=2, ax=ax1).set(title='Cape_Town - no. of transactions heatmap')
ax2.set_xlabel('Month')
#avg temp sourced from climate-data.org
avg_temp = {'Jan':20,'Feb':20.1,'Mar':18.9,'Apr':16.9,'May':15.1,'Jun':13.6,'Jul':13,'Aug':13,'Sep':14,'Oct':15.7,'Nov':17.1,'Dec':19}
avg_temp_df = pd.DataFrame(avg_temp.items(), columns=['Month', 'Temp degree in C'])
ax2.bar(avg_temp_df['Month'], avg_temp_df['Temp degree in C'], align='edge', color = 'grey')
ax2.set_xticklabels(avg_temp_df['Month'])
ax2.set_ylabel('Temp in C')
dummy_ax.axis('off')
plt.tight_layout()
plt.show()
The peak season of Cape Town is in between Oct and Mar next year. It is the summer time of South Africa. The animal migrations takes place from June to November.
# Bangkok - create the WordCloud object
wordcloud = WordCloud(min_word_length =3,
background_color='white')
# generate the word cloud
wordcloud.generate_from_frequencies(CountFrequency(L_Bangkok))
#plot
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off')
plt.title('Bangkok - Airbnb amenities', fontdict = {'fontsize' : 20})
plt.show()
In Bangkok, Air conditioning, Long term stays allowed, Shampoo are the common amenities.
# Sydney - create the WordCloud object
wordcloud = WordCloud(min_word_length =3,
background_color='white')
# generate the word cloud
wordcloud.generate_from_frequencies(CountFrequency(L_Sydney))
#plot
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off')
plt.title('Sydney - Airbnb amenities', fontdict = {'fontsize' : 20})
plt.show()
Sydney is similar to Rio de Janeiro. Kitchen and Wifi are the most common items. Smoke alarm is also something special.
# Istanbul - create the WordCloud object
wordcloud = WordCloud(min_word_length =3,
background_color='white')
# generate the word cloud
wordcloud.generate_from_frequencies(CountFrequency(L_Istanbul))
#plot
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off')
plt.title('Istanbul - Airbnb amenities', fontdict = {'fontsize' : 20})
plt.show()
Heating is very important in Istanbul.
# Rome - create the WordCloud object
wordcloud = WordCloud(min_word_length =3,
background_color='white')
# generate the word cloud
wordcloud.generate_from_frequencies(CountFrequency(L_Rome))
#plot
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off')
plt.title('Rome - Airbnb amenities', fontdict = {'fontsize' : 20})
plt.show()
In Rome, Heating and Wifi are equally common.
# Mexico City - create the WordCloud object
wordcloud = WordCloud(min_word_length =3,
background_color='white')
# generate the word cloud
wordcloud.generate_from_frequencies(CountFrequency(L_MexicoCity))
#plot
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off')
plt.title('Mexico City - Airbnb amenities', fontdict = {'fontsize' : 20})
plt.show()
Wifi and Essentials are equally common in Mexico City.
# Cape Town - create the WordCloud object
wordcloud = WordCloud(min_word_length =3,
background_color='white')
# generate the word cloud
wordcloud.generate_from_frequencies(CountFrequency(L_CapeTown))
#plot
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off')
plt.title('Cape Town - Airbnb amenities', fontdict = {'fontsize' : 20})
plt.show()
Wifi are Essentials are the most common in Cape Town. Free parking on premises is something special.