The FUNDUS project of Urbanum Lab (a tech lab empowered by the interdisciplinary Urbanum Research Foundation), as its very title suggests, aims at laying the foundations for a globally applicable, open and accessible approach (including technological as well as methodological contributions) for the assessment of how economical factors of living (e.g., real estate prices) correlate with the quality of life in a certain urban environment, as captured by environmental, infrastructural and other social indicators.
In this notebook, we will be focusing on a simple, yet central question with manifold interpretations and consequences: do higher (average) real estate prices indicate a greener environment, less prone to the heat island phenomenon? (In simpler words: does more expensive mean greener and cooler in the summer?)
In this notebook, we are using Budapest as an example, and part of our data comes from our own scraper solution to obtain average property prices from real estate offers on the internet.
In particular, the notebook is structured as follows:
before we conclude and summarize the future steps to take.
This notebook presents the first results of our exploratory data analysis. We would like to get a glimpse
into the data and its possible use. First, you have to run the property price scraper. You can find more info on
getting the property prices data in the repository of the project. We assume that you have a WEkEO account, if this is not
the case, please register here. It is a good practice to install all the project
requirements using a separated Python 3.9.10 (or higher) virtual environment. Use requirements.txt to install all dependencies. Last, you have
to configure your .hdarc
file using your WEkEO credentials. This article shows you how to do this.
If you would like to adapt our notebook to a different municipality or for a different time horizon, use WEkEO's online data exploration tool.
You might find the BoundingBox tool useful to get latitude and longitude coordinates of a given area.
Dataset queries were generated by using the WEkEO online platform.
The queries can be fund in the data/jsons
folder
{
"datasetId": "EO:CLMS:DAT:CGLS_GLOBAL_LAI300_V1_333M",
"dateRangeSelectValues": [
{
"name": "dtrange",
"start": "2022-06-01T00:00:00.000Z",
"end": "2022-06-30T23:59:59.999Z"
}
]
}
{
"datasetId": "EO:ESA:DAT:SENTINEL-3:SL_2_LST___",
"boundingBoxValues": [
{
"name": "bbox",
"bbox": [
18.99804053609134,
47.42120186691113,
19.190237776905892,
47.58048586099437
]
}
],
"dateRangeSelectValues": [
{
"name": "position",
"start": "2022-06-01T00:00:00.000Z",
"end": "2022-06-30T00:00:00.000Z"
}
],
"stringChoiceValues": [
{
"name": "productType",
"value": "LST"
},
{
"name": "timeliness",
"value": "Near+Real+Time"
},
{
"name": "orbitDirection",
"value": "ascending"
},
{
"name": "processingLevel",
"value": "LEVEL2"
}
]
}
{
"datasetId": "EO:CLMS:DAT:CGLS_GLOBAL_FCOVER300_V1_333M",
"dateRangeSelectValues": [
{
"name": "dtrange",
"start": "2022-06-01T00:00:00.000Z",
"end": "2022-06-30T23:59:59.999Z"
}
]
}
data/aggregated
folder. WARNING: Check the README.md file of the scraper to get your own data.The notebook provides a novel methodology for exploratory data analysis in the field of urban digital geography, instantiated using a limited, yet appropriate example (Budapest, Hungary). At the end of this notebook, you will know:
Here, we import packages for the project.
!pip install h3 altair pydeck xarray hda
import json
import os
from functools import reduce
import h3
import altair as at
import numpy as np
import pandas as pd
import pydeck as pdk
import xarray as xr
from hda import Client
c = Client(debug=True)
with open("../data/jsons/temperature.json") as infile:
query = json.load(infile)
matches = c.search(query)
# matches.download()
c = Client(debug=True)
with open("../data/jsons/lai.json") as infile:
query = json.load(infile)
matches = c.search(query)
# matches.download()
c = Client(debug=True)
with open("../data/jsons/fcover.json") as infile:
query = json.load(infile)
matches = c.search(query)
# matches.download()
Your operating system and tools might be different, below you can read tips which might be useful on a Linux machine
data
folder.mv
command, like mv ../data "*.zip"
)cd data; mkdir leaf_data temp_data fcover
)leaf_data
)temp_data
folder `cd temp_data; unzip ".zip"; rm "*.zip"`We scraped a Hungarian real estate listing site to get property prices in Budapest. The listing entries
were geocoded using the geocoder
package. The geo-coordinates were indexed using the H3 hexagonal
geospatial indexing system. You can check the resolution table of the cell
areas here. For more details, you can check
the repository of the scraper.
The data looks like this:
df5 = pd.read_csv("../data/aggregated/l5.tsv", sep="\t")
df6 = pd.read_csv("../data/aggregated/l6.tsv", sep="\t")
df7 = pd.read_csv("../data/aggregated/l7.tsv", sep="\t")
df8 = pd.read_csv("../data/aggregated/l8.tsv", sep="\t")
df7.head()
The hexagons listed in these files constitues our area of interest.
The code below aggregates the average temperature data on various levels of H3 hashing and writes the results to a tsv file.
h3_l5 = set(df5["l5"])
h3_l6 = set(df6["l6"])
h3_l7 = set(df7["l7"])
h3_l8 = set(df8["l8"])
root_folder = "../data/temp_data"
dirs = [
os.path.join(root_folder, d)
for d in os.listdir(root_folder)
if os.path.isdir(os.path.join(root_folder, d))
]
def is_within_bounding_box(lat, long):
if 47.392134 < lat < 47.601216 and 18.936234 < long < 19.250031:
return True
else:
return False
latlong_temp = {}
for inpath in dirs:
# geodetic_tx.nc -> latitude_tx, longitude_tx
geodetic = xr.open_dataset(
filename_or_obj=os.path.join(inpath, "geodetic_tx.nc"), engine="netcdf4"
)
lat = geodetic.data_vars["latitude_tx"].to_numpy().flatten()
long = geodetic.data_vars["longitude_tx"].to_numpy().flatten()
# met_tx.nc -> temperature_tx
met_tx = xr.open_dataset(
filename_or_obj=os.path.join(inpath, "met_tx.nc"), engine="netcdf4"
)
temp = met_tx.data_vars["temperature_tx"].to_numpy().flatten()
# LST_ancillary_ds.nc -> NDVI (empyt :()
lst = xr.open_dataset(
filename_or_obj=os.path.join(inpath, "LST_ancillary_ds.nc"), engine="netcdf4"
)
ndvi = lst.data_vars["NDVI"].to_numpy().flatten()
temp_data = zip(lat, long, temp)
temp_data = (e for e in temp_data if is_within_bounding_box(e[0], e[1]))
for e in temp_data:
k = (e[0], e[1])
if latlong_temp.get(k, False):
latlong_temp[k] = (latlong_temp[k] + e[2]) / 2
else:
latlong_temp[k] = e[2]
with open("../data/temp_budapest.tsv", "w") as outfile:
h = "lat\tlong\tcelsius\tl5\tl6\tl7\tl8\n"
outfile.write(h)
for k, v in latlong_temp.items():
l5 = h3.geo_to_h3(k[0], k[1], 5)
l6 = h3.geo_to_h3(k[0], k[1], 6)
l7 = h3.geo_to_h3(k[0], k[1], 7)
l8 = h3.geo_to_h3(k[0], k[1], 8)
if l5 in h3_l5 and l6 in h3_l6 and l7 in h3_l7 and l8 in h3_l8:
o = (
str(k[0])
+ "\t"
+ str(k[1])
+ "\t"
+ str(v - 273.15)
+ "\t"
+ l5
+ "\t"
+ l6
+ "\t"
+ l7
+ "\t"
+ l8
+ "\n"
)
outfile.write(o)
The code below computes the average LAI and assigns H3 hash codes to the values. The results will be saved into a tsv file.
root_folder = "../data/leaf_data"
fs = [
os.path.join(root_folder, f)
for f in os.listdir(root_folder)
if os.path.isfile(os.path.join(root_folder, f))
]
ll2lai = {}
for f in fs:
try:
ds = xr.open_dataset(filename_or_obj=os.path.join(f), engine="netcdf4")
lat = ds.data_vars["LAI"]["lat"].to_numpy()
lat = [e for e in lat if 47.392134 < e < 47.601216]
lon = ds.data_vars["LAI"]["lon"].to_numpy()
lon = [e for e in lon if 18.936234 < e < 19.250031]
time = ds.data_vars["LAI"]["time"].to_numpy()[0]
for i in range(len(lat)):
for j in range(len(lon)):
one_point = ds["LAI"].sel(lat=lat[i], lon=lon[i])
vals = one_point.values[0]
if ll2lai.get((lat[i], lon[j]), False):
ll2lai[(lat[i], lon[j])] = (ll2lai[(lat[i], lon[j])] + vals) / 2.0
else:
ll2lai[(lat[i], lon[j])] = vals
except Exception as exc1:
print(exc1)
continue
with open("../data/lai_budapest.tsv", "w") as outfile:
h = "lat\tlong\tlai\tl5\tl6\tl7\tl8\n"
outfile.write(h)
for k, v in ll2lai.items():
h5 = h3.geo_to_h3(k[0], k[1], 5)
h6 = h3.geo_to_h3(k[0], k[1], 6)
h7 = h3.geo_to_h3(k[0], k[1], 7)
h8 = h3.geo_to_h3(k[0], k[1], 8)
if h5 in h3_l5 and h6 in h3_l6 and h7 in h3_l7 and h8 in h3_l8:
o = (
str(k[0])
+ "\t"
+ str(k[1])
+ "\t"
+ str(v)
+ "\t"
+ str(h5)
+ "\t"
+ str(h6)
+ "\t"
+ str(h7)
+ "\t"
+ str(h8)
+ "\n"
)
outfile.write(o)
The code below computes the average FCOVER and assigns H3 hash codes to the values. The results will be saved into a tsv file.
root_folder = "../data/fcover"
fs = [
os.path.join(root_folder, f)
for f in os.listdir(root_folder)
if os.path.isfile(os.path.join(root_folder, f))
]
def is_within_bounding_box(lat, long):
if 47.392134 < lat < 47.601216 and 18.936234 < long < 19.250031:
return True
else:
return False
ll2fcover = {}
for f in fs:
try:
ds = xr.open_dataset(filename_or_obj=os.path.join(f), engine="netcdf4")
lat = ds.data_vars["FCOVER"]["lat"].to_numpy()
lat = [e for e in lat if 47.392134 < e < 47.601216]
lon = ds.data_vars["FCOVER"]["lon"].to_numpy()
lon = [e for e in lon if 18.936234 < e < 19.250031]
time = ds.data_vars["FCOVER"]["time"].to_numpy()[0]
for i in range(len(lat)):
for j in range(len(lon)):
one_point = ds["FCOVER"].sel(lat=lat[i], lon=lon[i])
vals = one_point.values[0]
if ll2fcover.get((lat[i], lon[j]), False):
ll2fcover[(lat[i], lon[j])] = (
ll2fcover[(lat[i], lon[j])] + vals
) / 2.0
else:
ll2fcover[(lat[i], lon[j])] = vals
except Exception as exc1:
print(exc1)
continue
with open("../data/fcover_budapest.tsv", "w") as outfile:
h = "lat\tlong\tfcover\tl5\tl6\tl7\tl8\n"
outfile.write(h)
for k, v in ll2fcover.items():
h5 = h3.geo_to_h3(k[0], k[1], 5)
h6 = h3.geo_to_h3(k[0], k[1], 6)
h7 = h3.geo_to_h3(k[0], k[1], 7)
h8 = h3.geo_to_h3(k[0], k[1], 8)
if h5 in h3_l5 and h6 in h3_l6 and h7 in h3_l7 and h8 in h3_l8:
o = (
str(k[0])
+ "\t"
+ str(k[1])
+ "\t"
+ str(v)
+ "\t"
+ str(h5)
+ "\t"
+ str(h6)
+ "\t"
+ str(h7)
+ "\t"
+ str(h8)
+ "\n"
)
outfile.write(o)
df_price = pd.read_csv("../data/aggregated/l7.tsv", sep="\t")
df_price["normalized"] = 255 - (df_price["price"] / np.sqrt(np.sum(df_price["price"] ** 2)) * 1000)
layer = pdk.Layer(
"H3HexagonLayer",
df_price,
get_hexagon="l7",
auto_highlight=True,
# elevation_scale=10,
pickable=True,
# elevation_range=[min(df["price"]), max(df["price"])],
extruded=True,
coverage=0.8,
opacity=0.01,
get_fill_color="[255, normalized, 0]",
)
view_state = pdk.ViewState(
latitude=47.500000, longitude=19.040236, zoom=10.5, bearing=0, pitch=35
)
r = pdk.Deck(
layers=[layer],
initial_view_state=view_state,
tooltip={"text": "square meter price: {price}"},
)
r.to_html("../vizs/maps/prices_h7.html")
df = pd.read_csv("../data/temp_budapest.tsv", sep="\t")
df.fillna(0, inplace=True)
df_temp = df.groupby("l7").mean()
df_temp.reset_index(inplace=True, level=["l7"])
df_temp["rescaled"] = [255 - ((e**3)/100) for e in df_temp["celsius"]]
layer = pdk.Layer(
"H3HexagonLayer",
df_temp,
get_hexagon="l7",
auto_highlight=True,
pickable=True,
extruded=True,
coverage=0.8,
opacity=0.05,
get_fill_color="[255, rescaled, 0]",
)
view_state = pdk.ViewState(
latitude=47.500000, longitude=19.040236, zoom=10.5, bearing=0, pitch=35
)
r = pdk.Deck(
layers=[layer],
initial_view_state=view_state,
tooltip={"text": "temperature (celsius): {celsius}"},
)
r.to_html("../vizs/maps/temperature_h7.html")
df = pd.read_csv("../data/lai_budapest.tsv", sep="\t")
df.fillna(0, inplace=True)
df_lai = df.groupby("l7").mean()
df_lai.reset_index(inplace=True, level=["l7"])
layer = pdk.Layer(
"H3HexagonLayer",
df_lai,
get_hexagon="l7",
auto_highlight=True,
pickable=True,
extruded=True,
coverage=0.9,
opacity=0.05,
get_fill_color="[255, 255 - (lai * 100), 0]"
)
view_state = pdk.ViewState(
latitude=47.500000, longitude=19.040236, zoom=10.5, bearing=0, pitch=35
)
r = pdk.Deck(
layers=[layer],
initial_view_state=view_state,
tooltip={"text": "Leaf Area Index: {lai}"},
)
r.to_html("../vizs/maps/lai_h7.html")
df = pd.read_csv("../data/fcover_budapest.tsv", sep="\t")
df.fillna(0.0, inplace=True)
df_fcover = df.groupby("l7").mean()
df_fcover.reset_index(inplace=True, level=["l7"])
df_fcover["normalized"] = (df_fcover["fcover"] / np.sqrt(np.sum(df_fcover["fcover"] ** 5))) ** -2
df_fcover["normalized"][df_fcover["normalized"] == np.inf] = 255
layer = pdk.Layer(
"H3HexagonLayer",
df_fcover,
get_hexagon="l7",
auto_highlight=True,
pickable=True,
extruded=True,
coverage=0.8,
opacity=0.05,
get_fill_color="[255, normalized, 0]",
)
view_state = pdk.ViewState(
latitude=47.500000, longitude=19.040236, zoom=10.5, bearing=0, pitch=35
)
r = pdk.Deck(
layers=[layer],
initial_view_state=view_state,
tooltip={"text": "Fraction of Vegetation Cover: {fcover}"},
)
r.to_html("../vizs/maps/fcover_h7.html")
We would like to test the common conception that wealthier neighborhoods are greener and also enjoy a lower average temperature during summer. Since we have no data on wealth at this granularity, we use property square meter prices as a proxy of wealth. This is a strong and yet not often assessed assumption. In particular, we test if
are connected.
We present our findings as interactive visualizations. We start with a relatively fine H3 resolution (7), giving us a fairly tight covergae of the geographical are under investigation. However, as we notice that the geographical resolution might not match the resolution of economical data, we also experiment with lower H3 resolutions.
at.renderers.enable('default')
data_frames = [df_price, df_temp, df_lai, df_fcover]
df_merged = reduce(lambda left,right: pd.merge(left,right,on=['l7'],
how='outer'), data_frames)
df_merged.dropna(inplace=True)
df_merged.drop(columns=["normalized_x", "lat_x", "long_x", "rescaled", "lat_y",
"long_y", "lat", "long", "normalized_y"], inplace=True)
df_merged.head()
cor_data = (df_merged.corr().stack().reset_index().rename(columns={0: 'correlation', 'level_0': 'variable', 'level_1': 'variable2'}))
cor_data['correlation_label'] = cor_data['correlation'].map('{:.2f}'.format) # Round to 2 decimal
cor_data
base = at.Chart(cor_data).encode(
x='variable2:O',
y='variable:O'
)
text = base.mark_text().encode(
text='correlation_label',
color=at.condition(
at.datum.correlation > 0.5,
at.value('white'),
at.value('black')
)
)
cor_plot = base.mark_rect().encode(
color='correlation:Q'
).properties(
width=700,
height=700
)
cor_plot + text
Since our price data is collected on street name level, maybe we should use a lower resolution.
price_h3_df = pd.read_csv("../data/aggregated/l6.tsv", sep="\t")
temp_df = pd.read_csv("../data/temp_budapest.tsv", sep="\t")
lai_df = pd.read_csv("../data/lai_budapest.tsv", sep="\t")
fcover_df = pd.read_csv("../data/fcover_budapest.tsv", sep="\t")
temp_h3_df = temp_df.groupby("l6").mean()
temp_h3_df.reset_index(inplace=True, level=["l6"])
lai_h3_df = lai_df.groupby("l6").mean()
lai_h3_df.reset_index(inplace=True, level=["l6"])
fcover_h3_df = fcover_df.groupby("l6").mean()
fcover_h3_df.reset_index(inplace=True, level=["l6"])
h3_data_frames = [price_h3_df, temp_h3_df, lai_h3_df, fcover_h3_df]
df_merged_h3 = reduce(lambda left,right: pd.merge(left,right,on=['l6'],
how='outer'), h3_data_frames)
df_merged_h3.dropna(inplace=True)
df_merged_h3.drop(columns=["lat_x", "long_x", "lat_y", "long_y", "lat", "long"], inplace=True)
df_merged_h3.head()
cor_data_h3 = (df_merged_h3.corr().stack().reset_index().rename(
columns={0: 'correlation', 'level_0': 'variable', 'level_1': 'variable2'}))
cor_data_h3['correlation_label'] = cor_data_h3['correlation'].map('{:.2f}'.format)
cor_data_h3
base = at.Chart(cor_data_h3).encode(
x='variable2:O',
y='variable:O'
)
text = base.mark_text().encode(
text='correlation_label',
color=at.condition(
at.datum.correlation > 0.5,
at.value('white'),
at.value('black')
)
)
cor_plot = base.mark_rect().encode(
color='correlation:Q'
).properties(
width=700,
height=700
)
cor_plot + text
price_h3_df = pd.read_csv("../data/aggregated/l5.tsv", sep="\t")
temp_df = pd.read_csv("../data/temp_budapest.tsv", sep="\t")
lai_df = pd.read_csv("../data/lai_budapest.tsv", sep="\t")
fcover_df = pd.read_csv("../data/fcover_budapest.tsv", sep="\t")
temp_h3_df = temp_df.groupby("l5").mean()
temp_h3_df.reset_index(inplace=True, level=["l5"])
lai_h3_df = lai_df.groupby("l5").mean()
lai_h3_df.reset_index(inplace=True, level=["l5"])
fcover_h3_df = fcover_df.groupby("l5").mean()
fcover_h3_df.reset_index(inplace=True, level=["l5"])
h3_data_frames = [price_h3_df, temp_h3_df, lai_h3_df, fcover_h3_df]
df_merged_h3 = reduce(lambda left,right: pd.merge(left,right,on=['l5'],
how='outer'), h3_data_frames)
df_merged_h3.dropna(inplace=True)
df_merged_h3.drop(columns=["lat_x", "long_x", "lat_y", "long_y", "lat", "long"], inplace=True)
df_merged_h3
cor_data_h3 = (df_merged_h3.corr().stack().reset_index().rename(
columns={0: 'correlation', 'level_0': 'variable', 'level_1': 'variable2'}))
cor_data_h3['correlation_label'] = cor_data_h3['correlation'].map('{:.2f}'.format)
cor_data_h3
base = at.Chart(cor_data_h3).encode(
x='variable2:O',
y='variable:O'
)
text = base.mark_text().encode(
text='correlation_label',
color=at.condition(
at.datum.correlation > 0.5,
at.value('white'),
at.value('black')
)
)
cor_plot = base.mark_rect().encode(
color='correlation:Q'
).properties(
width=700,
height=700
)
cor_plot + text
In this part of the notebook, we demonstrate our unique approach to multi-domain data synthesis, while bringing in two further data sources with a distinct community-driven character.
OpenStreetMap is an open neogeography platform: the community defines the semantics of the map and populates it with POIs, i.e., data entries referring to actual places of a certain character. That character is captured by features and subfeatures, i.e., a more abstract and a more detailed categorization.
In the following, we create an integrated, multi-domain sociogeographical dataset by counting the number of POIs of certain categories.
This obsservational dataset is further augmented by data from another community-driven platform: Járókelő is a civil-empowered engagement platform for reporting issues in the urban infrasturcture and enviroment, and following up on their status. We consider all the cases (over 10000) submitted this year (2022), until Aug 20.
As this is a Hungarian platform, and, for the sake of compatibility, we used the Hungarian names as feature labels in the dataset, we provide a translation of them here:
The following tables provide a data summary table and a correlation matrix for OSM features (and subfeatures, respectively) and Járókelő entries, aggregated for H3 hexagons with resolution Level 5 (where 'l5' hashes identify the hexagons). The tables have been collapsed for readability using the _df_mergedh3() function.
price_h3_df = pd.read_csv("../data/aggregated/l5.tsv", sep="\t")
temp_df = pd.read_csv("../data/temp_budapest.tsv", sep="\t")
lai_df = pd.read_csv("../data/lai_budapest.tsv", sep="\t")
fcover_df = pd.read_csv("../data/fcover_budapest.tsv", sep="\t")
temp_h3_df = temp_df.groupby("l5").mean()
temp_h3_df.reset_index(inplace=True, level=["l5"])
lai_h3_df = lai_df.groupby("l5").mean()
lai_h3_df.reset_index(inplace=True, level=["l5"])
fcover_h3_df = fcover_df.groupby("l5").mean()
fcover_h3_df.reset_index(inplace=True, level=["l5"])
osm_pois = pd.read_csv("../data/osm/key_l5.tsv", sep="\t")
osm_pois = osm_pois.pivot_table(values="0", index=osm_pois.l5, columns="key", aggfunc="first")
osm_pois.reset_index(inplace=True, level=["l5"])
osm_pois.fillna(0, inplace=True)
jarokelo = pd.read_csv("../data/jarokelo/jarokelo_l5.tsv", sep="\t")
jarokelo = jarokelo.pivot_table(values="0", index=jarokelo.l5, columns="Category", aggfunc="first")
jarokelo.reset_index(inplace=True, level=["l5"])
jarokelo.fillna(0, inplace=True)
h3_data_frames = [price_h3_df, temp_h3_df, lai_h3_df, fcover_h3_df, osm_pois, jarokelo]
df_merged_h3 = reduce(lambda left,right: pd.merge(left,right,on=['l5'],
how='outer'), h3_data_frames)
df_merged_h3.dropna(inplace=True)
df_merged_h3.drop(columns=["lat_x", "long_x", "lat_y", "long_y", "lat", "long"], inplace=True)
df_merged_h3
cor_data_h3 = (df_merged_h3.corr().stack().reset_index().rename(
columns={0: 'correlation', 'level_0': 'variable', 'level_1': 'variable2'}))
cor_data_h3['correlation_label'] = cor_data_h3['correlation'].map('{:.2f}'.format)
cor_data_h3
base = at.Chart(cor_data_h3).encode(
x='variable2:O',
y='variable:O'
)
text = base.mark_text().encode(
text='correlation_label',
color=at.condition(
at.datum.correlation > 0.5,
at.value('white'),
at.value('black')
)
)
cor_plot = base.mark_rect().encode(
color='correlation:Q'
).properties(
width=700,
height=700
)
cor_plot + text
price_h3_df = pd.read_csv("../data/aggregated/l5.tsv", sep="\t")
temp_df = pd.read_csv("../data/temp_budapest.tsv", sep="\t")
lai_df = pd.read_csv("../data/lai_budapest.tsv", sep="\t")
fcover_df = pd.read_csv("../data/fcover_budapest.tsv", sep="\t")
temp_h3_df = temp_df.groupby("l5").mean()
temp_h3_df.reset_index(inplace=True, level=["l5"])
lai_h3_df = lai_df.groupby("l5").mean()
lai_h3_df.reset_index(inplace=True, level=["l5"])
fcover_h3_df = fcover_df.groupby("l5").mean()
fcover_h3_df.reset_index(inplace=True, level=["l5"])
osm_pois = pd.read_csv("../data/osm/value_l5.tsv", sep="\t")
osm_pois = osm_pois.pivot_table(values="0", index=osm_pois.l5, columns="value", aggfunc="first")
osm_pois.reset_index(inplace=True, level=["l5"])
osm_pois.fillna(0, inplace=True)
jarokelo = pd.read_csv("../data/jarokelo/jarokelo_l5.tsv", sep="\t")
jarokelo = jarokelo.pivot_table(values="0", index=jarokelo.l5, columns="Category", aggfunc="first")
jarokelo.reset_index(inplace=True, level=["l5"])
jarokelo.fillna(0, inplace=True)
h3_data_frames = [price_h3_df, temp_h3_df, lai_h3_df, fcover_h3_df, osm_pois, jarokelo]
df_merged_h3 = reduce(lambda left,right: pd.merge(left,right,on=['l5'],
how='outer'), h3_data_frames)
df_merged_h3.dropna(inplace=True)
df_merged_h3.drop(columns=["lat_x", "long_x", "lat_y", "long_y", "lat", "long"], inplace=True)
df_merged_h3
cor_data_h3 = (df_merged_h3.corr().stack().reset_index().rename(
columns={0: 'correlation', 'level_0': 'variable', 'level_1': 'variable2'}))
cor_data_h3['correlation_label'] = cor_data_h3['correlation'].map('{:.2f}'.format)
cor_data_h3
from altair import pipe, limit_rows, to_values
t = lambda data: pipe(data, limit_rows(max_rows=50000), to_values)
at.data_transformers.register('custom', t)
at.data_transformers.enable('custom')
base = at.Chart(cor_data_h3).encode(
x='variable2:O',
y='variable:O'
)
text = base.mark_text().encode(
text='correlation_label',
color=at.condition(
at.datum.correlation > 0.5,
at.value('white'),
at.value('black')
)
)
cor_plot = base.mark_rect().encode(
color='correlation:Q'
).properties(
width=10000,
height=10000
)
cor_plot + text
The present notebook fulfills a threefold purpose: a hands-on educational, an analytical and an integrational one.
The first part of the notebook presents a practical approach to aggregating and visualizing geosocial data using h3 and pydeck.
As for a technosocial analysis, we investigated the connection between property prices and environmental factors. In particular, we wanted to see if having a more expensive property in Budapest, Hungary correlates with a greener environment and a cooler summer. We used WEkEO data to get information on the environment and we scraped property prices to supplement the data. Interestingly, using lower (but still not too coarse) geographical resolutions provided us with some notable (even if not very surprising) correlations between more expensive properties and the greenness of its environment (using both the LAI and the FCOVER measure for greenness. However, we could not verify any significant direct connection between higher prices and more acceptable temperature conditions! This might be a first step towards an important future investigation: if consequences of climate change can be effectively tackled on individual level by spending more money. (Maybe, they cannot.)
The third aim of our project to lay a foundation (another reason for the name FUNDUS) for a technosocial data integration and analysis platform. To this end, we have created a novel, multi-domain integration, which combines, in a unique way, heterogeneous, yet semantically interlinked data domains: FUNDUS becomes a representation of the networked society, fusing the economical, the environmental, the urban-geographical and the civic aspects of our lives.
In the future, we would like to increase the time window to get more LAI and FCOVER data. We investigate the possibility to incorporate other datasets like OSM and get official statistics on house prices and/or income, widening the scope of the FUNDUS project, to finally turn it into a full-fledged, environmentally and economically conscious qualitiy-of-life assessment approach. As already mentioned, a particular and potentially very interesting research question arising from the present notebook: can money effectively battle the effects of climate crisis in our direct surroundings? In order to learn more, we shall primarily proceed by expanding the geographical scope of the present notebook, and maybe even considering further economical factors. Here, the scarcity of open data is still a major impediment, but we hope we can contribute to the improvement of the situation with our endeavor.