Urban Greenspace and Asthma Prevalence¶

Get Started with Big Data Pipelines¶

Vegetation has the potential to provide many ecosystem services in urban areas, such as cleaner air and water and flood mitigation. However, the results are mixed on relationships between a simple measurement of vegetation cover (such as average NDVI, a measurement of vegetation health) and human health. We do, however, find relationships between landscape metrics that attempt to quantify the connectivity and structure of greenspace and human health. These types of metrics include mean patch size, edge density, and fragmentation.

In this notebook, you will write code to calculate patch, edge, and fragmentation statistics about urban greenspace in Chicago. These statistics should be reflective of the connectivity and spread of urban greenspace, which are important for ecosystem function and access. You will then use a linear model to identify statistically significant correlations between the distribution of greenspace and health data compiled by the US Center for Disease Control.

Read More¶

Check out this study by Tsai et al. 2019 on the relationship between edge density and life expectancy in Baltimore, MD. The authors also discuss the influence of scale (e.g.Ā the resolution of the imagery) on these relationships, which is important for this case study.

Full citation: Tsai, Wei-Lun, Yu-Fai Leung, Melissa R. McHale, Myron F. Floyd, and Brian J. Reich. 2019. ā€œRelationships Between Urban Green Land Cover and Human Health at Different Spatial Resolutions.ā€ Urban Ecosystems 22 (2): 315–24. https://doi.org/10.1007/s11252-018-0813-3.

Working with larger-than-memory (big) data¶

For this project, we are going to split up the green space (NDVI) data by census tract, because this matches the human health data from the CDC. If we were interested in the average NDVI at this spatial scale, we could easily use a source of multispectral data like Landsat (30m) or even MODIS (250m) without a noticeable impact on our results. However, because we need to know more about the structure of green space within each tract, we need higher resolution data. For that, we will access the National Agricultural Imagery Program (NAIP) data, which is taken for the continental US every few years at 1m resolution. That’s enough to see individual trees and cars! The main purpose of the NAIP data is, as the name suggests, agriculture. However, it’s also a great source for urban areas where lots is happening on a very small scale.

The NAIP data for the City of Chicago takes up about 20GB of space. This amount of data is likely to crash your kernel if you try to load it all in at once. It also would be inconvenient to store on your harddrive so that you can load it in a bit at a time for analysis. Even if you are using a computer that would be able to handle this amount of data, imagine if you were analysing the entire United States over multiple years!

To help with this problem, you will use cloud-based tools to calculate your statistics instead of downloading rasters to your computer or container. You can crop the data entirely in the cloud, thereby saving on your memory and internet connection, using rioxarray.

Check your work with testing!¶

This notebook does not have pre-built tests. You will need to write your own test code to make sure everything is working the way that you want. For many operations, this will be as simple as creating a plot to check that all your data lines up spatially the way you were expecting, or printing values as you go. However, if you don’t test as you go, you are likely to end up with intractable problems with the final code.

STEP 1: Set up your analysis¶

Try It

As always, before you get started:

  1. Import necessary packages
  2. Create reproducible file paths for your project file structure.
  3. To use cloud-optimized GeoTiffs, we recommend some settings to make sure your code does not get stopped by a momentary connection lapse – see the code cell below.
InĀ [60]:
!pip install sodapy
Requirement already satisfied: sodapy in c:\users\seism\miniconda3\envs\earth-analytics-python\lib\site-packages (2.2.0)
Requirement already satisfied: requests>=2.28.1 in c:\users\seism\miniconda3\envs\earth-analytics-python\lib\site-packages (from sodapy) (2.32.5)
Requirement already satisfied: charset_normalizer<4,>=2 in c:\users\seism\miniconda3\envs\earth-analytics-python\lib\site-packages (from requests>=2.28.1->sodapy) (3.4.3)
Requirement already satisfied: idna<4,>=2.5 in c:\users\seism\miniconda3\envs\earth-analytics-python\lib\site-packages (from requests>=2.28.1->sodapy) (3.10)
Requirement already satisfied: urllib3<3,>=1.21.1 in c:\users\seism\miniconda3\envs\earth-analytics-python\lib\site-packages (from requests>=2.28.1->sodapy) (2.5.0)
Requirement already satisfied: certifi>=2017.4.17 in c:\users\seism\miniconda3\envs\earth-analytics-python\lib\site-packages (from requests>=2.28.1->sodapy) (2025.8.3)
InĀ [61]:
### Import libraries

### file paths and organization
import os
import pathlib
from pathlib import Path

### see warnings
import warnings

### deal with different data types
import geopandas as gpd
import numpy as np
import pandas as pd
import xarray as xr
import rioxarray as rxr
import rioxarray.merge as rxrmerge
import shapely
from shapely.geometry import box
import time

### plotting and visualization
import geoviews as gv
import holoviews as hv
import hvplot.pandas
import hvplot.xarray
from cartopy import crs as ccrs

### data exploration and visualiation
import matplotlib
import matplotlib.pyplot as plt
import scipy.stats as stats

### image processing
from scipy.ndimage import convolve
from scipy.ndimage import label

### working with data in the cloud
import pystac_client
from sodapy import Socrata

### progress bar
from tqdm.notebook import tqdm

### OLS sklearn
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.model_selection import KFold

import json
InĀ [62]:
### Create reproducible file paths
data_dir = os.path.join(

    ### home directory
    pathlib.Path.home(),

    ### Earth analytics folder
    'earth-analytics',
    'data',

    ## project directory
    'urban-green-big-data2',
)

### make the dir
os.makedirs(data_dir, exist_ok = True)
print(data_dir)
C:\Users\seism\earth-analytics\data\urban-green-big-data2
InĀ [63]:
### revent GDAL from quitting due to momentary disruptions
os.environ["GDAL_HTTP_MAX_RETRY"] = "5"
os.environ["GDAL_HTTP_RETRY_DELAY"] = "1"

STEP 2: Create a site map¶

We’ll be using the Center for Disease Control (CDC) Places dataset for human health data to compare with vegetation metrics. CDC Places also provides some modified census tracts, clipped to the city boundary, to go along with the health data. We’ll start by downloading the matching geographic data, and then select the City of Chicago.

Try It
  1. Download the Census tract Shapefile that goes along with CDC Places
  2. Use a row filter to select only the census tracts in Chicago
  3. Use a conditional statement to cache your download. There is no need to cache the full dataset – stick with your pared down version containing only Chicago.
InĀ [64]:
### Sanity check
# tract_gdf.columns

### You were not kidding when you said it was difficult to get Portland Oregon
# tract_gdf["ST"].value_counts().head(20)
# tract_gdf.loc[tract_gdf["PlaceName"].astype(str).str.contains("Portland", na=False), ["PlaceName","ST"]].drop_duplicates()
InĀ [65]:
### Set up the census tract path

### define directory
tract_dir = os.path.join(data_dir, 'portland-tract')

### make the directory
os.makedirs(tract_dir, exist_ok = True)

### make path to directory
tract_path = os.path.join(tract_dir, 'portland-tract.shp')

### Download the census tracts (only once)
if not os.path.exists(tract_path):

    ### put in census url
    tract_url = ('https://data.cdc.gov/download/x7zy-2xmx/application%2Fzip')

    ### read the shp file
    tract_gdf = gpd.read_file(tract_url)

    ### subset to Portland
    por_tract_gdf = tract_gdf[
    (tract_gdf["PlaceName"] == "Portland") &
    (tract_gdf["ST"].astype(str).str.zfill(2) == "41")
    ].copy()

    ### save it as a file
    por_tract_gdf.to_file(tract_path, index = False)

por_tract_gdf.shape
por_tract_gdf[["PlaceName","ST"]].drop_duplicates()
por_tract_gdf.geometry.isna().sum(), len(por_tract_gdf)
Out[65]:
(np.int64(0), 161)
InĀ [66]:
### Load in census tract data
por_tract_gdf = gpd.read_file(tract_path)
por_tract_gdf


### Site plot -- Census tracts with satellite imagery in the background
(
    por_tract_gdf

    ### set projection
    .to_crs(ccrs.Mercator())

    ### plot with satellite imagery in the background
    .hvplot(
        line_color = 'orange', fill_color = None,
        crs = ccrs.Mercator(), tiles = 'EsriImagery',
        frame_width = 600
    )
)
Out[66]:

STEP 3 - Access Asthma and Urban Greenspaces Data¶

Step 3a - Access human health data¶

The U.S. Center for Disease Control (CDC) provides a number of health variables through their Places Dataset that might be correlated with urban greenspace. For this assignment, start with adult asthma. Try to limit the data as much as possible for download. Selecting the state and county is a one way to do this.

Try It
  1. You can access Places data with an API, but as with many APIs it is easier to test out your search before building a URL. Navigate to the Places Census Tract Data Portal and search for the data you want.
  2. The data portal will make an API call for you, but there is a simpler, easier to read and modify way to form an API call. Check out to the socrata documentation to see how. You can also find some limited examples and a list of available parameters for this API on CDC Places SODA Consumer API Documentation.
  3. Once you have formed your query, you may notice that you have exactly 1000 rows. The Places SODA API limits you to 1000 records in a download. Either narrow your search or check out the &$limit= parameter to increase the number of rows downloaded. You can find more information on the Paging page of the SODA API documentation
  4. You should also clean up this data by renaming the 'data_value' to something descriptive, and possibly selecting a subset of columns.
InĀ [67]:
### debugging
# tmp = pd.read_csv(cdc_url)
# tmp.columns
InĀ [68]:
### Set up a path for the asthma data
cdc_path = os.path.join(data_dir, 'asthma.csv')

### Download asthma data (only once)
if not os.path.exists(cdc_path):

    cdc_url = (

        ### url for data
        "https://data.cdc.gov/resource/cwsq-ngmh.csv"

        ### parameter to filter on
        "?year=2023"
        "&stateabbr=OR"
        "&countyname=Multnomah"
        "&measureid=CASTHMA"
        "&$limit=1500"
    )

    cdc_df = (

        ### open it as a csv
        pd.read_csv(cdc_url)

        ### rename the cols
        .rename(columns = {
            'data_value': 'asthma',
            'low_confidence_limit' : 'asthma_ci_low',
            'high_confidence_limit': 'asthma_ci_high',
            'locationid': 'tract'
        })


        ### select columns I want to keep
        [[
            'year', 'tract', 
            'asthma', 'asthma_ci_low', 'asthma_ci_high',
            'data_value_unit', 'totalpopulation',
            'totalpop18plus'
        ]]
    )

    cdc_df['tract'] = cdc_df['tract'].astype(str)
    
    ### download to a csv
    cdc_df.to_csv(cdc_path, index = False)


### Load in asthma data
cdc_df = pd.read_csv(cdc_path, dtype={"tract": str})

### Preview asthma data
cdc_df
Out[68]:
year tract asthma asthma_ci_low asthma_ci_high data_value_unit totalpopulation totalpop18plus
0 2023 41051005202 11.0 9.8 12.3 % 2937 2792
1 2023 41051002002 12.1 10.8 13.5 % 3692 3465
2 2023 41051003803 11.5 10.3 12.9 % 4385 3655
3 2023 41051001602 11.6 10.3 12.9 % 4616 3786
4 2023 41051006300 11.7 10.5 13.2 % 5616 4694
... ... ... ... ... ... ... ... ...
192 2023 41051009202 13.0 11.6 14.4 % 5459 4135
193 2023 41051002402 11.3 10.2 12.6 % 3830 3560
194 2023 41051008203 12.6 11.2 13.9 % 5608 4637
195 2023 41051004300 11.2 10.0 12.4 % 1176 953
196 2023 41051008800 12.2 10.9 13.6 % 4204 3462

197 rows Ɨ 8 columns

Step 3b - Join health data with census tract boundaries¶

Try It
  1. Join the census tract GeoDataFrame with the asthma prevalence DataFrame using the .merge() method.
  2. You will need to change the data type of one of the merge columns to match, e.g.Ā using .astype('int64')
  3. There are a few census tracts in Chicago that do not have data. You should be able to confirm that they are not listed through the interactive Places Data Portal. However, if you have large chunks of the city missing, it may mean that you need to expand the record limit for your download.
InĀ [69]:
### Sanity check
cdc_df["tract"].head()
cdc_df["tract"].str.len().value_counts()
Out[69]:
tract
11    197
Name: count, dtype: int64
InĀ [70]:
### Sanity check
por_tract_gdf["tract2010"].astype(str).str.len().value_counts()
Out[70]:
tract2010
11    161
Name: count, dtype: int64
InĀ [71]:
### Change tract identifier datatype for merging
por_tract_gdf["tract2010"] = por_tract_gdf["tract2010"].astype(str)
cdc_df["tract"] = cdc_df["tract"].astype(str)

### merge census data with geometry
tract_cdc_gdf = (
    por_tract_gdf
    .merge(cdc_df,
           
           ### specify cols to merge on
           left_on = 'tract2010',
           right_on = 'tract',

           ### use inner join
           how = 'inner'
           )
)
# tract_cdc_gdf
### Plot asthma data as chloropleth
(
    ### load basemap with satellite imagery
    gv.tile_sources.EsriImagery

    *

    ### map census tracts with asthma data
    gv.Polygons(

        ### reproject to align CRSs
        tract_cdc_gdf.to_crs(ccrs.Mercator()),

        ### select cols
        vdims = ['asthma', 'tract2010'],

        ### ensure CRSs align
        crs = ccrs.Mercator()
    ).opts(color = 'asthma', colorbar = True, tools = ['hover'], colorbar_opts={
            'title': 'Asthma prevalence (%)'})
).opts(width = 600, height = 600, xaxis = None, yaxis = None, title='Asthma Prevalence by Census Tract (Portland)')
Out[71]:
InĀ [86]:
### generate for the portfolio
asthma_map = (
    gv.tile_sources.EsriImagery
    *
    gv.Polygons(
        tract_cdc_gdf.to_crs(ccrs.Mercator()),
        vdims=['asthma', 'tract2010'],
        crs=ccrs.Mercator()
    ).opts(
        color='asthma',
        colorbar=True,
        tools=['hover'],
        colorbar_opts={'title': 'Asthma prevalence (%)'}
    )
).opts(
    width=500, height=500,
    xaxis=None, yaxis=None,
    title='Asthma Prevalence by Census Tract (Portland)'
)

asthma_map
Out[86]:
InĀ [87]:
hv.save(asthma_map, "portland_asthma_map.html")

Step 3c - Get data URLs for urban greenspace imagery¶

NAIP data are freely available through the Microsoft Planetary Computer SpatioTemporal Access Catalog (STAC).

Try It

Get started with STAC by accessing the planetary computer catalog with the following code:

e84_catalog = pystac_client.Client.open(
    "https://planetarycomputer.microsoft.com/api/stac/v1"
)
InĀ [13]:
### Connect to the planetary computer catalog
e84_catalog = pystac_client.Client.open(
    "https://planetarycomputer.microsoft.com/api/stac/v1"
)
Try It
  1. Using a loop, for each Census Tract:

    1. Use the following sample code to search for data, replacing the names with applicable values with descriptive names:

      search = e84_catalog.search(
          collections=["naip"],
          intersects=shapely.to_geojson(tract_geometry),
          datetime=f"{year}"
      )
    2. Access the url using search.assets['image'].href

  2. Accumulate the urls in a pd.DataFrame or dict for later

  3. Occasionally you may find that the STAC service is momentarily unavailable. You should include code that will retry the request up to 5 times when you get the pystac_client.exceptions.APIError.

Warning

As always – DO NOT try to write this loop all at once! Stick with one step at a time, making sure to test your work. You also probably want to add a break into your loop to stop the loop after a single iteration. This will help prevent long waits during debugging.

InĀ [14]:
### Convert geometry to lat/lon for STAC
tract_latlong_gdf = tract_cdc_gdf.to_crs(4326)

### Define a path to save NDVI stats
ndvi_stats_path = os.path.join(data_dir, 'portland-ndvi-stats.csv')

### Check for existing data - do not access duplicate tracts

### initialize an empty list for the census tract IDs
downloaded_tracts = []

### write the conditional
if os.path.exists(ndvi_stats_path):

    ### if it exists, open the csvs in the path
    ndvi_stats_df = pd.read_csv(ndvi_stats_path)

    ### fill the list with the tract values
    downloaded_tracts = ndvi_stats_df.tract.values

### if it doesn't exist, print statement
else:
    print('No census tracts downloaded so far')
No census tracts downloaded so far
InĀ [15]:
print("tract_latlong_gdf rows:", len(tract_latlong_gdf))
print("geometry missing:", tract_latlong_gdf.geometry.isna().sum())
print("CRS:", tract_latlong_gdf.crs)
tract_latlong_gdf rows: 125
geometry missing: 0
CRS: EPSG:4326
InĀ [16]:
print("ndvi_stats_path exists:", os.path.exists(ndvi_stats_path))
print("downloaded_tracts count:", len(downloaded_tracts))
print("downloaded_tracts sample:", downloaded_tracts[:5] if len(downloaded_tracts) else [])
ndvi_stats_path exists: False
downloaded_tracts count: 0
downloaded_tracts sample: []
InĀ [17]:
### run 1 iteration
i = 0
tract_values = tract_latlong_gdf.iloc[i]
tract = tract_values.tract2010
tract
Out[17]:
'41051000301'
InĀ [18]:
### check if stats already downloaded for this tract
if not (tract in downloaded_tracts):
    print('proceed with the STAC search')
proceed with the STAC search
InĀ [19]:
### define naip_search
naip_search = e84_catalog.search(

    ### we want naip data
    collections = ['naip'],

    ### 2021 naip imagery
    intersects = shapely.to_geojson(tract_values.geometry), 
    datetime = '2021'
)
InĀ [20]:
### Sanity check
items = list(naip_search.items())
len(items)
Out[20]:
0
InĀ [21]:
### Concerned that NAIP acquisition for Oregon might have been skipped in 2021, does NAIP return anything if I remove the year?

geom = tract_latlong_gdf.geometry.iloc[0]
geom_geojson = json.loads(shapely.to_geojson(geom))  # <-- dict, not a string

search = e84_catalog.search(
    collections=["naip"],
    intersects=geom_geojson,
    # datetime omitted
    limit=5
)

items = list(search.items())
print("items:", len(items))
if items:
    print("example datetime:", items[0].datetime)
    print("example id:", items[0].id)
items: 12
example datetime: 2022-07-11 16:00:00+00:00
example id: or_m_4512236_nw_10_030_20220711
InĀ [22]:
### Naip_search
list(naip_search.items())[:2]

### build a dataframe
scene_df = pd.DataFrame(dict(

    ### tract column
    tract = tract,

    ### date column
    data = [pd.to_datetime(scene.datetime).date()
            
            ### return the scenes (STAC items)
            for scene in naip_search.items()], 

    ### make rgbir_href
    rgbir_href = [scene.assets['image'].href for scene in naip_search.items()],

))

scene_df
Out[22]:
tract data rgbir_href
InĀ [23]:
### loop through all the tracts

### Convert geometry to lat/lon for STAC
tract_latlong_gdf = tract_cdc_gdf.to_crs(4326)

### Define a path to save NDVI stats
ndvi_stats_path = os.path.join(data_dir, 'portland-ndvi-stats.csv')

### Check for existing data - do not access duplicate tracts
### initialize an empty list for the census tract IDs
downloaded_tracts = []

### write the conditional
if os.path.exists(ndvi_stats_path):

    ### if it exists, open the csvs in the path
    ndvi_stats_df = pd.read_csv(ndvi_stats_path)

    ### fill the list with the tract values
    downloaded_tracts = ndvi_stats_df.tract.values

### if it doesn't exist, print statement
else:
    print('No census tracts downloaded so far')


### Loop through each census tract
### list to accumulate into
scene_dfs = []

print("tracts to process:", len(tract_latlong_gdf))
print("already downloaded:", len(downloaded_tracts))
print("sample tracts:", tract_latlong_gdf["tract2010"].head().tolist())

### loop through each tract value in the gdf
for i, tract_values in tqdm(tract_latlong_gdf.iterrows()):

    ### for i, extract the tract id
    tract = tract_values.tract2010

    ### Check if statistics are already downloaded for this tract
    if not (tract in downloaded_tracts):

        ### Repeat up to 5 times in case of a momentary disruption 
        i = 0 ### initialize retry counter
        retry_limit = 5 ### max number of tries
        while i < retry_limit:

            ### Try accessing the STAC
            try:

                ### Search for tiles 
                naip_search = e84_catalog.search(
                    collections = ['naip'],
                    intersects = shapely.to_geojson(tract_values.geometry),
                    datetime="2022"
                )

                ### Build dataframe with tracts and tile urls
                scene_dfs.append(pd.DataFrame(dict(

                    ### column called tract
                    tract = tract,
                    date = [pd.to_datetime(scene.datetime).date()
                            
                            for scene in naip_search.items()],

                    rgbir_href = [scene.assets['image'].href for scene in naip_search.items()],
                    
                )))

                ### exit retry loop open STAC request succeeds
                break

            ### Try again in case of an APIError
            except pystac_client.exceptions.APIError:
                print(
                    f'Could not connect with STAC server'
                    f'Retrying tract {tract}...')

                ### wait 2 seconds before retrying
                time.sleep(2)
                i += 1
                continue        

### Concatenate the url dataframes
if scene_dfs:

    ### concatenate
    scene_df = pd.concat(scene_dfs).reset_index(drop = True)

### otherwise, tell me it didn't work
else:
    scene_df = None

### Preview the url dataframe
scene_df
No census tracts downloaded so far
tracts to process: 125
already downloaded: 0
sample tracts: ['41051000301', '41051000302', '41051000401', '41051000402', '41051000501']
0it [00:00, ?it/s]
Out[23]:
tract date rgbir_href
0 41051000301 2022-07-11 https://naipeuwest.blob.core.windows.net/naip/...
1 41051000301 2022-07-11 https://naipeuwest.blob.core.windows.net/naip/...
2 41051000302 2022-07-11 https://naipeuwest.blob.core.windows.net/naip/...
3 41051000302 2022-07-11 https://naipeuwest.blob.core.windows.net/naip/...
4 41051000401 2022-07-11 https://naipeuwest.blob.core.windows.net/naip/...
... ... ... ...
240 41051010200 2022-07-11 https://naipeuwest.blob.core.windows.net/naip/...
241 41051980000 2022-07-11 https://naipeuwest.blob.core.windows.net/naip/...
242 41051980000 2022-07-11 https://naipeuwest.blob.core.windows.net/naip/...
243 41051980000 2022-07-11 https://naipeuwest.blob.core.windows.net/naip/...
244 41051009605 2022-07-11 https://naipeuwest.blob.core.windows.net/naip/...

245 rows Ɨ 3 columns

InĀ [24]:
# write out scene_df as a csv (only if we actually built one)
if scene_df is not None and not scene_df.empty:
    scene_df.to_csv(ndvi_stats_path, index=False)
else:
    print("No new tracts to write (scene_df is empty/None).")
InĀ [25]:
### Sanity check
print("file exists:", os.path.exists(ndvi_stats_path))
if os.path.exists(ndvi_stats_path):
    import os
    print("file size (bytes):", os.path.getsize(ndvi_stats_path))
file exists: True
file size (bytes): 33343

Step 3d- Compute NDVI Statistics¶

Next, calculate some metrics to get at different aspects of the distribution of greenspace in each census tract. These types of statistics are called fragmentation statistics, and can be implemented with the scipy package. Some examples of fragmentation statistics are:

Percentage vegetation: The percentage of pixels that exceed a vegetation threshold (.12 is common with Landsat) Mean patch size

Mean patch size: The average size of patches, or contiguous area exceeding the vegetation threshold. Patches can be identified with the label function from scipy.ndimage

Edge density: The proportion of edge pixels among vegetated pixels. Edges can be identified by convolving the image with a kernel designed to detect pixels that are different from their surroundings.

What is convolution?

If you are familiar with differential equations, convolution is an approximation of the LaPlace transform.

For the purposes of calculating edge density, convolution means that we are taking all the possible 3x3 chunks for our image, and multiplying it by the kernel:

$$ \text{Kernel} = \begin{bmatrix} 1 & 1 & 1 \\ 1 & -8 & 1 \\ 1 & 1 & 1 \end{bmatrix} $$

The result is a matrix the same size as the original, minus the outermost edge. If the center pixel is the same as the surroundings, its value in the final matrix will be $-8 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 = 0$. If it is higher than the surroundings, the result will be negative, and if it is lower than the surroundings, the result will be positive. As such, the edge pixels of our patches will be negative.

Try It
  1. Select a single row from the census tract GeoDataFrame using e.g.Ā .loc[[0]], then select all the rows from your URL DataFrame that match the census tract.
  2. For each URL in crop, merge, clip, and compute NDVI for that census tract
  3. Set a threshold to get a binary mask of vegetation
  4. Using the sample code to compute the fragmentation statistics. Feel free to add any other statistics you think are relevant, but make sure to include a fraction vegetation, mean patch size, and edge density. If you are not sure what any line of code is doing, make a plot or print something to find out! You can also ask ChatGPT or the class.
InĀ [26]:
### open df of NAIP imagery URLs
scene_df = pd.read_csv(ndvi_stats_path)
scene_df
Out[26]:
tract date rgbir_href
0 41051000301 2022-07-11 https://naipeuwest.blob.core.windows.net/naip/...
1 41051000301 2022-07-11 https://naipeuwest.blob.core.windows.net/naip/...
2 41051000302 2022-07-11 https://naipeuwest.blob.core.windows.net/naip/...
3 41051000302 2022-07-11 https://naipeuwest.blob.core.windows.net/naip/...
4 41051000401 2022-07-11 https://naipeuwest.blob.core.windows.net/naip/...
... ... ... ...
240 41051010200 2022-07-11 https://naipeuwest.blob.core.windows.net/naip/...
241 41051980000 2022-07-11 https://naipeuwest.blob.core.windows.net/naip/...
242 41051980000 2022-07-11 https://naipeuwest.blob.core.windows.net/naip/...
243 41051980000 2022-07-11 https://naipeuwest.blob.core.windows.net/naip/...
244 41051009605 2022-07-11 https://naipeuwest.blob.core.windows.net/naip/...

245 rows Ɨ 3 columns

InĀ [27]:
### Pull out first tract value
# tract = por_tract_gdf.loc[0].tract2010
# tract
### use a tract that we know has imagery
tract = scene_df["tract"].astype(str).iloc[0]
tract

### convert tract to string
tract = str(tract)
tract
Out[27]:
'41051000301'
InĀ [28]:
### Sanity check
print("tract value:", tract)
print("tract type:", type(tract))
print("scene_df tract dtype:", scene_df["tract"].dtype)
print("scene_df tract sample:", scene_df["tract"].head().tolist())
tract value: 41051000301
tract type: <class 'str'>
scene_df tract dtype: int64
scene_df tract sample: [41051000301, 41051000301, 41051000302, 41051000302, 41051000401]
InĀ [29]:
### grab first row of scene_df dataframe, select image column for matching rows
href_value = scene_df.loc[
    scene_df.tract.astype(str) == tract,

    ### column to extract
    "rgbir_href"
].iloc[0]

href_value
Out[29]:
'https://naipeuwest.blob.core.windows.net/naip/v002/or/2022/or_030cm_2022/45122/m_4512236_nw_10_030_20220711.tif'
InĀ [30]:
### process one scene
### open the raster
tile_da = rxr.open_rasterio(

    ### give it the url we want
    href_value,

    ### mask out NoData pixels
    masked = True

### remove dimensions of length = 1
).squeeze()

tile_da
Out[30]:
<xarray.DataArray (band: 4, y: 24560, x: 17740)> Size: 7GB
[1742777600 values with dtype=float32]
Coordinates:
  * band         (band) int64 32B 1 2 3 4
  * x            (x) float64 142kB 5.291e+05 5.291e+05 ... 5.344e+05 5.344e+05
  * y            (y) float64 196kB 5.039e+06 5.039e+06 ... 5.031e+06 5.031e+06
    spatial_ref  int64 8B 0
Attributes:
    TIFFTAG_XRESOLUTION:     1
    TIFFTAG_YRESOLUTION:     1
    TIFFTAG_RESOLUTIONUNIT:  1 (unitless)
    AREA_OR_POINT:           Area
    scale_factor:            1.0
    add_offset:              0.0
xarray.DataArray
  • band: 4
  • y: 24560
  • x: 17740
  • ...
    [1742777600 values with dtype=float32]
    • band
      (band)
      int64
      1 2 3 4
      array([1, 2, 3, 4])
    • x
      (x)
      float64
      5.291e+05 5.291e+05 ... 5.344e+05
      array([529098.15, 529098.45, 529098.75, ..., 534419.25, 534419.55, 534419.85],
            shape=(17740,))
    • y
      (y)
      float64
      5.039e+06 5.039e+06 ... 5.031e+06
      array([5038787.85, 5038787.55, 5038787.25, ..., 5031420.75, 5031420.45,
             5031420.15], shape=(24560,))
    • spatial_ref
      ()
      int64
      0
      crs_wkt :
      PROJCS["NAD83 / UTM zone 10N",GEOGCS["NAD83",DATUM["North_American_Datum_1983",SPHEROID["GRS 1980",6378137,298.257222101,AUTHORITY["EPSG","7019"]],AUTHORITY["EPSG","6269"]],PRIMEM["Greenwich",0,AUTHORITY["EPSG","8901"]],UNIT["degree",0.0174532925199433,AUTHORITY["EPSG","9122"]],AUTHORITY["EPSG","4269"]],PROJECTION["Transverse_Mercator"],PARAMETER["latitude_of_origin",0],PARAMETER["central_meridian",-123],PARAMETER["scale_factor",0.9996],PARAMETER["false_easting",500000],PARAMETER["false_northing",0],UNIT["metre",1,AUTHORITY["EPSG","9001"]],AXIS["Easting",EAST],AXIS["Northing",NORTH],AUTHORITY["EPSG","26910"]]
      semi_major_axis :
      6378137.0
      semi_minor_axis :
      6356752.314140356
      inverse_flattening :
      298.257222101
      reference_ellipsoid_name :
      GRS 1980
      longitude_of_prime_meridian :
      0.0
      prime_meridian_name :
      Greenwich
      geographic_crs_name :
      NAD83
      horizontal_datum_name :
      North American Datum 1983
      projected_crs_name :
      NAD83 / UTM zone 10N
      grid_mapping_name :
      transverse_mercator
      latitude_of_projection_origin :
      0.0
      longitude_of_central_meridian :
      -123.0
      false_easting :
      500000.0
      false_northing :
      0.0
      scale_factor_at_central_meridian :
      0.9996
      spatial_ref :
      PROJCS["NAD83 / UTM zone 10N",GEOGCS["NAD83",DATUM["North_American_Datum_1983",SPHEROID["GRS 1980",6378137,298.257222101,AUTHORITY["EPSG","7019"]],AUTHORITY["EPSG","6269"]],PRIMEM["Greenwich",0,AUTHORITY["EPSG","8901"]],UNIT["degree",0.0174532925199433,AUTHORITY["EPSG","9122"]],AUTHORITY["EPSG","4269"]],PROJECTION["Transverse_Mercator"],PARAMETER["latitude_of_origin",0],PARAMETER["central_meridian",-123],PARAMETER["scale_factor",0.9996],PARAMETER["false_easting",500000],PARAMETER["false_northing",0],UNIT["metre",1,AUTHORITY["EPSG","9001"]],AXIS["Easting",EAST],AXIS["Northing",NORTH],AUTHORITY["EPSG","26910"]]
      GeoTransform :
      529098.0 0.3 0.0 5038788.0 0.0 -0.3
      array(0)
    • band
      PandasIndex
      PandasIndex(Index([1, 2, 3, 4], dtype='int64', name='band'))
    • x
      PandasIndex
      PandasIndex(Index([        529098.15, 529098.4500000001,         529098.75,
                     529099.05,         529099.35,         529099.65,
             529099.9500000001,         529100.25,         529100.55,
                     529100.85,
             ...
                     534417.15, 534417.4500000001,         534417.75,
                     534418.05,         534418.35,         534418.65,
             534418.9500000001,         534419.25,         534419.55,
                     534419.85],
            dtype='float64', name='x', length=17740))
    • y
      PandasIndex
      PandasIndex(Index([       5038787.85,        5038787.55,        5038787.25,
             5038786.949999999, 5038786.649999999,        5038786.35,
                    5038786.05,        5038785.75, 5038785.449999999,
             5038785.149999999,
             ...
                    5031422.85,        5031422.55,        5031422.25,
             5031421.949999999, 5031421.649999999,        5031421.35,
                    5031421.05,        5031420.75, 5031420.449999999,
             5031420.149999999],
            dtype='float64', name='y', length=24560))
  • TIFFTAG_XRESOLUTION :
    1
    TIFFTAG_YRESOLUTION :
    1
    TIFFTAG_RESOLUTIONUNIT :
    1 (unitless)
    AREA_OR_POINT :
    Area
    scale_factor :
    1.0
    add_offset :
    0.0
InĀ [31]:
### check out dimensions
tile_da.dims
Out[31]:
('band', 'y', 'x')
InĀ [32]:
print(scene_df.columns)
Index(['tract', 'date', 'rgbir_href'], dtype='object')
InĀ [33]:
tract_cdc_gdf
Out[33]:
place2010 tract2010 ST PlaceName plctract10 PlcTrPop10 geometry year tract asthma asthma_ci_low asthma_ci_high data_value_unit totalpopulation totalpop18plus
0 4159000 41051000301 41 Portland 4159000-41051000301 5041 POLYGON ((-13650305.147 5699042.371, -13650193... 2023 41051000301 12.1 10.8 13.5 % 5443 4908
1 4159000 41051000302 41 Portland 4159000-41051000302 6709 POLYGON ((-13650063.25 5697279.135, -13649943.... 2023 41051000302 11.3 10.1 12.5 % 7191 5476
2 4159000 41051000401 41 Portland 4159000-41051000401 3418 POLYGON ((-13648772.389 5699042.688, -13648767... 2023 41051000401 11.7 10.4 13.0 % 3746 3023
3 4159000 41051000402 41 Portland 4159000-41051000402 3477 POLYGON ((-13647657.747 5699037.289, -13647659... 2023 41051000402 11.6 10.3 12.9 % 3906 3188
4 4159000 41051000501 41 Portland 4159000-41051000501 3732 POLYGON ((-13646585.183 5696754.598, -13646694... 2023 41051000501 11.9 10.7 13.3 % 4134 3307
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
120 4159000 41051009804 41 Portland 4159000-41051009804 3183 POLYGON ((-13634931.924 5701311.194, -13634931... 2023 41051009804 13.1 11.8 14.6 % 3496 2786
121 4159000 41051009903 41 Portland 4159000-41051009903 391 POLYGON ((-13634999.94 5697453.141, -13635002.... 2023 41051009903 11.6 10.4 13.0 % 7482 5839
122 4159000 41051010200 41 Portland 4159000-41051010200 583 MULTIPOLYGON (((-13633882.74 5709928.305, -136... 2023 41051010200 12.1 10.9 13.5 % 7130 5680
123 4159000 41051980000 41 Portland 4159000-41051980000 8 POLYGON ((-13660405.502 5712550.235, -13660339... 2023 41051980000 11.7 10.4 13.2 % 55 53
124 4159000 41051009605 41 Portland 4159000-41051009605 0 POLYGON ((-13636161.226 5707323.831, -13636162... 2023 41051009605 12.6 11.4 14.1 % 5710 4409

125 rows Ɨ 15 columns

InĀ [34]:
tract_cdc_gdf["tract2010"].head()
Out[34]:
0    41051000301
1    41051000302
2    41051000401
3    41051000402
4    41051000501
Name: tract2010, dtype: object
InĀ [35]:
### Sanity check
tract in tract_cdc_gdf["tract2010"].astype(str).values
Out[35]:
True
InĀ [36]:
### get boundary of the tract we're working with
boundary = (
    tract_cdc_gdf

    ### deal with issues with integers
    .assign(tract2010 = lambda df: df["tract2010"].astype(str))

    ### use tract ID as the index
    .set_index("tract2010")

    ### grab the tract we are using
    .loc[[str(tract)]]

    ### make sure the CRS matc
    .to_crs(tile_da.rio.crs)

    ### grab geometry
    .geometry
)

boundary
Out[36]:
tract2010
41051000301    POLYGON ((529476.134 5037488.683, 529554.509 5...
Name: geometry, dtype: geometry
InĀ [37]:
### crop raster to the bounding box
crop_da = tile_da.rio.clip_box(

    ### get coordinates of the bounding box
    *boundary.envelope.total_bounds,

    ### let it expand a little
    auto_expand = True
)

crop_da
Out[37]:
<xarray.DataArray (band: 4, y: 6054, x: 2115)> Size: 205MB
[51216840 values with dtype=float32]
Coordinates:
  * band         (band) int64 32B 1 2 3 4
  * x            (x) float64 17kB 5.291e+05 5.291e+05 ... 5.297e+05 5.297e+05
  * y            (y) float64 48kB 5.038e+06 5.038e+06 ... 5.036e+06 5.036e+06
    spatial_ref  int64 8B 0
Attributes:
    TIFFTAG_XRESOLUTION:     1
    TIFFTAG_YRESOLUTION:     1
    TIFFTAG_RESOLUTIONUNIT:  1 (unitless)
    AREA_OR_POINT:           Area
    scale_factor:            1.0
    add_offset:              0.0
xarray.DataArray
  • band: 4
  • y: 6054
  • x: 2115
  • ...
    [51216840 values with dtype=float32]
    • band
      (band)
      int64
      1 2 3 4
      array([1, 2, 3, 4])
    • x
      (x)
      float64
      5.291e+05 5.291e+05 ... 5.297e+05
      axis :
      X
      long_name :
      x coordinate of projection
      standard_name :
      projection_x_coordinate
      units :
      metre
      array([529098.15, 529098.45, 529098.75, ..., 529731.75, 529732.05, 529732.35],
            shape=(2115,))
    • y
      (y)
      float64
      5.038e+06 5.038e+06 ... 5.036e+06
      axis :
      Y
      long_name :
      y coordinate of projection
      standard_name :
      projection_y_coordinate
      units :
      metre
      array([5037519.75, 5037519.45, 5037519.15, ..., 5035704.45, 5035704.15,
             5035703.85], shape=(6054,))
    • spatial_ref
      ()
      int64
      0
      crs_wkt :
      PROJCS["NAD83 / UTM zone 10N",GEOGCS["NAD83",DATUM["North_American_Datum_1983",SPHEROID["GRS 1980",6378137,298.257222101,AUTHORITY["EPSG","7019"]],AUTHORITY["EPSG","6269"]],PRIMEM["Greenwich",0,AUTHORITY["EPSG","8901"]],UNIT["degree",0.0174532925199433,AUTHORITY["EPSG","9122"]],AUTHORITY["EPSG","4269"]],PROJECTION["Transverse_Mercator"],PARAMETER["latitude_of_origin",0],PARAMETER["central_meridian",-123],PARAMETER["scale_factor",0.9996],PARAMETER["false_easting",500000],PARAMETER["false_northing",0],UNIT["metre",1,AUTHORITY["EPSG","9001"]],AXIS["Easting",EAST],AXIS["Northing",NORTH],AUTHORITY["EPSG","26910"]]
      semi_major_axis :
      6378137.0
      semi_minor_axis :
      6356752.314140356
      inverse_flattening :
      298.257222101
      reference_ellipsoid_name :
      GRS 1980
      longitude_of_prime_meridian :
      0.0
      prime_meridian_name :
      Greenwich
      geographic_crs_name :
      NAD83
      horizontal_datum_name :
      North American Datum 1983
      projected_crs_name :
      NAD83 / UTM zone 10N
      grid_mapping_name :
      transverse_mercator
      latitude_of_projection_origin :
      0.0
      longitude_of_central_meridian :
      -123.0
      false_easting :
      500000.0
      false_northing :
      0.0
      scale_factor_at_central_meridian :
      0.9996
      spatial_ref :
      PROJCS["NAD83 / UTM zone 10N",GEOGCS["NAD83",DATUM["North_American_Datum_1983",SPHEROID["GRS 1980",6378137,298.257222101,AUTHORITY["EPSG","7019"]],AUTHORITY["EPSG","6269"]],PRIMEM["Greenwich",0,AUTHORITY["EPSG","8901"]],UNIT["degree",0.0174532925199433,AUTHORITY["EPSG","9122"]],AUTHORITY["EPSG","4269"]],PROJECTION["Transverse_Mercator"],PARAMETER["latitude_of_origin",0],PARAMETER["central_meridian",-123],PARAMETER["scale_factor",0.9996],PARAMETER["false_easting",500000],PARAMETER["false_northing",0],UNIT["metre",1,AUTHORITY["EPSG","9001"]],AXIS["Easting",EAST],AXIS["Northing",NORTH],AUTHORITY["EPSG","26910"]]
      GeoTransform :
      529098.0 0.29999999999997795 0.0 5037519.9 0.0 -0.30000000000006155
      array(0)
    • band
      PandasIndex
      PandasIndex(Index([1, 2, 3, 4], dtype='int64', name='band'))
    • x
      PandasIndex
      PandasIndex(Index([        529098.15, 529098.4500000001,         529098.75,
                     529099.05,         529099.35,         529099.65,
             529099.9500000001,         529100.25,         529100.55,
                     529100.85,
             ...
                     529729.65, 529729.9500000001,         529730.25,
                     529730.55,         529730.85,         529731.15,
             529731.4500000001,         529731.75,         529732.05,
                     529732.35],
            dtype='float64', name='x', length=2115))
    • y
      PandasIndex
      PandasIndex(Index([       5037519.75, 5037519.449999999, 5037519.149999999,
                    5037518.85,        5037518.55,        5037518.25,
             5037517.949999999, 5037517.649999999,        5037517.35,
                    5037517.05,
             ...
                    5035706.55,        5035706.25, 5035705.949999999,
             5035705.649999999,        5035705.35,        5035705.05,
                    5035704.75, 5035704.449999999, 5035704.149999999,
                    5035703.85],
            dtype='float64', name='y', length=6054))
  • TIFFTAG_XRESOLUTION :
    1
    TIFFTAG_YRESOLUTION :
    1
    TIFFTAG_RESOLUTIONUNIT :
    1 (unitless)
    AREA_OR_POINT :
    Area
    scale_factor :
    1.0
    add_offset :
    0.0
InĀ [38]:
### clip to exact geometry of the census tract
clip_da = crop_da.rio.clip(

    boundary,

    ### include all pixels touching the boundary
    all_touched = True)

clip_da
Out[38]:
<xarray.DataArray (band: 4, y: 4117, x: 2115)> Size: 139MB
array([[[ nan,  nan,  nan, ..., 200., 200., 201.],
        [ nan,  nan,  nan, ..., 192., 189., 189.],
        [ nan,  nan,  nan, ..., 156., 157.,  nan],
        ...,
        [ nan,  nan,  nan, ...,  nan,  nan,  nan],
        [ nan,  nan,  nan, ...,  nan,  nan,  nan],
        [ nan,  nan,  nan, ...,  nan,  nan,  nan]],

       [[ nan,  nan,  nan, ..., 204., 207., 207.],
        [ nan,  nan,  nan, ..., 194., 197., 194.],
        [ nan,  nan,  nan, ..., 154., 154.,  nan],
        ...,
        [ nan,  nan,  nan, ...,  nan,  nan,  nan],
        [ nan,  nan,  nan, ...,  nan,  nan,  nan],
        [ nan,  nan,  nan, ...,  nan,  nan,  nan]],

       [[ nan,  nan,  nan, ..., 195., 199., 199.],
        [ nan,  nan,  nan, ..., 185., 188., 186.],
        [ nan,  nan,  nan, ..., 146., 147.,  nan],
        ...,
        [ nan,  nan,  nan, ...,  nan,  nan,  nan],
        [ nan,  nan,  nan, ...,  nan,  nan,  nan],
        [ nan,  nan,  nan, ...,  nan,  nan,  nan]],

       [[ nan,  nan,  nan, ..., 174., 180., 180.],
        [ nan,  nan,  nan, ..., 164., 169., 167.],
        [ nan,  nan,  nan, ..., 132., 133.,  nan],
        ...,
        [ nan,  nan,  nan, ...,  nan,  nan,  nan],
        [ nan,  nan,  nan, ...,  nan,  nan,  nan],
        [ nan,  nan,  nan, ...,  nan,  nan,  nan]]],
      shape=(4, 4117, 2115), dtype=float32)
Coordinates:
  * band         (band) int64 32B 1 2 3 4
  * x            (x) float64 17kB 5.291e+05 5.291e+05 ... 5.297e+05 5.297e+05
  * y            (y) float64 33kB 5.037e+06 5.037e+06 ... 5.036e+06 5.036e+06
    spatial_ref  int64 8B 0
Attributes:
    TIFFTAG_XRESOLUTION:     1
    TIFFTAG_YRESOLUTION:     1
    TIFFTAG_RESOLUTIONUNIT:  1 (unitless)
    AREA_OR_POINT:           Area
    scale_factor:            1.0
    add_offset:              0.0
xarray.DataArray
  • band: 4
  • y: 4117
  • x: 2115
  • nan nan nan nan nan nan nan nan ... nan nan nan nan nan nan nan nan
    array([[[ nan,  nan,  nan, ..., 200., 200., 201.],
            [ nan,  nan,  nan, ..., 192., 189., 189.],
            [ nan,  nan,  nan, ..., 156., 157.,  nan],
            ...,
            [ nan,  nan,  nan, ...,  nan,  nan,  nan],
            [ nan,  nan,  nan, ...,  nan,  nan,  nan],
            [ nan,  nan,  nan, ...,  nan,  nan,  nan]],
    
           [[ nan,  nan,  nan, ..., 204., 207., 207.],
            [ nan,  nan,  nan, ..., 194., 197., 194.],
            [ nan,  nan,  nan, ..., 154., 154.,  nan],
            ...,
            [ nan,  nan,  nan, ...,  nan,  nan,  nan],
            [ nan,  nan,  nan, ...,  nan,  nan,  nan],
            [ nan,  nan,  nan, ...,  nan,  nan,  nan]],
    
           [[ nan,  nan,  nan, ..., 195., 199., 199.],
            [ nan,  nan,  nan, ..., 185., 188., 186.],
            [ nan,  nan,  nan, ..., 146., 147.,  nan],
            ...,
            [ nan,  nan,  nan, ...,  nan,  nan,  nan],
            [ nan,  nan,  nan, ...,  nan,  nan,  nan],
            [ nan,  nan,  nan, ...,  nan,  nan,  nan]],
    
           [[ nan,  nan,  nan, ..., 174., 180., 180.],
            [ nan,  nan,  nan, ..., 164., 169., 167.],
            [ nan,  nan,  nan, ..., 132., 133.,  nan],
            ...,
            [ nan,  nan,  nan, ...,  nan,  nan,  nan],
            [ nan,  nan,  nan, ...,  nan,  nan,  nan],
            [ nan,  nan,  nan, ...,  nan,  nan,  nan]]],
          shape=(4, 4117, 2115), dtype=float32)
    • band
      (band)
      int64
      1 2 3 4
      array([1, 2, 3, 4])
    • x
      (x)
      float64
      5.291e+05 5.291e+05 ... 5.297e+05
      axis :
      X
      long_name :
      x coordinate of projection
      standard_name :
      projection_x_coordinate
      units :
      metre
      array([529098.15, 529098.45, 529098.75, ..., 529731.75, 529732.05, 529732.35],
            shape=(2115,))
    • y
      (y)
      float64
      5.037e+06 5.037e+06 ... 5.036e+06
      axis :
      Y
      long_name :
      y coordinate of projection
      standard_name :
      projection_y_coordinate
      units :
      metre
      array([5037490.05, 5037489.75, 5037489.45, ..., 5036255.85, 5036255.55,
             5036255.25], shape=(4117,))
    • spatial_ref
      ()
      int64
      0
      crs_wkt :
      PROJCS["NAD83 / UTM zone 10N",GEOGCS["NAD83",DATUM["North_American_Datum_1983",SPHEROID["GRS 1980",6378137,298.257222101,AUTHORITY["EPSG","7019"]],AUTHORITY["EPSG","6269"]],PRIMEM["Greenwich",0,AUTHORITY["EPSG","8901"]],UNIT["degree",0.0174532925199433,AUTHORITY["EPSG","9122"]],AUTHORITY["EPSG","4269"]],PROJECTION["Transverse_Mercator"],PARAMETER["latitude_of_origin",0],PARAMETER["central_meridian",-123],PARAMETER["scale_factor",0.9996],PARAMETER["false_easting",500000],PARAMETER["false_northing",0],UNIT["metre",1,AUTHORITY["EPSG","9001"]],AXIS["Easting",EAST],AXIS["Northing",NORTH],AUTHORITY["EPSG","26910"]]
      semi_major_axis :
      6378137.0
      semi_minor_axis :
      6356752.314140356
      inverse_flattening :
      298.257222101
      reference_ellipsoid_name :
      GRS 1980
      longitude_of_prime_meridian :
      0.0
      prime_meridian_name :
      Greenwich
      geographic_crs_name :
      NAD83
      horizontal_datum_name :
      North American Datum 1983
      projected_crs_name :
      NAD83 / UTM zone 10N
      grid_mapping_name :
      transverse_mercator
      latitude_of_projection_origin :
      0.0
      longitude_of_central_meridian :
      -123.0
      false_easting :
      500000.0
      false_northing :
      0.0
      scale_factor_at_central_meridian :
      0.9996
      spatial_ref :
      PROJCS["NAD83 / UTM zone 10N",GEOGCS["NAD83",DATUM["North_American_Datum_1983",SPHEROID["GRS 1980",6378137,298.257222101,AUTHORITY["EPSG","7019"]],AUTHORITY["EPSG","6269"]],PRIMEM["Greenwich",0,AUTHORITY["EPSG","8901"]],UNIT["degree",0.0174532925199433,AUTHORITY["EPSG","9122"]],AUTHORITY["EPSG","4269"]],PROJECTION["Transverse_Mercator"],PARAMETER["latitude_of_origin",0],PARAMETER["central_meridian",-123],PARAMETER["scale_factor",0.9996],PARAMETER["false_easting",500000],PARAMETER["false_northing",0],UNIT["metre",1,AUTHORITY["EPSG","9001"]],AXIS["Easting",EAST],AXIS["Northing",NORTH],AUTHORITY["EPSG","26910"]]
      GeoTransform :
      529098.0 0.29999999999997795 0.0 5037490.2 0.0 -0.29999999999995475
      array(0)
    • band
      PandasIndex
      PandasIndex(Index([1, 2, 3, 4], dtype='int64', name='band'))
    • x
      PandasIndex
      PandasIndex(Index([        529098.15, 529098.4500000001,         529098.75,
                     529099.05,         529099.35,         529099.65,
             529099.9500000001,         529100.25,         529100.55,
                     529100.85,
             ...
                     529729.65, 529729.9500000001,         529730.25,
                     529730.55,         529730.85,         529731.15,
             529731.4500000001,         529731.75,         529732.05,
                     529732.35],
            dtype='float64', name='x', length=2115))
    • y
      PandasIndex
      PandasIndex(Index([       5037490.05,        5037489.75, 5037489.449999999,
             5037489.149999999,        5037488.85,        5037488.55,
                    5037488.25, 5037487.949999999, 5037487.649999999,
                    5037487.35,
             ...
             5036257.949999999, 5036257.649999999,        5036257.35,
                    5036257.05,        5036256.75, 5036256.449999999,
             5036256.149999999,        5036255.85,        5036255.55,
                    5036255.25],
            dtype='float64', name='y', length=4117))
  • TIFFTAG_XRESOLUTION :
    1
    TIFFTAG_YRESOLUTION :
    1
    TIFFTAG_RESOLUTIONUNIT :
    1 (unitless)
    AREA_OR_POINT :
    Area
    scale_factor :
    1.0
    add_offset :
    0.0
InĀ [39]:
clip_da.shape
Out[39]:
(4, 4117, 2115)
InĀ [40]:
### do NDVI math
ndvi_da = (
    (clip_da.sel(band = 4) - clip_da.sel(band = 1))
    / (clip_da.sel(band = 4) + clip_da.sel(band = 1))
)

ndvi_da
Out[40]:
<xarray.DataArray (y: 4117, x: 2115)> Size: 35MB
array([[        nan,         nan,         nan, ..., -0.06951872,
        -0.05263158, -0.05511811],
       [        nan,         nan,         nan, ..., -0.07865169,
        -0.05586592, -0.06179775],
       [        nan,         nan,         nan, ..., -0.08333334,
        -0.08275862,         nan],
       ...,
       [        nan,         nan,         nan, ...,         nan,
                nan,         nan],
       [        nan,         nan,         nan, ...,         nan,
                nan,         nan],
       [        nan,         nan,         nan, ...,         nan,
                nan,         nan]], shape=(4117, 2115), dtype=float32)
Coordinates:
  * x            (x) float64 17kB 5.291e+05 5.291e+05 ... 5.297e+05 5.297e+05
  * y            (y) float64 33kB 5.037e+06 5.037e+06 ... 5.036e+06 5.036e+06
    spatial_ref  int64 8B 0
xarray.DataArray
  • y: 4117
  • x: 2115
  • nan nan nan nan nan nan nan nan ... nan nan nan nan nan nan nan nan
    array([[        nan,         nan,         nan, ..., -0.06951872,
            -0.05263158, -0.05511811],
           [        nan,         nan,         nan, ..., -0.07865169,
            -0.05586592, -0.06179775],
           [        nan,         nan,         nan, ..., -0.08333334,
            -0.08275862,         nan],
           ...,
           [        nan,         nan,         nan, ...,         nan,
                    nan,         nan],
           [        nan,         nan,         nan, ...,         nan,
                    nan,         nan],
           [        nan,         nan,         nan, ...,         nan,
                    nan,         nan]], shape=(4117, 2115), dtype=float32)
    • x
      (x)
      float64
      5.291e+05 5.291e+05 ... 5.297e+05
      axis :
      X
      long_name :
      x coordinate of projection
      standard_name :
      projection_x_coordinate
      units :
      metre
      array([529098.15, 529098.45, 529098.75, ..., 529731.75, 529732.05, 529732.35],
            shape=(2115,))
    • y
      (y)
      float64
      5.037e+06 5.037e+06 ... 5.036e+06
      axis :
      Y
      long_name :
      y coordinate of projection
      standard_name :
      projection_y_coordinate
      units :
      metre
      array([5037490.05, 5037489.75, 5037489.45, ..., 5036255.85, 5036255.55,
             5036255.25], shape=(4117,))
    • spatial_ref
      ()
      int64
      0
      crs_wkt :
      PROJCS["NAD83 / UTM zone 10N",GEOGCS["NAD83",DATUM["North_American_Datum_1983",SPHEROID["GRS 1980",6378137,298.257222101,AUTHORITY["EPSG","7019"]],AUTHORITY["EPSG","6269"]],PRIMEM["Greenwich",0,AUTHORITY["EPSG","8901"]],UNIT["degree",0.0174532925199433,AUTHORITY["EPSG","9122"]],AUTHORITY["EPSG","4269"]],PROJECTION["Transverse_Mercator"],PARAMETER["latitude_of_origin",0],PARAMETER["central_meridian",-123],PARAMETER["scale_factor",0.9996],PARAMETER["false_easting",500000],PARAMETER["false_northing",0],UNIT["metre",1,AUTHORITY["EPSG","9001"]],AXIS["Easting",EAST],AXIS["Northing",NORTH],AUTHORITY["EPSG","26910"]]
      semi_major_axis :
      6378137.0
      semi_minor_axis :
      6356752.314140356
      inverse_flattening :
      298.257222101
      reference_ellipsoid_name :
      GRS 1980
      longitude_of_prime_meridian :
      0.0
      prime_meridian_name :
      Greenwich
      geographic_crs_name :
      NAD83
      horizontal_datum_name :
      North American Datum 1983
      projected_crs_name :
      NAD83 / UTM zone 10N
      grid_mapping_name :
      transverse_mercator
      latitude_of_projection_origin :
      0.0
      longitude_of_central_meridian :
      -123.0
      false_easting :
      500000.0
      false_northing :
      0.0
      scale_factor_at_central_meridian :
      0.9996
      spatial_ref :
      PROJCS["NAD83 / UTM zone 10N",GEOGCS["NAD83",DATUM["North_American_Datum_1983",SPHEROID["GRS 1980",6378137,298.257222101,AUTHORITY["EPSG","7019"]],AUTHORITY["EPSG","6269"]],PRIMEM["Greenwich",0,AUTHORITY["EPSG","8901"]],UNIT["degree",0.0174532925199433,AUTHORITY["EPSG","9122"]],AUTHORITY["EPSG","4269"]],PROJECTION["Transverse_Mercator"],PARAMETER["latitude_of_origin",0],PARAMETER["central_meridian",-123],PARAMETER["scale_factor",0.9996],PARAMETER["false_easting",500000],PARAMETER["false_northing",0],UNIT["metre",1,AUTHORITY["EPSG","9001"]],AXIS["Easting",EAST],AXIS["Northing",NORTH],AUTHORITY["EPSG","26910"]]
      GeoTransform :
      529098.0 0.29999999999997795 0.0 5037490.2 0.0 -0.29999999999995475
      array(0)
    • x
      PandasIndex
      PandasIndex(Index([        529098.15, 529098.4500000001,         529098.75,
                     529099.05,         529099.35,         529099.65,
             529099.9500000001,         529100.25,         529100.55,
                     529100.85,
             ...
                     529729.65, 529729.9500000001,         529730.25,
                     529730.55,         529730.85,         529731.15,
             529731.4500000001,         529731.75,         529732.05,
                     529732.35],
            dtype='float64', name='x', length=2115))
    • y
      PandasIndex
      PandasIndex(Index([       5037490.05,        5037489.75, 5037489.449999999,
             5037489.149999999,        5037488.85,        5037488.55,
                    5037488.25, 5037487.949999999, 5037487.649999999,
                    5037487.35,
             ...
             5036257.949999999, 5036257.649999999,        5036257.35,
                    5036257.05,        5036256.75, 5036256.449999999,
             5036256.149999999,        5036255.85,        5036255.55,
                    5036255.25],
            dtype='float64', name='y', length=4117))
InĀ [41]:
### Skip this step if data are already downloaded 
if not scene_df is None:

    ### Get an example 
    ### pull out te identifier for an example tract
    # tract = por_tract_gdf.loc[0].tract2010
    scene_df["tract"] = scene_df["tract"].astype(str)
    tract = scene_df["tract"].iloc[0]

    ### subset scene_df to only include that tract
    # ex_scene_gdf = scene_df[scene_df.tract == tract]
    ex_scene_gdf = scene_df[scene_df["tract"] == tract]

    ### make a list to accumulate into
    tile_das = []

    ### loop through all the images for this tract
    for _, href_s in ex_scene_gdf.iterrows():

        ### open vsi connection to data
        tile_da = rxr.open_rasterio(

            ### location of the raster
            href_s.rgbir_href,

            ### deal with no data, remove extra dimension
            masked = True).squeeze()

        ### Crop data to the bounding box of the census tract
        ### make the bounding box
        boundary = (
            tract_cdc_gdf

            ### use tract id as index
            # .set_index('tract2010')
            .assign(tract2010=lambda df: df["tract2010"].astype(str))
            .set_index("tract2010")

            ### select geometry for the tract
            .loc[[tract]]

            ### reproject
            .to_crs(tile_da.rio.crs)

            ### extract geometry
            .geometry
        )

        ### crop to bounding box
        crop_da = tile_da.rio.clip_box(
 
            ### get coordinates for the bounding box
            *boundary.envelope.total_bounds,

            ### let expand
            auto_expand = True
        )

        ### Clip data to the boundary of the census tract
        clip_da = crop_da.rio.clip(
            boundary, all_touched = True)

        ### Compute NDVI
        ndvi_da = (
            (clip_da.sel(band = 4) - clip_da.sel(band = 1))
            / (clip_da.sel(band = 4) + clip_da.sel(band = 1))
        )

        ### Accumulate result
        tile_das.append(ndvi_da)

### check
print(f"Number of tiles in tiles_das: {len(tile_das)}")

### Merge data
scene_da = rxrmerge.merge_arrays(tile_das)
scene_da

### Mask vegetation
veg_mask = (scene_da > .3)


### Calculate mean patch size
labeled_patches, num_patches = label(veg_mask)
patch_sizes = np.bincount(labeled_patches.ravel())[1:]
mean_patch_size = patch_sizes.mean()

patch_sizes
Number of tiles in tiles_das: 2
Out[41]:
array([10995,  3595,     1, ...,     4,     1,     1], shape=(65701,))
InĀ [42]:
### Sanity check
print("ex_scene_gdf rows:", len(ex_scene_gdf))
print("tiles created:", len(tile_das))
ex_scene_gdf rows: 2
tiles created: 2
InĀ [43]:
mean_patch_size
Out[43]:
np.float64(130.32430252203162)
InĀ [44]:
### Make kernel to calculate edge density
kernel = np.array([
    [1, 1, 1], 
    [1, -8, 1], 
    [1, 1, 1]])

### Calculate edge density
edges = convolve(veg_mask, kernel, mode='constant')
edge_density = np.sum(edges != 0) / veg_mask.size

edge_density
Out[44]:
np.float64(0.1273147138822625)

Repeat for all tracts¶

Try It
  1. Using a loop, for each Census Tract:

    1. Using a loop, for each data URL:

      1. Use rioxarray to open up a connection to the STAC asset, just like you would a file on your computer
      2. Crop and then clip your data to the census tract boundary > HINT: check out the .clip_box parameter auto_expand and the clip parameter all_touched to make sure you don’t end up with an empty array
      3. Compute NDVI for the tract
    2. Merge the NDVI rasters

    3. Compute:

      1. total number of pixels within the tract
      2. fraction of pixels with an NDVI greater than .12 within the tract (and any other statistics you would like to look at)
    4. Accumulate the statistics in a file for later

  2. Using a conditional statement, ensure that you do not run this computation if you have already saved values. You do not want to run this step many times, or have to restart from scratch! There are many approaches to this, but we actually recommend implementing your caching in the previous cell when you generate your dataframe of URLs, since that step can take a few minutes as well. However, the important thing to cache is the computation.

InĀ [45]:
### make csv for ndvi data
ndvi_stats_path_veg = os.path.join(data_dir, 'portland-ndvi-stats-veg-test.csv')
InĀ [46]:
### debug
print("crop shape:", crop_da.shape)
print("clip shape:", clip_da.shape)
crop shape: (4, 6054, 5586)
clip shape: (4, 6054, 5586)
InĀ [47]:
### debug
area_m2 = (
    tract_cdc_gdf
    .assign(tract2010=lambda d: d["tract2010"].astype(str))
    .set_index("tract2010")
    .loc[[str(tract)]]
    .to_crs(3857)
    .geometry.area
)
print("tract area m^2:", float(area_m2))
tract area m^2: 4736442.152396669
C:\Users\seism\AppData\Local\Temp\ipykernel_99948\219635894.py:10: FutureWarning: Calling float on a single element Series is deprecated and will raise a TypeError in the future. Use float(ser.iloc[0]) instead
  print("tract area m^2:", float(area_m2))
InĀ [48]:
### Skip this step if data are already downloaded 
if not scene_df is None:

    scene_df["tract"] = scene_df["tract"].astype(str)
    tract_cdc_gdf["tract2010"] = tract_cdc_gdf["tract2010"].astype(str)
    
    ### Loop through the census tracts with URLs
    for tract, tract_date_gdf in tqdm(scene_df.groupby('tract')):

        ### Open all images for tract
        ### create a list to store NDVI arrays, with 1 array per NAIP image
        tile_das = []

        ### iterate over rows in tract-specific datafram
        for _, href_s in tract_date_gdf.iterrows():

            ### Open vsi connection to data
            tile_da = rxr.open_rasterio(

                ### mask and squeeze
                href_s.rgbir_href, masked = True).squeeze()
            
            ### Clip data
            ### get tract boundary
            boundary = (
                tract_cdc_gdf
                .set_index("tract2010")
                .loc[[tract]]
                .to_crs(tile_da.rio.crs)
                .geometry
            )

            ### crop to bounding box
            crop_da = tile_da.rio.clip_box(
                *boundary.envelope.total_bounds,
                auto_expand = True
            )

            ### downsample to fix out of memory error
            crop_da = crop_da.rio.reproject(crop_da.rio.crs, resolution=10)

            ### clip to actual tract geometry
            # clip_da = crop_da.rio.clip(boundary, all_touched = True)
            clip_da = crop_da.rio.clip([boundary.iloc[0]], all_touched=True)
                
            ### Compute NDVI
            # ndvi_da = ((clip_da.sel(band = 4) - clip_da.sel(band = 1))
            # / (clip_da.sel(band = 4) + clip_da.sel(band = 1)))
            ### safer, just because it keeps crashing after an hour
            nir = clip_da.sel(band=4).astype("float32")
            red = clip_da.sel(band=1).astype("float32")
            den = nir + red
            ndvi_da = (nir - red) / den.where(den != 0)

            ### Accumulate result
            tile_das.append(ndvi_da)

        ### Merge data
        scene_da = rxrmerge.merge_arrays(tile_das)

        ### Mask vegetation
        veg_mask = (scene_da > .3)

        ### Calculate statistics and save data to file
        ### count all valid pixels in the tract
        # total_pixels = scene_da.notnull().sum()
        total_pixels = int(scene_da.notnull().sum().values)

        ### count all vegetated pixels
        # veg_pixels = veg_mask.sum()
        veg_pixels = int(veg_mask.sum().values)
        
        ### identify vegetation patches
        # labeled_patches, num_patches = label(veg_mask)
        veg_arr = veg_mask.values.astype(bool)
        labeled_patches, num_patches = label(veg_arr)

        ### count patch pixels
        patch_sizes = np.bincount(labeled_patches.ravel())[1:]

        ### mean patch size
        # mean_patch_size = patch_sizes.mean()
        mean_patch_size = patch_sizes.mean() if len(patch_sizes) else np.nan

        ### Calculate edge density
        ### define the kernel
        kernel = np.array([
            [1, 1, 1], 
            [1, -8, 1], 
            [1, 1, 1]])
        
        ### detect boundaries between veg and non-veg
        # edges = convolve(veg_mask, kernel, mode='constant')
        
        ### calculate proportion of edge pixels by tract area
        # edge_density = np.sum(edges != 0) / veg_mask.size
        edges = convolve(veg_arr.astype(np.int8), kernel, mode="constant")
        edge_density = np.sum(edges != 0) / veg_arr.size

        ### Add a row to the statistics file for this tract
        ### create new datafram for the tract and append it to the csv
        pd.DataFrame(dict(

            ### store tract ID
            tract = [tract],

            ### store total pixels
            total_pixels = [int(total_pixels)],

            ### fraction of vegetated pixels 
            # frac_veg = [float(veg_pixels / total_pixels)],
            frac_veg = [float(veg_pixels / total_pixels) if total_pixels else np.nan],

            ### store mean patch size
            mean_patch_size = [mean_patch_size],

            ### edge density
            edge_density = [edge_density]

        ### write out the csv 
        )).to_csv(
            ndvi_stats_path_veg,

            ### append mode
            mode = 'a',

            ### get rid of row numbers
            index = False,

            ### write column names if the csv doesn't exist 
            header = (not os.path.exists(ndvi_stats_path_veg))
        )

### Re-load results from file
  0%|          | 0/125 [00:00<?, ?it/s]
InĀ [Ā ]:
### Understand warnings - no warnings in Portland
# ndvi_df = pd.read_csv(ndvi_stats_path_veg)
# ndvi_df[['mean_patch_size', 'frac_veg']].isna().sum()
InĀ [Ā ]:
### If needed, I am just going to park this here
### Since there is just one census tract with no vegetation patches, two choices, remove or reset from NaN, this is reset
# ndvi_df['mean_patch_size'] = ndvi_df['mean_patch_size'].fillna(0)
InĀ [49]:
### check the file
ndvi_stats_df = pd.read_csv(ndvi_stats_path_veg)
ndvi_stats_df
Out[49]:
tract total_pixels frac_veg mean_patch_size edge_density
0 41051000301 24731 0.313412 6.658935 0.521101
1 41051000302 32476 0.400819 7.140428 0.767046
2 41051000401 13692 0.280456 3.626062 0.771442
3 41051000402 11869 0.234055 2.930380 0.760752
4 41051000501 11419 0.210439 2.508351 0.681798
... ... ... ... ... ...
120 41051009803 3337 0.202577 3.634409 0.336567
121 41051009804 11798 0.292507 4.733882 0.732260
122 41051009903 2449 0.246223 2.512500 0.350452
123 41051010200 301533 0.039293 11.414258 0.035563
124 41051980000 89236 0.042808 3.669549 0.071589

125 rows Ɨ 5 columns

InĀ [Ā ]:
### why do you have 1499 rows and I have 788 rows
# len(tract_cdc_gdf)
# tract_cdc_gdf['tract2010'].nunique()
InĀ [Ā ]:
### compare
# scene_df['tract'].nunique()

STEP 4 - Explore your data with plots¶

Chloropleth plots¶

Before running any statistical models on your data, you should check that your download worked. You should see differences in both median income and mean NDVI across the City.

Try It

Create a plot that contains:

  • 2 side-by-side chloropleth plots
  • Asthma prevelence on one plot and mean NDVI on the other
  • Make sure to include a title and labeled color bars
InĀ [51]:
### Correct datatype mismatch
tract_cdc_gdf["tract2010"] = tract_cdc_gdf["tract2010"].astype(str)
ndvi_stats_df["tract"] = ndvi_stats_df["tract"].astype(str)
InĀ [52]:
### Merge census data with geometry
por_ndvi_cdc_gdf = (

    ### grab census tract gdf
    tract_cdc_gdf

    ### merge with ndvi df
    .merge(
        ndvi_stats_df,

        ### match on the tract ID
        ### left:  tract2010
        ### right:  tract
        left_on = 'tract2010',
        right_on = 'tract',
        how = 'inner'
    )
)

### check
por_ndvi_cdc_gdf
Out[52]:
place2010 tract2010 ST PlaceName plctract10 PlcTrPop10 geometry year tract_x asthma asthma_ci_low asthma_ci_high data_value_unit totalpopulation totalpop18plus tract_y total_pixels frac_veg mean_patch_size edge_density
0 4159000 41051000301 41 Portland 4159000-41051000301 5041 POLYGON ((-13650305.147 5699042.371, -13650193... 2023 41051000301 12.1 10.8 13.5 % 5443 4908 41051000301 24731 0.313412 6.658935 0.521101
1 4159000 41051000302 41 Portland 4159000-41051000302 6709 POLYGON ((-13650063.25 5697279.135, -13649943.... 2023 41051000302 11.3 10.1 12.5 % 7191 5476 41051000302 32476 0.400819 7.140428 0.767046
2 4159000 41051000401 41 Portland 4159000-41051000401 3418 POLYGON ((-13648772.389 5699042.688, -13648767... 2023 41051000401 11.7 10.4 13.0 % 3746 3023 41051000401 13692 0.280456 3.626062 0.771442
3 4159000 41051000402 41 Portland 4159000-41051000402 3477 POLYGON ((-13647657.747 5699037.289, -13647659... 2023 41051000402 11.6 10.3 12.9 % 3906 3188 41051000402 11869 0.234055 2.930380 0.760752
4 4159000 41051000501 41 Portland 4159000-41051000501 3732 POLYGON ((-13646585.183 5696754.598, -13646694... 2023 41051000501 11.9 10.7 13.3 % 4134 3307 41051000501 11419 0.210439 2.508351 0.681798
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
120 4159000 41051009804 41 Portland 4159000-41051009804 3183 POLYGON ((-13634931.924 5701311.194, -13634931... 2023 41051009804 13.1 11.8 14.6 % 3496 2786 41051009804 11798 0.292507 4.733882 0.732260
121 4159000 41051009903 41 Portland 4159000-41051009903 391 POLYGON ((-13634999.94 5697453.141, -13635002.... 2023 41051009903 11.6 10.4 13.0 % 7482 5839 41051009903 2449 0.246223 2.512500 0.350452
122 4159000 41051010200 41 Portland 4159000-41051010200 583 MULTIPOLYGON (((-13633882.74 5709928.305, -136... 2023 41051010200 12.1 10.9 13.5 % 7130 5680 41051010200 301533 0.039293 11.414258 0.035563
123 4159000 41051980000 41 Portland 4159000-41051980000 8 POLYGON ((-13660405.502 5712550.235, -13660339... 2023 41051980000 11.7 10.4 13.2 % 55 53 41051980000 89236 0.042808 3.669549 0.071589
124 4159000 41051009605 41 Portland 4159000-41051009605 0 POLYGON ((-13636161.226 5707323.831, -13636162... 2023 41051009605 12.6 11.4 14.1 % 5710 4409 41051009605 116 0.215517 4.166667 0.741667

125 rows Ɨ 20 columns

InĀ [53]:
### Plot chloropleths with vegetation statistics

### make a plot_cloropleth function
def plot_chloropleth(gdf, title, **opts):

    ### docstring to describe the function
    """Generate a cloropleth with the given color column"""

    ### use geoviews to make a polygon object
    return gv.Polygons(

        ### use mercator
        gdf.to_crs(ccrs.Mercator()),

        ### set plot crs to mercator
        crs = ccrs.Mercator()

    ### drop x and y axis and add a legend
    ).opts(xaxis = None, yaxis = None, colorbar = True, title=title, **opts)
InĀ [54]:
### apply the function to Portland data
(
    plot_chloropleth(

        ### plot asthma by census tract
        por_ndvi_cdc_gdf, color = 'asthma', cmap = 'viridis', title='Asthma Prevalence', colorbar_opts={
            'title': 'Asthma prevalence (%)'
        })
    +

    ### add a second plot with vegetation edge density
    plot_chloropleth(
        por_ndvi_cdc_gdf, color = 'edge_density', cmap = 'Greens', title='Vegetation Edge Density', colorbar_opts={
            'title': 'Vegetation edge density'
        })
    )
    
Out[54]:
InĀ [88]:
### generate figure for portfolio
combo_map = (
    plot_chloropleth(
        por_ndvi_cdc_gdf,
        color="asthma",
        cmap="viridis",
        title="Asthma Prevalence",
        colorbar_opts={"title": "Asthma prevalence (%)"},
    )
    +
    plot_chloropleth(
        por_ndvi_cdc_gdf,
        color="edge_density",
        cmap="Greens",
        title="Vegetation Edge Density",
        colorbar_opts={"title": "Vegetation edge density"},
    )
).opts(width=500, height=250)

combo_map
Out[88]:
InĀ [89]:
hv.save(combo_map, "portland_asthma_edge_density.html")

STEP 5: Explore a linear ordinary least-squares regression¶

Model description¶

One way to find if there is a statistically significant relationship between asthma prevalence and greenspace metrics is to run a linear ordinary least squares (OLS) regression and measure how well it is able to predict asthma given your chosen fragmentation statistics.

Before fitting an OLS regression, you should check that your data are appropriate for the model.

Reflect and Respond

Write a model description for the linear ordinary least-squares regression that touches on:

  1. Assumptions made about the data
  2. What is the objective of this model? What metrics could you use to evaluate the fit?
  3. Advantages and potential problems with choosing this model.

I use an ordinary least squares (OLS) regression as a first pass to see whether vegetation fragmentation metrics are related to adult asthma prevalence across Chicago census tracts. This model assumes a roughly linear relationship between the vegetation variables and asthma prevalence, treats each census tract as an independent observation, and assumes the residuals are reasonably well behaved. It also assumes that the vegetation metrics I calculated capture meaningful differences in green space structure between neighborhoods.

The goal of this model is not to make strong predictions, but to explore whether vegetation characteristics help explain some of the spatial variation in asthma prevalence. I would evaluate the model using metrics like R² and mean absolute error, and by looking at residuals to see whether there are patterns the model is missing.

One advantage of using OLS is that it is simple and easy to interpret, which makes it useful for exploratory analysis. At the same time, asthma prevalence is influenced by many other factors that are not included here, and nearby census tracts are likely spatially related. Because of these limitations, the results should be interpreted as exploratory rather than causal.

Step 5a - Data preparation¶

When fitting statistical models, you should make sure that your data meet the model assumptions through a process of selection and/or transformation.

You can select data by:

  • Eliminating observations (rows) or variables (columns) that are missing data
  • Selecting a model that matches the way in which variables are related to each other (for example, linear models are not good at modeling circles)
  • Selecting variables that explain the largest amount of variability in the dependent variable.

You can transform data by:

  • Transforming a variable so that it follows a normal distribution. The log transform is the most common to eliminate excessive skew (e.g.Ā make the data symmetrical), but you should select a transform most suited to your data.
  • Normalizing or standardizing variables to, for example, eliminate negative numbers or effects caused by variables being in a different range.
  • Performing a principle component analysis (PCA) to eliminate multicollinearity among the predictor variables

Tip

Keep in mind that data transforms like a log transform or a PCA must be reversed after modeling for the results to be meaningful.

Try It
  1. Use the hvplot.scatter_matrix() function to create an exploratory plot of your data.
  2. Make any necessary adjustments to your data to make sure that they meet the assumptions of a linear OLS regression.
  3. Check if there are NaN values, and if so drop those rows and/or columns. You can use the .dropna() method to drop rows with NaN values.
  4. Explain any data transformations or selections you made and why
InĀ [55]:
### missing values
por_ndvi_cdc_gdf.isna().any()
Out[55]:
place2010          False
tract2010          False
ST                 False
PlaceName          False
plctract10         False
PlcTrPop10         False
geometry           False
year               False
tract_x            False
asthma             False
asthma_ci_low      False
asthma_ci_high     False
data_value_unit    False
totalpopulation    False
totalpop18plus     False
tract_y            False
total_pixels       False
frac_veg           False
mean_patch_size    False
edge_density       False
dtype: bool
InĀ [77]:
### Visualize distribution of data
### select variables
variables = ['frac_veg', 'asthma', 'mean_patch_size', 'edge_density']

### scatterplot
hvplot.scatter_matrix(

    por_ndvi_cdc_gdf
    [variables]
)
Out[77]:
InĀ [57]:
### histogram
### make facet
fig, axes = plt.subplots(2, 2, figsize = (12, 10))

### list veriables to plot
variables = ['frac_veg', 'asthma', 'mean_patch_size', 'edge_density']
titles = ['Vegetated fraction', 'Adult asthma rate', 'Mean patch size', 'Edge density']

### loop through variables and axes to plot histograms
for i, (var, title) in enumerate(zip(variables, titles)):
    ax = axes[i//2, i%2]
    ax.hist(por_ndvi_cdc_gdf[var], bins = 10, color = 'gray', edgecolor = 'black')
    ax.set_title(title)
    ax.set_xlabel(var)
    ax.set_ylabel('Frequency')

### adjust layers to prevent overlap
plt.tight_layout()
plt.show()
No description has been provided for this image
InĀ [58]:
### drop rows with missing observations
model_df = (
    por_ndvi_cdc_gdf

    ### make a copy
    .copy()

    ### subset to columns
    [['frac_veg', 'asthma', 'mean_patch_size', 'edge_density', 'geometry']]

    ### drop rows (observations) with missing values
    .dropna()
)

model_df
Out[58]:
frac_veg asthma mean_patch_size edge_density geometry
0 0.313412 12.1 6.658935 0.521101 POLYGON ((-13650305.147 5699042.371, -13650193...
1 0.400819 11.3 7.140428 0.767046 POLYGON ((-13650063.25 5697279.135, -13649943....
2 0.280456 11.7 3.626062 0.771442 POLYGON ((-13648772.389 5699042.688, -13648767...
3 0.234055 11.6 2.930380 0.760752 POLYGON ((-13647657.747 5699037.289, -13647659...
4 0.210439 11.9 2.508351 0.681798 POLYGON ((-13646585.183 5696754.598, -13646694...
... ... ... ... ... ...
120 0.292507 13.1 4.733882 0.732260 POLYGON ((-13634931.924 5701311.194, -13634931...
121 0.246223 11.6 2.512500 0.350452 POLYGON ((-13634999.94 5697453.141, -13635002....
122 0.039293 12.1 11.414258 0.035563 MULTIPOLYGON (((-13633882.74 5709928.305, -136...
123 0.042808 11.7 3.669549 0.071589 POLYGON ((-13660405.502 5712550.235, -13660339...
124 0.215517 12.6 4.166667 0.741667 POLYGON ((-13636161.226 5707323.831, -13636162...

125 rows Ɨ 5 columns

InĀ [78]:
### Perform variable transformation
### log of asthma rate
# model_df['log_asthma'] = np.log(model_df.asthma)

### log of patch size
model_df['log_patch'] = np.log(model_df.mean_patch_size)

model_df
Out[78]:
frac_veg asthma mean_patch_size edge_density geometry log_patch
0 0.313412 12.1 6.658935 0.521101 POLYGON ((-13650305.147 5699042.371, -13650193... 1.895960
1 0.400819 11.3 7.140428 0.767046 POLYGON ((-13650063.25 5697279.135, -13649943.... 1.965773
2 0.280456 11.7 3.626062 0.771442 POLYGON ((-13648772.389 5699042.688, -13648767... 1.288147
3 0.234055 11.6 2.930380 0.760752 POLYGON ((-13647657.747 5699037.289, -13647659... 1.075132
4 0.210439 11.9 2.508351 0.681798 POLYGON ((-13646585.183 5696754.598, -13646694... 0.919625
... ... ... ... ... ... ...
120 0.292507 13.1 4.733882 0.732260 POLYGON ((-13634931.924 5701311.194, -13634931... 1.554746
121 0.246223 11.6 2.512500 0.350452 POLYGON ((-13634999.94 5697453.141, -13635002.... 0.921278
122 0.039293 12.1 11.414258 0.035563 MULTIPOLYGON (((-13633882.74 5709928.305, -136... 2.434863
123 0.042808 11.7 3.669549 0.071589 POLYGON ((-13660405.502 5712550.235, -13660339... 1.300069
124 0.215517 12.6 4.166667 0.741667 POLYGON ((-13636161.226 5707323.831, -13636162... 1.427116

125 rows Ɨ 6 columns

InĀ [79]:
### Visualize transformed variables
hvplot.scatter_matrix(
    model_df
    [[
        'frac_veg',
        'edge_density',
        'log_patch',
        'asthma'
    ]]
)
Out[79]:
InĀ [80]:
### q-q plots
### set variables
var_qq = ['frac_veg', 'edge_density', 'log_patch', 'asthma']

### make q-q plot for each variables
plt.figure(figsize = (12, 10))
for i, var in enumerate(var_qq, 1):

    ### make 2x2 facet
    plt.subplot(2, 2, i)

    ### norm distribution
    stats.probplot(model_df[var], dist = "norm", plot = plt)

    ### add title
    plt.title(f'Q-Q plot for {var}')

### plot
plt.tight_layout()
plt.show()
No description has been provided for this image

Reflect and respond: Based on inspection of the histograms, several of the variables were not normally distributed, which is a concern for meeting the assumptions of a linear regression model. In particular, adult asthma prevalence and mean vegetation patch size were the most skewed. To address this, I applied a log transformation to both variables to reduce skewness and bring the distributions closer to normal.

After transformation, the variables more closely approximate normal distributions and are more suitable for use in an OLS framework. While the transformed variables are not perfectly normal, they are sufficiently improved to support an exploratory linear analysis. Given the goals of this model, to identify broad relationships rather than make precise predictions, the transformations are adequate for meeting the assumptions of a linear model.

Step 5b - Fit and Predict¶

If you have worked with statistical models before, you may notice that the scikitlearn library has a slightly different approach than many software packages. For example, scikitlearn emphasizes generic model performance measures like cross-validation and importance over coefficient p-values and correlation. The scikitlearn approach is meant to generalize more smoothly to machine learning (ML) models where the statistical significance is harder to derive mathematically.

Try It
  1. Use the scikitlearn documentation or ChatGPT as a starting point, split your data into training and testing datasets.
  2. Fit a linear regression to your training data.
  3. Use your fitted model to predict the testing values.
  4. Plot the predicted values against the measured values. You can use the following plotting code as a starting point.
InĀ [81]:
### Select predictor and outcome variables
X = model_df[['edge_density', 'log_patch']]
y = model_df[['asthma']]

### Split into training and testing datasets
X_train, X_test, y_train, y_test = train_test_split(

    X, y, test_size = 0.33, random_state = 19
)

### Fit a linear regression
reg = LinearRegression()

### fit it to our data
reg.fit(X_train, y_train)

### Predict asthma values for the test dataset
y_test['pred_asthma'] = np.exp(reg.predict(X_test))

### real asthma rate for comparison
y_test['asthma'] = np.exp(y_test.asthma)

y_test
Out[81]:
asthma pred_asthma
27 73130.441833 121694.493606
99 198789.151143 136349.314210
36 98715.771011 173982.656275
2 120571.714986 141342.789221
77 219695.988672 116207.570247
105 162754.791419 120875.041480
124 296558.565298 135835.195944
118 198789.151143 146596.941060
78 89321.723361 97364.751829
16 98715.771011 172366.801852
30 80821.637540 106924.654909
95 162754.791419 182767.267020
122 179871.862254 121281.584296
18 109097.799277 144166.671629
113 400312.191330 122340.176642
8 147266.625241 145617.378311
65 44355.855130 84421.370313
84 98715.771011 109042.513801
13 219695.988672 149694.298701
61 162754.791419 166967.736950
83 54176.363797 96569.709963
46 296558.565298 161301.947040
37 98715.771011 152756.240989
104 268337.286521 123771.684495
70 49020.801136 92049.909520
123 120571.714986 180650.774690
111 242801.617498 100281.438743
14 147266.625241 167276.009377
76 54176.363797 81184.424663
28 147266.625241 157831.676096
20 120571.714986 152386.280398
101 133252.352946 164648.997001
114 179871.862254 108759.464545
47 242801.617498 173630.513186
119 442413.392009 164928.597895
102 120571.714986 147499.095097
107 147266.625241 151698.899208
29 198789.151143 192928.136665
82 198789.151143 120207.897561
6 133252.352946 151091.099066
64 89321.723361 168853.342763
90 296558.565298 116079.620834
InĀ [82]:
### Plot measured vs. predicted asthma prevalence with a 1-to-1 line
y_max = y_test.asthma.max()

(
    y_test
    .hvplot.scatter(x='asthma', 
                    y='pred_asthma',
                    xlabel = "Measured adult asthma prevalence",
                    ylabel = "Predicted adult asthma prevalence",
                    title = "Linear regression performance - testing data")
    .opts(aspect='equal', 
          xlim=(0, y_max), ylim=(0, y_max), 
          width=600, height=600)
) * hv.Slope(slope=1, y_intercept=0).opts(color='black')
Out[82]:

Step 5c - Explore spatial bias¶

We always need to think about bias, or systematic error, in model results. Every model is going to have some error, but we’d like to see that error evenly distributed. When the error is systematic, it can be an indication that we are missing something important in the model.

In geographic data, it is common for location to be a factor that doesn’t get incorporated into models. After all – we generally expect places that are right next to each other to be more similar than places that are far away (this phenomenon is known as spatial autocorrelation). However, models like this linear regression don’t take location into account at all.

Try It
  1. Compute the model error (predicted - measured) for all census tracts
  2. Plot the error as a chloropleth map with a diverging color scheme
  3. Looking at both of your error plots, what do you notice? What are some possible explanations for any bias you see in your model?
InĀ [Ā ]:
### Compute model error for all census tracts
### use the trained model to predict asthma prevalence for each census tract
model_df['pred_asthma'] = np.exp(reg.predict(X))

### calculate model error
model_df['error_asthma'] = model_df['pred_asthma'] - model_df['asthma']

### Plot error geographically as a chloropleth
(
    plot_chloropleth(model_df, color = 'error_asthma', title = 'Model error (predicted - measured)', cmap = 'RdBu', colorbar_opts={
            'title': 'Asthma prevalence error (%)'
        })

    ### set frame width and aspect ratio
    .opts(frame_width = 600, aspect = 'equal')
)
Out[Ā ]:
InĀ [91]:
error_map = (
    plot_chloropleth(
        model_df,
        color="error_asthma",
        title="Model error (predicted - measured)",
        cmap="RdBu",
        colorbar_opts={"title": "Asthma prevalence error (%)"},
    )
    .opts(width=500, height=400)
)

error_map
Out[91]:
InĀ [92]:
hv.save(error_map, "portland_model_error.html")