--- title: "collapse and sf" subtitle: "Fast Manipulation of Simple Features Data Frames" author: "Sebastian Krantz and Grant McDermott" date: "2024-04-19" output: rmarkdown::html_vignette: toc: true vignette: > %\VignetteIndexEntry{collapse and sf} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- This short vignette focuses on using *collapse* with the popular *sf* package by Edzer Pebesma. It shows that *collapse* supports easy manipulation of *sf* data frames, at computation speeds far above *dplyr*. *collapse* v1.6.0 added internal support for *sf* data frames by having most essential functions (e.g., `fselect/gv`, `fsubset/ss`, `fgroup_by`, `findex_by`, `qsu`, `descr`, `varying`, `funique`, `roworder`, `rsplit`, `fcompute`, ...) internally handle the geometry column. To demonstrate this, we can load a test dataset provided by *sf*: ```r library(collapse) library(sf) nc <- st_read(system.file("shape/nc.shp", package = "sf"), quiet = TRUE) options(sf_max_print = 3) nc # Simple feature collection with 100 features and 14 fields # Geometry type: MULTIPOLYGON # Dimension: XY # Bounding box: xmin: -84.32385 ymin: 33.88199 xmax: -75.45698 ymax: 36.58965 # Geodetic CRS: NAD27 # First 3 features: # AREA PERIMETER CNTY_ CNTY_ID NAME FIPS FIPSNO CRESS_ID BIR74 SID74 NWBIR74 BIR79 SID79 # 1 0.114 1.442 1825 1825 Ashe 37009 37009 5 1091 1 10 1364 0 # 2 0.061 1.231 1827 1827 Alleghany 37005 37005 3 487 0 10 542 3 # 3 0.143 1.630 1828 1828 Surry 37171 37171 86 3188 5 208 3616 6 # NWBIR79 geometry # 1 19 MULTIPOLYGON (((-81.47276 3... # 2 12 MULTIPOLYGON (((-81.23989 3... # 3 260 MULTIPOLYGON (((-80.45634 3... ``` ## Summarising sf Data Frames Computing summary statistics on *sf* data frames automatically excludes the 'geometry' column: ```r # Which columns have at least 2 non-missing distinct values varying(nc) # AREA PERIMETER CNTY_ CNTY_ID NAME FIPS FIPSNO CRESS_ID BIR74 SID74 # TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE # NWBIR74 BIR79 SID79 NWBIR79 # TRUE TRUE TRUE TRUE # Quick summary stats qsu(nc) # N Mean SD Min Max # AREA 100 0.1263 0.0492 0.042 0.241 # PERIMETER 100 1.673 0.4823 0.999 3.64 # CNTY_ 100 1985.96 106.5166 1825 2241 # CNTY_ID 100 1985.96 106.5166 1825 2241 # NAME 100 - - - - # FIPS 100 - - - - # FIPSNO 100 37100 58.023 37001 37199 # CRESS_ID 100 50.5 29.0115 1 100 # BIR74 100 3299.62 3848.1651 248 21588 # SID74 100 6.67 7.7812 0 44 # NWBIR74 100 1050.81 1432.9117 1 8027 # BIR79 100 4223.92 5179.4582 319 30757 # SID79 100 8.36 9.4319 0 57 # NWBIR79 100 1352.81 1975.9988 3 11631 # Detailed statistics description of each column descr(nc) # Dataset: nc, 14 Variables, N = 100 # ---------------------------------------------------------------------------------------------------- # AREA (numeric): # Statistics # N Ndist Mean SD Min Max Skew Kurt # 100 77 0.13 0.05 0.04 0.24 0.48 2.5 # Quantiles # 1% 5% 10% 25% 50% 75% 90% 95% 99% # 0.04 0.06 0.06 0.09 0.12 0.15 0.2 0.21 0.24 # ---------------------------------------------------------------------------------------------------- # PERIMETER (numeric): # Statistics # N Ndist Mean SD Min Max Skew Kurt # 100 96 1.67 0.48 1 3.64 1.48 5.95 # Quantiles # 1% 5% 10% 25% 50% 75% 90% 95% 99% # 1 1.09 1.19 1.32 1.61 1.86 2.2 2.72 3.2 # ---------------------------------------------------------------------------------------------------- # CNTY_ (numeric): # Statistics # N Ndist Mean SD Min Max Skew Kurt # 100 100 1985.96 106.52 1825 2241 0.26 2.32 # Quantiles # 1% 5% 10% 25% 50% 75% 90% 95% 99% # 1826.98 1832.95 1837.9 1902.25 1982 2067.25 2110 2156.3 2238.03 # ---------------------------------------------------------------------------------------------------- # CNTY_ID (numeric): # Statistics # N Ndist Mean SD Min Max Skew Kurt # 100 100 1985.96 106.52 1825 2241 0.26 2.32 # Quantiles # 1% 5% 10% 25% 50% 75% 90% 95% 99% # 1826.98 1832.95 1837.9 1902.25 1982 2067.25 2110 2156.3 2238.03 # ---------------------------------------------------------------------------------------------------- # NAME (character): # Statistics # N Ndist # 100 100 # Table # Freq Perc # Ashe 1 1 # Alleghany 1 1 # Surry 1 1 # Currituck 1 1 # Northampton 1 1 # Hertford 1 1 # Camden 1 1 # Gates 1 1 # Warren 1 1 # Stokes 1 1 # Caswell 1 1 # Rockingham 1 1 # Granville 1 1 # Person 1 1 # ... 86 Others 86 86 # # Summary of Table Frequencies # Min. 1st Qu. Median Mean 3rd Qu. Max. # 1 1 1 1 1 1 # ---------------------------------------------------------------------------------------------------- # FIPS (character): # Statistics # N Ndist # 100 100 # Table # Freq Perc # 37009 1 1 # 37005 1 1 # 37171 1 1 # 37053 1 1 # 37131 1 1 # 37091 1 1 # 37029 1 1 # 37073 1 1 # 37185 1 1 # 37169 1 1 # 37033 1 1 # 37157 1 1 # 37077 1 1 # 37145 1 1 # ... 86 Others 86 86 # # Summary of Table Frequencies # Min. 1st Qu. Median Mean 3rd Qu. Max. # 1 1 1 1 1 1 # ---------------------------------------------------------------------------------------------------- # FIPSNO (numeric): # Statistics # N Ndist Mean SD Min Max Skew Kurt # 100 100 37100 58.02 37001 37199 -0 1.8 # Quantiles # 1% 5% 10% 25% 50% 75% 90% 95% 99% # 37002.98 37010.9 37020.8 37050.5 37100 37149.5 37179.2 37189.1 37197.02 # ---------------------------------------------------------------------------------------------------- # CRESS_ID (integer): # Statistics # N Ndist Mean SD Min Max Skew Kurt # 100 100 50.5 29.01 1 100 0 1.8 # Quantiles # 1% 5% 10% 25% 50% 75% 90% 95% 99% # 1.99 5.95 10.9 25.75 50.5 75.25 90.1 95.05 99.01 # ---------------------------------------------------------------------------------------------------- # BIR74 (numeric): # Statistics # N Ndist Mean SD Min Max Skew Kurt # 100 100 3299.62 3848.17 248 21588 2.79 11.79 # Quantiles # 1% 5% 10% 25% 50% 75% 90% 95% 99% # 283.64 419.75 531.8 1077 2180.5 3936 6725.7 11193 20378.22 # ---------------------------------------------------------------------------------------------------- # SID74 (numeric): # Statistics # N Ndist Mean SD Min Max Skew Kurt # 100 23 6.67 7.78 0 44 2.44 10.28 # Quantiles # 1% 5% 10% 25% 50% 75% 90% 95% 99% # 0 0 0 2 4 8.25 15.1 18.25 38.06 # ---------------------------------------------------------------------------------------------------- # NWBIR74 (numeric): # Statistics # N Ndist Mean SD Min Max Skew Kurt # 100 93 1050.81 1432.91 1 8027 2.83 11.84 # Quantiles # 1% 5% 10% 25% 50% 75% 90% 95% 99% # 1 9.95 39.2 190 697.5 1168.5 2231.8 3942.9 7052.84 # ---------------------------------------------------------------------------------------------------- # BIR79 (numeric): # Statistics # N Ndist Mean SD Min Max Skew Kurt # 100 100 4223.92 5179.46 319 30757 2.99 13.1 # Quantiles # 1% 5% 10% 25% 50% 75% 90% 95% 99% # 349.69 539.3 675.7 1336.25 2636 4889 8313 14707.45 26413.87 # ---------------------------------------------------------------------------------------------------- # SID79 (numeric): # Statistics # N Ndist Mean SD Min Max Skew Kurt # 100 28 8.36 9.43 0 57 2.28 9.88 # Quantiles # 1% 5% 10% 25% 50% 75% 90% 95% 99% # 0 0 1 2 5 10.25 21 26 38.19 # ---------------------------------------------------------------------------------------------------- # NWBIR79 (numeric): # Statistics # N Ndist Mean SD Min Max Skew Kurt # 100 98 1352.81 1976 3 11631 3.18 14.45 # Quantiles # 1% 5% 10% 25% 50% 75% 90% 95% 99% # 3.99 11.9 44.7 250.5 874.5 1406.75 2987.9 5090.5 10624.17 # ---------------------------------------------------------------------------------------------------- ``` ## Selecting Columns and Subsetting We can select columns from the *sf* data frame without having to worry about taking along 'geometry': ```r # Selecting a sequence of columns fselect(nc, AREA, NAME:FIPSNO) # Simple feature collection with 100 features and 4 fields # Geometry type: MULTIPOLYGON # Dimension: XY # Bounding box: xmin: -84.32385 ymin: 33.88199 xmax: -75.45698 ymax: 36.58965 # Geodetic CRS: NAD27 # First 3 features: # AREA NAME FIPS FIPSNO geometry # 1 0.114 Ashe 37009 37009 MULTIPOLYGON (((-81.47276 3... # 2 0.061 Alleghany 37005 37005 MULTIPOLYGON (((-81.23989 3... # 3 0.143 Surry 37171 37171 MULTIPOLYGON (((-80.45634 3... # Same using standard evaluation (gv is a shorthand for get_vars()) gv(nc, c("AREA", "NAME", "FIPS", "FIPSNO")) # Simple feature collection with 100 features and 4 fields # Geometry type: MULTIPOLYGON # Dimension: XY # Bounding box: xmin: -84.32385 ymin: 33.88199 xmax: -75.45698 ymax: 36.58965 # Geodetic CRS: NAD27 # First 3 features: # AREA NAME FIPS FIPSNO geometry # 1 0.114 Ashe 37009 37009 MULTIPOLYGON (((-81.47276 3... # 2 0.061 Alleghany 37005 37005 MULTIPOLYGON (((-81.23989 3... # 3 0.143 Surry 37171 37171 MULTIPOLYGON (((-80.45634 3... ``` The same applies to subsetting rows (and columns): ```r # A fast and enhanced version of base::subset fsubset(nc, AREA > fmean(AREA), AREA, NAME:FIPSNO) # Simple feature collection with 44 features and 4 fields # Geometry type: MULTIPOLYGON # Dimension: XY # Bounding box: xmin: -84.32385 ymin: 33.88199 xmax: -75.45698 ymax: 36.58965 # Geodetic CRS: NAD27 # First 3 features: # AREA NAME FIPS FIPSNO geometry # 1 0.143 Surry 37171 37171 MULTIPOLYGON (((-80.45634 3... # 2 0.153 Northampton 37131 37131 MULTIPOLYGON (((-77.21767 3... # 3 0.153 Rockingham 37157 37157 MULTIPOLYGON (((-79.53051 3... # A fast version of `[` (where i is used and optionally j) ss(nc, 1:10, c("AREA", "NAME", "FIPS", "FIPSNO")) # Simple feature collection with 10 features and 4 fields # Geometry type: MULTIPOLYGON # Dimension: XY # Bounding box: xmin: -84.32385 ymin: 33.88199 xmax: -75.45698 ymax: 36.58965 # Geodetic CRS: NAD27 # First 3 features: # AREA NAME FIPS FIPSNO geometry # 1 0.114 Ashe 37009 37009 MULTIPOLYGON (((-81.47276 3... # 2 0.061 Alleghany 37005 37005 MULTIPOLYGON (((-81.23989 3... # 3 0.143 Surry 37171 37171 MULTIPOLYGON (((-80.45634 3... ``` This is significantly faster than using `[`, `base::subset()`, `dplyr::select()` or `dplyr::filter()`: ```r library(microbenchmark) library(dplyr) # Selecting columns microbenchmark(collapse = fselect(nc, AREA, NAME:FIPSNO), dplyr = select(nc, AREA, NAME:FIPSNO), collapse2 = gv(nc, c("AREA", "NAME", "FIPS", "FIPSNO")), sf = nc[c("AREA", "NAME", "FIPS", "FIPSNO")]) # Unit: microseconds # expr min lq mean median uq max neval # collapse 3.034 3.9565 5.19429 5.1865 5.6990 22.878 100 # dplyr 431.279 452.2915 505.29015 466.3750 493.8450 3356.342 100 # collapse2 2.665 3.4850 4.59610 4.4075 5.0635 14.391 100 # sf 105.165 114.1235 120.39732 118.0390 124.9270 156.497 100 # Subsetting microbenchmark(collapse = fsubset(nc, AREA > fmean(AREA), AREA, NAME:FIPSNO), dplyr = select(nc, AREA, NAME:FIPSNO) |> filter(AREA > fmean(AREA)), collapse2 = ss(nc, 1:10, c("AREA", "NAME", "FIPS", "FIPSNO")), sf = nc[1:10, c("AREA", "NAME", "FIPS", "FIPSNO")]) # Unit: microseconds # expr min lq mean median uq max neval # collapse 9.676 11.5825 15.01707 14.4730 16.8920 30.463 100 # dplyr 890.643 917.6415 1055.40970 941.7085 1009.7890 5546.685 100 # collapse2 2.829 3.5465 5.40585 4.8995 6.4165 20.541 100 # sf 176.997 187.6160 202.72286 200.7565 210.8220 340.464 100 ``` However, *collapse* functions don't subset the 'agr' attribute on selecting columns, which (if specified) relates columns (attributes) to the geometry, and also don't modify the 'bbox' attribute giving the overall boundaries of a set of geometries when subsetting the *sf* data frame. Keeping the full 'agr' attribute is not problematic for all practical purposes, but not changing 'bbox' upon subsetting may lead to too large margins when plotting the geometries of a subset *sf* data frame. One way to to change this is calling `st_make_valid()` on the subset frame; but `st_make_valid()` is very expensive, thus unless the subset frame is very small, it is better to use `[`, `base::subset()` or `dplyr::filter()` in cases where the bounding box size matters. ## Aggregation and Grouping The flexibility and speed of `collap()` for aggregation can be used on *sf* data frames. A separate method for *sf* objects was not considered necessary as one can simply aggregate the geometry column using `st_union()`: ```r # Aggregating by variable SID74 using the median for numeric and the mode for categorical columns collap(nc, ~ SID74, custom = list(fmedian = is.numeric, fmode = is.character, st_union = "geometry")) # or use is.list to fetch the geometry # Simple feature collection with 23 features and 15 fields # Geometry type: MULTIPOLYGON # Dimension: XY # Bounding box: xmin: -84.32385 ymin: 33.88199 xmax: -75.45698 ymax: 36.58965 # Geodetic CRS: NAD27 # First 3 features: # AREA PERIMETER CNTY_ CNTY_ID NAME FIPS FIPSNO CRESS_ID BIR74 SID74 SID74 NWBIR74 BIR79 # 1 0.0780 1.3070 1950.0 1950.0 Alleghany 37005 37073 37.0 487 0 0 40.0 594.0 # 2 0.0810 1.2880 1887.0 1887.0 Ashe 37009 37137 69.0 751 1 1 148.0 899.0 # 3 0.1225 1.6435 1959.5 1959.5 Caswell 37033 37078 39.5 1271 2 2 382.5 1676.5 # SID79 NWBIR79 geometry # 1 1 45 MULTIPOLYGON (((-83.69563 3... # 2 1 176 MULTIPOLYGON (((-80.02406 3... # 3 2 452 MULTIPOLYGON (((-77.16129 3... ``` *sf* data frames can also be grouped and then aggregated using `fsummarise()`: ```r nc |> fgroup_by(SID74) # Simple feature collection with 100 features and 14 fields # Geometry type: MULTIPOLYGON # Dimension: XY # Bounding box: xmin: -84.32385 ymin: 33.88199 xmax: -75.45698 ymax: 36.58965 # Geodetic CRS: NAD27 # First 3 features: # AREA PERIMETER CNTY_ CNTY_ID NAME FIPS FIPSNO CRESS_ID BIR74 SID74 NWBIR74 BIR79 SID79 # 1 0.114 1.442 1825 1825 Ashe 37009 37009 5 1091 1 10 1364 0 # 2 0.061 1.231 1827 1827 Alleghany 37005 37005 3 487 0 10 542 3 # 3 0.143 1.630 1828 1828 Surry 37171 37171 86 3188 5 208 3616 6 # NWBIR79 geometry # 1 19 MULTIPOLYGON (((-81.47276 3... # 2 12 MULTIPOLYGON (((-81.23989 3... # 3 260 MULTIPOLYGON (((-80.45634 3... # # Grouped by: SID74 [23 | 4 (4) 1-13] nc |> fgroup_by(SID74) |> fsummarise(AREA_Ag = fsum(AREA), Perimeter_Ag = fmedian(PERIMETER), geometry = st_union(geometry)) # Simple feature collection with 23 features and 3 fields # Geometry type: MULTIPOLYGON # Dimension: XY # Bounding box: xmin: -84.32385 ymin: 33.88199 xmax: -75.45698 ymax: 36.58965 # Geodetic CRS: NAD27 # First 3 features: # SID74 AREA_Ag Perimeter_Ag geometry # 1 0 1.103 1.3070 MULTIPOLYGON (((-83.69563 3... # 2 1 0.914 1.2880 MULTIPOLYGON (((-80.02406 3... # 3 2 1.047 1.6435 MULTIPOLYGON (((-77.16129 3... ``` Typically most of the time in aggregation is consumed by `st_union()` so that the speed of *collapse* does not really become visible on most datasets. A faster alternative is to use *geos* (*sf* backend for planar geometries) or *s2* (*sf* backend for spherical geometries) directly: ```r # Using s2 backend: sensible for larger tasks nc |> fmutate(geometry = s2::as_s2_geography(geometry)) |> fgroup_by(SID74) |> fsummarise(AREA_Ag = fsum(AREA), Perimeter_Ag = fmedian(PERIMETER), geometry = s2::s2_union_agg(geometry)) |> fmutate(geometry = st_as_sfc(geometry)) # Simple feature collection with 23 features and 3 fields # Geometry type: MULTIPOLYGON # Dimension: XY # Bounding box: xmin: -84.32385 ymin: 33.88199 xmax: -75.45698 ymax: 36.58965 # Geodetic CRS: WGS 84 # First 3 features: # SID74 AREA_Ag Perimeter_Ag geometry # 1 0 1.103 1.3070 MULTIPOLYGON (((-83.69563 3... # 2 1 0.914 1.2880 MULTIPOLYGON (((-80.02406 3... # 3 2 1.047 1.6435 MULTIPOLYGON (((-77.16129 3... ``` In general, also upon aggregation with *collapse*, functions `st_as_sfc()`, `st_as_sf()`, or, in the worst case, `st_make_valid()`, may need to be invoked to ensure valid *sf* object output. Functions `collap()` and `fsummarise()` are attribute preserving but do not give special regard to geometry columns. One exception that both avoids the high cost of spatial functions in aggregation and any need for ex-post conversion/validation is aggregating spatial panel data over the time-dimension. Such panels can quickly be aggregated using `ffirst()` or `flast()` to aggregate the geometry: ```r # Creating a panel-dataset by simply duplicating nc for 2 different years pnc <- rowbind(`2000` = nc, `2001` = nc, idcol = "Year") |> as_integer_factor() pnc # Simple feature collection with 200 features and 15 fields # Geometry type: MULTIPOLYGON # Dimension: XY # Bounding box: xmin: -84.32385 ymin: 33.88199 xmax: -75.45698 ymax: 36.58965 # Geodetic CRS: NAD27 # First 3 features: # Year AREA PERIMETER CNTY_ CNTY_ID NAME FIPS FIPSNO CRESS_ID BIR74 SID74 NWBIR74 BIR79 # 1 2000 0.114 1.442 1825 1825 Ashe 37009 37009 5 1091 1 10 1364 # 2 2000 0.061 1.231 1827 1827 Alleghany 37005 37005 3 487 0 10 542 # 3 2000 0.143 1.630 1828 1828 Surry 37171 37171 86 3188 5 208 3616 # SID79 NWBIR79 geometry # 1 0 19 MULTIPOLYGON (((-81.47276 3... # 2 3 12 MULTIPOLYGON (((-81.23989 3... # 3 6 260 MULTIPOLYGON (((-80.45634 3... # Aggregating by NAME, using the last value for all categorical data collap(pnc, ~ NAME, fmedian, catFUN = flast, cols = -1L) # Simple feature collection with 100 features and 15 fields # Geometry type: MULTIPOLYGON # Dimension: XY # Bounding box: xmin: -84.32385 ymin: 33.88199 xmax: -75.45698 ymax: 36.58965 # Geodetic CRS: NAD27 # First 3 features: # AREA PERIMETER CNTY_ CNTY_ID NAME NAME FIPS FIPSNO CRESS_ID BIR74 SID74 NWBIR74 BIR79 # 1 0.111 1.392 1904 1904 Alamance Alamance 37001 37001 1 4672 13 1243 5767 # 2 0.066 1.070 1950 1950 Alexander Alexander 37003 37003 2 1333 0 128 1683 # 3 0.061 1.231 1827 1827 Alleghany Alleghany 37005 37005 3 487 0 10 542 # SID79 NWBIR79 geometry # 1 11 1397 MULTIPOLYGON (((-79.24619 3... # 2 2 150 MULTIPOLYGON (((-81.10889 3... # 3 3 12 MULTIPOLYGON (((-81.23989 3... # Using fsummarise to aggregate just two variables and the geometry pnc_ag <- pnc |> fgroup_by(NAME) |> fsummarise(AREA_Ag = fsum(AREA), Perimeter_Ag = fmedian(PERIMETER), geometry = flast(geometry)) # The geometry is still valid... (slt = shorthand for fselect) plot(slt(pnc_ag, AREA_Ag)) ```