cboettig commited on
Commit
c399c9b
Β·
1 Parent(s): 14124c0

here we go :rocket:

Browse files
Files changed (8) hide show
  1. README.md +31 -2
  2. app.R +73 -102
  3. footer.md +16 -0
  4. preprocess.md +27 -0
  5. schema.yml +41 -0
  6. system-prompt.md +34 -0
  7. test.R +82 -0
  8. utils.R +4 -0
README.md CHANGED
@@ -1,8 +1,8 @@
1
  ---
2
- title: Biodiversity Justice
3
  emoji: πŸ“š
4
  colorFrom: blue
5
- colorTo: green
6
  sdk: docker
7
  pinned: false
8
  license: bsd-2-clause
@@ -12,3 +12,32 @@ license: bsd-2-clause
12
 
13
  :hugs: Shiny App on Huggingface: <https://huggingface.co/spaces/boettiger-lab/geo-llm-r>
14
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ title: Geo Llm R
3
  emoji: πŸ“š
4
  colorFrom: blue
5
+ colorTo: yellow
6
  sdk: docker
7
  pinned: false
8
  license: bsd-2-clause
 
12
 
13
  :hugs: Shiny App on Huggingface: <https://huggingface.co/spaces/boettiger-lab/geo-llm-r>
14
 
15
+ Work in progress. This is a proof-of-principle for an LLM-driven interface to dynamic mapping. Key technologies include duckdb, geoparquet, pmtiles, maplibre, open LLMs (via VLLM + LiteLLM). R interface through ellmer (LLMs), mapgl (maplibre), shiny, and duckdb.
16
+
17
+ # Setup
18
+
19
+ ## GitHub with HuggingFace Deploy
20
+
21
+ All edits should be pushed to GitHub. Edits to `main` branch are automatically deployed to HuggingFace via GitHub Actions.
22
+ When using this scaffold, you will first have to set up your auto-deploy system:
23
+
24
+ - [Create a new HuggingFace Space](https://huggingface.co/new-space) (any template is fine, will be overwritten).
25
+ - [Create a HuggingFace Token](https://huggingface.co/settings/tokens/new?tokenType=write) with write permissions if you do not have one.
26
+ - In the GitHub Settings of your repository, add the token as a "New Repository Secret" under the `Secrets and Variables` -> `Actions` section of settings (`https://github.com/{USER}/{REPO}/settings/secrets/actions`).
27
+ - Edit the `.github/workflows/deploy.yml` file to specify your HuggingFace user name and HF repo to publish to.
28
+
29
+ ## Language Model setup
30
+
31
+ This example is designed to be able to leverage open source or open weights models. You will need to adjust the API URL and API key accordingly. This could be a local model with `vllm` or `ollama`, and of course commercial models should work too. The demo app currently runs on an VLLM+LiteLLM backed model, currently a Llama3 variant, hosted on the National Research Platform.
32
+
33
+ The LLM plays only a simple role in generating SQL queries from background information on the data including the table schema, see the system prompt for details. Most open models I have experimented with do not support the [tool use](https://ellmer.tidyverse.org/articles/tool-calling.html) or [structured data](https://ellmer.tidyverse.org/articles/structured-data.html) interfaces very well compared to commercial models. An important trick in working with open models used here is merely requesting the reply be structured as JSON. Open models are quite decent at this, and at SQL construction, given necessary context about the data. The map and chart elements merely react the resulting data frames, and the entire analysis is thus transparent and reproducible as it would be if the user had composed their request in SQL instead of plain English.
34
+
35
+ ## Software Dependencies
36
+
37
+ The Dockerfile includes all dependencies required for the HuggingFace deployment, and can be used as a template or directly to serve RStudio server.
38
+
39
+ ## Data pre-processing
40
+
41
+ Pre-processing the data into cloud-native formats and hosting data on a high bandwidth, highly avalialbe server is essential for efficient and scalable renending. Pre-computing expensive operations such as zonal statistics across all features is also necessary. These steps are described in [preprocess.md](preprocess.md) and corresponding scripts.
42
+
43
+
app.R CHANGED
@@ -1,20 +1,23 @@
1
  library(shiny)
2
  library(bslib)
3
  library(htmltools)
4
- library(markdown)
5
  library(fontawesome)
6
  library(bsicons)
7
  library(gt)
8
  library(glue)
9
  library(ggplot2)
10
-
11
- library(mapgl)
12
  library(dplyr)
 
13
  library(duckdbfs)
14
-
15
  duckdbfs::load_spatial()
16
 
17
- css <- HTML("<link rel='stylesheet' type='text/css' href='https://demos.creative-tim.com/material-dashboard/assets/css/material-dashboard.min.css?v=3.2.0'>")
 
 
 
 
18
 
19
 
20
  # Define the UI
@@ -23,23 +26,28 @@ ui <- page_sidebar(
23
  tags$head(css),
24
  titlePanel("Demo App"),
25
 
26
- "This is a proof-of-principle for a simple chat-driven interface to dynamically explore geospatial data.
27
- ",
28
-
 
29
 
30
  card(
31
  layout_columns(
32
- textInput("chat",
33
- label = NULL,
34
- "Which counties in California have the highest average social vulnerability?",
35
- width = "100%"),
36
- div(
37
- actionButton("user_msg", "", icon = icon("paper-plane"),
38
- class = "btn-primary btn-sm align-bottom"),
39
- class = "align-text-bottom"),
40
- col_widths = c(11, 1)),
41
- fill = FALSE
42
  ),
 
 
 
 
43
  layout_columns(
44
  card(maplibreOutput("map")),
45
  card(includeMarkdown("## Plot"),
@@ -51,12 +59,10 @@ ui <- page_sidebar(
51
  max_height = "700px"
52
  ),
53
 
54
-
55
  gt_output("table"),
56
 
57
  card(fill = TRUE,
58
  card_header(fa("robot")),
59
-
60
  accordion(
61
  open = FALSE,
62
  accordion_panel(
@@ -70,34 +76,13 @@ ui <- page_sidebar(
70
  textOutput("explanation"),
71
  )
72
  ),
73
-
74
  card(
75
  card_header("Errata"),
76
- markdown(
77
- "
78
- #### Credits
79
-
80
- Developed by Carl Boettiger, UC Berkeley, 2025. BSD License.
81
-
82
- Data from the US Census and CDC's [Social Vulnerability Index](https://www.atsdr.cdc.gov/place-health/php/svi/index.html)
83
-
84
- #### Technical details
85
-
86
- The app is written entirely in R using shiny. The app will translate natural language queries in SQL code using
87
- a small open-weights language model. The SQL code is executed using the duckdb backend against cloud-native
88
- geoparquet snapshot of the Social Vulnerability Index hosted on Source Cooperative. Summary chart data are also
89
- computed in duckdb by streaming, providing responsive updates while needing minimal RAM or disk storage despite
90
- the large size of the data sources.
91
-
92
- The map is rendered and updated using MapLibre with PMTiles, which provides responsive rendering for large feature sets.
93
- The PMTiles layer is also hosted on Source cooperative where it can be streamed efficiently.
94
- ")
95
  )
96
-
97
  ),
98
 
99
  sidebar = sidebar(
100
-
101
  input_switch("redlines", "Redlined Areas", value = FALSE),
102
  input_switch("svi", "Social Vulnerability", value = TRUE),
103
  input_switch("richness", "Biodiversity Richness", value = FALSE),
@@ -112,41 +97,15 @@ The PMTiles layer is also hosted on Source cooperative where it can be streamed
112
  theme = bs_theme(version = "5")
113
  )
114
 
115
-
116
-
117
-
118
  repo <- "https://data.source.coop/cboettig/social-vulnerability"
119
- pmtiles <- glue("{repo}/svi2020_us_tract.pmtiles")
120
- parquet <- glue("{repo}/svi2020_us_tract.parquet")
121
- svi <- open_dataset(parquet, tblname = "svi") |>
122
- filter(RPL_THEMES > 0)
123
-
124
 
 
125
  con <- duckdbfs::cached_connection()
126
- schema <- DBI::dbGetQuery(con, "PRAGMA table_info(svi)")
127
-
128
- system_prompt = glue::glue('
129
- You are a helpful agent who always replies strictly in JSON-formatted text.
130
- Your task is to translate the users question into a SQL query that will be run
131
- against the "svi" table in a duckdb database. The duckdb database has a
132
- spatial extension which understands PostGIS operations as well.
133
- Include semantically meaningful columns like COUNTY and STATE name.
134
-
135
- In the data, each row represents an individual census tract. If asked for
136
- county or state level statistics, be sure to aggregate across all the tracts
137
- in that county or state.
138
-
139
- The table schema is <schema>
140
-
141
- The column called "RPL_THEMES" corresponds to the overall "Social vulnerability index" number.
142
-
143
- Format your answer as follows:
144
-
145
- {
146
- "query": "your raw SQL response goes here",
147
- "explanation": "your explanation of the query"
148
- }
149
- ', .open = "<", .close = ">")
150
 
151
  chat <- ellmer::chat_vllm(
152
  base_url = "https://llm.nrp-nautilus.io/",
@@ -168,52 +127,64 @@ filter_column <- function(full_data, filtered_data, id_col = "FIPS") {
168
  list("in", list("get", id_col), list("literal", values))
169
  }
170
 
171
- chart1_data <- svi |>
172
- group_by(COUNTY) |>
173
- summarise(mean_svi = mean(RPL_THEMES)) |>
174
- collect()
175
-
176
- chart1 <- chart1_data |>
177
- ggplot(aes(mean_svi)) + geom_density(fill="darkred") +
178
- ggtitle("County-level vulnerability nation-wide")
179
 
180
 
181
  # Define the server
182
  server <- function(input, output, session) {
 
 
 
 
 
 
 
 
 
 
183
  data <- reactiveValues(df = tibble())
184
  output$chart1 <- renderPlot(chart1)
185
 
186
  observeEvent(input$user_msg, {
187
  stream <- chat$chat(input$chat)
188
 
189
- # optional, remember previous discussion
190
- #chat_append("chat", stream)
191
 
192
  # Parse response
193
  response <- jsonlite::fromJSON(stream)
194
- output$sql_code <- renderText(stringr::str_wrap(response$query, width = 60))
195
- output$explanation <- renderText(response$explanation)
196
 
197
- # Actually execute the SQL query generated:
198
- df <- DBI::dbGetQuery(con, response$query)
 
 
 
 
199
 
200
- # don't display shape column in render
201
- df <- df |> select(-any_of("Shape"))
202
- output$table <- render_gt(df, height = 300)
203
 
204
 
205
- y_axis <- colnames(df)[!colnames(df) %in% colnames(svi)]
206
- chart2 <- df |>
207
- rename(social_vulnerability = y_axis) |>
208
- ggplot(aes(social_vulnerability)) +
209
- geom_density(fill = "darkred") +
210
- xlim(c(0, 1)) +
211
- ggtitle("Vulnerability of selected areas")
212
 
213
- output$chart2 <- renderPlot(chart2)
214
 
215
- # We need to somehow trigger this df to update the map.
216
- data$df <- df
 
 
 
 
 
 
 
 
 
217
 
218
  })
219
 
@@ -259,7 +230,7 @@ server <- function(input, output, session) {
259
  id = "svi_layer",
260
  source = list(type = "vector",
261
  url = paste0("pmtiles://", pmtiles)),
262
- source_layer = "SVI2000_US_tract",
263
  filter = filter_column(svi, data$df, "FIPS"),
264
  fill_opacity = 0.5,
265
  fill_color = interpolate(column = "RPL_THEMES",
 
1
  library(shiny)
2
  library(bslib)
3
  library(htmltools)
4
+ #library(markdown)
5
  library(fontawesome)
6
  library(bsicons)
7
  library(gt)
8
  library(glue)
9
  library(ggplot2)
10
+ library(readr)
 
11
  library(dplyr)
12
+ library(mapgl)
13
  library(duckdbfs)
 
14
  duckdbfs::load_spatial()
15
 
16
+ css <-
17
+ HTML(paste0("<link rel='stylesheet' type='text/css' ",
18
+ "href='https://demos.creative-tim.com/",
19
+ "material-dashboard/assets/css/",
20
+ "material-dashboard.min.css?v=3.2.0'>"))
21
 
22
 
23
  # Define the UI
 
26
  tags$head(css),
27
  titlePanel("Demo App"),
28
 
29
+ "
30
+ This is a proof-of-principle for a simple chat-driven interface
31
+ to dynamically explore geospatial data.
32
+ ",
33
 
34
  card(
35
  layout_columns(
36
+ textInput("chat",
37
+ label = NULL,
38
+ "Which counties in California have the highest average social vulnerability?",
39
+ width = "100%"),
40
+ div(
41
+ actionButton("user_msg", "", icon = icon("paper-plane"),
42
+ class = "btn-primary btn-sm align-bottom"),
43
+ class = "align-text-bottom"),
44
+ col_widths = c(11, 1)),
45
+ fill = FALSE
46
  ),
47
+
48
+ textOutput("agent"),
49
+
50
+
51
  layout_columns(
52
  card(maplibreOutput("map")),
53
  card(includeMarkdown("## Plot"),
 
59
  max_height = "700px"
60
  ),
61
 
 
62
  gt_output("table"),
63
 
64
  card(fill = TRUE,
65
  card_header(fa("robot")),
 
66
  accordion(
67
  open = FALSE,
68
  accordion_panel(
 
76
  textOutput("explanation"),
77
  )
78
  ),
 
79
  card(
80
  card_header("Errata"),
81
+ shiny::markdown(readr::read_file("footer.md")),
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
82
  )
 
83
  ),
84
 
85
  sidebar = sidebar(
 
86
  input_switch("redlines", "Redlined Areas", value = FALSE),
87
  input_switch("svi", "Social Vulnerability", value = TRUE),
88
  input_switch("richness", "Biodiversity Richness", value = FALSE),
 
97
  theme = bs_theme(version = "5")
98
  )
99
 
 
 
 
100
  repo <- "https://data.source.coop/cboettig/social-vulnerability"
101
+ pmtiles <- glue("{repo}/2022/SVI2022_US_county.pmtiles")
 
 
 
 
102
 
103
+ duckdb_s3_config(s3_endpoint = "minio.carlboettiger.info")
104
  con <- duckdbfs::cached_connection()
105
+ svi <- open_dataset("s3://public-gbif/svi", tblname = "svi") |> filter(RPL_THEMES > 0)
106
+ schema <- read_file("schema.yml")
107
+ system_prompt <- glue::glue(readr::read_file("system-prompt.md"),
108
+ .open = "<", .close = ">")
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
109
 
110
  chat <- ellmer::chat_vllm(
111
  base_url = "https://llm.nrp-nautilus.io/",
 
127
  list("in", list("get", id_col), list("literal", values))
128
  }
129
 
 
 
 
 
 
 
 
 
130
 
131
 
132
  # Define the server
133
  server <- function(input, output, session) {
134
+
135
+ chart1_data <- svi |>
136
+ group_by(COUNTY) |>
137
+ summarise(mean_svi = mean(RPL_THEMES)) |>
138
+ collect()
139
+
140
+ chart1 <- chart1_data |>
141
+ ggplot(aes(mean_svi)) + geom_density(fill="darkred") +
142
+ ggtitle("County-level vulnerability nation-wide")
143
+
144
  data <- reactiveValues(df = tibble())
145
  output$chart1 <- renderPlot(chart1)
146
 
147
  observeEvent(input$user_msg, {
148
  stream <- chat$chat(input$chat)
149
 
150
+
 
151
 
152
  # Parse response
153
  response <- jsonlite::fromJSON(stream)
 
 
154
 
155
+ if ("query" %in% names(response)) {
156
+ output$sql_code <- renderText(stringr::str_wrap(response$query, width = 60))
157
+ output$explanation <- renderText(response$explanation)
158
+
159
+ # Actually execute the SQL query generated:
160
+ df <- DBI::dbGetQuery(con, response$query)
161
 
162
+ # don't display shape column in render
163
+ df <- df |> select(-any_of("Shape"))
164
+ output$table <- render_gt(df, height = 300)
165
 
166
 
167
+ y_axis <- colnames(df)[!colnames(df) %in% colnames(svi)]
168
+ chart2 <- df |>
169
+ rename(social_vulnerability = y_axis) |>
170
+ ggplot(aes(social_vulnerability)) +
171
+ geom_density(fill = "darkred") +
172
+ xlim(c(0, 1)) +
173
+ ggtitle("Vulnerability of selected areas")
174
 
175
+ output$chart2 <- renderPlot(chart2)
176
 
177
+ # We need to somehow trigger this df to update the map.
178
+ data$df <- df
179
+
180
+ # Note: ellmer will preserve full chat history automatically.
181
+ # this can confuse the agent and mess up behavior, so we reset:
182
+ chat$set_turns(NULL)
183
+
184
+ } else {
185
+ output$agent <- renderText(response$agent)
186
+
187
+ }
188
 
189
  })
190
 
 
230
  id = "svi_layer",
231
  source = list(type = "vector",
232
  url = paste0("pmtiles://", pmtiles)),
233
+ source_layer = "svi",
234
  filter = filter_column(svi, data$df, "FIPS"),
235
  fill_opacity = 0.5,
236
  fill_color = interpolate(column = "RPL_THEMES",
footer.md ADDED
@@ -0,0 +1,16 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #### Credits
2
+
3
+ Developed by Carl Boettiger, UC Berkeley, 2025. BSD License.
4
+
5
+ Data from the US Census and CDC's [Social Vulnerability Index](https://www.atsdr.cdc.gov/place-health/php/svi/index.html)
6
+
7
+ #### Technical details
8
+
9
+ The app is written entirely in R using shiny. The app will translate natural language queries in SQL code using
10
+ a small open-weights language model. The SQL code is executed using the duckdb backend against cloud-native
11
+ geoparquet snapshot of the Social Vulnerability Index hosted on Source Cooperative. Summary chart data are also
12
+ computed in duckdb by streaming, providing responsive updates while needing minimal RAM or disk storage despite
13
+ the large size of the data sources.
14
+
15
+ The map is rendered and updated using MapLibre with PMTiles, which provides responsive rendering for large feature sets.
16
+ The PMTiles layer is also hosted on Source cooperative where it can be streamed efficiently.
preprocess.md ADDED
@@ -0,0 +1,27 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ ---
3
+
4
+
5
+ # Vector Layers
6
+
7
+ The heart of this application design is a vector dataset serialized as both (Geo)Parquet and PMTiles.
8
+ The parquet version allows for real-time calculations through rapid SQL queries via duckdb,
9
+ and the PMTiles version allows the data to be quickly visualized at any zoom through maplibre.
10
+ maplibre can also efficiently filter the PMTiles data given a feature ids returned by duckdb.
11
+
12
+ `gdal_translates` can generate both PMTiles and geoparquet, though `tippecanoe` provides more
13
+ options for PMTiles generation and can produce nicer tile sets.
14
+
15
+ The demo uses the CDC Social Vulnerability data because it is built on the hierachical partitioning
16
+ used by the Census (Country->State->County->Tract) hierarchy.
17
+
18
+ # Raster Layers
19
+
20
+ ## Generating static tiles
21
+
22
+ ## Zonal statistics calculations
23
+
24
+ The application is essentially driven by the vector layer data using SQL.
25
+ I find it helpful to pre-process 'zonal' calculations, e.g. the mean value of each raster layer
26
+ within each feature in the 'focal' vector data set(s).
27
+
schema.yml ADDED
@@ -0,0 +1,41 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ - VARIABLE_NAME: ST
2
+ DESCRIPTION: State-level FIPS code (two-digit integer)
3
+ - VARIABLE_NAME: STATE
4
+ DESCRIPTION: State name
5
+ - VARIABLE_NAME: ST_ABBR
6
+ DESCRIPTION: State abbreviation
7
+ - VARIABLE_NAME: STCNTY
8
+ DESCRIPTION: County-level FIPS code (5 digit integer)
9
+ - VARIABLE_NAME: COUNTY
10
+ DESCRIPTION: County name
11
+ - VARIABLE_NAME: FIPS
12
+ DESCRIPTION: Tract-level geographic identification (full Census Bureau FIPS code)
13
+ - VARIABLE_NAME: LOCATION
14
+ DESCRIPTION: Text description of tract county state
15
+ - VARIABLE_NAME: AREA_SQMI
16
+ DESCRIPTION: Tract area in square miles
17
+ - VARIABLE_NAME: RPL_THEMES
18
+ DESCRIPTION: Overall social vulnerability. Should always be used unless explicit sub-theme is called for.
19
+ - VARIABLE_NAME: RPL_THEME1
20
+ DESCRIPTION: Subtheme for socio-economic status social vulnerability score
21
+ - VARIABLE_NAME: RPL_THEME2
22
+ DESCRIPTION: Subtheme for Household characteristics vulnerability score
23
+ - VARIABLE_NAME: RPL_THEME3
24
+ DESCRIPTION: Subtheme for Racial and Ethnic Minority status based vulnerability score
25
+ - VARIABLE_NAME: RPL_THEME4
26
+ DESCRIPTION: Subtheme for Housing and transportation-based vulnerability score
27
+ - VARIABLE_NAME: kingdom
28
+ DESCRIPTION: phylogenetic kingdom
29
+ - VARIABLE_NAME: phylum
30
+ DESCRIPTION: phylogenetic phylum
31
+ - VARIABLE_NAME: class
32
+ DESCRIPTION: phylogenetic class
33
+ - VARIABLE_NAME: order
34
+ DESCRIPTION: phylogenetic order
35
+ - VARIABLE_NAME: family
36
+ DESCRIPTION: phylogenetic family
37
+ - VARIABLE_NAME: genus
38
+ DESCRIPTION: phylogenetic genus
39
+ - VARIABLE_NAME: species
40
+ DESCRIPTION: phylogenetic genus and species name (scientific name)
41
+
system-prompt.md ADDED
@@ -0,0 +1,34 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+
2
+ You are a helpful agent who always replies strictly in JSON-formatted text.
3
+ Your task is to translate the user's questions about the data into a SQL query
4
+ that will be run against the "biodiversity_occurrences" table in a duckdb database.
5
+ The duckdb database has a spatial extension which understands PostGIS operations as well.
6
+
7
+ If your answer involves the construction of a SQL query, you must format your answer as follows:
8
+
9
+ {
10
+ "query": "your raw SQL response goes here",
11
+ "explanation": "your explanation of the query"
12
+ }
13
+
14
+ If your answer does not involve a SQL query, please reply with the following format instead:
15
+
16
+ {
17
+ "user": "user question goes here",
18
+ "agent": "your response goes here"
19
+ }
20
+
21
+ If you are asked to describe the data or for information about the data schema, give only a human-readable response with SQL.
22
+
23
+ In the data, each row represents an individual occurrence of a species. The occurrences
24
+ are geocoded to US Census counties, with the STATE, COUNTY, and FIPS columns indicating
25
+ the corresponding state name, county name, and FIPS identifier for the specific County.
26
+ The FIPS column is an 5-digit number that uniquely identifies a county in a state.
27
+ Taxonomic classification of the species is given in the corresponding columns, kingdom,
28
+ phylum, class, order, family, genus, and species.
29
+
30
+ The data also includes information about various measures of social vulnerability (RPL_THEMES).
31
+ Pay attention to the DESCRIPTION of each of the columns (VARIABLE_NAME) from the metadata table:
32
+ <schema>
33
+
34
+
test.R ADDED
@@ -0,0 +1,82 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ## Illustrate/test core app functionality without shiny
2
+
3
+ library(tidyverse)
4
+ library(duckdbfs)
5
+ library(mapgl)
6
+ library(ellmer)
7
+ library(glue)
8
+
9
+ repo <- "https://data.source.coop/cboettig/social-vulnerability"
10
+ pmtiles <- glue("{repo}/svi2020_us_tract.pmtiles")
11
+ duckdb_s3_config(s3_endpoint = "minio.carlboettiger.info")
12
+ svi <-
13
+ open_dataset("s3://public-gbif/svi", tblname = "biodiversity_occurrences") |>
14
+ filter(RPL_THEMES > 0)
15
+ schema <- read_file("schema.yml")
16
+ system_prompt <- glue::glue(readr::read_file("system-prompt.md"),
17
+ .open = "<", .close = ">")
18
+
19
+
20
+
21
+
22
+ # Or optionally test with cirrus
23
+ chat <- ellmer::chat_vllm(
24
+ base_url = "https://llm.cirrus.carlboettiger.info/v1/",
25
+ model = "kosbu/Llama-3.3-70B-Instruct-AWQ",
26
+ api_key = Sys.getenv("CIRRUS_LLM_KEY"),
27
+ system_prompt = system_prompt,
28
+ api_args = list(temperature = 0)
29
+ )
30
+
31
+ # or use the NRP model
32
+ chat <- ellmer::chat_vllm(
33
+ base_url = "https://llm.nrp-nautilus.io/",
34
+ model = "llama3",
35
+ api_key = Sys.getenv("NRP_API_KEY"),
36
+ system_prompt = system_prompt,
37
+ api_args = list(temperature = 0)
38
+ )
39
+
40
+ cols <- colnames(svi)
41
+ rpls <- grep("RPL_THEME.+", cols)
42
+ keep <- cols[c(1:9, rpls, 161:226)]
43
+ biodiversity <- svi |> select(all_of(keep))
44
+
45
+ # Test a chat-based response
46
+ chat$chat("Which columns describes racial components of social vulnerability?")
47
+ chat$set_turns(NULL)
48
+ ## A query-based response
49
+ stream <- chat$chat("Which counties have the most bird observations?")
50
+ stream <- chat$chat("Give me the number bird observations per county vs county social vulnerability")
51
+ response <- jsonlite::fromJSON(stream)
52
+
53
+
54
+ stream2 <- chat$chat("Great, now give me the ggplot2 code to plot the data.frame you returned as those counts vs social vulnerability as points. Be sure to place the R code for your reply by itself in a 'code' element of the JSON")
55
+
56
+ response <- jsonlite::fromJSON(stream2)
57
+
58
+ con <- duckdbfs::cached_connection()
59
+ filtered_data <- DBI::dbGetQuery(con, response$query)
60
+
61
+ filter_column <- function(full_data, filtered_data, id_col) {
62
+ if (nrow(filtered_data) < 1) return(NULL)
63
+ values <- full_data |>
64
+ inner_join(filtered_data, copy = TRUE) |>
65
+ pull(id_col)
66
+ # maplibre syntax for the filter of PMTiles
67
+ list("in", list("get", id_col), list("literal", values))
68
+ }
69
+
70
+ maplibre(center = c(-102.9, 41.3), zoom = 3) |>
71
+ add_fill_layer(
72
+ id = "svi_layer",
73
+ source = list(type = "vector", url = paste0("pmtiles://", pmtiles)),
74
+ source_layer = "SVI2000_US_tract",
75
+ filter = filter_column(full_data, filtered_data, "FIPS"),
76
+ fill_opacity = 0.5,
77
+ fill_color = interpolate(column = "RPL_THEMES",
78
+ values = c(0, 1),
79
+ stops = c("#e19292c0", "darkblue"),
80
+ na_color = "lightgrey")
81
+ )
82
+
utils.R ADDED
@@ -0,0 +1,4 @@
 
 
 
 
 
1
+
2
+ library(tidyverse)
3
+ library(duckdbfs)
4
+