Spaces:
Sleeping
Sleeping
here we go :rocket:
Browse files- README.md +31 -2
- app.R +73 -102
- footer.md +16 -0
- preprocess.md +27 -0
- schema.yml +41 -0
- system-prompt.md +34 -0
- test.R +82 -0
- utils.R +4 -0
README.md
CHANGED
@@ -1,8 +1,8 @@
|
|
1 |
---
|
2 |
-
title:
|
3 |
emoji: π
|
4 |
colorFrom: blue
|
5 |
-
colorTo:
|
6 |
sdk: docker
|
7 |
pinned: false
|
8 |
license: bsd-2-clause
|
@@ -12,3 +12,32 @@ license: bsd-2-clause
|
|
12 |
|
13 |
:hugs: Shiny App on Huggingface: <https://huggingface.co/spaces/boettiger-lab/geo-llm-r>
|
14 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
---
|
2 |
+
title: Geo Llm R
|
3 |
emoji: π
|
4 |
colorFrom: blue
|
5 |
+
colorTo: yellow
|
6 |
sdk: docker
|
7 |
pinned: false
|
8 |
license: bsd-2-clause
|
|
|
12 |
|
13 |
:hugs: Shiny App on Huggingface: <https://huggingface.co/spaces/boettiger-lab/geo-llm-r>
|
14 |
|
15 |
+
Work in progress. This is a proof-of-principle for an LLM-driven interface to dynamic mapping. Key technologies include duckdb, geoparquet, pmtiles, maplibre, open LLMs (via VLLM + LiteLLM). R interface through ellmer (LLMs), mapgl (maplibre), shiny, and duckdb.
|
16 |
+
|
17 |
+
# Setup
|
18 |
+
|
19 |
+
## GitHub with HuggingFace Deploy
|
20 |
+
|
21 |
+
All edits should be pushed to GitHub. Edits to `main` branch are automatically deployed to HuggingFace via GitHub Actions.
|
22 |
+
When using this scaffold, you will first have to set up your auto-deploy system:
|
23 |
+
|
24 |
+
- [Create a new HuggingFace Space](https://huggingface.co/new-space) (any template is fine, will be overwritten).
|
25 |
+
- [Create a HuggingFace Token](https://huggingface.co/settings/tokens/new?tokenType=write) with write permissions if you do not have one.
|
26 |
+
- In the GitHub Settings of your repository, add the token as a "New Repository Secret" under the `Secrets and Variables` -> `Actions` section of settings (`https://github.com/{USER}/{REPO}/settings/secrets/actions`).
|
27 |
+
- Edit the `.github/workflows/deploy.yml` file to specify your HuggingFace user name and HF repo to publish to.
|
28 |
+
|
29 |
+
## Language Model setup
|
30 |
+
|
31 |
+
This example is designed to be able to leverage open source or open weights models. You will need to adjust the API URL and API key accordingly. This could be a local model with `vllm` or `ollama`, and of course commercial models should work too. The demo app currently runs on an VLLM+LiteLLM backed model, currently a Llama3 variant, hosted on the National Research Platform.
|
32 |
+
|
33 |
+
The LLM plays only a simple role in generating SQL queries from background information on the data including the table schema, see the system prompt for details. Most open models I have experimented with do not support the [tool use](https://ellmer.tidyverse.org/articles/tool-calling.html) or [structured data](https://ellmer.tidyverse.org/articles/structured-data.html) interfaces very well compared to commercial models. An important trick in working with open models used here is merely requesting the reply be structured as JSON. Open models are quite decent at this, and at SQL construction, given necessary context about the data. The map and chart elements merely react the resulting data frames, and the entire analysis is thus transparent and reproducible as it would be if the user had composed their request in SQL instead of plain English.
|
34 |
+
|
35 |
+
## Software Dependencies
|
36 |
+
|
37 |
+
The Dockerfile includes all dependencies required for the HuggingFace deployment, and can be used as a template or directly to serve RStudio server.
|
38 |
+
|
39 |
+
## Data pre-processing
|
40 |
+
|
41 |
+
Pre-processing the data into cloud-native formats and hosting data on a high bandwidth, highly avalialbe server is essential for efficient and scalable renending. Pre-computing expensive operations such as zonal statistics across all features is also necessary. These steps are described in [preprocess.md](preprocess.md) and corresponding scripts.
|
42 |
+
|
43 |
+
|
app.R
CHANGED
@@ -1,20 +1,23 @@
|
|
1 |
library(shiny)
|
2 |
library(bslib)
|
3 |
library(htmltools)
|
4 |
-
library(markdown)
|
5 |
library(fontawesome)
|
6 |
library(bsicons)
|
7 |
library(gt)
|
8 |
library(glue)
|
9 |
library(ggplot2)
|
10 |
-
|
11 |
-
library(mapgl)
|
12 |
library(dplyr)
|
|
|
13 |
library(duckdbfs)
|
14 |
-
|
15 |
duckdbfs::load_spatial()
|
16 |
|
17 |
-
css <-
|
|
|
|
|
|
|
|
|
18 |
|
19 |
|
20 |
# Define the UI
|
@@ -23,23 +26,28 @@ ui <- page_sidebar(
|
|
23 |
tags$head(css),
|
24 |
titlePanel("Demo App"),
|
25 |
|
26 |
-
"
|
27 |
-
|
28 |
-
|
|
|
29 |
|
30 |
card(
|
31 |
layout_columns(
|
32 |
-
|
33 |
-
|
34 |
-
|
35 |
-
|
36 |
-
|
37 |
-
|
38 |
-
|
39 |
-
|
40 |
-
|
41 |
-
|
42 |
),
|
|
|
|
|
|
|
|
|
43 |
layout_columns(
|
44 |
card(maplibreOutput("map")),
|
45 |
card(includeMarkdown("## Plot"),
|
@@ -51,12 +59,10 @@ ui <- page_sidebar(
|
|
51 |
max_height = "700px"
|
52 |
),
|
53 |
|
54 |
-
|
55 |
gt_output("table"),
|
56 |
|
57 |
card(fill = TRUE,
|
58 |
card_header(fa("robot")),
|
59 |
-
|
60 |
accordion(
|
61 |
open = FALSE,
|
62 |
accordion_panel(
|
@@ -70,34 +76,13 @@ ui <- page_sidebar(
|
|
70 |
textOutput("explanation"),
|
71 |
)
|
72 |
),
|
73 |
-
|
74 |
card(
|
75 |
card_header("Errata"),
|
76 |
-
markdown(
|
77 |
-
"
|
78 |
-
#### Credits
|
79 |
-
|
80 |
-
Developed by Carl Boettiger, UC Berkeley, 2025. BSD License.
|
81 |
-
|
82 |
-
Data from the US Census and CDC's [Social Vulnerability Index](https://www.atsdr.cdc.gov/place-health/php/svi/index.html)
|
83 |
-
|
84 |
-
#### Technical details
|
85 |
-
|
86 |
-
The app is written entirely in R using shiny. The app will translate natural language queries in SQL code using
|
87 |
-
a small open-weights language model. The SQL code is executed using the duckdb backend against cloud-native
|
88 |
-
geoparquet snapshot of the Social Vulnerability Index hosted on Source Cooperative. Summary chart data are also
|
89 |
-
computed in duckdb by streaming, providing responsive updates while needing minimal RAM or disk storage despite
|
90 |
-
the large size of the data sources.
|
91 |
-
|
92 |
-
The map is rendered and updated using MapLibre with PMTiles, which provides responsive rendering for large feature sets.
|
93 |
-
The PMTiles layer is also hosted on Source cooperative where it can be streamed efficiently.
|
94 |
-
")
|
95 |
)
|
96 |
-
|
97 |
),
|
98 |
|
99 |
sidebar = sidebar(
|
100 |
-
|
101 |
input_switch("redlines", "Redlined Areas", value = FALSE),
|
102 |
input_switch("svi", "Social Vulnerability", value = TRUE),
|
103 |
input_switch("richness", "Biodiversity Richness", value = FALSE),
|
@@ -112,41 +97,15 @@ The PMTiles layer is also hosted on Source cooperative where it can be streamed
|
|
112 |
theme = bs_theme(version = "5")
|
113 |
)
|
114 |
|
115 |
-
|
116 |
-
|
117 |
-
|
118 |
repo <- "https://data.source.coop/cboettig/social-vulnerability"
|
119 |
-
pmtiles <- glue("{repo}/
|
120 |
-
parquet <- glue("{repo}/svi2020_us_tract.parquet")
|
121 |
-
svi <- open_dataset(parquet, tblname = "svi") |>
|
122 |
-
filter(RPL_THEMES > 0)
|
123 |
-
|
124 |
|
|
|
125 |
con <- duckdbfs::cached_connection()
|
126 |
-
|
127 |
-
|
128 |
-
system_prompt
|
129 |
-
|
130 |
-
Your task is to translate the users question into a SQL query that will be run
|
131 |
-
against the "svi" table in a duckdb database. The duckdb database has a
|
132 |
-
spatial extension which understands PostGIS operations as well.
|
133 |
-
Include semantically meaningful columns like COUNTY and STATE name.
|
134 |
-
|
135 |
-
In the data, each row represents an individual census tract. If asked for
|
136 |
-
county or state level statistics, be sure to aggregate across all the tracts
|
137 |
-
in that county or state.
|
138 |
-
|
139 |
-
The table schema is <schema>
|
140 |
-
|
141 |
-
The column called "RPL_THEMES" corresponds to the overall "Social vulnerability index" number.
|
142 |
-
|
143 |
-
Format your answer as follows:
|
144 |
-
|
145 |
-
{
|
146 |
-
"query": "your raw SQL response goes here",
|
147 |
-
"explanation": "your explanation of the query"
|
148 |
-
}
|
149 |
-
', .open = "<", .close = ">")
|
150 |
|
151 |
chat <- ellmer::chat_vllm(
|
152 |
base_url = "https://llm.nrp-nautilus.io/",
|
@@ -168,52 +127,64 @@ filter_column <- function(full_data, filtered_data, id_col = "FIPS") {
|
|
168 |
list("in", list("get", id_col), list("literal", values))
|
169 |
}
|
170 |
|
171 |
-
chart1_data <- svi |>
|
172 |
-
group_by(COUNTY) |>
|
173 |
-
summarise(mean_svi = mean(RPL_THEMES)) |>
|
174 |
-
collect()
|
175 |
-
|
176 |
-
chart1 <- chart1_data |>
|
177 |
-
ggplot(aes(mean_svi)) + geom_density(fill="darkred") +
|
178 |
-
ggtitle("County-level vulnerability nation-wide")
|
179 |
|
180 |
|
181 |
# Define the server
|
182 |
server <- function(input, output, session) {
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
183 |
data <- reactiveValues(df = tibble())
|
184 |
output$chart1 <- renderPlot(chart1)
|
185 |
|
186 |
observeEvent(input$user_msg, {
|
187 |
stream <- chat$chat(input$chat)
|
188 |
|
189 |
-
|
190 |
-
#chat_append("chat", stream)
|
191 |
|
192 |
# Parse response
|
193 |
response <- jsonlite::fromJSON(stream)
|
194 |
-
output$sql_code <- renderText(stringr::str_wrap(response$query, width = 60))
|
195 |
-
output$explanation <- renderText(response$explanation)
|
196 |
|
197 |
-
|
198 |
-
|
|
|
|
|
|
|
|
|
199 |
|
200 |
-
|
201 |
-
|
202 |
-
|
203 |
|
204 |
|
205 |
-
|
206 |
-
|
207 |
-
|
208 |
-
|
209 |
-
|
210 |
-
|
211 |
-
|
212 |
|
213 |
-
|
214 |
|
215 |
-
|
216 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
217 |
|
218 |
})
|
219 |
|
@@ -259,7 +230,7 @@ server <- function(input, output, session) {
|
|
259 |
id = "svi_layer",
|
260 |
source = list(type = "vector",
|
261 |
url = paste0("pmtiles://", pmtiles)),
|
262 |
-
source_layer = "
|
263 |
filter = filter_column(svi, data$df, "FIPS"),
|
264 |
fill_opacity = 0.5,
|
265 |
fill_color = interpolate(column = "RPL_THEMES",
|
|
|
1 |
library(shiny)
|
2 |
library(bslib)
|
3 |
library(htmltools)
|
4 |
+
#library(markdown)
|
5 |
library(fontawesome)
|
6 |
library(bsicons)
|
7 |
library(gt)
|
8 |
library(glue)
|
9 |
library(ggplot2)
|
10 |
+
library(readr)
|
|
|
11 |
library(dplyr)
|
12 |
+
library(mapgl)
|
13 |
library(duckdbfs)
|
|
|
14 |
duckdbfs::load_spatial()
|
15 |
|
16 |
+
css <-
|
17 |
+
HTML(paste0("<link rel='stylesheet' type='text/css' ",
|
18 |
+
"href='https://demos.creative-tim.com/",
|
19 |
+
"material-dashboard/assets/css/",
|
20 |
+
"material-dashboard.min.css?v=3.2.0'>"))
|
21 |
|
22 |
|
23 |
# Define the UI
|
|
|
26 |
tags$head(css),
|
27 |
titlePanel("Demo App"),
|
28 |
|
29 |
+
"
|
30 |
+
This is a proof-of-principle for a simple chat-driven interface
|
31 |
+
to dynamically explore geospatial data.
|
32 |
+
",
|
33 |
|
34 |
card(
|
35 |
layout_columns(
|
36 |
+
textInput("chat",
|
37 |
+
label = NULL,
|
38 |
+
"Which counties in California have the highest average social vulnerability?",
|
39 |
+
width = "100%"),
|
40 |
+
div(
|
41 |
+
actionButton("user_msg", "", icon = icon("paper-plane"),
|
42 |
+
class = "btn-primary btn-sm align-bottom"),
|
43 |
+
class = "align-text-bottom"),
|
44 |
+
col_widths = c(11, 1)),
|
45 |
+
fill = FALSE
|
46 |
),
|
47 |
+
|
48 |
+
textOutput("agent"),
|
49 |
+
|
50 |
+
|
51 |
layout_columns(
|
52 |
card(maplibreOutput("map")),
|
53 |
card(includeMarkdown("## Plot"),
|
|
|
59 |
max_height = "700px"
|
60 |
),
|
61 |
|
|
|
62 |
gt_output("table"),
|
63 |
|
64 |
card(fill = TRUE,
|
65 |
card_header(fa("robot")),
|
|
|
66 |
accordion(
|
67 |
open = FALSE,
|
68 |
accordion_panel(
|
|
|
76 |
textOutput("explanation"),
|
77 |
)
|
78 |
),
|
|
|
79 |
card(
|
80 |
card_header("Errata"),
|
81 |
+
shiny::markdown(readr::read_file("footer.md")),
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
82 |
)
|
|
|
83 |
),
|
84 |
|
85 |
sidebar = sidebar(
|
|
|
86 |
input_switch("redlines", "Redlined Areas", value = FALSE),
|
87 |
input_switch("svi", "Social Vulnerability", value = TRUE),
|
88 |
input_switch("richness", "Biodiversity Richness", value = FALSE),
|
|
|
97 |
theme = bs_theme(version = "5")
|
98 |
)
|
99 |
|
|
|
|
|
|
|
100 |
repo <- "https://data.source.coop/cboettig/social-vulnerability"
|
101 |
+
pmtiles <- glue("{repo}/2022/SVI2022_US_county.pmtiles")
|
|
|
|
|
|
|
|
|
102 |
|
103 |
+
duckdb_s3_config(s3_endpoint = "minio.carlboettiger.info")
|
104 |
con <- duckdbfs::cached_connection()
|
105 |
+
svi <- open_dataset("s3://public-gbif/svi", tblname = "svi") |> filter(RPL_THEMES > 0)
|
106 |
+
schema <- read_file("schema.yml")
|
107 |
+
system_prompt <- glue::glue(readr::read_file("system-prompt.md"),
|
108 |
+
.open = "<", .close = ">")
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
109 |
|
110 |
chat <- ellmer::chat_vllm(
|
111 |
base_url = "https://llm.nrp-nautilus.io/",
|
|
|
127 |
list("in", list("get", id_col), list("literal", values))
|
128 |
}
|
129 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
130 |
|
131 |
|
132 |
# Define the server
|
133 |
server <- function(input, output, session) {
|
134 |
+
|
135 |
+
chart1_data <- svi |>
|
136 |
+
group_by(COUNTY) |>
|
137 |
+
summarise(mean_svi = mean(RPL_THEMES)) |>
|
138 |
+
collect()
|
139 |
+
|
140 |
+
chart1 <- chart1_data |>
|
141 |
+
ggplot(aes(mean_svi)) + geom_density(fill="darkred") +
|
142 |
+
ggtitle("County-level vulnerability nation-wide")
|
143 |
+
|
144 |
data <- reactiveValues(df = tibble())
|
145 |
output$chart1 <- renderPlot(chart1)
|
146 |
|
147 |
observeEvent(input$user_msg, {
|
148 |
stream <- chat$chat(input$chat)
|
149 |
|
150 |
+
|
|
|
151 |
|
152 |
# Parse response
|
153 |
response <- jsonlite::fromJSON(stream)
|
|
|
|
|
154 |
|
155 |
+
if ("query" %in% names(response)) {
|
156 |
+
output$sql_code <- renderText(stringr::str_wrap(response$query, width = 60))
|
157 |
+
output$explanation <- renderText(response$explanation)
|
158 |
+
|
159 |
+
# Actually execute the SQL query generated:
|
160 |
+
df <- DBI::dbGetQuery(con, response$query)
|
161 |
|
162 |
+
# don't display shape column in render
|
163 |
+
df <- df |> select(-any_of("Shape"))
|
164 |
+
output$table <- render_gt(df, height = 300)
|
165 |
|
166 |
|
167 |
+
y_axis <- colnames(df)[!colnames(df) %in% colnames(svi)]
|
168 |
+
chart2 <- df |>
|
169 |
+
rename(social_vulnerability = y_axis) |>
|
170 |
+
ggplot(aes(social_vulnerability)) +
|
171 |
+
geom_density(fill = "darkred") +
|
172 |
+
xlim(c(0, 1)) +
|
173 |
+
ggtitle("Vulnerability of selected areas")
|
174 |
|
175 |
+
output$chart2 <- renderPlot(chart2)
|
176 |
|
177 |
+
# We need to somehow trigger this df to update the map.
|
178 |
+
data$df <- df
|
179 |
+
|
180 |
+
# Note: ellmer will preserve full chat history automatically.
|
181 |
+
# this can confuse the agent and mess up behavior, so we reset:
|
182 |
+
chat$set_turns(NULL)
|
183 |
+
|
184 |
+
} else {
|
185 |
+
output$agent <- renderText(response$agent)
|
186 |
+
|
187 |
+
}
|
188 |
|
189 |
})
|
190 |
|
|
|
230 |
id = "svi_layer",
|
231 |
source = list(type = "vector",
|
232 |
url = paste0("pmtiles://", pmtiles)),
|
233 |
+
source_layer = "svi",
|
234 |
filter = filter_column(svi, data$df, "FIPS"),
|
235 |
fill_opacity = 0.5,
|
236 |
fill_color = interpolate(column = "RPL_THEMES",
|
footer.md
ADDED
@@ -0,0 +1,16 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
#### Credits
|
2 |
+
|
3 |
+
Developed by Carl Boettiger, UC Berkeley, 2025. BSD License.
|
4 |
+
|
5 |
+
Data from the US Census and CDC's [Social Vulnerability Index](https://www.atsdr.cdc.gov/place-health/php/svi/index.html)
|
6 |
+
|
7 |
+
#### Technical details
|
8 |
+
|
9 |
+
The app is written entirely in R using shiny. The app will translate natural language queries in SQL code using
|
10 |
+
a small open-weights language model. The SQL code is executed using the duckdb backend against cloud-native
|
11 |
+
geoparquet snapshot of the Social Vulnerability Index hosted on Source Cooperative. Summary chart data are also
|
12 |
+
computed in duckdb by streaming, providing responsive updates while needing minimal RAM or disk storage despite
|
13 |
+
the large size of the data sources.
|
14 |
+
|
15 |
+
The map is rendered and updated using MapLibre with PMTiles, which provides responsive rendering for large feature sets.
|
16 |
+
The PMTiles layer is also hosted on Source cooperative where it can be streamed efficiently.
|
preprocess.md
ADDED
@@ -0,0 +1,27 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
---
|
3 |
+
|
4 |
+
|
5 |
+
# Vector Layers
|
6 |
+
|
7 |
+
The heart of this application design is a vector dataset serialized as both (Geo)Parquet and PMTiles.
|
8 |
+
The parquet version allows for real-time calculations through rapid SQL queries via duckdb,
|
9 |
+
and the PMTiles version allows the data to be quickly visualized at any zoom through maplibre.
|
10 |
+
maplibre can also efficiently filter the PMTiles data given a feature ids returned by duckdb.
|
11 |
+
|
12 |
+
`gdal_translates` can generate both PMTiles and geoparquet, though `tippecanoe` provides more
|
13 |
+
options for PMTiles generation and can produce nicer tile sets.
|
14 |
+
|
15 |
+
The demo uses the CDC Social Vulnerability data because it is built on the hierachical partitioning
|
16 |
+
used by the Census (Country->State->County->Tract) hierarchy.
|
17 |
+
|
18 |
+
# Raster Layers
|
19 |
+
|
20 |
+
## Generating static tiles
|
21 |
+
|
22 |
+
## Zonal statistics calculations
|
23 |
+
|
24 |
+
The application is essentially driven by the vector layer data using SQL.
|
25 |
+
I find it helpful to pre-process 'zonal' calculations, e.g. the mean value of each raster layer
|
26 |
+
within each feature in the 'focal' vector data set(s).
|
27 |
+
|
schema.yml
ADDED
@@ -0,0 +1,41 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
- VARIABLE_NAME: ST
|
2 |
+
DESCRIPTION: State-level FIPS code (two-digit integer)
|
3 |
+
- VARIABLE_NAME: STATE
|
4 |
+
DESCRIPTION: State name
|
5 |
+
- VARIABLE_NAME: ST_ABBR
|
6 |
+
DESCRIPTION: State abbreviation
|
7 |
+
- VARIABLE_NAME: STCNTY
|
8 |
+
DESCRIPTION: County-level FIPS code (5 digit integer)
|
9 |
+
- VARIABLE_NAME: COUNTY
|
10 |
+
DESCRIPTION: County name
|
11 |
+
- VARIABLE_NAME: FIPS
|
12 |
+
DESCRIPTION: Tract-level geographic identification (full Census Bureau FIPS code)
|
13 |
+
- VARIABLE_NAME: LOCATION
|
14 |
+
DESCRIPTION: Text description of tract county state
|
15 |
+
- VARIABLE_NAME: AREA_SQMI
|
16 |
+
DESCRIPTION: Tract area in square miles
|
17 |
+
- VARIABLE_NAME: RPL_THEMES
|
18 |
+
DESCRIPTION: Overall social vulnerability. Should always be used unless explicit sub-theme is called for.
|
19 |
+
- VARIABLE_NAME: RPL_THEME1
|
20 |
+
DESCRIPTION: Subtheme for socio-economic status social vulnerability score
|
21 |
+
- VARIABLE_NAME: RPL_THEME2
|
22 |
+
DESCRIPTION: Subtheme for Household characteristics vulnerability score
|
23 |
+
- VARIABLE_NAME: RPL_THEME3
|
24 |
+
DESCRIPTION: Subtheme for Racial and Ethnic Minority status based vulnerability score
|
25 |
+
- VARIABLE_NAME: RPL_THEME4
|
26 |
+
DESCRIPTION: Subtheme for Housing and transportation-based vulnerability score
|
27 |
+
- VARIABLE_NAME: kingdom
|
28 |
+
DESCRIPTION: phylogenetic kingdom
|
29 |
+
- VARIABLE_NAME: phylum
|
30 |
+
DESCRIPTION: phylogenetic phylum
|
31 |
+
- VARIABLE_NAME: class
|
32 |
+
DESCRIPTION: phylogenetic class
|
33 |
+
- VARIABLE_NAME: order
|
34 |
+
DESCRIPTION: phylogenetic order
|
35 |
+
- VARIABLE_NAME: family
|
36 |
+
DESCRIPTION: phylogenetic family
|
37 |
+
- VARIABLE_NAME: genus
|
38 |
+
DESCRIPTION: phylogenetic genus
|
39 |
+
- VARIABLE_NAME: species
|
40 |
+
DESCRIPTION: phylogenetic genus and species name (scientific name)
|
41 |
+
|
system-prompt.md
ADDED
@@ -0,0 +1,34 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
|
2 |
+
You are a helpful agent who always replies strictly in JSON-formatted text.
|
3 |
+
Your task is to translate the user's questions about the data into a SQL query
|
4 |
+
that will be run against the "biodiversity_occurrences" table in a duckdb database.
|
5 |
+
The duckdb database has a spatial extension which understands PostGIS operations as well.
|
6 |
+
|
7 |
+
If your answer involves the construction of a SQL query, you must format your answer as follows:
|
8 |
+
|
9 |
+
{
|
10 |
+
"query": "your raw SQL response goes here",
|
11 |
+
"explanation": "your explanation of the query"
|
12 |
+
}
|
13 |
+
|
14 |
+
If your answer does not involve a SQL query, please reply with the following format instead:
|
15 |
+
|
16 |
+
{
|
17 |
+
"user": "user question goes here",
|
18 |
+
"agent": "your response goes here"
|
19 |
+
}
|
20 |
+
|
21 |
+
If you are asked to describe the data or for information about the data schema, give only a human-readable response with SQL.
|
22 |
+
|
23 |
+
In the data, each row represents an individual occurrence of a species. The occurrences
|
24 |
+
are geocoded to US Census counties, with the STATE, COUNTY, and FIPS columns indicating
|
25 |
+
the corresponding state name, county name, and FIPS identifier for the specific County.
|
26 |
+
The FIPS column is an 5-digit number that uniquely identifies a county in a state.
|
27 |
+
Taxonomic classification of the species is given in the corresponding columns, kingdom,
|
28 |
+
phylum, class, order, family, genus, and species.
|
29 |
+
|
30 |
+
The data also includes information about various measures of social vulnerability (RPL_THEMES).
|
31 |
+
Pay attention to the DESCRIPTION of each of the columns (VARIABLE_NAME) from the metadata table:
|
32 |
+
<schema>
|
33 |
+
|
34 |
+
|
test.R
ADDED
@@ -0,0 +1,82 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
## Illustrate/test core app functionality without shiny
|
2 |
+
|
3 |
+
library(tidyverse)
|
4 |
+
library(duckdbfs)
|
5 |
+
library(mapgl)
|
6 |
+
library(ellmer)
|
7 |
+
library(glue)
|
8 |
+
|
9 |
+
repo <- "https://data.source.coop/cboettig/social-vulnerability"
|
10 |
+
pmtiles <- glue("{repo}/svi2020_us_tract.pmtiles")
|
11 |
+
duckdb_s3_config(s3_endpoint = "minio.carlboettiger.info")
|
12 |
+
svi <-
|
13 |
+
open_dataset("s3://public-gbif/svi", tblname = "biodiversity_occurrences") |>
|
14 |
+
filter(RPL_THEMES > 0)
|
15 |
+
schema <- read_file("schema.yml")
|
16 |
+
system_prompt <- glue::glue(readr::read_file("system-prompt.md"),
|
17 |
+
.open = "<", .close = ">")
|
18 |
+
|
19 |
+
|
20 |
+
|
21 |
+
|
22 |
+
# Or optionally test with cirrus
|
23 |
+
chat <- ellmer::chat_vllm(
|
24 |
+
base_url = "https://llm.cirrus.carlboettiger.info/v1/",
|
25 |
+
model = "kosbu/Llama-3.3-70B-Instruct-AWQ",
|
26 |
+
api_key = Sys.getenv("CIRRUS_LLM_KEY"),
|
27 |
+
system_prompt = system_prompt,
|
28 |
+
api_args = list(temperature = 0)
|
29 |
+
)
|
30 |
+
|
31 |
+
# or use the NRP model
|
32 |
+
chat <- ellmer::chat_vllm(
|
33 |
+
base_url = "https://llm.nrp-nautilus.io/",
|
34 |
+
model = "llama3",
|
35 |
+
api_key = Sys.getenv("NRP_API_KEY"),
|
36 |
+
system_prompt = system_prompt,
|
37 |
+
api_args = list(temperature = 0)
|
38 |
+
)
|
39 |
+
|
40 |
+
cols <- colnames(svi)
|
41 |
+
rpls <- grep("RPL_THEME.+", cols)
|
42 |
+
keep <- cols[c(1:9, rpls, 161:226)]
|
43 |
+
biodiversity <- svi |> select(all_of(keep))
|
44 |
+
|
45 |
+
# Test a chat-based response
|
46 |
+
chat$chat("Which columns describes racial components of social vulnerability?")
|
47 |
+
chat$set_turns(NULL)
|
48 |
+
## A query-based response
|
49 |
+
stream <- chat$chat("Which counties have the most bird observations?")
|
50 |
+
stream <- chat$chat("Give me the number bird observations per county vs county social vulnerability")
|
51 |
+
response <- jsonlite::fromJSON(stream)
|
52 |
+
|
53 |
+
|
54 |
+
stream2 <- chat$chat("Great, now give me the ggplot2 code to plot the data.frame you returned as those counts vs social vulnerability as points. Be sure to place the R code for your reply by itself in a 'code' element of the JSON")
|
55 |
+
|
56 |
+
response <- jsonlite::fromJSON(stream2)
|
57 |
+
|
58 |
+
con <- duckdbfs::cached_connection()
|
59 |
+
filtered_data <- DBI::dbGetQuery(con, response$query)
|
60 |
+
|
61 |
+
filter_column <- function(full_data, filtered_data, id_col) {
|
62 |
+
if (nrow(filtered_data) < 1) return(NULL)
|
63 |
+
values <- full_data |>
|
64 |
+
inner_join(filtered_data, copy = TRUE) |>
|
65 |
+
pull(id_col)
|
66 |
+
# maplibre syntax for the filter of PMTiles
|
67 |
+
list("in", list("get", id_col), list("literal", values))
|
68 |
+
}
|
69 |
+
|
70 |
+
maplibre(center = c(-102.9, 41.3), zoom = 3) |>
|
71 |
+
add_fill_layer(
|
72 |
+
id = "svi_layer",
|
73 |
+
source = list(type = "vector", url = paste0("pmtiles://", pmtiles)),
|
74 |
+
source_layer = "SVI2000_US_tract",
|
75 |
+
filter = filter_column(full_data, filtered_data, "FIPS"),
|
76 |
+
fill_opacity = 0.5,
|
77 |
+
fill_color = interpolate(column = "RPL_THEMES",
|
78 |
+
values = c(0, 1),
|
79 |
+
stops = c("#e19292c0", "darkblue"),
|
80 |
+
na_color = "lightgrey")
|
81 |
+
)
|
82 |
+
|
utils.R
ADDED
@@ -0,0 +1,4 @@
|
|
|
|
|
|
|
|
|
|
|
1 |
+
|
2 |
+
library(tidyverse)
|
3 |
+
library(duckdbfs)
|
4 |
+
|