Spaces:
Running
Running
| LLM: gpt | |
| instructions: '1. Refactor the unstructured OCR text into a dictionary based on the | |
| JSON structure outlined below. | |
| 2. You should map the unstructured OCR text to the appropriate JSON key and then | |
| populate the field based on its rules. | |
| 3. Some JSON key fields are permitted to remain empty if the corresponding information | |
| is not found in the unstructured OCR text. | |
| 4. Ignore any information in the OCR text that doesn''t fit into the defined JSON | |
| structure. | |
| 5. Duplicate dictionary fields are not allowed. | |
| 6. Ensure that all JSON keys are in lowercase. | |
| 7. Ensure that new JSON field values follow sentence case capitalization. | |
| 8. Ensure all key-value pairs in the JSON dictionary strictly adhere to the format | |
| and data types specified in the template. | |
| 9. Ensure the output JSON string is valid JSON format. It should not have trailing | |
| commas or unquoted keys. | |
| 10. Only return a JSON dictionary represented as a string. You should not explain | |
| your answer.' | |
| json_formatting_instructions: "The next section of instructions outlines how to format\ | |
| \ the JSON dictionary. The keys are the same as those of the final formatted JSON\ | |
| \ object.\nFor each key there is a format requirement that specifies how to transcribe\ | |
| \ the information for that key. \nThe possible formatting options are:\n1. \"verbatim\ | |
| \ transcription\" - field is populated with verbatim text from the unformatted OCR.\n\ | |
| 2. \"spell check transcription\" - field is populated with spelling corrected text\ | |
| \ from the unformatted OCR.\n3. \"boolean yes no\" - field is populated with only\ | |
| \ yes or no.\n4. \"boolean 1 0\" - field is populated with only 1 or 0.\n5. \"integer\"\ | |
| \ - field is populated with only an integer.\n6. \"[list]\" - field is populated\ | |
| \ from one of the values in the list.\n7. \"yyyy-mm-dd\" - field is populated with\ | |
| \ a date in the format year-month-day.\nThe desired null value is also given. Populate\ | |
| \ the field with the null value of the information for that key is not present in\ | |
| \ the unformatted OCR text." | |
| mapping: | |
| COLLECTING: | |
| - collectors | |
| - collector_number | |
| - determined_by | |
| - multiple_names | |
| - verbatim_date | |
| - date | |
| - end_date | |
| GEOGRAPHY: | |
| - country | |
| - state | |
| - county | |
| - min_elevation | |
| - max_elevation | |
| - elevation_units | |
| LOCALITY: | |
| - locality_name | |
| - verbatim_coordinates | |
| - decimal_coordinates | |
| - datum | |
| - plant_description | |
| - cultivated | |
| - habitat | |
| MISCELLANEOUS: [] | |
| TAXONOMY: | |
| - catalog_number | |
| - genus | |
| - species | |
| - subspecies | |
| - variety | |
| - forma | |
| rules: | |
| Dictionary: | |
| catalog_number: | |
| description: The barcode identifier, typically a number with at least 6 digits, | |
| but fewer than 30 digits. | |
| format: verbatim transcription | |
| null_value: '' | |
| collector_number: | |
| description: Unique identifier or number that denotes the specific collecting | |
| event and associated with the collector. | |
| format: verbatim transcription | |
| null_value: s.n. | |
| collectors: | |
| description: Full name(s) of the individual(s) responsible for collecting the | |
| specimen. When multiple collectors are involved, their names should be separated | |
| by commas. | |
| format: verbatim transcription | |
| null_value: not present | |
| country: | |
| description: Country that corresponds to the current geographic location of | |
| collection. Capitalize first letter of each word. If abbreviation is given | |
| populate field with the full spelling of the country's name. | |
| format: spell check transcription | |
| null_value: '' | |
| county: | |
| description: Administrative division 2 that corresponds to the current geographic | |
| location of collection; capitalize first letter of each word. Administrative | |
| division 2 is equivalent to a U.S. county, parish, borough. | |
| format: spell check transcription | |
| null_value: '' | |
| cultivated: | |
| description: Cultivated plants are intentionally grown by humans. In text descriptions, | |
| look for planting dates, garden locations, ornamental, cultivar names, garden, | |
| or farm to indicate cultivated plant. The value 1 indicates that the specimen | |
| was cultivated, the value zero otherwise. | |
| format: boolean 1 0 | |
| null_value: '0' | |
| date: | |
| description: 'Date the specimen was collected formatted as year-month-day. If | |
| specific components of the date are unknown, they should be replaced with | |
| zeros. Examples: ''0000-00-00'' if the entire date is unknown, ''YYYY-00-00'' | |
| if only the year is known, and ''YYYY-MM-00'' if year and month are known | |
| but day is not.' | |
| format: yyyy-mm-dd | |
| null_value: '' | |
| datum: | |
| description: Datum of location coordinates. Possible values are include in the | |
| format list. Leave field blank if unclear. [WGS84, WGS72, WGS66, WGS60, NAD83, | |
| NAD27, OSGB36, ETRS89, ED50, GDA94, JGD2011, Tokyo97, KGD2002, TWD67, TWD97, | |
| BJS54, XAS80, GCJ-02, BD-09, PZ-90.11, GTRF, CGCS2000, ITRF88, ITRF89, ITRF90, | |
| ITRF91, ITRF92, ITRF93, ITRF94, ITRF96, ITRF97, ITRF2000, ITRF2005, ITRF2008, | |
| ITRF2014, Hong Kong Principal Datum, SAD69] | |
| format: '[list]' | |
| null_value: '' | |
| decimal_coordinates: | |
| description: Correct and convert the verbatim location coordinates to conform | |
| with the decimal degrees GPS coordinate format. | |
| format: spell check transcription | |
| null_value: '' | |
| determined_by: | |
| description: Full name of the individual responsible for determining the taxanomic | |
| name of the specimen. Sometimes the name will be near to the characters 'det' | |
| to denote determination. This name may be isolated from other names in the | |
| unformatted OCR text. | |
| format: verbatim transcription | |
| null_value: '' | |
| elevation_units: | |
| description: 'Elevation units must be meters. If min_elevation field is populated, | |
| then elevation_units: ''m''. Otherwise elevation_units: ''''.' | |
| format: spell check transcription | |
| null_value: '' | |
| end_date: | |
| description: 'If a date range is provided, this represents the later or ending | |
| date of the collection period, formatted as year-month-day. If specific components | |
| of the date are unknown, they should be replaced with zeros. Examples: ''0000-00-00'' | |
| if the entire end date is unknown, ''YYYY-00-00'' if only the year of the | |
| end date is known, and ''YYYY-MM-00'' if year and month of the end date are | |
| known but the day is not.' | |
| format: yyyy-mm-dd | |
| null_value: '' | |
| forma: | |
| description: Taxonomic determination to form (f.). | |
| format: verbatim transcription | |
| null_value: '' | |
| genus: | |
| description: Taxonomic determination to genus. Genus must be capitalized. If | |
| genus is not present use the taxonomic family name followed by the word 'indet'. | |
| format: verbatim transcription | |
| null_value: '' | |
| habitat: | |
| description: Description of a plant's habitat or the location where the specimen | |
| was collected. Ignore descriptions of the plant itself. | |
| format: verbatim transcription | |
| null_value: '' | |
| locality_name: | |
| description: Description of geographic location, landscape, landmarks, regional | |
| features, nearby places, or any contextual information aiding in pinpointing | |
| the exact origin or site of the specimen. | |
| format: verbatim transcription | |
| null_value: '' | |
| max_elevation: | |
| description: Maximum elevation or altitude in meters. If only one elevation | |
| is present, then max_elevation should be set to the null_value. Only if units | |
| are explicit then convert from feet ('ft' or 'ft.' or 'feet') to meters ('m' | |
| or 'm.' or 'meters'). Round to integer. | |
| format: integer | |
| null_value: '' | |
| min_elevation: | |
| description: Minimum elevation or altitude in meters. Only if units are explicit | |
| then convert from feet ('ft' or 'ft.' or 'feet') to meters ('m' or 'm.' or | |
| 'meters'). Round to integer. | |
| format: integer | |
| null_value: '' | |
| multiple_names: | |
| description: Indicate whether multiple people or collector names are present | |
| in the unformatted OCR text. If you see more than one person's name the value | |
| is 'yes'; otherwise the value is 'no'. | |
| format: boolean yes no | |
| null_value: '' | |
| plant_description: | |
| description: Description of plant features such as leaf shape, size, color, | |
| stem texture, height, flower structure, scent, fruit or seed characteristics, | |
| root system type, overall growth habit and form, any notable aroma or secretions, | |
| presence of hairs or bristles, and any other distinguishing morphological | |
| or physiological characteristics. | |
| format: verbatim transcription | |
| null_value: '' | |
| species: | |
| description: Taxonomic determination to species, do not capitalize species. | |
| format: verbatim transcription | |
| null_value: '' | |
| state: | |
| description: Administrative division 1 that corresponds to the current geographic | |
| location of collection. Capitalize first letter of each word. Administrative | |
| division 1 is equivalent to a U.S. State. | |
| format: spell check transcription | |
| null_value: '' | |
| subspecies: | |
| description: Taxonomic determination to subspecies (subsp.). | |
| format: verbatim transcription | |
| null_value: '' | |
| variety: | |
| description: Taxonomic determination to variety (var). | |
| format: verbatim transcription | |
| null_value: '' | |
| verbatim_coordinates: | |
| description: Verbatim location coordinates as they appear on the label. Do not | |
| convert formats. Possible coordinate types are one of [Lat, Long, UTM, TRS]. | |
| format: verbatim transcription | |
| null_value: '' | |
| verbatim_date: | |
| description: Date of collection exactly as it appears on the label. Do not change | |
| the format or correct typos. | |
| format: verbatim transcription | |
| null_value: s.d. | |
| SpeciesName: | |
| taxonomy: | |
| - Genus_species | |