Spaces:

jpwahle
/

field-time-diversity

Running

App Files Files Community

jpwahle commited on Jun 15

Commit

3c7ff65

1 Parent(s): 0bf2b65

Should work again with 429 handling and updated key

Browse files

Files changed (5) hide show

README.md +6 -4
data/nlp_papers_citation_age.txt +0 -0
main.py +60 -48
plots.py +9 -10
s2.py +87 -42

README.md CHANGED Viewed

@@ -1,10 +1,12 @@
 ---
-title: Field Diversity
-emoji: 🚀
 colorFrom: pink
-colorTo: red
 sdk: docker
-pinned: false
 ---
 Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

 ---
+title: Citation Field x Age
+emoji: 🌈
 colorFrom: pink
+colorTo: indigo
 sdk: docker
+app_file: app.py
+pinned: true
+app_port: 7860
 ---
 Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

data/nlp_papers_citation_age.txt ADDED Viewed

The diff for this file is too large to render. See raw diff

main.py CHANGED Viewed

@@ -9,10 +9,15 @@ import gradio as gr
 from aclanthology import determine_page_type
 from plots import generate_cfdi_plot, generate_maoc_plot
-from s2 import (check_s2_id_type, compute_stats_for_acl_author,
-                compute_stats_for_acl_paper, compute_stats_for_acl_venue,
-                compute_stats_for_pdf, compute_stats_for_s2_author,
-                compute_stats_for_s2_paper)
 def return_clear():
@@ -21,7 +26,7 @@ def return_clear():
     Returns:
         None
     """
-    return None, None, None, None, None, None, None, None
 def create_compute_stats(submit_type=None):
@@ -92,7 +97,7 @@ def plot_and_return_stats(
     plot_cfdi = generate_cfdi_plot(cfdi, compute_type)
     # Generate cadi plot
-    # plot_maoc = generate_maoc_plot(maoc, compute_type)
     # Get top 3 most cited fields
     top_fields_text = "\n".join(
@@ -103,72 +108,82 @@ def plot_and_return_stats(
             )[:3]
         ]
     )
-    cfdi = round(cfdi, 3)
     # Get most common oldest papers
-    # oldest_paper_text = "".join(
-    #     f"[{str(year)}] {title}" + "\n"
-    #     for year, title in sorted(year_title_dict.items())[:3]
-    # )
     return (
         title_authors,
         num_references,
         top_fields_text,
-        # oldest_paper_text,
         cfdi,
-        # cadi,
         plot_cfdi,
-        # plot_maoc,
     )
-with gr.Blocks(
-    theme=gr.themes.Soft()
-) as demo:
     with gr.Row():
         gr.Markdown(
             """
-            # Citation Field Diversity Calculator
-            Welcome to this interactive demo to analyze the field diversity aspect of your citational practice. This tool will enable you to reflect on a critical aspect:
             - By whom am I influenced? Which fields heavily inform and shape the research trajectory of my works?
-            In addition, you will be able to analyze how the above compares to the average paper or author. The results you will receive cannot be categorized into “good” or “bad”. Instead, they are meant to raise self-awareness about one’s citational diversity and reflect on it. The results might bring you to further questions, such as:
-            - Am I reading widely across fields?
             - Should I expand my literature search to include works from other fields?
-            Using citations as a tangible marker of influence, our demo provides empirical insights into the influence of papers across fields.
             ## What is Citation Field Diversity?
-            Field diversity is a measure of the variety of research fields that a paper or an author draws upon. A high field diversity indicates that the work draws from various distinct research fields, demonstrating a multidisciplinary influence on that work or author.
-            ## What is the Citation Field Diversity Index (CFDI) and how is it calculated?
-            The calculation of Field Diversity involves extracting all the references of a paper, categorizing them into distinct study fields, and determining the proportion of each study field over all the references. The Citation Field Diversity Index (CFDI) is then computed by applying the Gini Index on these proportions.
-            For more details, please refer to Eq. 3 in [this paper](https://aclanthology.org/2023.acl-long.341/).
             """
         )
         gr.Markdown(
             """
-            ## How do I Interpret CFDI?
-            Higher values of CFDI indicate a greater diversity of a paper in terms of the fields it cites, signifying a multidisciplinary influence. On the other hand, lower values signify a lower diversity, indicating that citations are more concentrated in specific fields.
             ## How can I use this demo?
-            There are three ways for you to compute the field diversity for papers:
             1. **Semantic Scholar ID**: Enter the Semantic Scholar ID of a **paper** or **author** and click the *"Compute"* button.
             2. **ACL Anthology Link**: Paste the ACL Anthology link of a **paper**, **venue**, or **author** and click the *"Compute"* button.
             3. **PDF File**: Upload your **paper** PDF and click the *"Compute"* button.
-            To retrieve the **Semantic Scholar ID** for a paper such as "The Elephant in the Room: Analyzing the Presence of Big Tech in Natural Language Processing Research," search the paper on Semantic Scholar [here](https://www.semanticscholar.org/paper/The-Elephant-in-the-Room%3A-Analyzing-the-Presence-of-Abdalla-Wahle/587ffdfd7229e8e0dbc5250b44df5fad6251f6ad) and use the last part of the URL. The Semantic Scholar ID (SSID) for this paper is: **587ffdfd7229e8e0dbc5250b44df5fad6251f6ad**.
             To get an ACL Anthology link, you can go to any ACL Anthology paper, author or proceedings page and just copy and paste the url. For example:
             - https://aclanthology.org/2023.acl-long.1/
@@ -197,9 +212,7 @@ with gr.Blocks(
                 with gr.Row():
                     acl_submit_btn = gr.Button("Compute")
             with gr.TabItem("PDF File"):
-                pdf_file = gr.File(
-                    file_types=[".pdf"], label="Upload your paper PDF"
-                )
                 with gr.Row():
                     file_submit_btn = gr.Button("Compute")
     with gr.Row():
@@ -209,13 +222,13 @@ with gr.Blocks(
     with gr.Row():
         num_ref = gr.Textbox(label="Number of references", lines=3)
         top_field_list = gr.Textbox(label="Top 3 fields cited:", lines=3)
-        # top_age_list = gr.Textbox(label="Top 3 oldest papers cited:", lines=3)
     with gr.Row():
         cfdi = gr.Textbox(label="CFDI")
-        # cadi = gr.Textbox(label="CADI")
     with gr.Row():
         cfdi_plot = gr.Plot(label="Citation Field Diversity")
-        # cadi_plot = gr.Plot(label="Citation Age Diversity")
     with gr.Row():
         clear_btn = gr.Button("Clear")
@@ -225,11 +238,11 @@ with gr.Blocks(
             title,
             num_ref,
             top_field_list,
-            # top_age_list,
             cfdi,
-            # cadi,
             cfdi_plot,
-            # cadi_plot,
         ],
     )
@@ -256,16 +269,15 @@ with gr.Blocks(
             title,
             num_ref,
             top_field_list,
-            # top_age_list,
             cfdi,
-            # cadi,
             cfdi_plot,
-            # cadi_plot,
             s2_id,
             acl_link,
             pdf_file,
         ],
     )
-demo.queue(concurrency_count=3)
-demo.launch(server_port=7860, server_name="0.0.0.0")

 from aclanthology import determine_page_type
 from plots import generate_cfdi_plot, generate_maoc_plot
+from s2 import (
+    check_s2_id_type,
+    compute_stats_for_acl_author,
+    compute_stats_for_acl_paper,
+    compute_stats_for_acl_venue,
+    compute_stats_for_pdf,
+    compute_stats_for_s2_author,
+    compute_stats_for_s2_paper,
+)
 def return_clear():
     Returns:
         None
     """
+    return None, None, None, None, None, None, None, None, None, None, None
 def create_compute_stats(submit_type=None):
     plot_cfdi = generate_cfdi_plot(cfdi, compute_type)
     # Generate cadi plot
+    plot_maoc = generate_maoc_plot(maoc, compute_type)
     # Get top 3 most cited fields
     top_fields_text = "\n".join(
             )[:3]
         ]
     )
     # Get most common oldest papers
+    oldest_paper_text = "".join(
+        f"[{str(year)}] {title}" + "\n"
+        for year, title in sorted(year_title_dict.items())[:3]
+    )
+    # Round CFDI and CADI
+    cfdi = round(cfdi, 3)
+    cadi = round(cadi, 3)
     return (
         title_authors,
         num_references,
         top_fields_text,
+        oldest_paper_text,
         cfdi,
+        cadi,
         plot_cfdi,
+        plot_maoc,
     )
+with gr.Blocks(theme=gr.themes.Soft()) as demo:
     with gr.Row():
         gr.Markdown(
             """
+            # Citation Age and Field Diversity Calculator
+            <div align="center">
+                <img src="https://onedrive.live.com/embed?resid=684CB5200DB6B388%21682618&authkey=%21AILbTZikzXAbAyc&width=1310&height=728" />
+            </div>
+            Welcome to this interactive demo to analyze various aspects of your citational diversity. This tool will enable you to reflect on two critical aspects:
             - By whom am I influenced? Which fields heavily inform and shape the research trajectory of my works?
+            - How far back in time do I cite? What are critical works (present and past) that shape my research?
+            In addition, you will be able to analyze how the above compares to the average paper or author. The results you will receive can not be categorized into “good” or “bad”. Instead, they are meant to raise self-awareness about one’s citational diversity and reflect on it. The results might bring you to further questions, such as:
+            - Am I reading widely across fields and time?
             - Should I expand my literature search to include works from other fields?
+            - Are there ideas rooted in the past that can be used in an innovative way?
+            Using citations as a tangible marker of influence, our demo provides empirical insights into the influence of papers across fields and time.
             ## What is Citation Field Diversity?
+            Field diversity is a measure of the variety of research Fields that a paper or an author draws upon. A high field diversity indicates that the work draws from various distinct research fields, demonstrating a multidisciplinary influence on that work or author.
+            ## What is Citation Age Diversity?
+            Citation age is a measure of how far back in time a paper cites other papers. A high citation age shows that the work draws from past works, while a low citation age indicates that mostly recent work has influenced that paper.
             """
         )
         gr.Markdown(
             """
+            ## What are the Citation Field Diversity Index (CFDI) and Citation Age Diversity Index (CADI) and how are they calculated?
+            The calculation of Field Diversity involves extracting all the references of a paper, categorizing them into distinct study fields, and determining the proportion of each study field over all the references. The Citation Field Diversity Index (CFDI) is then computed by applying the Gini Index on these proportions.
+            Calculating CADI is similar to CFDI but instead of determining the proportion of each study field, we determine the proportion of citation ages. If we take a paper from 2020 that cites two papers, one from 2010 and one from 1990, the citation ages are 10 and 30, respectively. The CADI is then computed by applying the Gini Index on these ages.
+            For more details, please refer to Eq. 3 in [this paper](https://aclanthology.org/2023.acl-long.341/) and Eq. 4 in [this paper](https://arxiv.org/).
+            ## How do I Interpret CFDI and CADI?
+            For both indices, higher values indicate a greater diversity of a NLP paper (in terms of how far back it cites and in the fields it cites). On the other hand, lower values signify a lower diversity, indicating that citations are more concentrated in specific fields and time ranges.
             ## How can I use this demo?
+            There are three ways how you to compute the field and age diversity for papers:
             1. **Semantic Scholar ID**: Enter the Semantic Scholar ID of a **paper** or **author** and click the *"Compute"* button.
             2. **ACL Anthology Link**: Paste the ACL Anthology link of a **paper**, **venue**, or **author** and click the *"Compute"* button.
             3. **PDF File**: Upload your **paper** PDF and click the *"Compute"* button.
+            To retrieve the **Semantic Scholar ID** for a paper such as "The Elephant in the Room: Analyzing the Presence of Big Tech in Natural Language Processing Research," search the paper on Semantic Scholar [here](https://www.semanticscholar.org/paper/The-Elephant-in-the-Room%3A-Analyzing-the-Presence-of-Abdalla-Wahle/587ffdfd7229e8e0dbc5250b44df5fad6251f6ad) and use the last part of the URL. The Semantic Scholar ID (SSID) for this paper is: **587ffdfd7229e8e0dbc5250b44df5fad6251f6ad**.
             To get an ACL Anthology link, you can go to any ACL Anthology paper, author or proceedings page and just copy and paste the url. For example:
             - https://aclanthology.org/2023.acl-long.1/
                 with gr.Row():
                     acl_submit_btn = gr.Button("Compute")
             with gr.TabItem("PDF File"):
+                pdf_file = gr.File(file_types=[".pdf"], label="Upload your paper PDF")
                 with gr.Row():
                     file_submit_btn = gr.Button("Compute")
     with gr.Row():
     with gr.Row():
         num_ref = gr.Textbox(label="Number of references", lines=3)
         top_field_list = gr.Textbox(label="Top 3 fields cited:", lines=3)
+        top_age_list = gr.Textbox(label="Top 3 oldest papers cited:", lines=3)
     with gr.Row():
         cfdi = gr.Textbox(label="CFDI")
+        cadi = gr.Textbox(label="CADI")
     with gr.Row():
         cfdi_plot = gr.Plot(label="Citation Field Diversity")
+        cadi_plot = gr.Plot(label="Citation Age Diversity")
     with gr.Row():
         clear_btn = gr.Button("Clear")
             title,
             num_ref,
             top_field_list,
+            top_age_list,
             cfdi,
+            cadi,
             cfdi_plot,
+            cadi_plot,
         ],
     )
             title,
             num_ref,
             top_field_list,
+            top_age_list,
             cfdi,
+            cadi,
             cfdi_plot,
+            cadi_plot,
             s2_id,
             acl_link,
             pdf_file,
         ],
     )
+demo.launch(server_port=7860, server_name="127.0.0.1")

plots.py CHANGED Viewed

@@ -22,16 +22,15 @@ mean_cfdi = papers_df["incoming_diversity"].mean()
 # Compute the mean CADI
 mean_citation_ages = []
-# Commenting out the old code
-#|# Open the file and read the content in a list
-#|with open(
-#|    os.path.join(dirname, "data/nlp_papers_citation_age.txt"),
-#|    "r",
-#|    encoding="utf-8",
-#|) as filehandle:
-#|    for line in filehandle:
-#|        temp = float(line[:-1])
-#|        mean_citation_ages.append(temp)
 def generate_cfdi_plot(input_cfdi, compute_type="paper"):

 # Compute the mean CADI
 mean_citation_ages = []
+# Open the file and read the content in a list
+with open(
+    os.path.join(dirname, "data/nlp_papers_citation_age.txt"),
+    "r",
+    encoding="utf-8",
+) as filehandle:
+    for line in filehandle:
+        temp = float(line[:-1])
+        mean_citation_ages.append(temp)
 def generate_cfdi_plot(input_cfdi, compute_type="paper"):

s2.py CHANGED Viewed

@@ -5,6 +5,7 @@
 import asyncio
 import datetime
 import os
 from collections import Counter
 from concurrent.futures import ThreadPoolExecutor, as_completed
 from typing import List, Tuple
@@ -12,8 +13,12 @@ from typing import List, Tuple
 import aiohttp
 import requests
-from aclanthology import (async_match_acl_id_to_s2_paper, extract_author_info,
-                          extract_paper_info, extract_venue_info)
 from metrics import calculate_gini, calculate_gini_simpson
 from pdf import parse_pdf_to_artcile_dict
@@ -34,21 +39,47 @@ def get_or_create_eventloop():
             return asyncio.get_event_loop()
-def send_s2_request(request_url):
     """
     Sends a GET request to the specified URL with the S2 API key in the headers.
     Args:
         request_url (str): The URL to send the request to.
     Returns:
-        requests.Response: The response object returned by the request.
     """
-    return requests.get(
-        request_url,
-        headers={"x-api-key": os.environ["s2apikey"]},
-        timeout=10,
-    )
 def check_s2_id_type(semantic_scholar_id):
@@ -64,9 +95,9 @@ def check_s2_id_type(semantic_scholar_id):
         if the ID is not valid for either a paper or an author.
     """
     # First, check if it's a paper ID
-    paper_response = requests.get(
         f"https://api.semanticscholar.org/v1/paper/{semantic_scholar_id}",
-        timeout=5,
     )
     # If the response status code is 200, it means the ID is valid for a paper
@@ -74,17 +105,19 @@ def check_s2_id_type(semantic_scholar_id):
         return "paper", None
     # Next, check if it's an author ID
-    author_response = requests.get(
         f"https://api.semanticscholar.org/v1/author/{semantic_scholar_id}",
-        timeout=5,
     )
     # If the response status code is 200, it means the ID is valid for an author
     return (
         "author",
-        author_response.json()["name"]
-        if author_response.status_code == 200
-        else "invalid",
     )
@@ -167,29 +200,18 @@ def compute_stats_for_references(s2_ref_paper_keys, year):
         year_ref for year_ref in reference_year_list if year_ref is not None
     ]
     reference_title_list = [
-        title_ref
-        for title_ref in reference_title_list
-        if title_ref is not None
     ]
     # Count references
     num_references = len(reference_year_list)
     # Flatten list and count occurrences
-    fields_of_study_counts = dict(
-        Counter(
-            [
-                field
-                for field in reference_fos_list
-            ]
-        )
-    )
     # Citation age list
     aoc_list = [
-        year - year_ref
-        for year_ref in reference_year_list
-        if year_ref and year
     ]
     if not aoc_list:
         return None, None, None, None, None, None
@@ -253,9 +275,7 @@ def compute_stats_for_s2_paper(ssid_paper_id):
             result["year"],
             result["authors"],
         )
-        title_authors = (
-            title + "\n" + ", ".join([author["name"] for author in authors])
-        )
         (
             num_references,
@@ -304,12 +324,14 @@ def compute_stats_for_acl_paper(url):
     Returns:
         dict: A dictionary containing statistics for the paper, or None if the paper was not found.
     """
     if paper_info := extract_paper_info(url):
         loop = get_or_create_eventloop()
         # Match paper ID to Semantic Scholar ID
         s2_paper = loop.run_until_complete(
             async_match_acl_id_to_s2_paper(paper_info["acl_id"])
         )
         return compute_stats_for_s2_paper(s2_paper["paperId"])
     return None
@@ -420,19 +442,44 @@ def compute_stats_for_multiple_s2_papers(
     )
-async def send_s2_async_request(url):
     """
-    Sends an asynchronous request to the specified URL and returns the response as a JSON object.
     Args:
         url (str): The URL to send the request to.
     Returns:
-        dict: The response from the URL as a JSON object.
     """
-    async with aiohttp.ClientSession() as session:
-        async with session.get(url) as response:
-            return await response.json()
 async def match_title_to_s2_paper(title, authors=None):
@@ -447,9 +494,7 @@ async def match_title_to_s2_paper(title, authors=None):
         str or None: Returns the S2 paper ID if found, otherwise None.
     """
     # Send a request to the Semantic Scholar API to search for the paper by its title
-    search_url = (
-        f"http://api.semanticscholar.org/graph/v1/paper/search?query={title}"
-    )
     # Send request
     response = await send_s2_async_request(search_url)

 import asyncio
 import datetime
 import os
+import time  # Added to implement retry delays
 from collections import Counter
 from concurrent.futures import ThreadPoolExecutor, as_completed
 from typing import List, Tuple
 import aiohttp
 import requests
+from aclanthology import (
+    async_match_acl_id_to_s2_paper,
+    extract_author_info,
+    extract_paper_info,
+    extract_venue_info,
+)
 from metrics import calculate_gini, calculate_gini_simpson
 from pdf import parse_pdf_to_artcile_dict
             return asyncio.get_event_loop()
+def send_s2_request(request_url, max_retries: int = 3):
     """
     Sends a GET request to the specified URL with the S2 API key in the headers.
+    If the request is rate-limited (HTTP 429), it will be retried after the
+    delay specified in the ``Retry-After`` header. A maximum of
+    ``max_retries`` additional attempts will be made before the response is
+    returned as-is.
     Args:
         request_url (str): The URL to send the request to.
+        max_retries (int, optional): Maximum number of retries after a 429
+            response. Defaults to 3.
     Returns:
+        requests.Response: The final response object.
     """
+    for attempt in range(max_retries + 1):
+        response = requests.get(
+            request_url,
+            headers={"x-api-key": os.environ["s2apikey"]},
+            timeout=10,
+        )
+        # Return early if not rate-limited or retries exhausted
+        if response.status_code != 429 or attempt == max_retries:
+            return response
+        print(response.status_code)
+        print(response.headers)
+        print(response.text)
+        # Respect the Retry-After header if present
+        retry_after_header = response.headers.get("Retry-After", "3")
+        try:
+            wait_seconds = int(retry_after_header)
+        except ValueError:
+            # Header could be an HTTP-date; fall back to 3 seconds
+            wait_seconds = 3
+        time.sleep(wait_seconds)
 def check_s2_id_type(semantic_scholar_id):
         if the ID is not valid for either a paper or an author.
     """
     # First, check if it's a paper ID
+    paper_response = send_s2_request(
         f"https://api.semanticscholar.org/v1/paper/{semantic_scholar_id}",
+        max_retries=3,
     )
     # If the response status code is 200, it means the ID is valid for a paper
         return "paper", None
     # Next, check if it's an author ID
+    author_response = send_s2_request(
         f"https://api.semanticscholar.org/v1/author/{semantic_scholar_id}",
+        max_retries=3,
     )
     # If the response status code is 200, it means the ID is valid for an author
     return (
         "author",
+        (
+            author_response.json()["name"]
+            if author_response.status_code == 200
+            else "invalid"
+        ),
     )
         year_ref for year_ref in reference_year_list if year_ref is not None
     ]
     reference_title_list = [
+        title_ref for title_ref in reference_title_list if title_ref is not None
     ]
     # Count references
     num_references = len(reference_year_list)
     # Flatten list and count occurrences
+    fields_of_study_counts = dict(Counter([field for field in reference_fos_list]))
     # Citation age list
     aoc_list = [
+        year - year_ref for year_ref in reference_year_list if year_ref and year
     ]
     if not aoc_list:
         return None, None, None, None, None, None
             result["year"],
             result["authors"],
         )
+        title_authors = title + "\n" + ", ".join([author["name"] for author in authors])
         (
             num_references,
     Returns:
         dict: A dictionary containing statistics for the paper, or None if the paper was not found.
     """
+    print(extract_paper_info(url))
     if paper_info := extract_paper_info(url):
         loop = get_or_create_eventloop()
         # Match paper ID to Semantic Scholar ID
         s2_paper = loop.run_until_complete(
             async_match_acl_id_to_s2_paper(paper_info["acl_id"])
         )
+        print(s2_paper)
         return compute_stats_for_s2_paper(s2_paper["paperId"])
     return None
     )
+async def send_s2_async_request(url, max_retries: int = 3):
     """
+    Sends an asynchronous GET request to the specified URL and returns the
+    response body as JSON.
+    Similar to :pyfunc:`send_s2_request`, this helper transparently retries
+    when the Semantic Scholar API responds with HTTP 429. The delay before
+    retrying is taken from the ``Retry-After`` header.
     Args:
         url (str): The URL to send the request to.
+        max_retries (int, optional): Maximum number of retries after a 429
+            response. Defaults to 3.
     Returns:
+        dict: The response parsed as JSON.
     """
+    headers = {"x-api-key": os.environ.get("s2apikey", "")}
+    timeout = aiohttp.ClientTimeout(total=10)
+    async with aiohttp.ClientSession(timeout=timeout) as session:
+        for attempt in range(max_retries + 1):
+            async with session.get(url, headers=headers) as response:
+                if response.status != 429 or attempt == max_retries:
+                    return await response.json()
+                print(response.status)
+                print(response.headers)
+                print(response.text)
+                retry_after_header = response.headers.get("Retry-After", "3")
+                try:
+                    wait_seconds = int(retry_after_header)
+                except ValueError:
+                    wait_seconds = 3
+                await asyncio.sleep(wait_seconds)
 async def match_title_to_s2_paper(title, authors=None):
         str or None: Returns the S2 paper ID if found, otherwise None.
     """
     # Send a request to the Semantic Scholar API to search for the paper by its title
+    search_url = f"http://api.semanticscholar.org/graph/v1/paper/search?query={title}"
     # Send request
     response = await send_s2_async_request(search_url)