Spaces:

fallinginfall65
/

fp3.1.1

Sleeping

fp3.1.1

File size: 7,932 Bytes

import streamlit as st
import pandas as pd
import altair as alt

original_data = pd.read_csv('https://huggingface.co/spaces/fallinginfall65/final_project/resolve/main/2015_salary_reporting.csv')
data_2014 = pd.read_csv('https://huggingface.co/spaces/fallinginfall65/final_project/resolve/main/2014_salary_repo.csv')

original_data["Pay Difference"] = original_data["2016 Budgeted Salary"] - original_data["2015 Total Pay"]

numeric_data = original_data.select_dtypes(include=['float64', 'int64'])
top_20_jobs = (
    original_data.groupby("Current Job Title")[numeric_data.columns]
    .mean()
    .sort_values("2015 Pay", ascending=False)
    .head(20)
    .reset_index()
)

data = original_data[original_data["Current Job Title"].isin(top_20_jobs["Current Job Title"])]

merged_filtered_data = pd.merge(
    data,
    data_2014[['Current Job Title', '2015 Budgeted Salary']],
    on='Current Job Title',
    how='inner'
).drop_duplicates(subset=['Current Job Title'])

merged_filtered_data['Difference'] = (
    merged_filtered_data['2015 Total Pay'] - merged_filtered_data['2015 Budgeted Salary']
)
st.title("Switching Gears?")
st.subheader("Team Member:")
st.markdown(
    """
    Ethan Meng
    """
)
st.subheader("Description")
st.markdown(
    """
    The size of the data is small and I will be uploading the dataset to HuggingFace to host the data. The 
    dashboard that I had created can be used by selecting the bars and it will shows 
    the specific department that the user is interested in. The data is filtered with the top 20 "2015 Pay" 
    that is in the dataset. There are some same "Job Title" and identical "Department", so some of the 
    department would have multiple data points when a specific department is selected. The purpose of the 
    visualization is to show the average total pay in 2015 for each department and compare their changes in 
    wages in 2016.

    Link to dataset: https://data.illinois.gov/dataset/5cb07295-93e7-46cc-858f-09e8c849802e/resource/4173bb1c-006f-4fd5-8ec5-302ed7c52560/download/2015_salary_reporting.csv
    """
)
st.subheader("Vizualization")
department_selection = alt.selection_point(fields=["Department Location"])

bar_chart = alt.Chart(data).mark_bar().encode(
    x=alt.X("mean(2015 Total Pay):Q", title="Average 2015 Total Pay"),
    y=alt.Y("Department Location:N", title="Department Location", sort="-x"),
    color=alt.condition(
        department_selection,
        alt.Color("Department Location:N", title="Department"),
        alt.value("lightgray")
    ),
    tooltip=[
        alt.Tooltip("Department Location:N", title="Department"),
        alt.Tooltip("mean(2015 Total Pay):Q", title="Average Total Pay"),
    ]
).add_params(
    department_selection
).properties(
    title="Average 2015 Total Pay by Department Location",
    width=700,
    height=400
)

scatter_chart = alt.Chart(data).mark_circle(size=100).encode(
    x=alt.X("Current Job Title:N", title="Job Title"),
    y=alt.Y("Pay Difference:Q", title="Pay Difference"),
    color=alt.Color("Department Location:N", title="Department"),
    tooltip=[
        alt.Tooltip("First Name:N", title="First Name"),
        alt.Tooltip("Last Name:N", title="Last Name"),
        alt.Tooltip("Pay Difference:Q", title="Pay Difference"),
        alt.Tooltip("Department Location:N", title="Department"),
        alt.Tooltip("Current Job Title:N", title="Job Title"),
    ]
).transform_filter(
    department_selection
).properties(
    title="Pay Difference(Filtered by Department)",
    width=700,
    height=400
)

combined_chart = alt.hconcat(bar_chart&scatter_chart).configure_title(
    fontSize=30
).configure_axis(
    labelFontSize=18,
    titleFontSize=18
).configure_legend(
    titleFontSize=18,
    labelFontSize=16
)

st.altair_chart(combined_chart, use_container_width=False)

st.subheader("Contextual Dataset")
st.markdown(
    """
    The contextual dataset that I included are from data.illinois.gov where the structure of the dataset is almost the same but it is from 2014.

    Link to dataset: https://data.illinois.gov/dataset/e30b5cb2-c1e8-428c-ae64-546498276690/resource/0a6f537a-a233-48f5-9967-34e54e2eaa79/download/2014_salary_repo
    """
)

difference_chart = alt.Chart(merged_filtered_data).mark_bar(color='orange').encode(
    x=alt.X('Current Job Title', sort='-y', title='Job Titles'),
    y=alt.Y('Difference', title='Difference in Salary'),
    tooltip=['Current Job Title', 'Difference']
).properties(
    title='Difference Between 2015 Total Pay and Budgeted Salary',
    width=800,
    height=400
).configure_axis(
    labelAngle=90
)

numeric_data = data_2014.select_dtypes(include=['float64', 'int64'])
top_20_jobs = (
    data_2014.groupby("Current Job Title")[numeric_data.columns]
    .mean()
    .sort_values("2014 Pay", ascending=False)
    .head(20)
    .reset_index()
)

new_data_2014 = data_2014[data_2014["Current Job Title"].isin(top_20_jobs["Current Job Title"])]

bar_chart2 = alt.Chart(new_data_2014).mark_bar().encode(
    x=alt.X("mean(2014 Total Pay):Q", title="Average 2014 Total Pay"),
    y=alt.Y("Department Location:N", title="Department Location", sort="-x"),
    color=alt.condition(
        department_selection,
        alt.Color("Department Location:N", title="Department"),
        alt.value("lightgray")
    ),
    tooltip=[
        alt.Tooltip("Department Location:N", title="Department"),
        alt.Tooltip("mean(2014 Total Pay):Q", title="Average Total Pay"),
    ]
).add_params(
    department_selection
).properties(
    title="Average 2014 Total Pay by Department Location",
    width=800,
    height=400
)

st.altair_chart(difference_chart)
st.altair_chart(bar_chart2)
st.markdown(
    """
    I generated the graph myself and the code is included below.

    Linked to the code: https://huggingface.co/spaces/fallinginfall65/fp3.1.1/blob/main/app.py
    """
)
st.subheader("Write Up")
st.markdown(
    """
    The dataset contains two categorical data which is the "Job Title" and the "Department Location". I modify the dataset so that only the top twenty 2014 total pay sorting by current job title would be presented. 
    After I took the subset that I want to graph on, I realized that there are some same job title and department location, and it is the best to use a scatter plot to look deeper into the data. 
    The bar plot on the top shows the average 2015 total pay for each department which is sorted in order for the user to tell the distribution easily. I used the average since there are different number of data points 
    for each department, and it would be more fairly for the department that have only one data point. Users can select a single bar to look deeper into the data points that are from the same department.

    The scatter plot serves as a insight for the bar graph. Once the users select a specific bar, it would only show the data points from the department and sort the data points by job title. The user could 
    hover over the data point to have a even deeper look into the data point, it will contain almost every information the user needs to know about the data point. As you can see from the scatter plot, almost every 
    data point has positive y-axis except one data point. It is because that the data point does not have a 2016 budget pay. 

    Moreover, the contexual visualization is created to demonstrate the same approach but using the dataset of 2014 and 2015. The contextual dataset also serves to verify that the 2015 budget pay has no big difference from 
    2015 total pay. We can see from the first bar graph that most of the data has a difference of 0 and the only one is due to lack of actual number. The second bar graph also sort the average of 2014 total pay which is used 
    to see the difference of ordering of the distribution. It is slightly different from the 2015 average total pay which is quite interesting to look deeper into. 
    """
)