File size: 7,932 Bytes
ed4c927
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1f329b3
 
 
 
 
 
ed4c927
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
550446c
 
eb2bcd5
 
 
550446c
 
ed4c927
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
import streamlit as st
import pandas as pd
import altair as alt

original_data = pd.read_csv('https://huggingface.co/spaces/fallinginfall65/final_project/resolve/main/2015_salary_reporting.csv')
data_2014 = pd.read_csv('https://huggingface.co/spaces/fallinginfall65/final_project/resolve/main/2014_salary_repo.csv')

original_data["Pay Difference"] = original_data["2016 Budgeted Salary"] - original_data["2015 Total Pay"]

numeric_data = original_data.select_dtypes(include=['float64', 'int64'])
top_20_jobs = (
    original_data.groupby("Current Job Title")[numeric_data.columns]
    .mean()
    .sort_values("2015 Pay", ascending=False)
    .head(20)
    .reset_index()
)

data = original_data[original_data["Current Job Title"].isin(top_20_jobs["Current Job Title"])]

merged_filtered_data = pd.merge(
    data,
    data_2014[['Current Job Title', '2015 Budgeted Salary']],
    on='Current Job Title',
    how='inner'
).drop_duplicates(subset=['Current Job Title'])

merged_filtered_data['Difference'] = (
    merged_filtered_data['2015 Total Pay'] - merged_filtered_data['2015 Budgeted Salary']
)
st.title("Switching Gears?")
st.subheader("Team Member:")
st.markdown(
    """
    Ethan Meng
    """
)
st.subheader("Description")
st.markdown(
    """
    The size of the data is small and I will be uploading the dataset to HuggingFace to host the data. The 
    dashboard that I had created can be used by selecting the bars and it will shows 
    the specific department that the user is interested in. The data is filtered with the top 20 "2015 Pay" 
    that is in the dataset. There are some same "Job Title" and identical "Department", so some of the 
    department would have multiple data points when a specific department is selected. The purpose of the 
    visualization is to show the average total pay in 2015 for each department and compare their changes in 
    wages in 2016.

    Link to dataset: https://data.illinois.gov/dataset/5cb07295-93e7-46cc-858f-09e8c849802e/resource/4173bb1c-006f-4fd5-8ec5-302ed7c52560/download/2015_salary_reporting.csv
    """
)
st.subheader("Vizualization")
department_selection = alt.selection_point(fields=["Department Location"])

bar_chart = alt.Chart(data).mark_bar().encode(
    x=alt.X("mean(2015 Total Pay):Q", title="Average 2015 Total Pay"),
    y=alt.Y("Department Location:N", title="Department Location", sort="-x"),
    color=alt.condition(
        department_selection,
        alt.Color("Department Location:N", title="Department"),
        alt.value("lightgray")
    ),
    tooltip=[
        alt.Tooltip("Department Location:N", title="Department"),
        alt.Tooltip("mean(2015 Total Pay):Q", title="Average Total Pay"),
    ]
).add_params(
    department_selection
).properties(
    title="Average 2015 Total Pay by Department Location",
    width=700,
    height=400
)

scatter_chart = alt.Chart(data).mark_circle(size=100).encode(
    x=alt.X("Current Job Title:N", title="Job Title"),
    y=alt.Y("Pay Difference:Q", title="Pay Difference"),
    color=alt.Color("Department Location:N", title="Department"),
    tooltip=[
        alt.Tooltip("First Name:N", title="First Name"),
        alt.Tooltip("Last Name:N", title="Last Name"),
        alt.Tooltip("Pay Difference:Q", title="Pay Difference"),
        alt.Tooltip("Department Location:N", title="Department"),
        alt.Tooltip("Current Job Title:N", title="Job Title"),
    ]
).transform_filter(
    department_selection
).properties(
    title="Pay Difference(Filtered by Department)",
    width=700,
    height=400
)

combined_chart = alt.hconcat(bar_chart&scatter_chart).configure_title(
    fontSize=30
).configure_axis(
    labelFontSize=18,
    titleFontSize=18
).configure_legend(
    titleFontSize=18,
    labelFontSize=16
)

st.altair_chart(combined_chart, use_container_width=False)

st.subheader("Contextual Dataset")
st.markdown(
    """
    The contextual dataset that I included are from data.illinois.gov where the structure of the dataset is almost the same but it is from 2014.

    Link to dataset: https://data.illinois.gov/dataset/e30b5cb2-c1e8-428c-ae64-546498276690/resource/0a6f537a-a233-48f5-9967-34e54e2eaa79/download/2014_salary_repo
    """
)

difference_chart = alt.Chart(merged_filtered_data).mark_bar(color='orange').encode(
    x=alt.X('Current Job Title', sort='-y', title='Job Titles'),
    y=alt.Y('Difference', title='Difference in Salary'),
    tooltip=['Current Job Title', 'Difference']
).properties(
    title='Difference Between 2015 Total Pay and Budgeted Salary',
    width=800,
    height=400
).configure_axis(
    labelAngle=90
)

numeric_data = data_2014.select_dtypes(include=['float64', 'int64'])
top_20_jobs = (
    data_2014.groupby("Current Job Title")[numeric_data.columns]
    .mean()
    .sort_values("2014 Pay", ascending=False)
    .head(20)
    .reset_index()
)

new_data_2014 = data_2014[data_2014["Current Job Title"].isin(top_20_jobs["Current Job Title"])]

bar_chart2 = alt.Chart(new_data_2014).mark_bar().encode(
    x=alt.X("mean(2014 Total Pay):Q", title="Average 2014 Total Pay"),
    y=alt.Y("Department Location:N", title="Department Location", sort="-x"),
    color=alt.condition(
        department_selection,
        alt.Color("Department Location:N", title="Department"),
        alt.value("lightgray")
    ),
    tooltip=[
        alt.Tooltip("Department Location:N", title="Department"),
        alt.Tooltip("mean(2014 Total Pay):Q", title="Average Total Pay"),
    ]
).add_params(
    department_selection
).properties(
    title="Average 2014 Total Pay by Department Location",
    width=800,
    height=400
)

st.altair_chart(difference_chart)
st.altair_chart(bar_chart2)
st.markdown(
    """
    I generated the graph myself and the code is included below.

    Linked to the code: https://huggingface.co/spaces/fallinginfall65/fp3.1.1/blob/main/app.py
    """
)
st.subheader("Write Up")
st.markdown(
    """
    The dataset contains two categorical data which is the "Job Title" and the "Department Location". I modify the dataset so that only the top twenty 2014 total pay sorting by current job title would be presented. 
    After I took the subset that I want to graph on, I realized that there are some same job title and department location, and it is the best to use a scatter plot to look deeper into the data. 
    The bar plot on the top shows the average 2015 total pay for each department which is sorted in order for the user to tell the distribution easily. I used the average since there are different number of data points 
    for each department, and it would be more fairly for the department that have only one data point. Users can select a single bar to look deeper into the data points that are from the same department.

    The scatter plot serves as a insight for the bar graph. Once the users select a specific bar, it would only show the data points from the department and sort the data points by job title. The user could 
    hover over the data point to have a even deeper look into the data point, it will contain almost every information the user needs to know about the data point. As you can see from the scatter plot, almost every 
    data point has positive y-axis except one data point. It is because that the data point does not have a 2016 budget pay. 

    Moreover, the contexual visualization is created to demonstrate the same approach but using the dataset of 2014 and 2015. The contextual dataset also serves to verify that the 2015 budget pay has no big difference from 
    2015 total pay. We can see from the first bar graph that most of the data has a difference of 0 and the only one is due to lack of actual number. The second bar graph also sort the average of 2014 total pay which is used 
    to see the difference of ordering of the distribution. It is slightly different from the 2015 average total pay which is quite interesting to look deeper into. 
    """
)