Exploratory Data Analysis Example: Complete with and Juptyer Notebook, Slide Deck, and README.

Jesse Villines
6 min readFeb 15, 2021

In this post, I share an example Exploratory Data Analysis (EDA) project. The purpose of this project is to illustrate various techniques, visualizations, and practice use of Jupyter Notebooks and several Python libraries, the most notable of which being Pandas. If you are like me and very familiar with Excel, projects like this are invaluable in learning how to manipulate, clean, and examine data using Python. There is a substantial learning curve, but after a few weeks of practice the benefits and reproducibility of python will become very apparent and you will be remiss to use Excel again. This project is built in Juptyer Notebooks and illustrates an intro level EDA examination of ACT/SAT and CAASP data for the state of California. I hope you enjoy following along and find this workbook useful in your journey towards becoming a Python whiz!

Github Repository: https://github.com/jesservillines/Introduction_to_Exploratory_Data_Analysis

Contents:

Problem Statement
Executive Summary
Data Dictionary
Conclusions and Recommendations

Problem Statement

The state of California is responsible for allocating 58% of the $97.2 billion dollars of 2018–19 funding for California’s public schools ( source). This project aims to provide guidance for the State of California’s Education Budget and make recommendations on the best way to track student success. California performs worse than the national average in the SAT and wants to improve student performance. This study examines local factors of sub-benchmark standardized test scores for students. The state wants to examine the nationally administered ACT and SAT in addition to the state-administered CAASSP (California Assessment of Student Performance). As colleges trend away from using the ACT/SAT ( source) California wants to make sure that they have viable ways to measure student outcomes. The goal is to optimize the effectiveness of each dollar of education spending by the state.

In pursuit of this goal, the state wants to identify barriers to a student’s academic preparation for higher education. The scope of this analysis is to briefly compare California’s ACT/SAT results to the nation’s and compare national standardized college readiness tests to the state-administered CAASSP as viable measures of student outcomes. Then it examines the relationship of college readiness in California’s counties in order to provide a summary of possible causes for poor performance and make suggestions for improvement.

I begin by comparing California’s ACT/SAT test results to the nation’s ACT/SAT test results. Here we find that the state of California is doing a mixed job compared to the national average in preparing our students for college entrance exams depending on the test.

For the ACT our students, on average, score half a standard deviation above the mean. For the SAT, our students score one standard deviation below the mean. In addition, there is a mix of preference in picking the ACT or SAT across districts and schools in California, but this preference favors the SAT. Performance in this test is an area for improvement.

Given the provided California SAT/ACT data sets there is no way to tell if a student took both the ACT and SAT, only the rates at which a student is taking either test. In order to avoid counting a student who took both the ACT and SAT, I filtered the data set to include only the scores for the test that was most taken in each county and district. I also removed records that were empty (schools that did not take the test in question) and schools with anonymous records (less than 15 tests taken). I also focused on only 11th-grade students from the CAASSP dataset. What this project measure is the percentage of students in a county that did not meet benchmark scores in either test. A benchmark score is a score that when met or exceeded is deemed acceptable for college readiness. Here are the benchmark scores for the ACT/SAT/CAASSP.

The ACT measures a student's performance in the following subjects: English, mathematics, reading, and science ( source). The SAT test is split into two sections, Math and Evidence-based Reading and Writing ( source) These tests are used by college admission departments in determining acceptance and financial aid for a student at their respective school. Look here for more information about the ACT and here for more information about the SAT.

In addition to the nationally administered ACT and SAT, I look at the state-administered California Assessment of Student Performance and Progression (CAASSP) test. Similar to the ACT/SAT the CAASSP measures a student’s performance in the academic categories of English, math, and science. Since CAASSP scores are administered for grades 4–8 and 11th grade they could be useful as an early indicator of a student's track towards college preparedness since it is administered more frequently and at younger ages ( source). This is an area for additional research.

The geographic distribution of benchmark scores is similar for both the national tests and the state test suggesting that there are specific local factors that influence test scores:

This study then examined the distribution of median household income and poverty rate under 18 throughout the state and found that these measures of income inequality trend with standardized academic test results. There is a strong correlation between socioeconomic factors and standardized test results. This study does not examine causality but suggests this as an area for additional research.

I take a look at a combination of national and state-specific data.

Data Sets used

Data Dictionary

Conclusions and Recommendations:

There is a link between local factors and student outcomes on standardized academic assessments. When allocating state funds to school districts these local factors need to be considered. Student outcomes are a local problem and need local solutions.

The distribution of sub-benchmark students differs widely based on county. Counties with the highest rates of sub-benchmark students were sub-benchmark for all examined assessment tests. This suggests that national and state academic assessment tests favor and disfavor the same bins of students. I then examined the bins of sub-benchmark students for shared local factors. This local analysis shows a strong link between median household income, poverty rates under the age of 18, and a student’s result on academic assessment tests. If median household income is lower than average in a county, test results are more likely to be sub-benchmark. If rates of poverty under 18 years of age are higher, test results are more likely to be sub-benchmark.

Recommendations:

  • Provide schools with nutritional free lunches. Access to healthy food can improve academic performance. source
  • Update computer labs for lower-income areas where there may not be available computers at home.
  • Create scalable software academic assessment test tutoring solutions.
  • Look at the CAASSP instead of the ACT/SAT as a measure of student benchmark preparedness.

Suggestions for continued analysis include:

  • English Learners per capita. A non-native English speaker may have additional trouble on the English portions of academic assessment tests. If there are more English learners language tutors could a netter use of educational funds than academic tutoring software.
  • Crime rates per capita. High crime rates in home neighborhoods could distract students from studying. Providing healthy outlets such as additional after school programs or library study hours could be helpful.
  • Student Drop-out Rates.
  • Farmed acres per capita (California is a major agricultural producer (source). Are the counties more agricultural? If so, can we design class projects around harvest schedules when students may be assisting parents with the harvest)

--

--

Jesse Villines

I’m a healthcare data scientist with a passion for mining golden insights from complex datasets.