Overview of This Blog

This week, we had two lessons: one for 2.2 and one for 2.3. This blog includes the hacks that we were assigned to do for 2.3, which was all about extracting information from data. I plan to use this as one of my study tools as we begin to prepare for the AP Exam on May 8.

Early Seed Award

The secret seed award for 2.3 was simply having the data files added to our repository before the tech talk was given. Below are the files that we had to add, along with some of their contents:

from PIL import Image
proof1 = Image.open('../images/datafile.png')
proof2 = Image.open('../images/gradefile.png')

print("DATA.CSV FILE")
display(proof1)

print("GRADE.JSON FILE")
display(proof2)
DATA.CSV FILE
GRADE.JSON FILE

AP Prep

2.3 College Board Practice Problems

(1) A researcher is analyzing data about students in a school district to determine whether there is a relationship between grade point average and number of absences. The researcher plans on compiling data from several sources to create a record for each student.

The researcher has access to a database with the following information about each student.

Last name

First name

Grade level (9, 10, 11, or 12)

Grade point average (on a 0.0 to 4.0 scale)

The researcher also has access to another database with the following information about each student.

First name

Last name

Number of absences from school

Number of late arrivals to school

Upon compiling the data, the researcher identifies a problem due to the fact that neither data source uses a unique ID number for each student. Which of the following best describes the problem caused by the lack of unique ID numbers?

(A) Students who have the same name may be confused with each other.

(B) Students who have the same grade point average may be confused with each other.

(C) Students who have the same grade level may be confused with each other.

(D) Students who have the same number of absences may be confused with each other.

Correct Answer: A

(2) A team of researchers wants to create a program to analyze the amount of pollution reported in roughly 3,000 counties across the United States. The program is intended to combine county data sets and then process the data. Which of the following is most likely to be a challenge in creating the program?

(A) A computer program cannot combine data from different files.

(B) Different counties may organize data in different ways.

(C) The number of counties is too large for the program to process.

(D) The total number of rows of data is too large for the program to process.

Correct Answer: B

(3) A student is creating a Web site that is intended to display information about a city based on a city name that a user enters in a text field. Which of the following are likely to be challenges associated with processing city names that users might provide as input?

Select two answers.

(A) Users might attempt to use the Web site to search for multiple cities.

(B) Users might enter abbreviations for the names of cities.

(C) Users might misspell the name of the city.

(D) Users might be slow at typing a city name in the text field.

Correct Answers: B and C

(4) A database of information about shows at a concert venue contains the following information.

Name of artist performing at the show

Date of show

Total dollar amount of all tickets sold

Which of the following additional pieces of information would be most useful in determining the artist with the greatest attendance during a particular month?

(A) Average ticket price

(B) Length of the show in minutes

(C) Start time of the show

(D) Total dollar amount of food and drinks sold during the show

Correct Answer: A

(5) A camera mounted on the dashboard of a car captures an image of the view from the driver’s seat every second. Each image is stored as data. Along with each image, the camera also captures and stores the car’s speed, the date and time, and the car’s GPS location as metadata. Which of the following can best be determined using only the data and none of the metadata?

(A) The average number of hours per day that the car is in use

(B) The car’s average speed on a particular day

(C) The distance the car traveled on a particular day

(D) The number of bicycles the car passed on a particular day

Correct Answer: D

(6) A teacher sends students an anonymous survey in order to learn more about the students’ work habits. The survey contains the following questions.

On average, how long does homework take you each night (in minutes)?

On average, how long do you study for each test (in minutes)?

Do you enjoy the subject material of this class (yes or no)?

Which of the following questions about the students who responded to the survey can the teacher answer by analyzing the survey results?

I. Do students who enjoy the subject material tend to spend more time on homework each night than the other students do?

II. Do students who spend more time on homework each night tend to spend less time studying for tests than the other students do?

III. Do students who spend more time studying for tests tend to earn higher grades in the class than the other students do?

(A) I only

(B) III only

(C) I and II

(D) I and III

Correct Answer: C

from PIL import Image
print("SCORE: ")
score = Image.open('../images/extractinginfoscore.png')

display(score)
SCORE: 

As shown in the above image, I earned a 6/6 (100%) on Extracting Data from Information quiz, which indicates that I have a very good understanding of what we learned this week. Generally speaking, as I was completing this quiz, I really had no trouble answering any of the questions and none of them really forced me to think for a while before arriving at the correct answer. While it is great I earned a good score on this mini quiz, it is also important that I refer back to this quiz along with many of the other tests and quizzes that we have taken on College Board, as all of these will definitely serve as useful study tools for the AP exam. It is also important that I continue to practice more on questions like the one on this quiz, as one quiz may not always be enough to indicate that I am strong in answering these kinds of questions. Overall, I am very happy with my score and believe that this quiz will be useful in helping me study for the AP exam coming up.

2.3 Notes

  • Pandas
    • Library in Python that is used to explore, clean, and process data
    • Data table in Pandas is called a DataFrame
    • Important to check what data needs to be cleaned, such as missing, invalid, or inaccurate data.
    • Pandas provides several features for extracting information from a data set, including extracting columns, sorting values, and filtering data
    • Data analysis tool that can be used to explore, clean, and process data
    • When working with data, users must have the right tools to process and analyze the data
    • Combining data from different sources can be a challenging task
    • Data cleaning is an important process that helps to remove missing, invalid, and inaccurate data
    • The DataFrame function in Pandas is used to gather data sets
    • When analyzing data, we can extract information using features such as DataFrame Extract Column
    • We can sort data using the DataFrame Sort function
    • The DataFrame Selection or Filter function helps to select or filter data based on specific conditions
    • The DataFrame Selection Max and Min function helps to select data based on the maximum or minimum value of a given feature
    • Good data cleaning practices are important to ensure that the data analyzed is accurate and reliable.

2.3 Observations

In this lecture, several code blocks were used to demonstrate how to use Pandas to extract information from a data set. The next part of this blog lists some of the observations of each code block as well as what each one does. In the first code cell, the following code was used to import the Pandas library:

In the first code cell, the following code was used to import the Pandas library, shown below:

'''Pandas is used to gather data sets through its DataFrames implementation'''
import pandas as pd

This code cell had to be ran first before running any of the other cells proceeding it, as not doing so would result in errors stating that "pd" was not defined. The second code cell is shown below:

df = pd.read_json('files/grade.json')

print(df)
# What part of the data set needs to be cleaned?
# From PBL learning, what is a good time to clean data?  Hint, remember Garbage in, Garbage out?
   Student ID Year in School   GPA
0         123             12  3.57
1         246             10  4.00
2         578             12  2.78
3         469             11  3.45
4         324         Junior  4.75
5         313             20  3.33
6         145             12  2.95
7         167             10  3.90
8         235      9th Grade  3.15
9         nil              9  2.80
10        469             11  3.45
11        456             10  2.75

This code cell demonstrated how to check what data needs to be cleaned using the pd.read_json() function to read a JSON file into a DataFrame, and then the print() function was used to display the DataFrame. The output of the print() function showed that the data set has incomplete data that needs to be cleaned, as shown by the "nil" and the 20th grade. The third code cell is shown below:

print(df[['GPA']])

print()

#try two columns and remove the index from print statement
print(df[['Student ID','GPA']].to_string(index=False))
     GPA
0   3.57
1   4.00
2   2.78
3   3.45
4   4.75
5   3.33
6   2.95
7   3.90
8   3.15
9   2.80
10  3.45
11  2.75

Student ID  GPA
       123 3.57
       246 4.00
       578 2.78
       469 3.45
       324 4.75
       313 3.33
       145 2.95
       167 3.90
       235 3.15
       nil 2.80
       469 3.45
       456 2.75

This code cell demonstrated how to extract columns from a data set using the df[['column_name']] function. The print(df[['GPA']]) function extracted the 'GPA' column, and the print(df[['Student ID','GPA']].to_string(index=False)) function extracted the 'Student ID' and 'GPA' columns and removed the index from the printed statement. The fourth code segment is shown below:

print(df.sort_values(by=['GPA']))

print()

#sort the values in reverse order
print(df.sort_values(by=['GPA'], ascending=False))
   Student ID Year in School   GPA
11        456             10  2.75
2         578             12  2.78
9         nil              9  2.80
6         145             12  2.95
8         235      9th Grade  3.15
5         313             20  3.33
3         469             11  3.45
10        469             11  3.45
0         123             12  3.57
7         167             10  3.90
1         246             10  4.00
4         324         Junior  4.75

   Student ID Year in School   GPA
4         324         Junior  4.75
1         246             10  4.00
7         167             10  3.90
0         123             12  3.57
3         469             11  3.45
10        469             11  3.45
5         313             20  3.33
8         235      9th Grade  3.15
6         145             12  2.95
9         nil              9  2.80
2         578             12  2.78
11        456             10  2.75

This code cell demonstrated how to sort values in a data set using the df.sort_values(by=['column_name']) function. The print(df.sort_values(by=['GPA'])) function sorted the data set by 'GPA' in ascending order, while the print(df.sort_values(by=['GPA'], ascending=False)) function sorted the data set by 'GPA' in descending order. The fifth code segment is shown below:

print(df[df.GPA > 3.00])
   Student ID Year in School   GPA
0         123             12  3.57
1         246             10  4.00
3         469             11  3.45
4         324         Junior  4.75
5         313             20  3.33
7         167             10  3.90
8         235      9th Grade  3.15
10        469             11  3.45

This code cell demonstrated how to filter data in a data set using the df[df.column_name > value] function. The print(df[df.GPA > 3.00]) function filtered the data set to only show rows with a 'GPA' value greater than 3.00. The sixth and final code segment is shown below:

print(df[df.GPA == df.GPA.max()])
print()
print(df[df.GPA == df.GPA.min()])
  Student ID Year in School   GPA
4        324         Junior  4.75

   Student ID Year in School   GPA
11        456             10  2.75

This code cell by itself is pretty self-explanatory in terms of what it does. The dataframe specifically extracts information from the GPA column and returns the highest (max) GPA. In addition, the dataframe also returns the lowest (min) GPA.

2.3 Reflection

Overall, Pandas is a powerful tool for analyzing data in Python, as it provides several features for extracting information from a data set, including extracting columns, sorting values, and filtering data. While Pandas has the ability to do many different things with data, before analyzing a data set, it is important to check what data needs to be cleaned so that the data is accurate and thus displayed as desired.

Pandas Application (Two Data Sets)

Ace Attorney Data Set

'''Pandas is used to gather data sets through its DataFrames implementation'''
import pandas as pd
df = pd.read_json('files/aceattorney.json')

print(df)
# What part of the data set needs to be cleaned?
# From PBL learning, what is a good time to clean data?  Hint, remember Garbage in, Garbage out?
                 name    sales
0    Original Trilogy  1000000
1      Ace Attorney 1   796000
2      Apollo Justice   660000
3      Ace Attorney 2   600000
4      Ace Attorney 3   580000
5          Chronicles   500000
6      Dual Destinies   448000
7       Investigation   400000
8   Spirit of Justice   343000
9    Investigations 2   275000
10               DGS1   241000
11               DGS2   149000
print(df[['sales']])

print()

#try two columns and remove the index from print statement
print(df[['name','sales']].to_string(index=False))

## index == False doesn't print out index values
      sales
0   1000000
1    796000
2    660000
3    600000
4    580000
5    500000
6    448000
7    400000
8    343000
9    275000
10   241000
11   149000

             name   sales
 Original Trilogy 1000000
   Ace Attorney 1  796000
   Apollo Justice  660000
   Ace Attorney 2  600000
   Ace Attorney 3  580000
       Chronicles  500000
   Dual Destinies  448000
    Investigation  400000
Spirit of Justice  343000
 Investigations 2  275000
             DGS1  241000
             DGS2  149000
print(df[df.sales == df.sales.max()])
print()
print(df[df.sales == df.sales.min()])
               name    sales
0  Original Trilogy  1000000

    name   sales
11  DGS2  149000

Planets of the Solar System Data Set

df = pd.read_json('files/planets.json')

print(df)
# What part of the data set needs to be cleaned?
# From PBL learning, what is a good time to clean data?  Hint, remember Garbage in, Garbage out?
   Planets  distance_from_the_sun  distance_from_the_earth
0  Mercury               36295000             1.272500e+08
1    Venus               67004000             1.219300e+08
2    Earth               92351000                      NaN
3     Mars              152890000             1.174200e+08
4  Jupiter              460300000             5.438000e+08
5   Saturn              911470000             9.972700e+08
6   Uranus             1826800000             1.878700e+09
7  Neptune             2779600000             2.871800e+09
8    Pluto             3700000000             3.284200e+09
print(df[['Planets']])

print()

#try two columns and remove the index from print statement
print(df[['Planets','distance_from_the_earth']].to_string(index=False))

## index == False doesn't print out index values
   Planets
0  Mercury
1    Venus
2    Earth
3     Mars
4  Jupiter
5   Saturn
6   Uranus
7  Neptune
8    Pluto

Planets  distance_from_the_earth
Mercury              127250000.0
  Venus              121930000.0
  Earth                      NaN
   Mars              117420000.0
Jupiter              543800000.0
 Saturn              997270000.0
 Uranus             1878700000.0
Neptune             2871800000.0
  Pluto             3284200000.0
print(df[df.distance_from_the_earth == df.distance_from_the_earth.max()])
print(df[df.distance_from_the_earth == df.distance_from_the_earth.min()])
  Planets  distance_from_the_sun  distance_from_the_earth
8   Pluto             3700000000             3.284200e+09
  Planets  distance_from_the_sun  distance_from_the_earth
3    Mars              152890000              117420000.0