2.3 Extracting Information from Data
from PIL import Image
proof1 = Image.open('../images/datafile.png')
proof2 = Image.open('../images/gradefile.png')
print("DATA.CSV FILE")
display(proof1)
print("GRADE.JSON FILE")
display(proof2)
2.3 College Board Practice Problems
(1) A researcher is analyzing data about students in a school district to determine whether there is a relationship between grade point average and number of absences. The researcher plans on compiling data from several sources to create a record for each student.
The researcher has access to a database with the following information about each student.
Last name
First name
Grade level (9, 10, 11, or 12)
Grade point average (on a 0.0 to 4.0 scale)
The researcher also has access to another database with the following information about each student.
First name
Last name
Number of absences from school
Number of late arrivals to school
Upon compiling the data, the researcher identifies a problem due to the fact that neither data source uses a unique ID number for each student. Which of the following best describes the problem caused by the lack of unique ID numbers?
(A) Students who have the same name may be confused with each other.
(B) Students who have the same grade point average may be confused with each other.
(C) Students who have the same grade level may be confused with each other.
(D) Students who have the same number of absences may be confused with each other.
Correct Answer: A
(2) A team of researchers wants to create a program to analyze the amount of pollution reported in roughly 3,000 counties across the United States. The program is intended to combine county data sets and then process the data. Which of the following is most likely to be a challenge in creating the program?
(A) A computer program cannot combine data from different files.
(B) Different counties may organize data in different ways.
(C) The number of counties is too large for the program to process.
(D) The total number of rows of data is too large for the program to process.
Correct Answer: B
(3) A student is creating a Web site that is intended to display information about a city based on a city name that a user enters in a text field. Which of the following are likely to be challenges associated with processing city names that users might provide as input?
Select two answers.
(A) Users might attempt to use the Web site to search for multiple cities.
(B) Users might enter abbreviations for the names of cities.
(C) Users might misspell the name of the city.
(D) Users might be slow at typing a city name in the text field.
Correct Answers: B and C
(4) A database of information about shows at a concert venue contains the following information.
Name of artist performing at the show
Date of show
Total dollar amount of all tickets sold
Which of the following additional pieces of information would be most useful in determining the artist with the greatest attendance during a particular month?
(A) Average ticket price
(B) Length of the show in minutes
(C) Start time of the show
(D) Total dollar amount of food and drinks sold during the show
Correct Answer: A
(5) A camera mounted on the dashboard of a car captures an image of the view from the driver’s seat every second. Each image is stored as data. Along with each image, the camera also captures and stores the car’s speed, the date and time, and the car’s GPS location as metadata. Which of the following can best be determined using only the data and none of the metadata?
(A) The average number of hours per day that the car is in use
(B) The car’s average speed on a particular day
(C) The distance the car traveled on a particular day
(D) The number of bicycles the car passed on a particular day
Correct Answer: D
(6) A teacher sends students an anonymous survey in order to learn more about the students’ work habits. The survey contains the following questions.
On average, how long does homework take you each night (in minutes)?
On average, how long do you study for each test (in minutes)?
Do you enjoy the subject material of this class (yes or no)?
Which of the following questions about the students who responded to the survey can the teacher answer by analyzing the survey results?
I. Do students who enjoy the subject material tend to spend more time on homework each night than the other students do?
II. Do students who spend more time on homework each night tend to spend less time studying for tests than the other students do?
III. Do students who spend more time studying for tests tend to earn higher grades in the class than the other students do?
(A) I only
(B) III only
(C) I and II
(D) I and III
Correct Answer: C
from PIL import Image
print("SCORE: ")
score = Image.open('../images/extractinginfoscore.png')
display(score)
As shown in the above image, I earned a 6/6 (100%) on Extracting Data from Information quiz, which indicates that I have a very good understanding of what we learned this week. Generally speaking, as I was completing this quiz, I really had no trouble answering any of the questions and none of them really forced me to think for a while before arriving at the correct answer. While it is great I earned a good score on this mini quiz, it is also important that I refer back to this quiz along with many of the other tests and quizzes that we have taken on College Board, as all of these will definitely serve as useful study tools for the AP exam. It is also important that I continue to practice more on questions like the one on this quiz, as one quiz may not always be enough to indicate that I am strong in answering these kinds of questions. Overall, I am very happy with my score and believe that this quiz will be useful in helping me study for the AP exam coming up.
2.3 Notes
- Pandas
- Library in Python that is used to explore, clean, and process data
- Data table in Pandas is called a DataFrame
- Important to check what data needs to be cleaned, such as missing, invalid, or inaccurate data.
- Pandas provides several features for extracting information from a data set, including extracting columns, sorting values, and filtering data
- Data analysis tool that can be used to explore, clean, and process data
- When working with data, users must have the right tools to process and analyze the data
- Combining data from different sources can be a challenging task
- Data cleaning is an important process that helps to remove missing, invalid, and inaccurate data
- The DataFrame function in Pandas is used to gather data sets
- When analyzing data, we can extract information using features such as DataFrame Extract Column
- We can sort data using the DataFrame Sort function
- The DataFrame Selection or Filter function helps to select or filter data based on specific conditions
- The DataFrame Selection Max and Min function helps to select data based on the maximum or minimum value of a given feature
- Good data cleaning practices are important to ensure that the data analyzed is accurate and reliable.
2.3 Observations
In this lecture, several code blocks were used to demonstrate how to use Pandas to extract information from a data set. The next part of this blog lists some of the observations of each code block as well as what each one does. In the first code cell, the following code was used to import the Pandas library:
In the first code cell, the following code was used to import the Pandas library, shown below:
'''Pandas is used to gather data sets through its DataFrames implementation'''
import pandas as pd
This code cell had to be ran first before running any of the other cells proceeding it, as not doing so would result in errors stating that "pd" was not defined. The second code cell is shown below:
df = pd.read_json('files/grade.json')
print(df)
# What part of the data set needs to be cleaned?
# From PBL learning, what is a good time to clean data? Hint, remember Garbage in, Garbage out?
This code cell demonstrated how to check what data needs to be cleaned using the pd.read_json() function to read a JSON file into a DataFrame, and then the print() function was used to display the DataFrame. The output of the print() function showed that the data set has incomplete data that needs to be cleaned, as shown by the "nil" and the 20th grade. The third code cell is shown below:
print(df[['GPA']])
print()
#try two columns and remove the index from print statement
print(df[['Student ID','GPA']].to_string(index=False))
This code cell demonstrated how to extract columns from a data set using the df[['column_name']] function. The print(df[['GPA']]) function extracted the 'GPA' column, and the print(df[['Student ID','GPA']].to_string(index=False)) function extracted the 'Student ID' and 'GPA' columns and removed the index from the printed statement. The fourth code segment is shown below:
print(df.sort_values(by=['GPA']))
print()
#sort the values in reverse order
print(df.sort_values(by=['GPA'], ascending=False))
This code cell demonstrated how to sort values in a data set using the df.sort_values(by=['column_name']) function. The print(df.sort_values(by=['GPA'])) function sorted the data set by 'GPA' in ascending order, while the print(df.sort_values(by=['GPA'], ascending=False)) function sorted the data set by 'GPA' in descending order. The fifth code segment is shown below:
print(df[df.GPA > 3.00])
This code cell demonstrated how to filter data in a data set using the df[df.column_name > value] function. The print(df[df.GPA > 3.00]) function filtered the data set to only show rows with a 'GPA' value greater than 3.00. The sixth and final code segment is shown below:
print(df[df.GPA == df.GPA.max()])
print()
print(df[df.GPA == df.GPA.min()])
This code cell by itself is pretty self-explanatory in terms of what it does. The dataframe specifically extracts information from the GPA column and returns the highest (max) GPA. In addition, the dataframe also returns the lowest (min) GPA.
2.3 Reflection
Overall, Pandas is a powerful tool for analyzing data in Python, as it provides several features for extracting information from a data set, including extracting columns, sorting values, and filtering data. While Pandas has the ability to do many different things with data, before analyzing a data set, it is important to check what data needs to be cleaned so that the data is accurate and thus displayed as desired.
'''Pandas is used to gather data sets through its DataFrames implementation'''
import pandas as pd
df = pd.read_json('files/aceattorney.json')
print(df)
# What part of the data set needs to be cleaned?
# From PBL learning, what is a good time to clean data? Hint, remember Garbage in, Garbage out?
print(df[['sales']])
print()
#try two columns and remove the index from print statement
print(df[['name','sales']].to_string(index=False))
## index == False doesn't print out index values
print(df[df.sales == df.sales.max()])
print()
print(df[df.sales == df.sales.min()])
df = pd.read_json('files/planets.json')
print(df)
# What part of the data set needs to be cleaned?
# From PBL learning, what is a good time to clean data? Hint, remember Garbage in, Garbage out?
print(df[['Planets']])
print()
#try two columns and remove the index from print statement
print(df[['Planets','distance_from_the_earth']].to_string(index=False))
## index == False doesn't print out index values
print(df[df.distance_from_the_earth == df.distance_from_the_earth.max()])
print(df[df.distance_from_the_earth == df.distance_from_the_earth.min()])