Data Science 101 — CodingLab

8 min readJun 10, 2020

--

Session 1

1st: import pandas as pd
2nd: Put the data into a list/dictionary
3rd:
df=pd.DataFrame(data,column=[‘….’, ‘….’, ‘….’])
df

*note: Dictionary key(column name)-value(data) pair

Uploading files:

from google.colab import files
uploaded = files.upload()

drive.mount(‘/content/drive’) → allows you to upload files from google drive without you downloading the files into your PC

Reading files:

import pandas as pd
df = pd.read_csv(‘unit05-data.csv’)
df

3. Data Processing

Extracting Row from DataFrame:

df = pd.DataFrame(data,columns = [‘name’,’age’,’salary’])
df.loc[0] → ‘0’ refers to the indexing

Extracting Column from DataFrame:

df = pd.DataFrame(data,columns = [‘name’,’age’,’salary’])
df[‘name’] → means the ‘name’ column

Grouping & Aggregation:

class_score = df.groupby(‘Class’).agg({‘Score’: [‘mean’, ‘min’, ‘max’]})
class_score

Grouping by Multiple Columns:

class_score = df.groupby([‘Class’,’Age’]).agg({‘Score’: [‘mean’, ‘min’, ‘max’]})
class_score

Filtering: Select Rows Based on Value of Column:

year_2002 = df[(df[‘year’] == 2002)]
year_2002

To get the original dataframe, just type: print(df)

Select Rows Whose Column Value Does NOT Equal a Specific Value:

year_not_2002 = df[ (df[‘year’] != 2002)]
year_not_2002

Those that are not year 2002 will be displayed.

There is two ways df[‘year’] and df.year

df[‘…’ ] is a better method if the header is long with spaces. However, df.year will be a faster method.

Select Not Null:

df_not_null = df[df.year.notnull()]
df_not_null

It will display all the year that are not null

Select rows based on a list:

years = [2002,2008]
selection = df[df.year.isin(years)]
selection

It displays those years that are 2002 and 2008

Select Rows Based on Values Not in List:

The opposite is called negate ~

years = [2002,2008]
selection = df[~df.year.isin(years)]
selection

It displays those years that are not 2002 and 2008

Select Rows Using Multiple Conditions:

Want to have more than one conditions. Use ‘and’ → &

Specify two conditions; year → 2002,2008 and country → Singapore

Two conditions; year → range 2004 to 2009(ie. between 2004 and 2008) and country other than Singapore

Note for using print():

Concatenate DataFrames by Rows:

ignore the index originally in their respective data frame and replace with new indexing

Concatenate by Column:

axis=1

the new data frame will be join as new column, which is not logical in our example

The ‘NaN’ means empty/null. Because df2 only has 3 rows of data, so the rest were displayed as ‘NaN’.

Merge DataFrames:

Add a new column ‘id’

on=”Student” → “student” is the column that coincides between df_row and df3

Changing the order of the names(ie. Anna and Bob)

When there is a different name (ie. name ‘Cass’ become ‘Jack’)

Pivoting in Pandas:

aggfunc → aggregation function by default will calculate the mean/average

index → Refers to the row

values → what data you want to display on the data frame.

fill_value → will display all ‘NaN’ with ‘0’ instead.

Set Missing Values To 0:

Output DataFrame to csv:

.to_csv

r → raw data means a full stop will be read as full stop

4. Data Cleaning

Missing Data: Either remove data or use imputation

i) remove data

dropna() will remove all the rows with missing values

By default dropna() will only remove row.

dropna(axis=1) will remove column.

**To read excel files using panda: pd.read_excel(“file name.xls”)

Only the age column gets removed because only the age contains ‘NaN’ value

ii) Data Imputation

Removing data may be convenient, but it may delete some important information on other columns.

Imputation means editing the values by some other values.

.fillna(0)

replace all ‘NaN’ with ‘0’ instead of deleting and assign back to original column df[“Age’]

df[‘Age’].mean()==df.Age.mean()

Plotting Histogram:

Quantitative(t-test,z-test,chi-square, Anova test), Graphical

Plotting Boxplot:

px.box( df, x= ,y= )

‘Y1’ is just a name and it can be changed to any naming.

You can change the ‘Age’ to ‘Score’ or other things

Plotting Bar Graphs:

Bar graph has spaces, while histogram is stick together.

Usually used for data with different category.

px.bar( df, x= , y= )

Plotting Pie-Chart:

px.pie(df, values= , names=)

values → data values

names → axis

Session 2

Descriptive statistics using data collected from a population through numerical calculations or graphs or tables.

Inferential statistics using sample data taken from a population to make inferences/predictions.

df.describe() shows you the mean, standard deviation, minimum, Inter-Quarter-Range, maximum

Confidence interval tells you the chances of meeting a certain range/criteria. eg.95%CI means 5% chance will not fall within the range.

Hypothesis Testing(z-test)

The higher the confidence level, the higher the z-score. (eg. You are confident that the mean height of people will fall within the range 1.4m to 1.8m, which is highly probable. But also mean higher margin of error thus higher z-score.)

one-tailed test, two-tailed test

level of significance(rejection level for two-tailed test)==alpha level

Z-test compare the mean of one sample/hypothesis
T-test compare the mean of two sample/hypothesis
ANOVA Test(Analysis of Variance) compare the mean of more than two sample/hypothesis
Chi-Square test compare categorical variables.

statistically significant means you are able to get the same result when you repeat the test on another sample. (It must be able to produce the same results when tested over and over again)

The more sample size (ie.30 samples and above), the more likely you will get a normal distribution.

Outliers normally we will keep them in your data unless you are really sure the outlier is due to errors(eg. recorded wrongly)

To determine data is normal:

central limit theorem (>30 sample size)
Chi-square test for normality
plot a graph and see if the shape is a normal distribution

IQR allows you to determine/identify your outlier.

Session 3

Scipy — https://docs.scipy.org/doc/

Statsmodel z-test — https://www.statsmodels.org/stable/generated/statsmodels.stats.weightstats.ztest.html?highlight=ztest

When do we use z-test?
Population standard deviation is known
Data should be normally distributed(ie. sample size >30 → central limit theorem)

statmodels gives you the pval and zset for the entire sample.
scipy.stats give you each indiviual value in the sample.

When do we use t-test?

When population variance/SD is unknown.
When determining the difference between the mean of 2 groups of sample.
Data should be normally distributed
For small sample size <30

Use Anova test when comparing data from more than two groups

since pvalue<0.05, we reject null hypothesis

Anova test == F-test

F-value = 7.1210194…(its same as the T-score, Z-score)

Chi-square is for data with categories.

pval>0.05, therefore, we do not reject null hypothesis. We conclude that there is not enough evidence to suggest the dice is not fair(this does not equate to being a fair dice).

If we print everything, it has 4 values. Refer to the image below for legend.

https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.chi2_contingency.html

p value is the probability of obtaining the test results are at least at the tails of the bell curve. Lower p value means to reject null hypothesis(ie <0.05)

rvalue → correlation of your values

matplotlib → draw graphs

Data Science 101 — CodingLab

Written by My Personal Learning

No responses yet