STAY INFORMED
following content serves as a personal note and may lack complete accuracy or certainty.

Minimal-Mistakes instruction
Useful vscode Shortcut Keys
Unix Commands
npm Commands
Vim Commands
Git Note
Useful Figma Shortcut Keys

7 minute read

Introduction

Pandas is a open-source data manipulation and analysis library. It provides data structures for efficiently storing and manipulating large datasets and tools for working with structured data.

import pandas as pd

Here is simple example of creating a DataFrame using pandas

data = {'Name': ['Alice', 'Bob', 'Charlie'],
        'Age': [25, 30, 35],
        'City': ['New York', 'San Francisco', 'Los Angeles']}

df = pd.DataFrame(data)

print(df)
#      Name  Age           City
# 0   Alice   25       New York
# 1     Bob   30  San Francisco
# 2  Charlie   35    Los Angeles

print(type(df)) # pandas.core.frame.DataFrame
print(df.columns) # [Name, Age, City]

It is quite easy to create data frames with pandas.

import pandas as pd
two_dimensional_list = [["a", 50, 86], ["b", 89, 31], ["c", 68, 91], ["d", 88, 75]]
my_df = pd.DataFrame(two_dimensional_list)
print(my_df)

output

  0 1 2
0 a 50 86
1 b 89 31
2 c 68 91
3 b 88 75

If you do not define the row, column names, it will be automatically generated 0, 1, 2, 3 …

You can define the names like this

import pandas as pd
two_dimensional_list = [["a", 50, 86], ["b", 89, 31], ["c", 68, 91], ["d", 88, 75]]
my_df = pd.DataFrame(two_dimensional_list, columns=["name", "english_score", "math_score"], index=["a", "b", "c", "d"])
print(my_df)

output

  name english_score math_score
a a 50 86
b b 89 31
c c 68 91
d b 88 75

If you want to check data types,

print(my_df.dtypes)
# name             object
# english_score     int64
# math_score        int64
# dtype: object

Data Frame

Data frame can contain a variety of data types, but within the same column, it should be of the same data type.

You can also create a frame using dictionary

import numpy as np
import pandas as pd

names = ['a', 'b', 'c', 'd']
english_scores = [50, 89, 68, 88]
math_scores = [86, 31, 91, 75]

dict1 = {
    'name': names,
    'english_score': english_scores,
    'math_score': math_scores
}

dict2 = {
    'name': np.array(names),
    'english_score': np.array(english_scores),
    'math_score': np.array(math_scores)
}

dict3 = {
    'name': pd.Series(names),
    'english_score': pd.Series(english_scores),
    'math_score': pd.Series(math_scores)
}

# same outputs
df1 = pd.DataFrame(dict1)
df2 = pd.DataFrame(dict2)
df3 = pd.DataFrame(dict3)

print(df1)

output

  name english_score math_score
a a 50 86
b b 89 31
c c 68 91
d b 88 75

or

import numpy as np
import pandas as pd

my_list = [
    {'name': 'dongwook', 'english_score': 50, 'math_score': 86},
    {'name': 'sineui', 'english_score': 89, 'math_score': 31},
    {'name': 'ikjoong', 'english_score': 68, 'math_score': 91},
    {'name': 'yoonsoo', 'english_score': 88, 'math_score': 75}
]

# if you do not specify the order of the column, it might be arranged it alphabetically.
df = pd.DataFrame(my_list, columns=["english_score", "math_score", "name"])
print(df)

output

  english_score math_score name
a 50 86 a
b 89 31 b
c 68 91 c
d 88 75 b

Several dtypes that can be contained in pandas.

dtype explain
int64 int
float64 float
object string
bool boolean
datetime64 date and time
category category

Read CSV File

Using pandas, you can quite easily read CSV files.

,released,display,memory,version,Face ID
iPhone 7,2016-09-16,4.7,2GB,iOS 10.0,No
iPhone 7 Plus,2016-09-16,5.5,3GB,iOS 10.0,No
iPhone 8,2017-09-22,4.7,2GB,iOS 11.0,No
iPhone 8 Plus,2017-09-22,5.5,3GB,iOS 11.0,No
iPhone X,2017-11-03,5.8,3GB,iOS 11.1,Yes
iPhone XS,2018-09-21,5.8,4GB,iOS 12.0,Yes
iPhone XS Max,2018-09-21,6.5,4GB,iOS 12.0,Yes
import pandas as pd
iphone_df = pd.read_csv("data/csvfile.csv")
print(ipone_df)

output

  Unnamed: 0 released display memory version Face ID
0 iPhone 7 2016-09-16 4.7 2GB iOS 10.0 No
1 iPhone 7 Plus 2016-09-16 5.5 3GB iOS 10.0 No
2 iPhone 8 2017-09-22 4.7 2GB iOS 11.0 No
3 iPhone 8 Plus 2017-09-22 5.5 3GB iOS 11.0 No
4 iPhone X 2017-11-03 5.8 3GB iOS 11.0 Yes
5 iPhone XS 2018-09-21 5.8 4GB iOS 12.0 Yes
6 iPhone XS Max 2018-09-21 6.5 4GB iOS 12.0 Yes

If you use rede_csv()functions, the first row will be considered as a header. If the csv file has no header, you have to do like

iphone_df = pd.read_csv("data/csvfile.csv", header=None)

and you may notice that there is Unnamed header. If you see the csv file, first column of the header is empty, so that is why you got Unnamed. I wanted to give the first column as index.

iphone_df = pd.read_csv("data/csvfile.csv", index_col=0)

then you will get this frame

  released display memory version Face ID
iPhone 7 2016-09-16 4.7 2GB iOS 10.0 No
iPhone 7 Plus 2016-09-16 5.5 3GB iOS 10.0 No
iPhone 8 2017-09-22 4.7 2GB iOS 11.0 No
iPhone 8 Plus 2017-09-22 5.5 3GB iOS 11.0 No
iPhone X 2017-11-03 5.8 3GB iOS 11.0 Yes
iPhone XS 2018-09-21 5.8 4GB iOS 12.0 Yes
iPhone XS Max 2018-09-21 6.5 4GB iOS 12.0 Yes

Indexing

You can access the data

Getting one data source

iphone_df.loc["iPhone 7", "released"]
# 2016-09-16

Getting all chosen row

iphone_df.loc["iPhone 7"]
# or
iphone_df.loc["iPhone 7", :]

# released        2016-09-16
# display             4.7
# memory               2GB
# version        iOS 10.0
# Face ID            No
# Name: iPhone 7, dtype: object

Getting all chosen column

iphone_df["display"]
# or
iphone_df.loc[:, "display"]

# iPhone 7         4.7
# iPhone 7 Plus    5.5
# iPhone 8         4.7
# iPhone 8 Plus    5.5
# iPhone X         5.8
# iPhone XS        5.8
# iPhone XS Max    6.5
# Name: display, dtype: float64

Getting multiple rows

iphone_df.loc[["iPhone 7", "iPhone 7 Plus"]]

output

  released display memory version Face ID
iPhone 7 2016-09-16 4.7 2GB iOS 10.0 No
iPhone 7 Plus 2016-09-16 5.5 3GB iOS 10.0 No

Getting multiple columns also same idea(without loc).

Getting multiple rows and columns using slicing

iphone_df.loc["iPhone 7":"iPhone X", "released":"memory"]

output

  released display memory
iPhone 7 2016-09-16 4.7 2GB
iPhone 7 Plus 2016-09-16 5.5 3GB
iPhone 8 2017-09-22 4.7 2GB
iPhone 8 Plus 2017-09-22 5.5 3GB
iPhone X 2017-11-03 5.8 3GB

Getting data using boolean methods

condition = (iphone_df["display"] > 5) & (iphone_df["Face ID"] == "YES")
# iPhone 7         False
# iPhone 7 Plus    False
# iPhone 8         False
# iPhone 8 Plus    False
# iPhone X         False
# iPhone XS        False
# iPhone XS Max    False
# dtype: bool

iphone_df.loc[condition]

output

  released display memory version Face ID
iPhone X 2017-11-03 5.8 3GB iOS 11.0 Yes
iPhone XS 2018-09-21 5.8 4GB iOS 12.0 Yes
iPhone XS Max 2018-09-21 6.5 4GB iOS 12.0 Yes

Indexing Table

Here is a table of indexing syntax

Indexing by Name
  Basic Form Shortcut Form
Single row by name df.loc["row4"]  
List of row names df.loc[["row4", "row5", "row3"]]  
Slicing row names df.loc["row2":"row5"] df["row2":"row5"]
Single column by name df.loc[:, "col1"] df["col1"]
List of column names df.loc[:, ["col4", "col6", "col3"]] df[["col4", "col6", "col3"]]
Slicing column names df.loc[:, "col2":"col5"]  
Indexing by Position
  Basic Form Shortcut Form
Single row by position df.iloc[8]  
List of row positions df.iloc[[4, 5, 3]]  
Slicing row positions df.iloc[2:5] df[2:5]
Single column by position df.iloc[:, 3]  
List of column positions df.iloc[:, [3, 5, 6]]  
Slicing column positions df.iloc[:, 3:7]  

Handling DataFrame

Modify

# modify one element
iphone_df.loc['iPhone 7', "memory"] = '2.5GB'

# modify one row
iphone_df.loc['iPhone 8'] = ['2015-09-22', '4.7', '2.5GB', 'ios 11.0', 'No']

# modify one column
iphone_df['display'] = ['4.5 in' '4.7 in'...]
ipohne_df['Face ID'] = 'Yes' # will be modified all rows to 'Yes'

# modify multiple rows
iphone_df.loc[['iphone 7', 'iphone 8']] = 'a'
iphone_df.loc['iphone 7' : 'iphone 8'] = 'a'

Add, Delete

Add
# will be added end of the row
iphone_df.loc['iPhone XR'] = ['2017-11-03', '5.8', '3GB', 'iOS 11.0', 'Yes']

# will be added end of the column
iphone_df['Company'] = 'Apple'
Delete
# delete selected row
iphone_df.drop('iPhone XR', axis='index', inplace=True)

# delete selected column
iphone_df.drop('Company', axis='columns', inplace=True)

if inplace=False, the original data frame will not be affected.

Rename index/column

# this create new data frame
iphone_df.rename(columns={'released' : 'Released', 'display' : 'Display'...})

# this modify the original data frame
iphone_df.rename(columns={'released': 'Released', 'display' : 'Display'...}, inplace=True)

# naming index name
iphone_df.index.name = 'Model Name'

Big DataFrame

Get Data From Top or Bottom

iphone_df.head(3) # top 3
iphone_df.tail(3) # bottom 3
  Unnamed: 0 released display memory version Face ID
0 iPhone 7 2016-09-16 4.7 2GB iOS 10.0 No
1 iPhone 7 Plus 2016-09-16 5.5 3GB iOS 10.0 No
2 iPhone 8 2017-09-22 4.7 2GB iOS 11.0 No
  Unnamed: 0 released display memory version Face ID
4 iPhone X 2017-11-03 5.8 3GB iOS 11.0 Yes
5 iPhone XS 2018-09-21 5.8 4GB iOS 12.0 Yes
6 iPhone XS Max 2018-09-21 6.5 4GB iOS 12.0 Yes
iphone_df.shape # (n rows, n columns)

Get Information of the DataFrame

iphone_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7 entries, 0 to 6
Data columns (total 7 columns):
 #   Column       Non-Null Count  Dtype
---  ------       --------------  -----
 0   Unnamed: 0   7 non-null      object
 1   released     7 non-null      object
 2   display      7 non-null      float64
 3   memory       7 non-null      object
 4   version      7 non-null      object
 5   Face ID      7 non-null      object
dtypes: float64(1), object(6)
memory usage: 520.0+ bytes
iphone_df.describe()

Returns columns that consisting only of numbers.

  display
count 7.000000
mean 5.357143
std 0.687871
min 4.700000
25% 4.700000
50% 5.500000
75% 5.800000
max 6.500000

Sort the DataFrame

iphone_df.sort_values(by='memory', ascending=True, inplace=True)