Pandas
Frequently used code for pandas related code snippets
Last updated
Was this helpful?
Frequently used code for pandas related code snippets
Last updated
Was this helpful?
Was this helpful?
plot columns
df.plot(figsize=(20,10))
fill na
dt.fillna(0)
Reading data from Postgresql
import pandas as pd
from sqlalchemy import create_engine
engine = create_engine("postgresql://user@localhost:5432/mydb")
df = pd.read_sql_query("select * from table", con=engine)
Add rows
dt.append({"col1":[1,2,3], "coln":[4,5,6]})
pivoting dataframe
df.pivot(columns="col_with_columns",values="col_with_values")
getting index of max/min values
df['column'].idxmax()
Count instances of values in column
df['target'].value_counts().sort_values(ascending=False)
Filtering by column's value len
df[df['itemsets'].map(len) > 2]
Iterate by rows
for index, row in df.iterrows():
print(row['c1'], row['c2'])
Subsetting columns
df[['col1', 'col2']]
or by regex pattern
df.filter(regex=("d.*"))
Parse datetimes
# errors = ['coerce','raise','ignore'] # default: 'raise'
df['date'] = pd.to_datetime(df.date, format='%Y-%m-%d')
Group by dates
df[df.date.dt.day == 8] #(or hour or minute)
Plot evolution over time
%matplotlib inline
import matplotlib.pyplot as plt
fig=plt.figure(figsize=(18, 13), dpi= 80, facecolor='w', edgecolor='k')
ax = plt.gca()
hist_dt[hist_dt["target_community"]==0].groupby([hist_dt.created_at.dt.day,hist_dt.created_at.dt.hour]).size().plot(ax=ax)
# if multiple samples: hist_dt[hist_dt["target_community"]==1].groupby([hist_dt.created_at.dt.day,hist_dt.created_at.dt.hour]).size().plot(ax=ax)
# if not the subsetting part could be removed
plt.xlabel('Dia,Hora de publicación')
plt.ylabel('Cantidad de tweets')
New column A based on column B
df['Discounted_Price'] = df.apply(lambda row: row.Cost -
(row.Cost * 0.1), axis = 1)
#if row.property == 1 else None # for conditional setting
To select rows whose column value equals a scalar, some_value, use ==:
df.loc[df['column_name'] == some_value]
To select rows whose column value is in an iterable, some_values, use isin:
df.loc[df['column_name'].isin(some_values)]
To select rows whose column value is in another column array
df.apply(lambda x: x['Responsibility Type'] in x['Roles'], axis=1)
Combine multiple conditions with &:
df.loc[(df['column_name'] >= A) & (df['column_name'] <= B)]
Note the parentheses. Due to Python's operator precedence rules, & binds more tightly than <= and >=. Thus, the parentheses in the last example are necessary. Without the parentheses
df['column_name'] >= A & df['column_name'] <= B
is parsed as
df['column_name'] >= (A & df['column_name']) <= B
To select rows whose column value does not equal some_value, use !=:
df.loc[df['column_name'] != some_value]
To select rows whose value is not in some_values
df.loc[~df['column_name'].isin(some_values)]
Improve query efficiency by setting an index
df = df.set_index(['colname'])
Order columns lexicographically
df = df.reindex(sorted(df.columns), axis=1)
Remove duplicates (distinct)
df.drop_duplicates()# subset=["col1", "col2"] use only those columns for distinction
Check if a field exists
df[df['column'].isnull()]
#or
df[df['column'].notnull()]
Replacing column values
w['female'] = w['female'].map({'female': 1, 'male': 0})
#or
dt.loc[dt['ccLeadtime'] == -1, 'ccLeadtime'] = 0
One-hot encoding
dfDummies = pd.get_dummies(df['categorical'], prefix = 'category')
df = pd.concat([df, dfDummies], axis=1)
Rename column
df = df.rename(columns={"A": "a", "B": "c"})
Rename index names
df.index.names = ['new_name']
Remove column from list
df.drop(['pop'], axis=1)
Join dataframes
pd.merge(df1,df2,on='key')
Iterate by groups
grouped = df.groupby('A')
for name, group in grouped:
Group data by year
data.groupby(data.date.dt.year)
Count unique by group
df.groupby('location')['user'].nunique()
Collect list in Group by
df.groupby('a')['b'].apply(list).reset_index(name='new')
Get normalized values from groupby count
df2 = df.groupby(['subset_product', 'subset_close']).size().reset_index(name='prod_count')
a = df2.groupby('subset_product')['prod_count'].transform('sum')
df2['prod_count'] = df2['prod_count'].div(a)
Generate csv from dataframe
df.to_csv('filename.csv',index=False)
Sum to date column
df['date'] = df['created'] + pd.to_timedelta(7, unit='d')
Replace character in column
df["column"] = df["column"].str.replace('foo', 'bar')