Posted By: Anonymous
I am trying to find the number of times a certain value appears in one column.
I have made the dataframe with
data = pd.DataFrame.from_csv('data/DataSet2.csv')
and now I want to find the number of times something appears in a column. How is this done?
I thought it was the below, where I am looking in the education column and counting the number of time
The code below shows that I am trying to find the number of times
9th appears and the error is what I am getting when I run the code
missing2 = df.education.value_counts()['9th'] print(missing2)
You can create
subset of data with your condition and then use
print df col1 education 0 a 9th 1 b 9th 2 c 8th print df.education == '9th' 0 True 1 True 2 False Name: education, dtype: bool print df[df.education == '9th'] col1 education 0 a 9th 1 b 9th print df[df.education == '9th'].shape 2 print len(df[df['education'] == '9th']) 2
Performance is interesting, the fastest solution is compare numpy array and
import perfplot, string np.random.seed(123) def shape(df): return df[df.education == 'a'].shape def len_df(df): return len(df[df['education'] == 'a']) def query_count(df): return df.query('education == "a"').education.count() def sum_mask(df): return (df.education == 'a').sum() def sum_mask_numpy(df): return (df.education.values == 'a').sum() def make_df(n): L = list(string.ascii_letters) df = pd.DataFrame(np.random.choice(L, size=n), columns=['education']) return df perfplot.show( setup=make_df, kernels=[shape, len_df, query_count, sum_mask, sum_mask_numpy], n_range=[2**k for k in range(2, 25)], logx=True, logy=True, equality_check=False, xlabel='len(df)')