python - Memory usage in creating Term Density Matrix from pandas dataFrame -
i have dataframe save/read csv file, , want create term density matrix dataframe it. following herrfz's suggestion here, use counvectorizer sklearn. wrapped code in function
sklearn.feature_extraction.text import countvectorizer countvec = countvectorizer() scipy.sparse import coo_matrix, csc_matrix, hstack def df2tdm(df,titlecolumn,placementcolumn): ''' takes in dataframe @ least 2 columns, , returns dataframe term density matrix of words appearing in titlecolumn inputs: df, dataframe containing titlecolumn, placementcolumn among other columns outputs: tdm_df, dataframe containing placementcolumn , columns words appearrig in df.titlecolumn credits: https://stackoverflow.com/questions/22205845/efficient-way-to-create-term-density-matrix-from-pandas-dataframe ''' tdm_df = pd.dataframe(countvec.fit_transform(df[titlecolumn]).toarray(), columns=countvec.get_feature_names()) tdm_df = tdm_df.join(pd.dataframe(df[placementcolumn])) return tdm_df
which returns tdm dataframe, example:
df = pd.dataframe({'title':['delicious boiled egg','fried egg ', 'potato salad', 'split orange','something else'], 'page':[1, 1, 2, 3, 4]}) print df.head() tdm_df = df2tdm(df,'title','page') tdm_df.head() boiled delicious egg else fried orange potato salad \ 0 1 1 1 0 0 0 0 0 0 1 0 0 1 0 1 0 0 0 0 2 0 0 0 0 0 0 1 1 0 3 0 0 0 0 0 1 0 0 0 4 0 0 0 1 0 0 0 0 1 split page 0 0 1 1 0 1 2 0 2 3 1 3 4 0 4
this implementation suffers bad memory scaling: when use dataframe occupies 190 kb saved utf8, function uses ~200 mb create tdm dataframe. when csv file 600 kb, function uses 700 mb, , when csv 3.8 mb function uses of memory , swap file (8 gb) , crashes.
i made implementation using sparse matrices , sparse dataframes (below), memory usage pretty same, considerably slower
def df2tdm_sparse(df,titlecolumn,placementcolumn): ''' takes in dataframe @ least 2 columns, , returns dataframe term density matrix of words appearing in titlecolumn. implementation uses sparse dataframes. inputs: df, dataframe containing titlecolumn, placementcolumn among other columns outputs: tdm_df, dataframe containing placementcolumn , columns words appearrig in df.titlecolumn credits: https://stackoverflow.com/questions/22205845/efficient-way-to-create-term-density-matrix-from-pandas-dataframe https://stackoverflow.com/questions/17818783/populate-a-pandas-sparsedataframe-from-a-scipy-sparse-matrix https://stackoverflow.com/questions/6844998/is-there-an-efficient-way-of-concatenating-scipy-sparse-matrices ''' pm = df[[placementcolumn]].values tm = countvec.fit_transform(df[titlecolumn])#.toarray() m = csc_matrix(hstack([pm,tm])) dfout = pd.sparsedataframe([ pd.sparseseries(m[i].toarray().ravel()) in np.arange(m.shape[0]) ]) dfout.columns = [placementcolumn]+countvec.get_feature_names() return dfout
any suggestions on how improve memory usage? wonder if related memory issues of scikit, e.g. here
i think problem might conversion sparse matrix sparse data frame.
try function (or similar)
def sparsematrixtosparsedf(xsparsematrix): import numpy np import pandas pd def elementstona(x): x[x==0] = nan return x xdf1 = pd.sparsedataframe([pd.sparseseries(elementstona(xsparsematrix[i].toarray().ravel())) in np.arange(xsparsematrix.shape[0]) ]) return xdf1
you can see reduces size using function density
df1.density
i hope helps
Comments
Post a Comment