python - Memory usage in creating Term Density Matrix from pandas dataFrame -

i have dataframe save/read csv file, , want create term density matrix dataframe it. following herrfz's suggestion here, use counvectorizer sklearn. wrapped code in function

    sklearn.feature_extraction.text import countvectorizer     countvec = countvectorizer()     scipy.sparse import coo_matrix, csc_matrix, hstack      def df2tdm(df,titlecolumn,placementcolumn):         '''         takes in dataframe @ least 2 columns, , returns dataframe term density matrix         of words appearing in titlecolumn          inputs: df, dataframe containing titlecolumn, placementcolumn among other columns         outputs: tdm_df, dataframe containing placementcolumn , columns words appearrig in df.titlecolumn          credits:         '''         tdm_df = pd.dataframe(countvec.fit_transform(df[titlecolumn]).toarray(), columns=countvec.get_feature_names())         tdm_df = tdm_df.join(pd.dataframe(df[placementcolumn]))         return tdm_df 

which returns tdm dataframe, example:

    df = pd.dataframe({'title':['delicious boiled egg','fried egg ', 'potato salad', 'split orange','something else'], 'page':[1, 1, 2, 3, 4]})     print df.head()     tdm_df = df2tdm(df,'title','page')     tdm_df.head()         boiled  delicious  egg  else  fried  orange  potato  salad   \     0       1          1    1     0      0       0       0      0          0        1       0          0    1     0      1       0       0      0          0        2       0          0    0     0      0       0       1      1          0        3       0          0    0     0      0       1       0      0          0        4       0          0    0     1      0       0       0      0          1            split  page       0      0     1       1      0     1       2      0     2       3      1     3       4      0     4   

this implementation suffers bad memory scaling: when use dataframe occupies 190 kb saved utf8, function uses ~200 mb create tdm dataframe. when csv file 600 kb, function uses 700 mb, , when csv 3.8 mb function uses of memory , swap file (8 gb) , crashes.

i made implementation using sparse matrices , sparse dataframes (below), memory usage pretty same, considerably slower

    def df2tdm_sparse(df,titlecolumn,placementcolumn):         '''         takes in dataframe @ least 2 columns, , returns dataframe term density matrix         of words appearing in titlecolumn. implementation uses sparse dataframes.          inputs: df, dataframe containing titlecolumn, placementcolumn among other columns         outputs: tdm_df, dataframe containing placementcolumn , columns words appearrig in df.titlecolumn          credits:         '''         pm = df[[placementcolumn]].values         tm = countvec.fit_transform(df[titlecolumn])#.toarray()         m = csc_matrix(hstack([pm,tm]))         dfout = pd.sparsedataframe([ pd.sparseseries(m[i].toarray().ravel()) in np.arange(m.shape[0]) ])         dfout.columns = [placementcolumn]+countvec.get_feature_names()         return dfout 

any suggestions on how improve memory usage? wonder if related memory issues of scikit, e.g. here

i think problem might conversion sparse matrix sparse data frame.

try function (or similar)

 def sparsematrixtosparsedf(xsparsematrix):      import numpy np      import pandas pd      def elementstona(x):           x[x==0] = nan      return x      xdf1 =        pd.sparsedataframe([pd.sparseseries(elementstona(xsparsematrix[i].toarray().ravel()))  in np.arange(xsparsematrix.shape[0]) ])   return xdf1 

you can see reduces size using function density


i hope helps


Popular posts from this blog

c# - How to get the current UAC mode -

postgresql - Lazarus + Postgres: incomplete startup packet -

angularjs - ng-repeat duplicating items after page reload -