Memory usage in creating python dict from xml using pattern -


i have largeish xml (40 mb) , use following function parse dict

    def get_title_year(xml,low,high):         """         given xml document extract title , year of each article.         inputs: xml (xml string); low, high (integers) defining beginning , ending year of record follow          """         dom = web.element(xml)         result = {'title':[],'publication year':[]}         count = 0         article in dom.by_tag('article'):             year = int(re.split('"',article.by_tag('cpyrt')[0].content)[1])              if low < year < high:                 result['title'].append(article.by_tag('title')[0].content)                 result['publication year'].append(int(re.split('"',article.by_tag('cpyrt')[0].content)[1]))         return result      ty_dict = get_title_year(pr_file,1912,1970)     ty_df = pd.dataframe(ty_dict)     print ty_df.head()         publication year                                              title     0              1913  velocity of electrons in photo-electri...     1              1913  announcement of transfer of review ...     2              1913  diffraction , secondary radiation elect...     3              1913      on comparative absorption of γ , x rays     4              1913             study of resistance of carbon contacts 

when run this, end using 2.5 gb of ram! 2 questions:

where ram used? not dictionary or dataframe, when save dataframe utf8 csv 3.4 mb.

also, ram not released after function finishes. normal? never paid attention python memory usage in past, cannot say.

this answers part releasing memory @ end of function. see wojciech walczak's comment , link above! posting code here because found in case (ubuntu 12.04) putting p.join() statement before assignment ty_dict = q.get() (as in original link) caused code deadlock, see here .

    multiprocessing import process, queue      def get_title_year(xml,low,high,q):         """         given xml document extract title , year of each article.         inputs: xml (xml string); low, high (integers) defining beginning , ending year of record follow          """         dom = web.element(xml)         result = {'title':[],'publication year':[]}         article in dom.by_tag('article'):             year = int(re.split('"',article.by_tag('cpyrt')[0].content)[1])              if low < year < high:                 result['title'].append(article.by_tag('title')[0].content)                 result['publication year'].append(int(re.split('"',article.by_tag('cpyrt')[0].content)[1]))         q.put(result)      q = queue()     p = process(target=get_title_year, args=(pr_file,1912,1970, q))     p.start()     ty_dict = q.get()     p.join()     if p.is_alive():         p.terminate() 

with version memory released os att end of statement.


Comments

Popular posts from this blog

c# - How to get the current UAC mode -

postgresql - Lazarus + Postgres: incomplete startup packet -

javascript - Ajax jqXHR.status==0 fix error -