Memory usage in creating python dict from xml using pattern -

- July 15, 2014

i have largeish xml (40 mb) , use following function parse dict

    def get_title_year(xml,low,high):         """         given xml document extract title , year of each article.         inputs: xml (xml string); low, high (integers) defining beginning , ending year of record follow          """         dom = web.element(xml)         result = {'title':[],'publication year':[]}         count = 0         article in dom.by_tag('article'):             year = int(re.split('"',article.by_tag('cpyrt')[0].content)[1])              if low < year < high:                 result['title'].append(article.by_tag('title')[0].content)                 result['publication year'].append(int(re.split('"',article.by_tag('cpyrt')[0].content)[1]))         return result      ty_dict = get_title_year(pr_file,1912,1970)     ty_df = pd.dataframe(ty_dict)     print ty_df.head()         publication year                                              title     0              1913  velocity of electrons in photo-electri...     1              1913  announcement of transfer of review ...     2              1913  diffraction , secondary radiation elect...     3              1913      on comparative absorption of γ , x rays     4              1913             study of resistance of carbon contacts

when run this, end using 2.5 gb of ram! 2 questions:

where ram used? not dictionary or dataframe, when save dataframe utf8 csv 3.4 mb.

also, ram not released after function finishes. normal? never paid attention python memory usage in past, cannot say.

this answers part releasing memory @ end of function. see wojciech walczak's comment , link above! posting code here because found in case (ubuntu 12.04) putting p.join() statement before assignment ty_dict = q.get() (as in original link) caused code deadlock, see here .

    multiprocessing import process, queue      def get_title_year(xml,low,high,q):         """         given xml document extract title , year of each article.         inputs: xml (xml string); low, high (integers) defining beginning , ending year of record follow          """         dom = web.element(xml)         result = {'title':[],'publication year':[]}         article in dom.by_tag('article'):             year = int(re.split('"',article.by_tag('cpyrt')[0].content)[1])              if low < year < high:                 result['title'].append(article.by_tag('title')[0].content)                 result['publication year'].append(int(re.split('"',article.by_tag('cpyrt')[0].content)[1]))         q.put(result)      q = queue()     p = process(target=get_title_year, args=(pr_file,1912,1970, q))     p.start()     ty_dict = q.get()     p.join()     if p.is_alive():         p.terminate()

with version memory released os att end of statement.

Search This Blog

Cap

Memory usage in creating python dict from xml using pattern -

Comments

Post a Comment

Popular posts from this blog

Need to Replace properties of single sql file using bat file -

c# - How to get the current UAC mode -

postgresql - Lazarus + Postgres: incomplete startup packet -