Memory usage in creating python dict from xml using pattern -
i have largeish xml (40 mb) , use following function parse dict
def get_title_year(xml,low,high): """ given xml document extract title , year of each article. inputs: xml (xml string); low, high (integers) defining beginning , ending year of record follow """ dom = web.element(xml) result = {'title':[],'publication year':[]} count = 0 article in dom.by_tag('article'): year = int(re.split('"',article.by_tag('cpyrt')[0].content)[1]) if low < year < high: result['title'].append(article.by_tag('title')[0].content) result['publication year'].append(int(re.split('"',article.by_tag('cpyrt')[0].content)[1])) return result ty_dict = get_title_year(pr_file,1912,1970) ty_df = pd.dataframe(ty_dict) print ty_df.head() publication year title 0 1913 velocity of electrons in photo-electri... 1 1913 announcement of transfer of review ... 2 1913 diffraction , secondary radiation elect... 3 1913 on comparative absorption of γ , x rays 4 1913 study of resistance of carbon contacts
when run this, end using 2.5 gb of ram! 2 questions:
where ram used? not dictionary or dataframe, when save dataframe utf8 csv 3.4 mb.
also, ram not released after function finishes. normal? never paid attention python memory usage in past, cannot say.
this answers part releasing memory @ end of function. see wojciech walczak's comment , link above! posting code here because found in case (ubuntu 12.04) putting p.join()
statement before assignment ty_dict = q.get()
(as in original link) caused code deadlock, see here .
multiprocessing import process, queue def get_title_year(xml,low,high,q): """ given xml document extract title , year of each article. inputs: xml (xml string); low, high (integers) defining beginning , ending year of record follow """ dom = web.element(xml) result = {'title':[],'publication year':[]} article in dom.by_tag('article'): year = int(re.split('"',article.by_tag('cpyrt')[0].content)[1]) if low < year < high: result['title'].append(article.by_tag('title')[0].content) result['publication year'].append(int(re.split('"',article.by_tag('cpyrt')[0].content)[1])) q.put(result) q = queue() p = process(target=get_title_year, args=(pr_file,1912,1970, q)) p.start() ty_dict = q.get() p.join() if p.is_alive(): p.terminate()
with version memory released os att end of statement.
Comments
Post a Comment