goldb.org home

AS OF MAY 2008, THIS BLOG IS NO LONGER BEING UPDATED.
Visit the new blog at: http://coreygoldberg.blogspot.com



 Wednesday, November 07, 2007

Python - Processing Large Text Files One Line At A Time

I want to process some very large text files one line at a time.  Normally when I process text files, I slurp them into a list using the readlines() method.   However, sometimes the files are huge and it isn't feasible or optimal to read the entire content into memory upfront.   In this case, it makes sense to process them one line at a time.

The best solution I can come up with is this:


fh = open('foo.txt', 'r')
line = fh.readline()
while line:
    # do something here
    line = fh.readline()

It doesn't feel very pythonic/idiomatic.  Anyone have a better solution?


Update
Thanks to the comments below, I found a few different ways to do it. The best and most Pythonic way seems to be this:


for line in open('foo.txt', 'r'):
    # do something here

Python file objects support the iterator protocol, so you can just open it and go.   This is the same as using a while loop and calling readline() but more compact.

#    Comments [7] |
Wednesday, November 07, 2007 3:08:52 PM (Eastern Standard Time, UTC-05:00)
This is the quintessential way to read files in python:

reader = file( 'foo.txt' )

# read the first line
headers = reader.next()

for line in reader:
line = line.strip()

Wednesday, November 07, 2007 3:19:29 PM (Eastern Standard Time, UTC-05:00)
for line in open('somefile').xreadlines()
Wednesday, November 07, 2007 4:01:41 PM (Eastern Standard Time, UTC-05:00)
xreadlines is deprecated since Python 2.3. (http://www.python.org/doc/2.3/lib/module-xreadlines.html)

The "for line in reader" approach is the way to go.
Wednesday, November 07, 2007 4:03:24 PM (Eastern Standard Time, UTC-05:00)
Python file objects support the iterator protocol, so you can just open it and go:

for line in open('somefile','r'):
print line


This is the same as using a while loop and calling readline() but more compact.
Thursday, November 08, 2007 1:31:12 AM (Eastern Standard Time, UTC-05:00)
Fredrik Lundh, does some optimising on line by line processing of very large text files, as part of his Wide-Finder work here: http://effbot.org/zone/wide-finder.htm

- Paddy.
Friday, November 09, 2007 7:00:57 AM (Eastern Standard Time, UTC-05:00)
You might also find the fileinput module useful (http://docs.python.org/lib/module-fileinput.html).
Sunday, November 11, 2007 1:48:58 AM (Eastern Standard Time, UTC-05:00)
I see you've already got good solutions from the other comments. Just commenting on the 'for line in ...' idiom - IIRC, I read in the Python Cookbook that this idiom (which I think is available since Python 2.2 or so), is also faster (maybe because they've optimized it internally). BTW (off topic, except about performance), the same book has a fascinating section by Tim Peters about how the internal Python sort() method has been heavily optimized as well - worth a read ...

Also, just saw your ystockquote Python module - going to check it out - thanks!

Vasudev Ram
Dancing Bison Enterprises
Biz site: http://www.dancingbison.com
Blog on software innovation:
http://jugad.livejournal.com
PDF creation toolkit (in Python):
http://www.dancingbison.com/products.html
Comments are closed.