Re: Python Gotcha

From: charlie strauss <cems_at_earthlink_dot_net>
Date: Mon Oct 02 2006 - 23:10:15 CDT

-----Original Message-----
>From: "Reg. Charney" <>
>Sent: Oct 2, 2006 6:46 PM
>To: Open Voting Consortium discussion list <>
>Subject: Re: [OVC-discuss] Python Gotcha
>I work with Python and your analysis is technically correct. However, it
>is unusual to intentionally create millions of objects without deleting
>any of them. For one thing, it eats up memory that can be used for other
>things. In your case, creating millions of objects is logically correct
>and you do want to disable automatic garbage collection (GC) -- so this
>is not a band aid.

Well yes that was my point. Since I encountered in a context I thought might occur in vote counting I thought it worth mentioning as a gotcha to watch for.

is a million unreasonable?
imagine you have a ballot with 100 lines of text, and there are ten attributes of each line to record.
Imagine you have 2000 ballots. That's minimally >2 million objects to store, before your program does anything at all, which presumably will cause generation of many more objects.

>Having said that, I am concerned that you are keeping all of that
>information in memory. If there is a power failure or a software bug,
>all the information that you have collected could be lost. I would have
>thought that you would want to write each ballot out to non-volatile
>memory or backing storage, so that only one ballot, at most, would be
>lost due to power outage or some component failure.

Don't fret. First my program is a toy demo. And second, it's not preserving any vote record but rather doing statistical analysis of previously recorded vote files. Reading them all in and comparing them requires they all be in memory.

I don't need any problem solving help here since I know the solution to my specific problems. But if you are curious to see the issue, here's an example below.

I've tested this on every kind of platform (win, mac, linux) and many versions of Python. All display it but there's
some importan variations that may make the bug seem worse or better.
it shows up most easily on slow, large memory machines.
 if your machine is faster than my 800mhz G4 then the time gaps are smaller at first so you have to set
MACHINERES smaller or wait longer to see the events. Windows machines can't time less than 0.17 second gaps.
Also if you don't have a lot of memory then if you let it run to completion your machine may go into VM swap, which of course is an entirely unrelated and acceptable reason for a time gap.

the following code is designed to exhibit this error why specifically minimizing excess allocations and deallocations.

import sys
from time import time

class Foo(object):
     def __init__(me,nfoo):
         me.memory = [1]*nfoo

class Bar(object):
     def __init__(me,nbar,nfoo):
         tgx.set_tag("Bar") # timer

         b = [None]*nbar
         for i in xrange(nbar):
             b[i]=Foo(nfoo) # make a foo, add it to list
             tgx.jump("Bar"+`i`) #timer

         me.b = b #store the list in my instance memory

# MACHINERES = 0.5 # was used on my 800mhz G4.
MACHINERES = 0.17 # your machine is probably faster.

# just a utility class to time things.
class gtime:
     def __init__(me,f=sys.stderr):
         me.last = time()
         me.tag = ""
         me.ticks = 0

     def set_tag(me,tag):
         me.tag = tag
         me.ticks = 0

     def reset(me):
         me.ticks = 0

     def jump(me,tag="NONE",allowed_gap=MACHINERES):
         me.t = time()
         me.ticks +=1
         if me.t-me.last>allowed_gap:
             print >>me.f,"Big Gap:",me.t-me.last,\
             "seconds ",me.tag,tag,me.ticks
         me.last = time()

tgx = gtime() # instance of the timer

# main loop
nfoo = 10 # foo could be a parsed line on a ballot for example
nbar = 100 # bar could be a ballot with 100 lines to parse

ad_nauseum = 20000
# this number is absurdly large. It's chose just to one can see this
# happen multiple times and watch the gaps grow. I expect
# you will press control-C way before reaching the end of this.

final = [None]*ad_nauseum

for i in xrange(ad_nauseum ):
     if i%100 == 0 :print >>sys.stderr,"Bars Made: ",i
     final[i] = Bar(nbar,nfoo)

sample Output:

Bars Made: 0
Bars Made: 100
Bars Made: 200
Bars Made: 300
Bars Made: 400
Bars Made: 500
Bars Made: 600
Bars Made: 700
Bars Made: 800
Bars Made: 900
Bars Made: 1000
Bars Made: 1100
Bars Made: 1200
Bars Made: 1300
Bars Made: 1400
Bars Made: 1500
Big Gap: 0.618862867355 seconds Bar Bar5 6
Bars Made: 1600
Bars Made: 1700
Bars Made: 1800
Big Gap: 0.748915195465 seconds Bar Bar76 77
Bars Made: 1900
Bars Made: 2000
Bars Made: 2100
Big Gap: 0.809149980545 seconds Bar Bar45 46
Bars Made: 2200
Bars Made: 2300
Bars Made: 2400
Big Gap: 0.898494958878 seconds Bar Bar15 16
Bars Made: 2500
Bars Made: 2600
Bars Made: 2700
Big Gap: 1.01110696793 seconds Bar Bar86 87
Bars Made: 2800
Bars Made: 2900
Bars Made: 3000
Big Gap: 1.12396192551 seconds Bar Bar55 56
Bars Made: 3100
Bars Made: 3200
Bars Made: 3300
Big Gap: 1.39006495476 seconds Bar Bar25 26
Bars Made: 3400
Bars Made: 3500
Bars Made: 3600
Big Gap: 1.55699706078 seconds Bar Bar96 97
Bars Made: 3700
Bars Made: 3800
Bars Made: 3900
Big Gap: 1.49929594994 seconds Bar Bar65 66
Bars Made: 4000
Bars Made: 4100
Bars Made: 4200
Big Gap: 1.64840602875 seconds Bar Bar35 36
Bars Made: 4300
Bars Made: 4400
Bars Made: 4500
Big Gap: 1.728484869 seconds Bar Bar5 6
Bars Made: 4600
Bars Made: 4700
Bars Made: 4800

>If you wish, you can send me directly, as opposed to the whole list, a
>zip/tarball of your source for me to look at.
>Reg. Charney
> Charlie Strauss wrote:
>> The fabulous pythonistas that program for OVC probably are well aware
>> of this one, but I thought I'd share this just in case it was
>> overlooked.
>> I wrote a toy program to simulate creating or reading in ballots into
>> memory and then processing them en-mass and got subtly bitten by
>> Garbage Collection in a place that I'd guess would affect any similar
>> program.
>> To be specific, I had a structure that looked like this:
>> each race-object held ten integers.
>> each ballot-object contained 100 race objects.
>> After I serially instantiated about 1500 ballot-objects then for
>> every 350 ballots I created the computer would pause for a moment.
>> The duration of the pauses grew from half a second up to almost ten
>> seconds.
>> After looking into this I found this is normal behavior in any python
>> program cause by the garbage collection.
>> So anyhow the long and the short is that every 350 ballots I created
>> GC fired and then scanned every single object. This is a very slow
>> process when you have a lot of these. In this case, for every
>> thousand ballots I had 100 races each holding ten objects. So that's
>> 1 million objects for 1000 ballots.
>> the GC was by orders of magnitude slower than all the operations I
>> wanted to do on the ballots! Indeed it was scanning 100% of the
>> memory so it got slower and slower every 350 ballots.
>> The solution was simple. disable GC.
>> Sorry if this is too much of a python-newbie observation, but I was
>> really surprised that I hit this problem at such a low number of
>> ballots. What was subtle about it was that since gc is silent, you
>> might easily over look it, since it's not a noticeable time lag at
>> low ballot numbers.
>> Note: in its default setting, GC works like this, every time you
>> create 700 new objects without deleting any, the GC scans all the new
>> objects to see if they contain cyclic references and are otherwise
>> unreachable. These are promoted to older objects, and every time you
>> create ten new older objects GC fires again and promotes these to
>> "oldest" objects, which also get scanned when the allocation minus
>> deallocation excceds ten.
>> Those trip points can be moved but that's a bandaid. unless you have
>> a data structure that can contain cyclic references I don't think gc
>> should ever be needed.
>> _______________________________________________
>> OVC-discuss mailing list
>OVC-discuss mailing list

OVC-discuss mailing list
= The content of this message, with the exception of any external
= quotations under fair use, are released to the Public Domain
Received on Tue Oct 31 23:17:03 2006

This archive was generated by hypermail 2.1.8 : Tue Oct 31 2006 - 23:17:10 CST