kmod's blog

19May/209

Python performance: it’s not just the interpreter

黄色电影免费观看I have a particular view of Python performance from my experience on the Pyston?project, and since this view is somewhat nonstandard I wanted to take some time to explain it and give a motivating example.

黄色电影免费观看A common explanation for Python's low performance is that it is an interpreted language.? In this post I hope to show that while the interpreter adds overhead, it is not the dominant factor for even a small microbenchmark.? Instead we'll see that dynamic features -- particularly, features inside the runtime -- are to blame.

People often consider the interpreter and runtime together since they are tightly linked and are included together, but in the context of Python optimization they are somewhat different.There isn't an official delineation between the two, but in my mind the Python interpreter is the code in ceval.c which evaluates and dispatches Python opcodes.(You may also consider the parser and compiler as part of the interpreter, but they are typically ignored.)The runtime is the code that implements Python semantics.It surprises some people that the interpreter knows relatively little about how Python objects behave: it simply knows that a BINARY_ADD opcode corresponds to a call to the runtime PyNumber_Add function.(It does have some fast paths for common cases but those still largely call into the runtime for their implementation.)

There have been many attempts to optimize the Python interpreter, which aim to eliminate the bytecode dispatching and temporary variable management overheads.In contrast, this post is about the potential gains from optimizing the execution of Python semantics.

tldr

In this particular microbenchmark, the time goes to:

  • 31% argument passing
  • 31% "real work"
  • 13% int boxing
  • 11% interpreter overhead
  • 8% recursion-limit checking
  • 6% dynamic type checking

[I've excluded some "easily avoidable" overheads from this list.]

I picked a particular microbenchmark that I knew had inefficient argument passing, but it's still interesting to see that that cost is almost 3x the cost of using an interpreter.I was also surprised that recursion-limit checking played such a role since it is not often mentioned.

I'm calling argument passing a "dynamic feature" because in this case it is completely avoidable.Python argument passing can be expensive because it requires packing up arguments into a tuple allocated for this purpose, as well as interpretation of a format string to determine how to parse the arguments on the callee side.

The microbenchmark

I've collected these numbers from a simple microbenchmark which?I pulled from a real-world use case (django templating); I'm not claiming that it's particularly representative, just that it's illustrative and an interesting data point.

Typically a program's performance is analyzed by running a profiler to see what parts of the program take the most time.I'm going to take a bit of a different approach: I'm going to iteratively optimize the benchmark and see how much each optimization helps.I believe this gives a much clearer view of the cost of each feature that we optimize away, as well as providing hints as to what it would take to do these optimizations automatically.

The benchmark is converting numbers to strings, which in Python is remarkably expensive for reasons we'll get into.?The motivation for this is to imagine that you have a template with many variables in it, all of which need to be converted to strings for rendering.

Here's the code:

def main(): for j in range(20): for i in range(1000000): str(i)main()

(You can find all the versions on my github)

The doubly-nested loops will become handy later.

In terms of methodology, I'm simply running all of these benchmarks on my development machine, doing three runs of each benchmark and reporting the median.The performance differences are pretty stark so we don't need extremely precise or accurate benchmarking, so I kept it simple.Unless mentioned otherwise, I'm running all benchmarks via my Ubuntu-provided python3 binary, which is Python 3.7.3.

Results

On my machine, the benchmark takes 2.33s

We can do a simple optimization: referencing str every iteration of the loop forces the interpreter to do a somewhat-expensive global variable lookup, which is slower than a local variable lookup.We can cache the str object into a local variable, and this brings the benchmark down to 2.07s.

The next goal is to move the for loop out of Python and into C, and in this case we can do that with the map() function.The benchmark is now

for j in range(20):list(map(str, range(1000000)))

This version does more work since it is creating a list of all the strings.Perhaps surprisingly, the wins from removing the interpreter almost outweigh this extra work, and this new version comes in at 2.11s

As a point of reference, if we create the same list via a list comprehension, the benchmark takes 2.34s.The optimization from 2.34s to 2.11s by using map lets us calculate the "interpreter overhead" as 10% of the program's execution.10% is large in many situations, but is not nearly enough to explain Python's reputation for being slow.

To proceed further, we're going to need to move into C extension territory. I ran Cython (a Python->C converter) on the previous benchmark, and it runs in exactly the same amount of time: 2.11s.I wrote a simplified C extension in 36 lines compared to Cython's 3600, and it too runs in 2.11s. [The variance in the benchmark runs is about 0.03s, so getting the same exact median time seems lucky.]

For reference here's the C code.

for (int i = 0; i < 20; i++) {PyObject* a = PyTuple_Pack(1, PyLong_FromLong(1000000));PyObject* r = PyObject_Call((PyObject*)&PyRange_Type, a, NULL);Py_DECREF(a);a = PyTuple_Pack(2, (PyObject*)&PyUnicode_Type, r);Py_DECREF(r);PyObject* m = PyObject_Call((PyObject*)&PyMap_Type, a, NULL);Py_DECREF(a);a = PyTuple_Pack(1, m);Py_DECREF(m);PyObject* l = PyObject_Call((PyObject*)&PyList_Type, a, NULL);Py_DECREF(a);Py_DECREF(l);}

The next thing I did was to eliminate the list again, and simply iterate through the map object.This brings the time down to 1.86s

Then I included and inlined the map iteration function.This didn't affect the runtime, which is now 1.87s.

The next thing to optimize is the feature that map() can take a variable number of arguments, which slows down its processing slightly.Hard-coding that this map has exactly one argument reduces the time slightly to 1.82s.

The next thing I did is to hoist some of the map iteration code out of the loop: map_next() typically has to reread the relevant variables from memory every iteration, so I thought doing that once outside the loop would help.Surprisingly the runtime is the same: 1.82s.

Next I copied and inlined _PyObject_FastCallDict and rangeiter_next.Surprisingly, switching to the copied version of _PyObject_FastCallDict slowed the program down noticeably, to 1.88s.My hypothesis is that this is because my extension module doesn't have PGO enabled, whereas I believe the Ubuntu Python build does have PGO enabled, so the copied code now has fewer optimizations.

Next I optimized _PyObject_FastCallDict for this particular case, removing some checks that are constant once we know the function we are calling.This simulates static typing, or speculative type inference in the case of a JIT.This didn't make much of a difference: 1.87s.

Now we're getting to the heart of the problem.str() is slow because it's not a function: it's a type.And types in Python have complex calling behavior since they execute Python's two-phase construction, allowing overrides at each step.This is rather unfortunate for common type conversions, since they do not do much work but invoke fairly expensive constructor behavior.I inlined type_call as well as unicode_new, and the performance is roughly the same: 1.86s.

Now that we have all the code in one place, we can see both the argument packing and parsing+unpacking.Python has a traditional calling convention, and a faster newer calling convention.Unfortunately calling a type falls back to the slower calling convention, requiring the allocation of an args tuple, and parsing it on the receiving side.I optimized the current benchmark by hard-coding the argument parsing.This had a large impact: the runtime is now 1.46s.

I inlined PyObject_Str, and this again slowed performance down: 1.52s.Again, I think this is due to the lack of PGO.

Next I optimized based on the assumption that str(int) returns a string object.This reduced the runtime to 1.40s

Now that we've removed enough code we can finally make another big optimization: no longer allocating the args tupleThis cuts runtime down to 1.15s.

Next I hoisted the recursion-checking code out of the loop: 0.98s

Including and inlining long_to_decimal_string: 0.94s

Converting the range iteration to a C for loop: 0.91s

Removing the unicode conversion (the real work): 0.27s

Not allocating a Python int for each iteration of the loop: 0.0s

So that's everything

For reference: NodeJS takes 1.38s to run an equivalent benchmark.And impressively, PyPy [PyPy3 v7.3.1] only takes 0.31s to run the original version of the benchmark.So not only do they get rid of most of the overheads, but they are significantly faster at unicode conversion as well.

To be continued...

I believe that these findings point towards a certain direction for Python optimization.I have some ideas in this space and hopefully you'll hear more about them on this blog!

Filed under: Pyston9 Comments
22Jan/190

PoolJacking: easy 51% attacks against Bitcoin and Ethereum

With the recent Ethereum Classic double-spend attack, I decided to finally write this post about an existing vulnerability in Bitcoin and Ethereum.? The attack vector has been used in the past to steal a small amount of Bitcoin, but it can be used more cleverly to pull off 51% attacks and double-spend?an unbounded amount of cryptocurrency.

This blog post talks about Bitcoin, but Ethereum and some other coins have the exact same vulnerability.

Background: Stratum

Bitcoin has a lot of parts of it that are secure.? And it has some parts that are not secure.? Unfortunately, having lots of secure parts doesn't count for much if you have parts that are not secure.

In particular, the Bitcoin ecosystem contains a protocol called Stratum, which coordinates miners and their mining pools.? Since?85% of mining?happens via pools, 85% of the hashrate is dependent on Stratum's security.

Stratum is notoriously insecure.? Not in the sense that there are clever attacks on it, but in the sense that it?contains zero security measures.? It's like letting anyone with an armored truck drive up to a bank and load up the truck with money: if the request is formatted right, Stratum will let it through.? Stratum has been?attacked in the past?by an ISP employee to steal mining rewards from their network, which is a clever, although limited, way of abusing the lack of security.? You can do much more damage than that.

The ISP employee simply redirected where the mining rewards went.? But with Stratum, you also control what block the miners are mining.? This means that if you can intercept Stratum, you can control the entire hashpower of a mining pool.? Not only does this still steal the mining rewards, but if done on a large enough scale it allows 51% attacks?which are devastating to the network as a whole.

I call this 黄色电影免费观看:?using?Stratum to hijack?an entire pool at a time, and then controlling their hashpower to rewrite the blockchain.

It's not a total break of Bitcoin's security, but PoolJacking allows attackers to double-spend, DDOS the entire network, or simply collect $5MM a day in mining rewards.

It's exploitable

This attack relies on intercepting specific internet traffic.? There are a few ways to do it, and easiest among them?is?BGP hijacking?(another is DNS poisoning).? BGP hijacking allows an attacker to use BGP, the core internet routing protocol, to redirect where traffic is sent, and already happens?thousands of times a day.

BGP access is limited, but still broadly available.? There are 62,970?licenses that have been given out to send BGP updates, and there are 16k "Network Engineers" on LinkedIn, most of whom presumably have this access.? It's conservative to say that there are 10k people who have the ability to pull off this attack.

Worryingly, these people are distributed throughout the world, all located in different legal and political regimes.? It would be easy for Pakistan to intercept Bitcoin traffic in the same way that it intercepted YouTube traffic in 2008.? I'm no expert but it looks like?North Korea has the level of access to pull this off as well.

The worst part is that while this attack is detectable, there is not much that can be done about it.? An attacker only needs about 2 hours to execute a double-spend, giving a very small window for?people to notice and raise confirmation requirements.? Even if the attack is noticed, it will take a very long time to roll out the required protocol fixes, since this will require updating the firmware on hardware miners (or more realistically, waiting for the next generation of miners).

 

Stratum's weakness is already known to the community, yet there has been no progress in replacing it with something secure.? I hope this blog post reveals just how damaging the vulnerability can be, and brings attention to this lingering problem.

Filed under: bitcoinNo Comments
25Jun/173

Update on NumPy acceleration

I've been looking into accelerating NumPy using TensorFlow, and the results have been pretty interesting. ?Here's a quick update.

tldr: TensorFlow adds a lot of overhead and doesn't speed up CPU execution, making converting NumPy->TensorFlow less promising. ?TensorFlow's ability to target GPUs, though, makes up for the overheads, but not all NumPy programs are GPU-friendly.

 

Overview

The thesis of this project is that TensorFlow (TF) implements many of the sorts of optimizations that could speed up NumPy, such as GPU offload and loop fusion. ?So my idea has been to take the work of the TF team and use it to speed up vanilla (non-AI / DL) NumPy workloads. ?This is difficult because TF has a restricted programming model compared to NumPy, so the crux of the project is building a tool that can automatically convert from NumPy to TF.

The problem can be broken up into two parts: the first is the "frontend", or understanding NumPy code enough to translate it. ?The second is the "backend", where we emit an efficient TF program.

Coming into the project, I thought the work would be in the frontend. ?Understanding NumPy code, or Python code in general, is a very tough problem, but I think I have a promising dynamic analysis approach that could crack this for many kinds of programs. ?As I'll talk about in the rest of this post, though, this didn't end up being the project blocker. ?I'm happy with the dynamic analysis system I wrote, but I'll have to leave that to a future blog post.

k-means

The first challenge in a project like this is to set up a good goal. ?It's easy to find?some code that your project can help, but to be more than an intellectual curiosity it would have to help out "real" code.

I looked around to find a benchmark program that I could use as a yardstick. ?I settled on scikit-learn, and specifically their k-means implementation. ?k-means is a clustering algorithm, and Wikipedia has a good overview.

I immediately ran into a few issues trying to speed up scikit-learn's k-means algorithm:

  • scikit-learn uses Elkan's algorithm,?which uses a neat triangle-inequality optimization to eliminate much of the work that has to be done. ?This is a pretty clever idea, and is helpful when run on a CPU, but unfortunately for this project means that the algorithm is not amenable to vectorization or CPU offload. ?Drat.
  • Even scikit-learn's implementation of the straightforward algorithm is hand-optimized Cython code. ?This makes it fast, but also not quite suitable for Python-analysis tools, since it is not quite Python.

I?converted their straightforward Cython implementation to Python so that it could run through my system, but the bar has become quite high: not only does my tool need to match the low-level optimizations that Cython does, but it needs to beat the higher-level algorithmic optimizations as well.

Translating k-means to TF

Although the overall goal is to automatically convert NumPy to TF, I started by doing the conversion manually to see the resulting performance. ?I had to start with just the inner loop, because the outer loop uses some functionality that was not present in TF. ?TF has a function (scatter_add) that exactly implements the semantics of the inner loop, so I thought it would have an optimized implementation that would be very fast.

In fact, it was not. ?It was much slower. ?This was for a few reasons:

  • TF makes a lot of copies. ?The situation (at least at the time) was bad enough that it gives a bit of credibility to the rumor that the public TF code is crippled compared to what Google uses internally.
  • scatter_add supports arbitrary-rank tensors and indices, whereas the scikit-learn version hard-codes the fact that it is a rank-2 matrix.
  • It looks like TF uses an abstraction internally where they first record the updates to execute, and then execute them. ?Since scatter_add is (or should be) entirely memory-bound, this is a very expensive overhead.

I fixed the first issue, but the other two would be relatively hard to solve. ?It looks like TF will not be able to beat even moderately hand-optimized Cython code, at least on CPUs.

GPU

TensorFlow still has an ace up its sleeve: GPU offload. ?This is what makes the restrictive programming model worthwhile in the first place. ?And as promised, it was simply a matter of changing some flags, and I had k-means running on the GPU.

But again, it was not any faster. ?The main problem was that the copy-elision I had done for the CPU version did not apply to CPU->GPU offload. ?I had to change my TF code to keep the input matrix on the GPU; this is an entirely valid optimization, but would be extremely hard for a conversion tool to determine to be safe, so I feel like it's not quite fair.

Once I fixed that, if I cranked up the problem size a lot, I started to see speedups! ?Nothing amazing (about 25% faster), and I didn't pursue this too far since the setup was starting to feel pretty synthetic to me. ?But it was quite nice to see that even on a memory-bound program -- which GPUs can help with but not as much as they can help with compute-bound programs -- there was a speedup from running on the GPU. ?Perhaps on other algorithms the speedup would be even greater.

黄色电影免费观看

Even though the end result wasn't that impressive, I think a lot was learned from this experiment:

  • Using a very-popular algorithm as a benchmark has drawbacks. ?It hides a large benefit of an automated tool like this, which is to help speed up programs that have not already been hand-tuned.
  • TensorFlow is not a great fit for memory-bound computations are, but it can still speed them up by using a GPU.
  • TensorFlow does not automatically make algorithms GPU-friendly.

I don't know when I'll come back to this project, mostly because I am less convinced that it will be worthwhile. ?But if it does turn out that people really want their NumPy code accelerated, I'll take a look again since I think this approach is promising.

11Apr/170

Monitor crashes

I've gotten a new problem in my life: my monitor has started crashing. ?To be fair, the steps that cause it are fairly esoteric (using the USB ports, then switch the video input), but the failure mode is pretty remarkable: the monitor becomes completely unresponsive. ?As in, I can't switch the video mode again. ?And most remarkably, I can't turn the monitor off. ?The power button becomes unresponsive, and I have to power cycle the monitor by unplugging it.

And this isn't a cheap monitor either, this is a high-end Dell monitor. ?As a software engineer I usually feel pretty confident in our profession, but this was certainly a reminder about how even things taken for granted still have software and can fail!

27Feb/170

Persuasiveness and selection bias

I happened to be watching the Oscars last night, and I was pretty shocked to see the mistake with the Best Picture award. ?Thinking back on it, this is a bit surprising to me: many things are happening that should be more "shocking" (all the craziness in Washington) but don't seem to affect me the same way.

I think this comes down to selection bias: the internet has made it so much easier to find extreme examples that the impact of them is dulled. ?In contrast, seeing something for yourself -- such as watching the Oscars mistake live -- has a realness to it that is much more impactful. ?Maybe another way of putting it is that it has become much easier to cherry-pick examples now.

I thought of some other examples of this: I don't feel very persuaded when someone says to me "there was a paper that shows X", because there's probably also a paper that shows the opposite of X. ?Similarly, quoting an "expert" on something doesn't mean that much to me anymore either. ?Particularly when their qualification is simply "[subject] expert", but even quotes from?generally-respected people don't have that much impact, since I'm sure someone else famous said the opposite.

 

Maybe this is all wishful thinking. ?There's the meme that "a terrorist attack is more frightening than X?even though X?kills more people", and if true is fairly opposite to what I'm saying here. ?And I don't really know how to solve the selection bias problem?-- words seem to hold less value in new internet regime where anyone can say anything they want, and it's not clear what to replace words with. ?Or maybe this whole thing is just me being a bit jaded. ?Either way, it will be interesting to see how society ends up adapting.

1Feb/1716

Personal thoughts about Pyston’s outcome

I try to not read HN/Reddit too much about Pyston, since while there are certainly some smart and reasonable people on there, there also seem to be quite a few people with axes to grind (*cough cough* Python 3). ?But there are some recurring themes I noticed in the comments about our announcement about Pyston's future?so I wanted to try to talk about some of them. ?I'm not really aiming to change anyone's mind, but since I haven't really talked through our motivations and decisions for the project, I wanted to make sure to put them out there.

Why we built a JIT

Let's go back to 2013 when we decided to do the project: CPU usage at Dropbox was an increasingly large concern. ?Despite the common wisdom that "Python is IO-bound", requests to the Dropbox website were spending around 90% of their time on the webserver CPU, and we were buying racks of webservers at a worrying pace.

At a technical level, the situation was tricky, because the CPU time was spread around in many areas, with the hottest areas accounting for a small (single-digit?) percentage of the entire request. ?This meant that potential solutions would have to apply to large portions of the codebase, as opposed to something like trying to Cython-ize a small number of functions.? And unfortunately, PyPy was not, and still is not, close to the level of compatibility to run a multi-million-LOC codebase like Dropbox's, especially with our heavy use of extension modules.

So, we thought (and I still believe) that Dropbox's use-case falls into a pretty wide gap in the Python-performance ecosystem, of people who want better performance but who are unable or unwilling to sacrifice the ecosystem that led them to choose Python in the first place. ?Our overall strategy has been to target the gap in the market, rather than trying to compete head-to-head with existing solutions.

And yes, I was excited to have an opportunity to tackle this sort of problem. ?I think I?did as good a job as I?could to discount that, but it's impossible to know what effect it actually had.

Why we started from scratch

Another common complaint is that we should have at least started with PyPy or CPython's codebase.

For PyPy, it would have been tricky, since Dropbox's?needs are?both philosophically and technically opposed to PyPy's goals. ?We needed a high level of compatibility and reasonable performance gains on complex, real-world workloads. ?I think this is a case that PyPy has not been able to crack, and in my opinion is why they are not enjoying higher levels of success. ?If this was just a matter of investing a bit more into their platform, then yes it would have been great to just "help make PyPy work a bit better". ?Unfortunately, I think their?issues (lack of C extension support, performance reliability, memory usage) are baked into their architecture. ?My understanding is that a "PyPy that is modified to work for Dropbox" would not look much like PyPy in the end.

For CPython, this was more of a pragmatic decision. ?Our goal was always to leverage CPython as much as we could, and now in 2017 I would recklessly?estimate that Pyston's codebase is 90% CPython code. ?So at this point, we are clearly a CPython-based implementation.

My opinion is that it would have been very tough to start out this way. ?The CPython codebase is not particularly amenable to experimentation in these fundamental areas. ?And for the early stages of the project, our priority was to validate our strategies. ?I think this was a good choice because our initial strategy (using LLVM to make Python fast) did not work, and we ended up switching gears to something much more successful.

But yes, along the way we did reimplement some things. ?I think we did a good job of understanding that those things were not our value-add and to treat them appropriately. ?I still wonder if there were ways we could have avoided more of the duplicated effort, but it's not obvious to me how we could have done so.

Issues people don't think about

It's an interesting phenomenon that people feel very comfortable having strong opinions about language performance without having much experience in the area. ?I can't judge,?because I was in this boat -- I thought that if web browsers made JS fast, then we could do the same thing and make Python fast. ?So instead of trying to squelch the "hey they made Lua fast, that means Lua is better!" opinions, I'll try to just talk about what makes Python hard to run quickly (especially as compared to less-dynamic languages like JS or Lua).

The thing I wish people understood about Python performance is that the difficulties come from Python's extremely rich object model, not from anything about its dynamic scopes or dynamic types. ?The problem is that every operation in Python will typically have?multiple points at which the user can override the behavior, and these features are used, often very extensively. ?Some examples are inspecting the locals of a frame after the frame has exited, mutating functions in-place, or even something as banal as overriding isinstance. ?These are all things that we had to support, and are used enough that we have to support efficiently, and don't have analogs in less-dynamic languages like JS or Lua.

On the flip side, the issues with Python compatibility are also quite different than most people understand. ?Even the smartest technical approaches will have compatibility issues with codebases the size of Dropbox. ?We found, for example, that there are simply too many things that will break when switching from refcounting to a tracing garbage collector, or even switching the dictionary ordering. ?We ended up having to re-do our implementations of both of these to match CPython's behavior exactly.

Memory usage is also a very large problem for Python programs, especially in the web-app domain. ?This is, unintuitively, driven in part by the GIL:?while a multi-process approach will be conceptually similar to a multi-threaded approach, the multi-process approach uses much more memory. ?This is because Python cannot easily share its memory between different processes, both for logistical reasons, but also for some deeper reasons stemming from reference counting. ?Regardless of the exact reasons, there are many parts?of Dropbox that are actually memory-capacity-bound, where?the key metric is "requests per second per GB of memory". ?We thought a 50% speed increase would justify a 2x memory increase, but this is worse in a memory-bound service. ?Memory usage is not something that gets talked about that often in the Python space (except for MicroPython), and would be another reason that PyPy would struggle to be competitive for Dropbox's use-case.

 

So again, this post is me trying to explain some of the decisions we made along the way, and hopefully stay away from being too defensive about it. ?We certainly had our share of bad bets and schedule overruns, and if I were to do this all over again my plan would be much better the second time around. ?But I do think that most of our decisions were defensible, which is why I wanted to take the time to talk about them.

Filed under: Pyston16 Comments
17Jan/173

NumPy to Theano / TensorFlow: Yea or Nay?

Hey all, I'm investigating an idea and it's gotten to the point that I'd like to solicit feedback. ?The idea is to use Theano or TensorFlow to accelerate existing NumPy programs. ?The technical challenges here are pretty daunting, but I feel like I have a decent understanding of its feasibility (I have a prototype that I think is promising). ?The other side of the equation is how valuable this would be. ?The potential benefits seem very compelling (cross-op optimizations, GPU and distributed execution "for free"), and I've heard a lot of people ask for better NumPy performance. ?The worrying thing, though, is that I haven't been able to find anyone willing to share their code or workflow. ?Not that I'm blaming anyone, but that situation?makes me worried about the demand for something like this.

So, what do you think, would this be valuable or useful? ?Is it worth putting more time into this? ?Or will it be just another NumPy accelerator that doesn't get used? ?If you have any thoughts, or want to chime in about your experiences with NumPy performance, I'd definitely be interested to hear about it in the comments.

15Jan/171

Amazon-Walmart arbitrage

I recently ordered some junk food from Amazon, despite my wife's objections.I ordered it from an Amazon Market (aka third party) seller since that was the choice picked by Amazon for one-click ordering.

The food arrived, and the interesting thing is that it arrived in a Walmart box, with a Walmart packing slip.Evidently, someone savvy recognized that the Walmart price was lower than the Amazon price, and undercut Amazon's price using Walmart as the fulfillment.I was pretty annoyed to have been caught by this, but at the same time I have to respect that they pulled this off, and that I got the food cheaper than if they hadn't done this.

Anyway, just thought that it is interesting that people are out there doing this!

3Oct/160

What does this print, #2

I meant to post more of these, but here's one for fun:

class A(object):def __eq__(self, rhs):return Trueclass B(object):def __eq__(self, rhs):return Falseprint A() in [B()]print B() in [A()]

Maybe not quite as surprising once you see the results and think about it, but getting this wrong was the source of some strange bugs in Pyston.

28Jul/162

Stack vs Register bytecodes for Python

There seems to be a consensus that register bytecodes are superior to stack bytecodes. ?I don't quite know how to cite "common knowledge", but doing?a google search for "Python register VM" or "stack vs register vm"?supports the fact that many people believe this. ?There was a comment on this blog to this effect as well.

Anyway, regardless of whether it truly is something that everyone believes or not, I thought I'd add my two cents. ?Pyston uses a register bytecode for Python, and I wouldn't say it's as great as people claim.

Lifetime management for refcounting

Why? ?One of the commonly-cited reasons that register bytecodes are better is that they don't need explicit push/pop instructions. ?I'm not quite sure I agree that you don't need push instructions -- you still need an equivalent "load immediate into register". ?But the more interesting one (at least for this blog post) is pop.

The problem is that in a reference-counted VM, we need to explicitly kill registers. ?While the Python community has made great strides to support deferred destruction, there is still code (especially legacy code) that relies on immediate destruction. ?In Pyston, we've found that it's not good enough to just decref a register the next time it is set: we need to decref a register the last time it is used. ?This means that we had to add explicit "kill flags" to our instructions that say which registers should be killed as a result of the instruction. ?In certain cases we need to add explicit "kill instructions" whose only purpose is to kill a register.

In the end it's certainly manageable. ?But because we use a register bytecode, we need to add explicit lifetime management, whereas in a stack bytecode you get that for free.

 

I don't think it's a huge deal either way, because I don't think interpretation overhead is the main factor in Python performance, and a JIT can smooth over the differences anyway. ?But the lifetime-management aspect was a surprise to me and I thought I'd mention it.