Surprising Python

I haven’t written anything on Python here in a good while.  But that doesn’t mean I haven’t been busy wrestling with it.  I’ll need to take a look at my Perforce changelists over the last months and take stock.  In the meantime, I’d like to rant a bit about a most curious peculiarity of Python that I came across  a while back.

The story

Here at CCP we are increasingly, as is the trend nowadays, crunching numbers on the backend to figure out statistics and trends in how our games are being played by our users, and how our servers and clients are performing.  We have a whole team of data enthusiasts using a number of tools for that purpose, both off the shelf, open source, and our own stuff.

One day, a colleague of mine came to me with an interesting problem.  He was running a series of Python scripts on a Hadoop installation to do number crunching.  He had a bunch of Python scripts and modules in a folder, running a script that then imported from a sibling module, like this:

# script1.py
import tools
tools.setup()
#script2.py
import tools, script1
tools.setup()
script1.main()

etc.  The script files were in a common folder, and the Hadoop job would run them in a straighforward manner:

/usr/bin/python jobdir/script1.py

The problem my colleague was having was that none of the imports would resolve, when the job was run from Hadoop.  When he ran it manually on the unix box from the command line, all was well, but in the context of haddoop, it was broken.  It was as though the script folder wasn’t in the search path.  He had discovered a workaround: Manually adding the current directory to sys.path and making sure to cd into the job folder first:

cd jobdir
/usr/bin/python script1.py

But we were still both stumped as to why things were simply not working. By printing out sys.path we could see some sort of temporary Hadoop folder in there, which would be consistent with a per-job-instance invocation. But this path entry didn’t contain “jobdir” in it. It took us a while to figure out what was happening.

 The setup

So, we started looking a bit better at how the job was run.  This Hadoop version (Cloudera CDH4) would create for each job instance, an copy of the job’s home directory.  So, simplifying the details, let’s say that we had this structure here:

/hadoop/jobs/job1/
    jobdir/
        script1.py
        script2.py
        tools.py

Then hadoop would set up, for each instance, a temporary image:

/tmp/hadoop/tmpjobs/12345/
    jobdir/
        script1.py
        script2.py
        tools.py

Now, this temporary image would actually be a linktree.  This is a directory structure that contains real directories but virtual files,  each file a symbolic link to the original.  Linktrees are well known things and often used to make cheap copies of file structures.  Mercurial, for instance, uses linktrees when cloning repositories on unix systems, or so I am lead to believe.
But unusually in this case, the linktree did not link to the original files in /hadoop/jobs/job1, but to a generic caching structure:

/tmp/hadoop/tmpjobs/12345/
    jobdir/
        script1.py  -> /hadoop/filecache/ffee/script1.py
        script2.py  -> /hadoop/filecache/1234/script2.py
        tools.py    -> /hadoop/filecache/abcd/tools.py

The Hadoop shell command was being run with a current directory of, say, “/tmp/hadoop/tmpjobs/12345” but when we examined sys.path of the running script, astonishingly, we found that the folder “/hadoop/filecache/ffee/” was in the path!

The explanation

It turns out that Python, when adding the “home” folder of the running script to sys.path, decides to resolve any symbolic links in the path, and use the “dereferenced” path as as the search path it appends to sys.path.

Cloudera was creating a virtual copy of the workspace, using symbolic links, but each link pointed to an abstract file cache, where each file could reside in its own unique directory. So, while the linktree was correct, the underlying hierarchy used to store the files that made up the image of the workspace was nothing like the original.

The defect

As a unix veteran, I’m familiar with symbolic links.  I know how they work and  I also know that using symbolic links is supposed to be transparent to your application.  So, I filed a defect with bugs.python.org.  And promptly, the bug was brushed off as a feature.

You see, this particular behaviour was actually designed to facilitate a peculiar use-case for some users in the Unix world.  The use case is having separate “applications“, if you will, each residing in their own folder, but then creating a “script” folder somewhere with convenient symlinks to the “main” scripts of each application.  Like this:

/app1/
    app1.py
    applibrary1.py
/app2/
    app2.py
    anotherapplibrary2.py
/script/
    app1.py -> /app1/app1.py
    app2.py -> /app2/app2.py

In order to enable people to create a script shortcut folder like this, Python was actively dereferencing the real folder of the script it was running to give it access to its libraries, rather than adding the folder where the file appeared to reside.   What it does is:

  1. Take the provided filename of the script and call the realpath() API to get the physical location
  2. Exctract the directory path of that and append it to sys.path

And this is suprising.  Because the defect was actually a feature, even though  a very obscure one, it won’t get fixed.  The recommended workaround is to manually modify the scripts to tweak sys.path, as we originally did when we encountered this glitch.

Surprising? Quite.

It is my opinion that this behaviour violates the principle of least surprise.  Symbolic links are usually the domain of the file system and they are designed to construct an apparent file structure out of a real one.  Applications that look beneath the apparent towards the real are usually limited to utilities such as file system tools.  Not user applications.  User applications should believe the illusion that the user and the operating system presents to them.

The way that Python adds the directory of the script file to sys.path is akin to the way other programming utilities work.  The C Preprocessor, for instance, allows the #include “foo” syntax to include from the directory of the file doing the include.  The apparent directory.  Because if the file doing the include actually is a symlink pointing to a file in another directory, the preprocessor doesn’t look there.

To attach semantic meaning to symbolic links in the filenames, even if it is convenient to a group of users, is surprising to the rest of the users.  And it violates the concept of layering where the user and operating system produce the file structure, and the application consumes it.

Summary

Python has a very odd special case where it interprets the real location (as opposed to apparent location) of the script file it is executing as a place to add to sys.path.  Thus is it attaching semantics to symbolic links and preclude the user in some cases from presenting a virtual hierarchy of files to Python for execution.  Strange stuff.

Advertisements

4 thoughts on “Surprising Python

  1. I agree it’s weird, but Guido has a point there.
    Not breaking the contract with users (same approach as linux kernel)

    Last time Python did it with Strings in Python3. the debacle about it isn’t finished to this day.

    • It’s a an misconceived contract though, IMHO, and I would like to see us take steps to correct our mistakes. There are ways to do it, optional flags, for example.
      But I’m not going to fight that battle, it is a relatively minor issue as it is. The point of my post was to point out a very surprising, and very obscure, feature. No one seemed to even know about it, except the BDFL 🙂

  2. Let me argue that it’s not an obscure feature but a widely applicable unixy behavior.
    Consider *any* program whose installation consists of more than one file. One (at least) is an executable, which you would like to install somewhere on $PATH (/usr/bin, a personal ~/bin etc.). The other supporting files (whether data or dynamically loaded code) do not belong there. How will the executable find them?

    One approach is to hardcoding: something like “./configure –prefix=/usr; make” creates an executable that always looks under /usr/lib/foo and/or /usr/share/foo. A common variant is not compiling the path into the main binary but generating a small wrapper script that knows the path.

    But that’s not good enough.
    (1) If you’re developing the program, you want to test it without installing in a centralized location.
    (2) Same you’re a user who just checked it out from version control. You might prefer to just run it from there — less of a mess then installing, and you can just “git pull” the newest version without rebuilding (esp. if it’s a purely interpreted code).
    (3) Remote execution (like Hadoop) is easier if your program can run unmodified from any location.

    The obvious — unavoidable — solution is the executable inspecting argv[0], e.g.
    do_something $(dirname $0)/supporting-file

    But again how do you install it into $PATH without dragging the supporting files? Copying or hardlinking doesn’t leave a trace of where to look for the supporting files, and has to be repeated if you modify/upgrade the program. But symlinking is perfect! For it to work, the program must do:
    do_something $(dirname (readlink -f $0))/supporting-file
    Indeed I’ve seen this pattern in many programs, and it is precisely what Python automates when filling in sys.path[0]!

    [P.S. There is also the self-extracting archive approach, which gives you a standalone fully relocatable binary. This is handy for remote deployment, but it’s pretty ugly… Python has builtin zipimport allowing you to avoid extraction — but only for pure python, extension modules must be extracted since you can’t dlopen() from memory.]

    =>
    IMHO the Right solution would be for Hadoop to use a hardlink tree rathan symlink tree. Hardlinks are truly transparent so python — or any other app — would not break on them.
    [I believe most uses of linktrees to save space, including mercurial clones, employ hardlinks. Mostly because they’re symmetric – you can delete the original tree and the clone would not be broken.]

    • “But again how do you install it into $PATH “
      The answer is that you don’t. You add the directory to $PATH.
      Symbolic links aren’t “shortcuts”. Shortcuts need help from the shell to manually translate them to the proper place. The “dereferencing” of the link should not be the job of the executable. It is a terrible, terrible, hack if the program starts performing the job of the shell for the convenience of a subset of the users.

      You mention a number of programs that have this property. I’m curious as to what they are. I gave a pretty good counter-example with the world’s most ubiquitous programming language.

      It think that if Python, or any language, decides to perform this sort of shortcut patching for the shell, it should at least be enabled via an optional argument.

      As for hard links, nobody in their right mind uses hard links after symbolic links were invented 🙂

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s