I haven’t written anything on Python here in a good while. But that doesn’t mean I haven’t been busy wrestling with it. I’ll need to take a look at my Perforce changelists over the last months and take stock. In the meantime, I’d like to rant a bit about a most curious peculiarity of Python that I came across a while back.
Here at CCP we are increasingly, as is the trend nowadays, crunching numbers on the backend to figure out statistics and trends in how our games are being played by our users, and how our servers and clients are performing. We have a whole team of data enthusiasts using a number of tools for that purpose, both off the shelf, open source, and our own stuff.
One day, a colleague of mine came to me with an interesting problem. He was running a series of Python scripts on a Hadoop installation to do number crunching. He had a bunch of Python scripts and modules in a folder, running a script that then imported from a sibling module, like this:
# script1.py import tools tools.setup()
#script2.py import tools, script1 tools.setup() script1.main()
etc. The script files were in a common folder, and the Hadoop job would run them in a straighforward manner:
The problem my colleague was having was that none of the imports would resolve, when the job was run from Hadoop. When he ran it manually on the unix box from the command line, all was well, but in the context of haddoop, it was broken. It was as though the script folder wasn’t in the search path. He had discovered a workaround: Manually adding the current directory to sys.path and making sure to cd into the job folder first:
cd jobdir /usr/bin/python script1.py
But we were still both stumped as to why things were simply not working. By printing out sys.path we could see some sort of temporary Hadoop folder in there, which would be consistent with a per-job-instance invocation. But this path entry didn’t contain “jobdir” in it. It took us a while to figure out what was happening.
So, we started looking a bit better at how the job was run. This Hadoop version (Cloudera CDH4) would create for each job instance, an copy of the job’s home directory. So, simplifying the details, let’s say that we had this structure here:
/hadoop/jobs/job1/ jobdir/ script1.py script2.py tools.py
Then hadoop would set up, for each instance, a temporary image:
/tmp/hadoop/tmpjobs/12345/ jobdir/ script1.py script2.py tools.py
Now, this temporary image would actually be a linktree. This is a directory structure that contains real directories but virtual files, each file a symbolic link to the original. Linktrees are well known things and often used to make cheap copies of file structures. Mercurial, for instance, uses linktrees when cloning repositories on unix systems, or so I am lead to believe.
But unusually in this case, the linktree did not link to the original files in /hadoop/jobs/job1, but to a generic caching structure:
/tmp/hadoop/tmpjobs/12345/ jobdir/ script1.py -> /hadoop/filecache/ffee/script1.py script2.py -> /hadoop/filecache/1234/script2.py tools.py -> /hadoop/filecache/abcd/tools.py
The Hadoop shell command was being run with a current directory of, say, “/tmp/hadoop/tmpjobs/12345” but when we examined sys.path of the running script, astonishingly, we found that the folder “/hadoop/filecache/ffee/” was in the path!
It turns out that Python, when adding the “home” folder of the running script to sys.path, decides to resolve any symbolic links in the path, and use the “dereferenced” path as as the search path it appends to sys.path.
Cloudera was creating a virtual copy of the workspace, using symbolic links, but each link pointed to an abstract file cache, where each file could reside in its own unique directory. So, while the linktree was correct, the underlying hierarchy used to store the files that made up the image of the workspace was nothing like the original.
As a unix veteran, I’m familiar with symbolic links. I know how they work and I also know that using symbolic links is supposed to be transparent to your application. So, I filed a defect with bugs.python.org. And promptly, the bug was brushed off as a feature.
You see, this particular behaviour was actually designed to facilitate a peculiar use-case for some users in the Unix world. The use case is having separate “applications“, if you will, each residing in their own folder, but then creating a “script” folder somewhere with convenient symlinks to the “main” scripts of each application. Like this:
/app1/ app1.py applibrary1.py /app2/ app2.py anotherapplibrary2.py /script/ app1.py -> /app1/app1.py app2.py -> /app2/app2.py
In order to enable people to create a script shortcut folder like this, Python was actively dereferencing the real folder of the script it was running to give it access to its libraries, rather than adding the folder where the file appeared to reside. What it does is:
- Take the provided filename of the script and call the realpath() API to get the physical location
- Exctract the directory path of that and append it to sys.path
And this is suprising. Because the defect was actually a feature, even though a very obscure one, it won’t get fixed. The recommended workaround is to manually modify the scripts to tweak sys.path, as we originally did when we encountered this glitch.
It is my opinion that this behaviour violates the principle of least surprise. Symbolic links are usually the domain of the file system and they are designed to construct an apparent file structure out of a real one. Applications that look beneath the apparent towards the real are usually limited to utilities such as file system tools. Not user applications. User applications should believe the illusion that the user and the operating system presents to them.
The way that Python adds the directory of the script file to sys.path is akin to the way other programming utilities work. The C Preprocessor, for instance, allows the #include “foo” syntax to include from the directory of the file doing the include. The apparent directory. Because if the file doing the include actually is a symlink pointing to a file in another directory, the preprocessor doesn’t look there.
To attach semantic meaning to symbolic links in the filenames, even if it is convenient to a group of users, is surprising to the rest of the users. And it violates the concept of layering where the user and operating system produce the file structure, and the application consumes it.
Python has a very odd special case where it interprets the real location (as opposed to apparent location) of the script file it is executing as a place to add to sys.path. Thus is it attaching semantics to symbolic links and preclude the user in some cases from presenting a virtual hierarchy of files to Python for execution. Strange stuff.