<?xml version="1.0" encoding="utf-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Codegrind &#187; Python</title>
	<atom:link href="http://jordanovski.com/tag/python/feed" rel="self" type="application/rss+xml" />
	<link>http://jordanovski.com</link>
	<description>Homepage of Dusko Jordanovski</description>
	<lastBuildDate>Fri, 18 Jun 2010 17:37:08 +0000</lastBuildDate>
	<generator>http://wordpress.org/?v=2.9.2</generator>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
			<item>
		<title>Some problem solving and how it&#8217;s easier with python</title>
		<link>http://jordanovski.com/some-problem-solving-and-how-its-easier-with-python</link>
		<comments>http://jordanovski.com/some-problem-solving-and-how-its-easier-with-python#comments</comments>
		<pubDate>Mon, 23 Mar 2009 02:12:48 +0000</pubDate>
		<dc:creator>admin</dc:creator>
				<category><![CDATA[PHP]]></category>
		<category><![CDATA[Programming]]></category>
		<category><![CDATA[Python]]></category>
		<category><![CDATA[algorithms]]></category>
		<category><![CDATA[generators]]></category>
		<category><![CDATA[text comparison]]></category>

		<guid isPermaLink="false">http://jordanovski.com/?p=111</guid>
		<description><![CDATA[There's a project I work on that required me to make an import utility for a CRM. The import should get a comma separated values file of clients and information about clients, and save it to the database. The database is split across several tables, so in the `clients` table I normally don't keep the [...]]]></description>
			<content:encoded><![CDATA[<p>There's a project I work on that required me to make an import utility for a CRM. The import should get a comma separated values file of clients and information about clients, and save it to the database. The database is split across several tables, so in the `clients` table I normally don't keep the name of the company, but just a foreign key. Now, our client is not very good with numbers and she needed to import files in which she could enter the name of the company instead of the database ID. A spreadsheet row representing a client looks like this:</p>
<pre><code style="font-family: monaco,consolas,monospace;">FirstName | LastName | Email               | Company
John      | Doe      | johndoe@example.com | Coca Cola
</code></pre>
<p>But the database row in the `clients` table looks like this:</p>
<pre><code style="font-family: monaco,consolas,monospace;">first_name | last_name | email            | company_id
John       | Doe       | john@example.com | 2
</code></pre>
<p>What I need to do is search for the company named 'Coca Cola' in the `companies` table and replace the name with it's ID. This is all fine except for one problem - typos. Moreover, the user could write "Apple Computer Inc." instead of "Apple Inc.". So I needed a way to compare the input strings with the ones in the database.</p>
<p>After poking around I found out about the <a href="http://en.wikipedia.org/wiki/Levenshtein_distance">Levenshtein distance</a> between strings, but that solved only half of my problems - the typo part. The distance would be very small between "Apple" and "Aple" but very big between "ACME International Inc." and "International ACME Inc.", and the latter two are obviously the same.</p>
<p>I devised the following method to compare entries:</p>
<ol>
<li>Split up the terms by words and eliminate blanks</li>
<li>Get the Levenshtein distance between each word from the first term and each word from the second term. Comparing "Apple Computer Inc." with "Apple Inc." for example, will give a matrix of 6 distances. <img class="size-full wp-image-112 aligncenter" style="margin-top: 5px; margin-bottom: 5px;" title="lev_matrix" src="http://jordanovski.com/wp-content/uploads/lev_matrix.png" alt="lev_matrix" width="347" height="227" /></li>
<li>Get the shortest term (one with less words, not the one with less characters). It has 2 words in this case. Then choose the smallest values from each <em>row</em>. When you pick the smallest <em>row</em> value, you cannot pick anymore values from that <em>column</em>. This means that the word in the <em>column</em> is the best match for some word in the <em>rows</em>.</li>
<li>Add these values up and add the difference between the word count of the 2 terms - and you have a score for the similarity of the terms. If the score is zero, they are the same. We are adding +1 for each extra word, but this can be weighted if needed. The point is that we don't care much for extra words since company names can have many words in them, but they are often called by one or two words.</li>
</ol>
<p style="text-align: left;">But there is a problem with step 3. If, for example, a column has the lowest values for more than one row, we always choose the first, and this practice is not always the best answer. For instance, matching "Fast Cats" with "Fats Cats" (notice the typo) gets a total score of 3 - matching <em>Cats</em> to <em>Fats</em> and <em>Fast to</em> <em>Cats, </em>which is wrong - it will be 2 if we match <em>Fast</em> to <em>Fats</em> and <em>Cats</em> to <em>Cats</em>, which is the intended solution.</p>
<p style="text-align: left;">
<p style="text-align: left;"><img class="aligncenter size-full wp-image-113" style="margin:5px;" title="fast_cats" src="http://jordanovski.com/wp-content/uploads/fast_cats.png" alt="fast_cats" width="207" height="97" /></p>
<p style="text-align: left;">So to be sure we have the best match, we need to always have the lowest sum that is unique across rows and columns. One solution is to make all permutations of the words in the <em>columns </em>and join them to a single permutation of the words <em>in the rows </em>then see which one has the lowest score. If the words in the rows are fewer then we need to get all permutations <strong>P(n,k)</strong> of the words in <em>the columns, </em>where <strong>n </strong>is the number of columns and <strong>k</strong> is the number of rows. This is a O(n!) algorithm but it's the best that I could think of - practically the same problem as finding every possible way to place 8 rooks on a chess table without making them attack each other.</p>
<p style="text-align: left;">And finally, here is the part where we get to write some code. I need a function that can calculate all permutations consisted of <strong>k</strong> elements out of a larger set consisted of <strong>n</strong> elements (<strong>k</strong> &lt;= <strong>n</strong>).</p>
<p style="text-align: left;">I decided first to write the algorithm in Python because it's cleaner and easier to think, and then to rewrite it in PHP. The first attempt was really, really sucky and I won't talk about it because I'm a bit embarassed. But I wasn't aware of a neat thing that Python has: the <strong>yield</strong> statement. The darn thing can be written in 6 lines with it:</p>
<pre><code style="font-family: monaco,consolas,monospace;">def permutations(the_set, n):
  if n==0:
    yield []
  else:
    for i in xrange( len( the_set ) ):
      for x in </code><code style="font-family: monaco,consolas,monospace;">permutations</code><code style="font-family: monaco,consolas,monospace;">( the_set[0:i] + the_set[i+1:], n-1 ):
        yield [the_set[i]]+x
</code></pre>
<p>I will go into the yield statement later, maybe I will extend this post, but for now, I'll say that it allows you to make a function that will calculate the combinations on the fly, without storing them in a huge list and then returning the list. It sort of lazy-loads the list of combinations when needed. There is no such thing in PHP (as far as I know). So here's my best shot at the function in PHP:</p>
<pre><code style="font-family: monaco,consolas,monospace;">function permutations( $array, $size )
{
  $result = array();
  $x = count($array);
  for( $i=0; $i&lt;$x; $i++ ) {
    $copy = $array; // copy: array_splice gets the arg by reference
    $item = array_splice( $copy, $i, 1 );
    if( $size == 1 )
      $result[] = $item;
    else {
      $rest = permutations( $copy , $size - 1 );
      foreach( $rest as $r )
        $result[] = array_merge($item, $r);
    }
  }
  return $result;
}
</code></pre>
<p>There really are excessive parts of the PHP code like storing the final result, but more importantly copying the array each time because array_splice takes the array argument by reference and modifies it ( talking about orthogonality ), plus its twice as long as the python code and half as readable.</p>
<p>Anyway, to get back at my original problem - the solution worked in terms of accuracy (at least for the first few test cases), but I fear it's going to be slow for large datasets. I have around 7 fields to compare with each respective table of the database,  each table having 100 records on average; each record is 3 words long on average which gives 6 permutations per comparison. Importing a list of 1000 clients would require 1000*7*100*6 = 4,200,000 comparisons, plus 700,000 calls to the permutations function (not counting the recursive calls :). I still think that it's better than hammering the database with 7000 fulltext serach queries, not to mention moving the database tables to MyISAM and indexing a bunch of fields. After all, it's an import. I could put one of those useless progress indicators like when you're starting Windows.</p>
]]></content:encoded>
			<wfw:commentRss>http://jordanovski.com/some-problem-solving-and-how-its-easier-with-python/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Configuring Django to work with your OSX X (Leopard) apache</title>
		<link>http://jordanovski.com/configuring-django-to-work-with-your-mac-os-x-apache</link>
		<comments>http://jordanovski.com/configuring-django-to-work-with-your-mac-os-x-apache#comments</comments>
		<pubDate>Fri, 27 Feb 2009 01:15:05 +0000</pubDate>
		<dc:creator>admin</dc:creator>
				<category><![CDATA[Django]]></category>
		<category><![CDATA[Apache]]></category>
		<category><![CDATA[Djangp]]></category>
		<category><![CDATA[Leopard]]></category>
		<category><![CDATA[mod_python]]></category>
		<category><![CDATA[OS X]]></category>
		<category><![CDATA[Python]]></category>

		<guid isPermaLink="false">http://jordanovski.com/?p=68</guid>
		<description><![CDATA[I hope that I finally got it right, since I can see the admin interface and the media files are being served by the same development server as the site.  The machine is an Intel MacBook running OS X 10.5.6 and python 2.6.1  I suggest reading the official Django documentation on setting it [...]]]></description>
			<content:encoded><![CDATA[<p>I hope that I finally got it right, since I can see the admin interface and the media files are being served by the same development server as the site.  The machine is an Intel MacBook running OS X 10.5.6 and python 2.6.1  I suggest reading the official <a href="http://docs.djangoproject.com/en/dev/howto/deployment/modpython/">Django documentation</a> on setting it with up mod_python first. I hope that this article can fill in the gaps. Remember to change the paths and names to the ones that you use.</p>
<h3>Configure the virtual hosts</h3>
<p>In this case 'mysite' is the name of the virtual host and 'my_site' is the name of the project, and server root directory. The server root was in my /Users/discodancer/Dev/my_site directory</p>
<pre><code style="font-family:monaco, consolas, monospace">&lt;VirtualHost *:80&gt;
  ServerAdmin jordanovskid@gmail.com
  DocumentRoot "/Users/discodancer/Dev/my_site"
  ServerName mysite
  ServerAlias mysite
  ErrorLog "/private/var/log/apache2/my_site-error_log"
  CustomLog "/private/var/log/apache2/</code><code style="font-family:monaco, consolas, monospace">my_site</code><code style="font-family:monaco, consolas, monospace">-access_log" common

  &lt;Directory "/Users/discodancer/Dev/</code><code style="font-family:monaco, consolas, monospace">my_site</code><code style="font-family:monaco, consolas, monospace">"&gt;
    Options FollowSymLinks MultiViews Includes
    AllowOverride All
    Order allow,deny
    Allow from all
  &lt;/Directory&gt;
  &lt;Location "/"&gt;
    SetHandler mod_python
    SetEnv DJANGO_SETTINGS_MODULE </code><code style="font-family:monaco, consolas, monospace">my_site</code><code style="font-family:monaco, consolas, monospace">.settings
    PythonHandler django.core.handlers.modpython
    PythonPath sys.path+['/Users/discodancer/Dev/']
  &lt;/Location&gt;

  # Do not use python interpreter for /media
  &lt;Location "/media"&gt;
    SetHandler none
  &lt;/Location&gt;

  # Do not use python interpreter for images
  &lt;LocationMatch ".(jpg|gif|png)$"&gt;
    SetHandler None
  &lt;/LocationMatch&gt;
&lt;/VirtualHost&gt;</code></pre>
<p>Then, to allow serving of media files, you need to make a symlink from django's contrib/admin/media directory to your project. The apache user normally does not have privileges to the django installation, so you need to do this.</p>
<pre><code style="font-family:monaco, consolas, monospace">ln -s /Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/site-packages/django/
contrib/admin/media/Users/discodancer/Dev/my_site/media</code></pre>
<p>(the path is too long, try not to paste the line breaks in your terminal :)  Then make a file apache_settings.py in your project directory/server root and paste these lines in it:</p>
<pre><code style="font-family:monaco, consolas, monospace">import osos.environ['PYTHON_EGG_CACHE'] = '/Users/discodancer/Temp'</code></pre>
<p>The path in my case is writable by the webserver (anyone for that matter). Finally add these 2 lines in the apache httpd.conf file. They will tell apache to load the settings from the file you just created.</p>
<pre><code style="font-family:monaco, consolas, monospace">PythonInterpreter my_site
PythonImport /Users/discodancer/Dev/my_site/apache_settings.py my_site</code></pre>
<p>Restart the web server.  I suppose you already know, but the apache httpd.conf file can be found in /etc/apache2/httpd.conf and the virtual hosts file can be found in /etc/apache2/extra/httpd-vhosts.conf. This should work :) at least it did for me.  One more note: at the moment of writing there is no current MySQLdb module for python 2.6. I am using the one that works with python 2.5 and each time I import it it throws a warning that the sets module is deprecated. Just ignore this, it didn't cause any trouble to me. If someone can explain what it really means, i'd be grateful.</p>
]]></content:encoded>
			<wfw:commentRss>http://jordanovski.com/configuring-django-to-work-with-your-mac-os-x-apache/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>
