wwd.ca

 

mon petit blogue sans importance...

PyCon 2009, day 1

I am very happy to have come to PyCon for the third year in a row. This gathering of python geeks is a very good conference to go to. The mix of people, from basement hobbyist kids to corporate types, is great.

I got to PyCon by getting up at 3:40am this morning and taking the 5:44 flight from Montréal. The hotel is much nicer than last year; i got an awesome room on the 10th floor, with a balcony! So far the conference itself is pretty much like last year, with the talk and speaker quality varying wildly. Lunch was pleasantly improved, and served, not a buffet. The wifi pretty much just worked for me, and still works quite well, but there's again many... less technically-inclined people who keep creating adhoc networks with the same name as the conf network. Still, working a lot better than last year, so far. In general, many kudos to the organizers: the amount of things that need to go right for such a conference is incredible, and the registration fees are ridiculously low and yet the result is an extremely well organized conference.

Overall, i'd say that if you want to watch a few presentations, from what i saw today i'd recommend you watch the python namespaces, automated qa, coverage and the everyblock talks. The videos are surely not up yet, but i'm assured they will be at some point.

Another nice addition is that the schedule has diamond ratings which give a rough idea of the technical level.

Here's a dump of point-form notes with some commentary from the talks i attended.


Lightning Talks

These are talks strictly enforced at less than 5 minutes, where it's very easy to get a slot. I have to say that most aren't really that interesting, but sometimes you do catch something that piques your curiosity enough that you want to follow-up on it.

Cassandra distributed DB

Seems like something i'll want to follow-up on.

pycon talk page

About Python Namespaces

This was a very cool talk, precisely the type of talks i want to hear, personally. It talked about how namespaces work in python, and, in explaining that, how python code is compiled then executed. It included some neat, usable tricks involving closures or early binding; for instance, i feel stupid for writing:


foo_re = re.compile("<some complicated regex>")
if foo_re.match(s):
    bar

when:

is_a_foo = re.compile("<some complicated regex>").match
if is_a_foo(s):
    bar

is sooo much nicer.

pycon talk page

Using Windmill

Windmill is a very nice web app test browser-side test framework that lets you easily write tests in python or javascript and automatically run that test in many different browsers. It's also what we started using for our automated frontend testing.

  • very monolithic codebase atm; some windmill 2 work started with new codebase that would fix that, but that's on a 1-to-2 years timeline
  • canonical (?) supposedly to push some patches for ssl support soon?
  • js tests are faster
  • windmill shell
  • #windmill on freenode, http://www.getwindmill.com, http://trac.windmill.com

pycon talk page

Building a Simple Configuration-driven Web Testing Framework With Twill

Apologies for the near-uselessly succinct summary on this one, but apart from the link to PageObjects which were kind of neat, i can't say that i enjoyed this one much.

pycon talk page

Strategies For Testing Ajax Web Applications

This was basically about one man's approach to testing ajax apps. It was mostly pointing out the different tools, and his five-pronged approach to testing ajax applications:

  1. Test Data Handlers
    • this is just testing your backend hooks, e.g. with django unittests like we do
  2. Test JavaScript
    • the idea here is to use something like rhino which is a no-browser js engine, to functionally test the javascript. I'm not sure that's worth it, and i'm not sure this would work when you start sourcing all sorts of external js.
  3. Isolate UI For Testing
    • fake the backend server so that you can just look at the generated UI
  4. Automate UI Tests
  5. Gridify your test suite

I'm not sure about points 2 and 3. Point 1 is basic backend testing, which is necessary and can run without a browser (and run faster), so it's definitely something you want to do. Then you need to have automated testing that'll check that when you fetch page a, and then press button b, div c contains text bar and button bob is hidden. The problem is with making sure the UI was drawn properly. You can use screenshots and automatically compare them with the stored picture. But UIs change in subtle and less-subtle ways all the time, so this would be too expensive to keep up to date. A good system would store the new screenshot, and if it's unlike the old one, it would tell you "test failed", show you the new screenshot, and you'd be able to say with one click "no, this is good", and it'll store the file as the new baseline. Texttest, a rather neat sw test framework, does something like this but with text output. Alas, i don't think there's anything like that for ajax web apps. The last point - which we don't do either -, also does make a lot of sense, but there's nothing easy to use with windmill yet. None of this was part of his presentation - this is just my own little commentary on the subject.

pycon talk page

Building an Automated QA Infrastructure using Open-Source Python Tools

Overview of many tools used for automating your QA systems:

  • running unit tests on every checkin
  • blaming devs automatically when they push broken code
  • running code coverage tools to ensure everything does get tested
  • etc.

This was a very good talk; he did many surveys showing approximately the proportions of people using tests, automating tests, etc. The results were quite encouraging for the state of software, actually, with many hands going up each time (and considering a lot of people here don't actually write much code). I hope it'll have convinced many to automated more, include coverage testing, etc. - whatever tool they use for it. Fortunately for us, we have simon who has implemented a very nice QA infrastructure at akoha, and we keep automating more.

He covered buildbot in some more details (which is what we use at akoha), developing a fairly complete example.

pycon talk page

Coverage testing, the good and the bad

Coverage testing is important. This is a tool which will tell you which lines in your code was actually exercized during your tests. For example it'll tell you that for this code:

if foo:
    a = bar()
else:
    a = bob()

your tests always go through the a = bar() and actually never encounter the else branch. It'll tell you that you should probably add a test that will get to that line as well. It'll also tell you that though you spent a lot of time writing tests, you've been mostly testing the same lines of code over and over - you have redundant tests, and changing one of those lines will break many tests.

This was a very good talk. The presenter is the author of coverage.py, one of several python code coverage tools. We tried that one, but it was super slow - a test suite that normally takes ~1hr would take a good 6-10h with coverage.py, so we went with figleaf. But he did say he's actively working on the speed problem!

  • One reason for the slowdown vs figleaf is that figleaf automatically excludes all the python library code from its data gathering, unless you explicitely include it, and you can exclude other code as well (such as, in our case, all of django - the django guys have their own unit tests!), which dramatically reduces the amount of code that needs to be traced. In the case of overage.py, on the other hand, regex exclusions in are only considered in the reporting phase.

    • A better exclusion mechanism is also coming soon to a coverage.py near you.

More notes from the talk:

  • don't target a specific coverage percentage; you want the most possible but with reason - it all depends of the project. Not all of them will require 100%.

  • There's different kinds of code coverage; all we have at the moment in python is just statement coverage: has this line been tested or not?

    • we don't have branch coverage or path coverage in python yet!

      if a:
          foo()
      bar()
      

      you might have written one test which goes through all 3 lines, but you also need to test that if the if condition is false, your code still works. That's branch coverage. He gave many more excellent examples of the different types of coverage and why they're important.

    • the point is that you can't look at the output of figleaf or coverage.py and see 100% and think "alright! we're covered!" - that's not true. You still need to look at your code.

pycon talk page

Building tests for large, untested codebases

This was about testing code that normally uses lots of data (like 100GB), the pygr library. The pygr code was open sourced so it would get used more. It seems to be a fairly specific piece of software for bioinformatics, but also seems like a pretty neat thing even though i will probably never use it.

Now the idea is that this codebase was written by other people, is well-written but fairly complex, and you want to test it, document it, and understand it. Probably not in that order, but that's kind of the point: testing and using code coverage tools will help you understand, expand, debug the code.

The fact that all we have at our disposal is code coverage, as opposed to branch coverage and such, restricts the usefulness of coverage. Someone asked at the end what might be the easiest next step to take in coverage testing so it does more. The speaker suggested that, in python 2.6 at least, it should be easy to trace the AST of your code to see that each branch is actually covered.

The speaker was actually very interesting and lively, but i can't say that the talk was very interesting in and of itself.

pycon talk page

Behind the scenes of EveryBlock.com

This talk was very interesting to me as it talked about problems we've both encountered and thought a lot about with no real good solution... so pardon the length of my notes on it. The slides aren't available online yet, but most of it was spoken anyways, so hopefully these notes will help someone else.

Adrian Holovaty, a core django developer, started everyblock.com, basically a cool local news / information mashup application written in python using django, which shows what's going on around your house (in your block!):

  • news happening around you
  • crime around you
  • photos
  • street closures
  • new businesses
  • many, many more things...

It is a smörgåsbord of links: everything is linked together and you can browse by any type of data - everything is linked together.

Django was made for this type of stuff originally, so that's bound to be a good fit. Adrian talked about how that site works. They will make the code open-source on June 30th so others can imitate that site. In fact, the grant that funded this project obligates them to!

The problem with this - one we've faced before and in some ways still do at akoha but with different data types - is that at a basic level you have news items, but these can link to theft events (which will have their own field types: had_arrests, type of crime, ...) or a restaurant inspection (violation, passed or failed, ...). You face the problem that if you create a schema which can express news items, restaurant violations or crimes, it'll work tomorrow but you'll face a schema migration every time you want to tweak the fields or add a type of news event.

So one approach is called entity-attribute-value, EAV: using lowest-denominator column types, such as a data table listing with the columns:

newsitem
att_name
att_value

So the idea is that if you want to store data about newsitem 35, which has attributes foo='a' and bar='b', you'd store 2 rows in this table:

35,'foo','a'
35,'bar','b'

but that leads to much duplication (as a given field name is likely to come back over and over), many JOIN clauses as you write complicated search queries, humongous tables, and problems with the type of the att_value (you have to use string for it, but then you'll need to convert to/from other data types, and serialize things into it, ...).

Another solution, that's kind of in-between between the previous two and what they eventually settled on is to have an attribute table which has a few generic columns of different types such as:

id
string01
string02
bool01
int01
int02
float01

Now, that works, but it makes the schema opaque: for a 'crime', what does the bool01 column refer to? So, in what i think is the breakthrough here, they created a Schema table, which will list what each field does for a given data type. So now, a given news story item will point to what type it is, and the attributes table will have a row (i guess it could have more if you tweaked this) pointing to that news item. The news item will have a column telling you what data type it is. If you want to know what bool01 means for an attributes row for a given news item, you look up the attribute name in the schema table for the news type and generic column name (bool01).

So then, leveraging that schema table, they also created a nice helper in the model manager that allows you to filter by, e.g., had_arrests, instead of bool01. It makes the code a lot more readable, which is usually a big problem when you go with a solution of that type.

The schema description model also has a is_filter column: if true, the site will make that attribute clickable, and you'll be able to see all other news items of that type that have that same value for the field, around you.

And of course they added a nice templatetag which you can use to say render this news item: the tag will figure out the type, and look up a type-specific html template to render the item properly.

He suggested taking a look at the code for databrowse for a few of these techniques.

Now, getting the data is also a problem! For news stories, how do you get the meat of the story of an article from the huge polluted web pages of news media! You look at the different pages of that site, and figure out automatically what changes between them - it'll be the individual news item. Look up wrapper induction and templatemaker.

For other types, such as flickr, you can scrape the sites (using, e.g., BeautifulSoup, or use their APIs (flickr has a great API).

And, finally, the bigger problem is the government data. Most of that isn't available on web site in any useable way, so they're trying to convince the powers-that-be to dedicate some efforts to making that data available and searchable. For some of the data, there is some very basic interface to the data (here's an example)

Someone asked whether they tried or considered schema-less databases (such as CouchDB), which would solve the unknown-schema problem. But it creates other problems:

  • no geospatial data indexing, which they use a lot
  • difficulty linking
  • performance

They haven't had to do much scaling yet, but are starting to look at sharding, multiple-database support in django, etc.

A very cool thing Adrian said was that government agencies shouldn't be in the business of presenting the data: creating nice dynamic websites to search, view and enter that data. They should do what only them should do, like arrest the bad guys, and then make the data available through some automatic means so that people like Adrian can create much better interfaces to that data, and tie it to all the other related data, something that those agencies would never get to anyways.

pycon talk page

by wiswaud on 27 March 2009
Tags: chicago, django, english, pycon, python

Comments

Share this page
| More

follow me on Twitter

Powered by Debian Valid HTML 4.01 Transitional Valid CSS! Powered by Django.