Continuous Integration and Parallel Python testing on Heroku.


Or how to make your tests at least 10 times faster using parallel testing on Heroku.

In this post I’ll show you how to deploy a very simple Python web app with very long tests. And I’ll show you how to speed up those test significantly. I you are already familiar with Heroku and just want to go straight to the point, go directly to part 3.

It’s testing time.

For my last project, (a web scraping API) we’ve decided to have part of our infrastructure on Heroku. The reason was simple, neither my co-founder nor me were very good and the ops side of dev, so we have chosen the simplest, most time-efficient way to deploy our app: Heroku. Prior to this we’ve had mitigate experience with AWS, in particular EBS.

Make no mistake, this simplicity comes at price, and Heroku is crazy expensive, but their free plan is very good for side projects such as a Twitch SMS notificator 😎.

So as I said, I’ve been using Heroku for quite a bit of time, and since the beginning we use the lightweight but simple CI integration that will automatically deploy our application every time we push, if and only if,  all our tests pass.

Nothing new under the sun here.

In this post you will see how to easily deploy an Heroku application and setup the continuous integration. But more importantly you will see how to parallelize tests.

Again, if you are already familiar with how to deploy an Heroku application and the continuous application, go directly here to learn about parallelising the test.

First, deploy an app on Heroku:

If you don’t already, you need to create a Heroku account. You also need to download and install the Heroku client. I’ve provided a test project on Github, do not hesitate to check it out it you need help bootstrapping this tutorial.

You can pull this repo, cd into it and just do a `heroku create –app <app name>`. If you go on your app dashboard you’ll see your new app.

Ok, now comes the interesting part, just go on your dashboard and click on the name of your newly created app. And go on the “deploy” panel.

We will now link this Heroku app, with your Github repo. This is rather simple, simply click on “Github” in the “Deployment method” section, add your repo in the “App connected to Github” section and don’t forget to click “Enable automatic deploys” on the “Automatic deploys” section.

Once everything is setup it should look a little bit like this:

If you go over “Settings -> Domains” you should see the domain where your app is live.

Ok so now you app is live, and every-time you’ll push to Github, a new deploy will take place.

Then add tests and CI:

In order to run tests on Heroku you have to do to is click on “Wait for CI to deploy” on the deploy section of your app.

You also need to add you application to an Heroku pipeline.

Doing this is really easy, just go on the Deploy tab of you application and create a new Pipeline with the name fo your choice.

You have now access to the Pipeline view, where you can click on your previously deployed app.

Go over the Tests tab, link your Github repo, and click on “Enable Heroku CI”, be careful this option costs $10 a month.

Let’s go back to our code. The test file is already written, and now, all you have to do to trigger the magic is simply to push on master.

`git commit –allow-empty -m “Trigger heroku” && git push origin master`

And now, the app won’t deploy right away, Heroku will wait for tests to pass before deploying. You can check what’s going on behind the curtain on the Test tab.

The command that is ran during the test is defined in the app.json file.

As you can tests are now being run sequentially on Heroku. If you look at the `slow-tests.py` file, you will see that I defined my tests using pytest.mark.parametrize that allow me to trigger multiple tests, in one line:

@pytest.mark.parametrize("wait_time", [5] * 20)
def test_slow(wait_time):
    time.sleep(wait_time)
    assert True

This decorator means that the test will be run 20 times, with `wait_time=5`.

As you can see in Heroku, this test suite is (artificially) rather slow:

7 is here is just the number of the build

Parallelising test on Heroku

As stated here in the doc, Heroku easily offers the ability to parallelise tests. In order to launch your tests on multiple dynos at the same time, you just have to tweak your app.json file a little bit.

{
  "environments": {
    "test": {
      "scripts": {
        "test-setup": "pip install -r requirements.txt",
        "test": "pytest --tap-stream slow-tests.py"
      },
      "formation": {
          "test": {
            "quantity": 12
          }
      }
    }
  },
  "buildpacks": [{ "url": "heroku/python" }]
}

The quantity key will tell Heroku on how many dynos you want to run your test. From now on, pushing on master will launch the test on 12 dynos. But stopping here won’t make your tests faster, because all the test suite will be run on 12 dynos, and what we want is to run 1/12 of all tests on each of the 12 dynos.

It is actually easy to check,

Tests were run on 12 dynos, but were not that much faster. So now comes, the tricky, and unfortunately not very documented part, how do we tell Heroku to run 1/12 of the test suite on each of the 12 dynos?

Splitting up tests

To do this we will use 2 environment variables set by Heroku, and accessible on each dyno, CI_NODE_TOTAL  and CI_NODE_INDEX . The first one indicates the number of dyno on which the test are run, and the second one indicates on which current dyno are you.

Let’s see right now how to use them. pytest offers you the ability to overwrite the test items that are going to be executed during the test phase. To overwrite this function, just declare this snippet of code in conftest.py file.

import os


def pytest_collection_modifyitems(items, config):
    ci_node_total = int(os.getenv("CI_NODE_TOTAL", 1))
    ci_node_index = int(os.getenv("CI_NODE_INDEX", 0))
    items[:] = [
        item
        for index, item in enumerate(items)
        if index % ci_node_total == ci_node_index
    ]

This method is used to modify test items, inplace, that are going to be tested. This method does not return anything, this is why you have to update the array, inplace. This usually an example of what not to do, but that is not the subject of this post.

You have to keep in mind that this snippet is ran in every test node. In every test node, CI_NODE_TOTAL is the same and CI_NODE_INDEX is different, so by only keeping tests whose `index` in `items` modulo CI_NODE_TOTAL equal CI_NODE_INDEX we ensure 2 things:

  • every node runs 1/`CI_NODE_TOTAL` number of tests
  • every test originally in items ended up being ran

If it is not clear, imagine that I have 24 tests in items: `[t1, t2, …., t24]`, this snippet of code, executed on the number number 1 will update the itemsvariable such that at the end of pytest_collection_modifyitems we have items = [t1, t13] .
In dyno number 2, we have items = [t2, t14] , etc ….

And here is what happens on Heroku once we push:

As you can see we did not managed to divide the time my 12, the reason is simple, each dyno take about 30 seconds to boot, and this time is incompressible. But we managed to divide time by 2, and more importantly, we can parallelize our tests to up until 32 dynos, so there is plenty room for time improvement.

Thank you for reading

I had trouble finding documentation about parallelising tests on Heroku in python and I really hope you liked that post and that it will speed up your deployment time on Heroku. All sources are freely available here on Github.

I frequently blog about Python and web scraping, actually I recently wrote a Python web-scraping guide that got some nice attention from Reddit 😎, do not hesitate to check it out.

You can follow me there on twitter to not miss any of my next blog post.





Source link