Machine learning is exciting, but the work is complex and difficult. It typically involves a lot of manual lifting — assembling workflows and pipelines, setting up data sources, and shunting back and forth between on-prem and cloud-deployed resources.
The more tools you have in your belt to ease that job, the better. Thankfully, Python is a giant tool belt of a language that’s widely used in big data and machine learning. Here are five Python libraries that help relieve the heavy lifting for those trades.
A simple package with a powerful premise, lets you run Python-based scientific computing workloads as multiple instances of AWS Lambda functions. A profile of the project describes PyWren using AWS Lambda as a giant parallel processing system, tackling projects that can be sliced and diced into little tasks that don’t need a lot of memory or storage to run.
One downside is that lambda functions can’t run for more than 300 seconds max. But if you need a job that takes only a few minutes to complete and need to run it thousands of times across a data set, PyWren may be a good option to parallelize that work in the cloud at a scale unavailable on user hardware.
. One common question about it: How can I make use of the models I train in TensorFlow without using TensorFlow itself?
Tfdeploy is a partial answer to that question. It exports a trained TensorFlow model to “a simple NumPy-based callable,” meaning the model can be used in Python with and the the NumPy math-and-stats library as the only dependencies. Most of the operations you can perform in TensorFlow can also be performed in Tfdeploy, and you can extend the behaviors of the library by way of standard Python metaphors (such as overloading a class).
Now the bad news: Tfdeploy doesn’t support GPU acceleration, if only because NumPy doesn’t do that. Tfdeploy’s creator suggests using the project as a possible replacement.
Writing batch jobs is generally only one part of processing heaps of data; you also have to string all the jobs together into something resembling a workflow or a pipeline. , created by Spotify and named for , was built to “address all the plumbing typically associated with long-running batch processes.”
provides a set of Pythonic interfaces to Kubernetes, originally to aid with Jenkins scripting. But it can be used without Jenkins as well, and it can do everything exposed through the CLI or the Kubernetes API.
Let’s not forget about this to the Python world, an implementation of the Torch machine learning framework. doesn’t only port Torch to Python, but adds many other conveniences, such as GPU acceleration and a library that allows multiprocessing to be done with shared memory (for partitioning jobs across multiple cores). Best of all, it can provide GPU-powered replacements for some of the unaccelerated functions in NumPy.