PyData Berlin 2015

After a visit of PyData Berlin 2015 event, which took place 29.-30. May at location Betahaus, I wrote down some links as a reminder, and decided then to put them here alongside with some thoughts on given topics.

Actually there were talks as given on meeting site - 'related to the use of Python in data management and analysis' - from variety of fields, but with emphasis on application of machine learning.

In 2014 PyData Berlin event was a talk of Travis Oliphant , during which also 'PyData: the First 20 Years' were summarized on a slide. Actually this story began with basic matrix calculation packages that were precursor of current standard NumPy and extensions like SciPy and matplotlib. With pandas and several machine learning packages, e.g. scikit-learn, the PyData ecosystem rivals R and interacts with 'big' Java-driven 'big data' solutions centered on apache.org - as described e.g. here.

Topics presented and discussed in Berlin this year were commented in 'official' #PyDataBerlin twitter feeds, and videos are on-line.

The keynote from Matthew Rocklin from Continuum Analytics was about Dask, which seems to be a quite elegant approach to get 'instantly' multi-core performance for a large subset of NumPy functionality by parallel processing. This topic leads usually to hints about limitations for multi-threaded data management because of GIL, and consequently one of Matthew's message was to get rid of this in main parts of PyData module ecosystem (with nogil statement in cython).

Valentin Haenel talked about Blosc and related higher-level packages (see links on his homepage). Especially the 'columnar data container' bcolz could be considered as an light-weight alternative for HDF5 file format (actually PyTables offers also Blosc-based compression filter).

Claas Abert's talk about numerical treatment of PDE with Numpy gave impresison about one 'classic' use case for PyData packages, i.e. physical simulations by finite-difference and finite-elements methods of micromagnetic problems. The finite-elements package is based on FEnICS. Number crunching involves also optimization of Python code, and an overview on possibilities was given. Thanks for getting a hint about new JIT compiler: HOPE!

Mobile app marketing was one example of 'new' 'Big Data' and their challenges. Nakul Selvaraj from Trademob explained demands for real-time monitoring, and his colleague Tobias Kuhn gave some insights about statistical methods with specific algorithms like t-digest used e.g. for tracking of anomalies, or how to decompose trends on top of seasonal variations.

There were 2 talks from Pivotal folks. Cloud Foundry was mentioned as their PaaS product, and it's interesting to see what open source is used @Pivotal. Here smart GPS tracking of cars was given as one example of IoT. In this study presented by Ronert Obst Random Forests were used to learn and classify possible driven routes. Although Spark usage was mentioned here and in other talks, it was also hinted to Flink as alternative with different memory model and ability to process data streams.

On Friday the presentation session ended with 'Get Together', but one should not forget to mention the (en)lightning talks just before. I pick two: Matthew Rocklin gave short example to show what the benefit of pandas' dtype category is. And the presentation from Tadej Štajner about automatic machine learning with some code. So how different will be future PyData events, if one takes one statement on AutoML site too serious: 'taking the human expert out of the loop'?

It was not only a presentation about how ascribe uses Python modules, like transactions as 'Bitcoin for Humans', but in his keynote Trent McConaghy also gave an interesting view on how blockchains and a suitable protocol (Spool) under the hood can help e.g. artists to keep in control of intellectual property of digital objects.

The keynote of Felix Wick from Blue Yonder contained really lots of information not only about technical stuff a data scientist might or should know, when doing e.g predictive analytics. So in his/her life it might be not bad to know a bit about hype cycles - but do not care too much. And you should really have a look to the video, because it appears to be almost a lecture, which gives a nice introduction on 'data science' in our Big Data world (be it a peak or not ...).

Peadar Coyle very nicely presented an interesting example how sports analytics - in this case 6 Nations playing Rugby - can be done by applying bayesian statistical models with PyMC. Someone in the audience mentioned that it could be already worth switching to the successor PyMC 3 to perform Markow chain Monte Carlo fitting. Actually this notebook seems to be a good tutorial for learning or migration purposes.

The second presentation of Blue Yonder folks was an introduction to PySpark DataFrame objects, which are created to make use of - for instance - columnar data stored on Hadoop HDFS clusters in Parquet format. It was concluded that such an API for distributed file access is too costly in terms of efficiency compared to NumPy arrays or Pandas dataframes, when data fits on one machine.

Alejandro C. Bahnsen presented his work CostCla, which examplifies usage of scikit-learn classifications on financial topics like credit scoring.

Brian Carter (IBM Software Group) gave an overview on web text mining processes. In this field the NLTK module seems to be the standard for language processing with Python.

When dealing with similarity of words like Miguel Fernando Cabrera (TrustYou) did with hotel reviews, word2vec has the metrices needed.

I didn't attend the tutorial sessions, so it will help to have a look at respective videos, if one wants to learn and understand better e.g. how interactive graphics e.g. within IPython notebooks can be created with Bokeh, how Docker could be seen in between tools like virtualenv and 'normal' virtual machines, and what one could use instead of %timeit for code profiling.

Someone referred to the velox model server, and I noted also Luigi for pipeline creation of batch jobs - if you remember the right context, talk or discussion, those were given then feel free to message me @RaPrism.

blogroll

social