After a visit of PyData Berlin 2015
event, which took place 29.-30. May at location
Betahaus, I wrote down some links as a
reminder, and decided then to put them here alongside with some
thoughts on given topics.
Actually there were talks as given on meeting site - 'related to the
use of Python in data management and analysis' - from variety of
fields, but with emphasis on application of machine learning.
In 2014 PyData Berlin event was a talk of Travis
Oliphant , during which
also 'PyData: the First 20 Years' were summarized on a slide. Actually
this story began with basic matrix calculation packages that were
precursor of current standard NumPy and
extensions like SciPy and
matplotlib. With
pandas and several machine learning
packages, e.g. scikit-learn, the PyData
ecosystem rivals R and interacts with
'big' Java-driven 'big data' solutions centered on
apache.org -
as described
e.g. here.
Topics presented and discussed in Berlin this year were commented in
'official'
#PyDataBerlin
twitter feeds, and videos are
on-line.
The keynote from Matthew
Rocklin from Continuum
Analytics was about
Dask, which seems to be a quite elegant
approach to get 'instantly' multi-core performance for a large subset
of NumPy functionality by parallel processing. This topic leads
usually to hints about limitations for multi-threaded data management
because of GIL,
and consequently one of Matthew's message was to get rid of this in
main parts of PyData module ecosystem (with
nogil
statement in cython).
Valentin Haenel talked about Blosc and related
higher-level packages (see links on his homepage). Especially the
'columnar data container' bcolz could
be considered as an light-weight alternative for HDF5 file format (actually
PyTables offers also Blosc-based
compression filter).
Claas Abert's talk about numerical
treatment of PDE with Numpy gave impresison about one 'classic' use
case for PyData packages, i.e. physical simulations by
finite-difference and finite-elements methods of micromagnetic
problems. The finite-elements package is
based on FEnICS. Number crunching
involves also optimization of Python code, and an overview on
possibilities was given. Thanks for getting a hint about new JIT
compiler: HOPE!
Mobile app marketing was one example of 'new' 'Big Data' and their
challenges. Nakul Selvaraj from Trademob
explained demands for real-time monitoring, and his colleague Tobias
Kuhn gave some insights about statistical methods with specific
algorithms like t-digest used
e.g. for tracking of anomalies, or how to decompose trends on top of
seasonal variations.
There were 2 talks from Pivotal folks. Cloud
Foundry was mentioned as their
PaaS product,
and it's interesting to see what open source is used
@Pivotal. Here smart GPS tracking of
cars was given as one example of
IoT. In this study
presented by Ronert Obst Random
Forests
were used to learn and classify possible driven routes. Although
Spark usage was mentioned here and
in other talks, it was also hinted to
Flink as alternative with different
memory model and ability to process data
streams.
On Friday the presentation session ended with 'Get Together', but one
should not forget to mention the (en)lightning talks just before. I
pick two: Matthew Rocklin gave short example to show what the benefit
of pandas' dtype
category
is. And the presentation from Tadej Štajner about
automatic machine learning
with some code. So how different
will be future PyData events, if one takes one statement on
AutoML site too serious: 'taking the human expert
out of the loop'?
It was not only a presentation about how ascribe
uses Python modules, like
transactions as 'Bitcoin
for Humans', but in his keynote Trent McConaghy also gave an
interesting view on how blockchains and a suitable protocol
(Spool) under the hood can help
e.g. artists to keep in control of intellectual property of digital
objects.
The keynote of Felix Wick from Blue
Yonder contained really lots of
information not only about technical stuff a data scientist might or
should know, when doing e.g predictive
analytics. So
in his/her life it might be not bad to know a bit about hype cycles -
but do not care too much. And you should really have a look to the
video, because it
appears to be almost a lecture, which gives a nice introduction on
'data science' in our Big Data world (be it a peak or not ...).
Peadar Coyle very nicely presented an interesting example how sports
analytics -
in this case 6 Nations playing Rugby -
can be done by applying bayesian statistical models with
PyMC. Someone in the audience
mentioned that it could be already worth switching to the successor
PyMC 3 to perform Markow chain
Monte Carlo fitting. Actually
this
notebook seems to be a good tutorial for learning or migration
purposes.
The second presentation of Blue Yonder folks was an
introduction to
PySpark
DataFrame objects, which are created to make use of - for instance -
columnar data stored on Hadoop HDFS
clusters in Parquet format. It was
concluded that such an API for distributed file access is too costly
in terms of efficiency compared to NumPy arrays or Pandas dataframes,
when data fits on one machine.
Alejandro C. Bahnsen presented his work
CostCla,
which examplifies usage of scikit-learn classifications on financial
topics like credit
scoring.
Brian Carter (IBM Software Group) gave an overview on web text mining
processes. In this field the NLTK module seems
to be the standard for language processing with Python.
When dealing with similarity of words like Miguel Fernando Cabrera (TrustYou) did with hotel reviews, word2vec has the metrices needed.
I didn't attend the tutorial sessions, so it will help to have a look
at respective videos, if one wants to learn and understand better
e.g. how interactive graphics e.g. within IPython
notebooks can be created with
Bokeh, how
Docker could be seen in between tools like
virtualenv and 'normal' virtual machines, and what one could use
instead of
%timeit
for code profiling.
Someone referred to the
velox model server,
and I noted also Luigi for
pipeline creation of batch jobs - if you remember the right context,
talk or discussion, those were given then feel free to message me
@RaPrism.