Big Data

By Ulrich Hoffmann | Ulrich Hoffmann created a magazine on Flipboard. “Big Data on Flipboard” is available with thousands of other magazines and all the news you care about. Download Flipboard for free and search for “Ulrich Hoffmann”.

NVIDIA Talks Up Numba For GPGPU Computing With Python

Numba is designed to allow for high performance Python JIT-compiled code designing for C/C++ levels of performance while using LLVM for optimizations …

Nvidia

Apache Showdown: Flink vs. Spark

Saiki is Zalando’s next generation data integration and distribution platform in a world of microservices. Saiki ingests data generated by …

How Apache Flink™ handles backpressure - data Artisans

People often ask us how Flink deals with backpressure effects. The answer is simple: Flink does not use any sophisticated mechanism, because it does …

Patterns for Streaming Realtime Analytics

We did a tutorial on DEBS 2015 (9th ACM International Conference on Distributed Event-Based Systems), describing a set of realtime analytics …

Workshop: Mind blown: Crafting a Distributed Data Science Pipeline using Spark, Cassandra, Akka and the Spark Notebook | SkillsCast | 10th December 2015

Please log in to watch this conference skillscast.<p>Get your hands dirty with distributed tools, during these two hours we’ll have a quick overview on …

Flink 0.10: A significant step forward in open source stream processing - data Artisans

We are delighted to see that the Apache Flink™ community has announced the availability of Apache Flink™ 0.10. The 0.10 release is one of the largest …

How-to: Build a Complex Event Processing App on Apache Spark and Drools

<b>Combining CDH with a business execution engine can serve as a solid foundation for complex event processing on big data.</b>Event processing involves …

Big Data

Saturday Morning Video: Statistical Learning with Big Data by Trevor Hastie

<b></b> Slides: Statistical Learning with Big Data by Trevor Hastie <b></b><b></b> <b><br>Join the CompressiveSensing subreddit or the Google+ Community or the Facebook page and</b> …

Strata NYC 2015 - Supercharging R with Apache Spark

R is the favorite language of many data scientists. In addition to a language and runtime, R is a rich ecosystem of libraries for a wide range of use cases from statistical inference to data visualization. However, handling large or distributed data with R is challenging. Hence R is used along with …

How we selected Apache Flink™ as our Stream Processing Framework at the Otto Group Business Intelligence Department - data Artisans

<i>This is a guest post written by Christian Kreutzfeldt (@mnxfst) and Alexander Kolb (@lofifnc) from the Otto Group Business Intelligence Department.</i> …

Distributed Stream and Graph Processing with Apache Flink

Apache Flink is a top-level Apache project that allows unifying distributed stream and batch processing. In the core of Apache Flink is a streaming …

6 Ways To Ask Smarter Questions Of Big Data

To drive more value out of your big data, you have to start with the right questions. Here are six strategies for improving the quality of the …

Mastering Hadoop MapReduce Software Framework To Sort & Manage Large Amounts Of Data

Big data challenges modern businesses daily as data sets become larger and impossible for existing software to read. Enter Hadoop MapReduce, a …

Real Time Analytics With Spark Streaming and Cassandra

Spark Streaming is a good tool to roll up transactions data into summaries as they enter the system. When paired with an easily idempotent data store …

Scientific method: Statistical errors

P values, the 'gold standard' of statistical validity, are not as reliable as many scientists assume.<p>For a brief moment in 2010, Matt Motyl was on …

Statistics

15 Most Read Data Science Articles in 2015. So far …

We've compiled the latest set of "most read" articles from the Data Science Weekly Newsletter. This is what is most popular thus far in 2015 - a mix …

Analyzing Flight Data: A Gentle Introduction to GraphX in Spark

Basics of GraphX<p>GraphX, as you might have guessed, is built upon this basic paradigm of graph theory. It's awesome.<p>Requirements<p>Now this tutorial is …

Basic MCMC and Bayesian statistics in... BASIC! - Publishable Stuff

The BASIC programming language was at one point the most widely spread programming language. Many home computers in the 80s came with BASIC (like the …

Big Data VR app allows researchers to 'browse' genomes

Earlier this year, Epic Games (the folks that made the Unreal Engine) held a $20,000 competition that challenged VR companies to create programs that could help users better tackle the valuable, albeit unwieldy, figures in Big Data sets. For its entry into "The Big Data VR Challenge" Hammerhead VR …

Big Data

Data science blogs

A curated list of data science blogs<p>A Blog From a Human-engineer-being http://www.erogol.com/ (RSS)<br>• Aakash Japi http://aakashjapi.com/ (RSS)<br>• Adit …

Pandashells

Introduction<p>For decades, system administrators, dev-ops engineers and data analysts have been piping textual data between unix tools like grep, awk, …

Pipes.Tutorial

Introduction<p>The pipes library decouples stream processing stages from each other so that you can mix and match diverse stages to produce useful …

Merging datasets using graph analytics

Imagine it: Just when you’ve perfectly decorated your living room, you receive a beautiful gift from your boss. It’s a life-size golden statue of a …

Ibis Project Blog

This year, I collaborated with members of the Apache Impala (incubating) team at Cloudera to create a new C++ library to eventually become a faster, …

Interactive Audience Analytics With Spark and HyperLogLog

At Collective we are working not only on cool things like Machine Learning and Predictive Modeling, but also on reporting that can be tedious and …

Data flow vs. procedural programming: How to put your algorithms into Flink by Mikio Braun 23.06.15

Four things to know about reliable Spark Streaming