Hadoop

Hortonworks plays balancing act in the cloud

Among the big elephant platform providers, Hortonworks remains just about the last one not shy of having Hadoop associated with its name. But after …

Big Data

vtreat 1.2.0 is Available on CRAN, and it is now Big Data Capable

(This article was first published on <b>R – Win-Vector Blog</b>, and kindly contributed to R-bloggers)<p>We here at Win-Vector LLC have some really big news we …

Machine Learning

Performance cost of HBase table with a high number of versions per row?

We are implementing a HBase storage mechanism which will have one table that will make use of a (String) row key and (long) timestamp to maintain …

Databases

Save and load model in Apache Spark

I'm trying to implement a simple model to detect anomalies based on gaussian distribution but i don't know if there is a way to save the model like …

Test Data

Force Apache Flink to execute at a given point

It is my understanding that Apache Flink does not actually run the operations that you ask it to until the result of those operations is needed for …

Ubiquitous Computing

Linear Regression, with Map-Reduce

(This article was first published on <b>R-english – Freakonometrics</b>, and kindly contributed to R-bloggers)<p>Sometimes, with big data, matrices are too big …

MapReduce

How to view Apache Parquet file in Windows?

I couldn't find any plain English explanations regarding Apache Parquet files. Such as:• What are they?<br>• Do I need Hadoop or HDFS to view/create/store …

Big Data

How to name outputs of Kafka-HDFS-Ingestion job containing Apache Kafka topic names in Apache Gobblin?

I have tested Gobblin with Hadoop and Apache Kafka using Kafka-HDFS-Ingestion Job. The example is available here. In Kafka, I have 2 topic and I can …

Linux

online time series anomalies detection with apache spark

we have a data pipeline systemapache kafka---->spark steaming----->spark mlibthe data consumed is time series data (e.g. each record is in the form …

Apache Spark

Getting null pointer exception when running saveAsNewAPIHadoopDataset in scala spark2 to hbase

I am saving a puts RDD to Hbase using saveAsNewAPIHadoopDataset. Below is my job creation and submition.This is my job creation functionThis function …

Ubiquitous Computing

Hortonworks Data Platform turns 3.0; new cloud partnerships announced

HDP hits magic v3 milestone, gets containers and new OLAP-y version of Hive. Hortonworks announces revamped Microsoft Azure partnership and team-ups …

Big Data

DBA to Big Data DBA | Oracle Magazine

Turn your DBA skills into big data skills.<p><i>By Michelle Malcher</i><p>May/June 2014<p>If you are a DBA dealing with big databases, you might think that you …

Big Data

Compute euclidean distance between two DenseMatrix in Scala

i have two Densematrix i want to compute the euclidean distance between each points with saclaThanks

Apache Spark

Scala how to find the max of Intgers in a List[Row]

Let the list behow to get the maximum value in scala . The list is List[Row] not List[Int]

Apache Spark

Updating grouper column based on specific column value in Scala/Apache Spark

Here is what I am trying to accomplish using spark or scala+spark:Each instance of "A", in column_1, signifies the start of a new group up until the …

Fish

Apache Spark - Tweets Processing

Given a huge dataset of tweets i need to:• extract and count the hashtags.<br>• extract and count the emoticons/emojis.<br>• extract and count the words …

Apache Spark

Bulk loading timestamp sensitive data HBase

We have a lot of historical data that we need to migrate into HBase. The setup of our HBase is that the (timestamp) versioning is relevant and using …

HBase

Where is scala on node with spark-shell installed?

I have Apache Spark installed on a cluster. I can run spark-shell on the cluster master node. So, it means there is the scala installed to this …

Ubiquitous Computing

How to make HDFS work in docker swarm

I have troubles to make my HDFS setup work in docker swarm. To understand the problem I've reduced my setup to the minimum :• 1 physical machine<br>• 1 …

Microservices

"On Client" DataTable SQL Query

I'm running a sql query against several instances of our WCMS software that all have a fairly uniform schema but live on unconnected sql server …

SQL Servers

Data science moves away from big data towards containers and cloud

Traditional Hadoop-style big data is giving way to cloud and container solutions like Docker, according to the results of a new survey. The State of …

Artificial Intelligence

Build a MapReduce flow in Elixir

Giving the Elephant Some Elixir<p>MapReduce is a common Big Data pattern for analyzing a data set concurrently. This tutorial will introduce you to …

MapReduce

Optimizing Apache Spark JDBC with SQL Server using boundary limits

We are having a framework that uses Apache Spark to get data from SQL Server using Spark SQL . You can see a sample of the query below . Here I have …

SQL Servers

hadoop streaming failed with error code 1 in RHadoop

I am working with RHadoop by the following code:When I run this code, I get an error as:packageJobJar: [/tmp/hadoop-unjar7138506441946536619/] [] …

Databases

Write spark dataframe into existing parquet hive table

Want to write spark dataframe into existing parquet hive table. I am able to do it usingdf.write.mode("append").insertI to("myexistinghivetable")but …

Ubiquitous Computing

Spark-submit -class command not found?

I am running a project with kafka and Apache spark. To run my kafka stream I am running this command from within the project:However I simply get the …

Scala

I want to learn about Hadoop admin.. is there any good book or tutorils

I want to become Hadoop admin.. is there any good book or tutorials. And what are the day to day activities that Hadoop admins do? Do they develop …

Tutorials

Awesome design patterns

A curated list of software and architecture related design patterns.<p>Software design pattern - A general, reusable solution to a commonly occurring …

Microservices

Cassandra vs HBase Consistency Model

How is Cassandra's eventual consistency model different from HBase? It seems Facebook moved from Cassandra to HBase because consistency issues. Which …

HBase

Show progress of Java 8 stream processing

I have a Stream processing a few millions of elements. The Map-Reduce algorithm behind it takes a few milliseconds, so task completion takes about …

MapReduce