Faster generalised linear models in largeish data
A method for faster generalized linear models on large datasets using a single database query and one Newton-Raphson iteration.
A method for faster generalized linear models on large datasets using a single database query and one Newton-Raphson iteration.
A tutorial on installing OmniSci (formerly MapD) using Docker and loading data for GPU-accelerated SQL analytics and visualization.
A technical guide on installing and configuring Oracle GoldenGate for Big Data with Kafka Connect and Confluent Platform.
A list of 19 Apache Kafka-related technical sessions at Oracle OpenWorld, JavaOne, and Oak Table World 2017 conferences.
A personal reflection on the trade-offs between convenience and privacy in an era of AI, IoT, and pervasive data collection.
Explains improvements in joblib's compressed persistence for Python, focusing on reduced memory usage and single-file storage for large numpy arrays.
Technical guide on building a real-time Twitter sentiment analysis system using Apache Kafka and Storm.
Explains Lambda Architecture for Big Data, combining batch processing (Hadoop) and real-time stream processing (Spark, Storm) to handle large datasets.
A tutorial on building data pipelines using Microsoft Azure Data Factory, covering ingestion, transformation, and orchestration.
A reflection on the challenges of data science in academia, discussing the 'brain drain' of data skills and the need for systemic change.
A data engineer shares five practical lessons and performance tips for working with Apache Hive, focusing on common pitfalls and optimizations.
Fixing MongoDB Connector for Hadoop authentication errors by granting the clusterManager role to the user.
An explanation of Microsoft Azure HDInsights, a managed Apache Hadoop service for processing big data on Azure.
Final tutorial on analyzing airline data with Hadoop using Hive for SQL queries and Pig for scripting, covering setup and basic analytics.
Explores how the demand for big data skills in industry is draining talent from academic science, threatening research.
A tutorial on using Apache Hive to create tables and views from data loaded into a Hadoop cluster, continuing a multi-part series.
Explains how to parallelize QR decomposition for linear models on big data using R's biglm package and incremental merging.
A practical guide introducing Hadoop's ecosystem and setting up a proof-of-concept cluster on Amazon EC2 using Cloudera for big data processing.
A guide to installing and using R on Amazon EC2 instances to overcome in-memory limitations for big data analysis.
Announcement for DevNexus 2013, a Java/JVM technology conference in Atlanta, featuring sessions on cloud, mobile, web, and more.