IBM is using Apache Sparke to analyse radio signals for signs of extra-terrestrial intelligence.
Speaking at Apache: Big Data Europe, Anjul Bhambrhi, vice president of big data products at IBM, talked about how the firm has thrown its weight behind Spark.
“We think of [Spark] as the analytics operating system. Never before have so many capabilities come together on one platform,” Bhambrhi said.
Spark is a key project because of its speed and ease of use, and because it integrates seamlessly with other open-source components, Bhambrhi explained.
“Spark is speeding up even MapReduce jobs, even though they are batch oriented by two to six times. It’s making developers more productive, enabling them to build applications in less time and with fewer lines of code,” she claimed.
She revealed IBM is working with Nasa and Seti to analyse radio signals for signs of extra-terrestrial intelligence, using Spark to process the 60Gbit of data generated per second by various receivers.
Other applications IBM is working on with Spark include genome sequencing for personalised medicine via the Adam project at UC Berkeley in California, and early detection of conditions such as diabetes by analysing patient medical data.
“At IBM, we are certainly sold on Spark. It forms part of our big data stack, but most importantly we are contributing to the community by enhancing it,” Bhambrhi said.
The Apache: Big Data Europe conference also saw Canonical founder Mark Shuttleworth outline some of the key problems in starting a big data project, such as simply finding engineers with the skills needed just to build the infrastructure for operating tools such as Hadoop.
“Analytics and machine learning are the next big thing, but the problem is there are just not enough ‘unicorns’, the mythical technologists who know everything about everything,” he explained in his keynote address, adding that the blocker is often just getting the supporting infrastructure up and running.
Shuttleworth, pictured above, went on to demonstrate how the Juju service orchestration tool developed by Canonical could solve this problem. Juju enables users to describe the end configuration they want, and will automatically provision the servers and software and configure them as required.
This could be seen as a pitch for Juju, but Shuttleworth’s message was that the open-source community is delivering tools that can manage the underlying infrastructure so that users can focus on the application itself.
“The value creators are the guys around the outside who take the big data store and do something useful with it,” he said.
“Juju enables them to start thinking about the things they need for themselves and their customers in a tractable way, so they don’t need to go looking for those unicorns.”
The Apache community is working on a broad range of projects, many of which are focused on specific big data problems, such as Flume for handling large volumes of log data or Flink, another processing engine that, like Spark, is designed to replace MapReduce in Hadoop deployments.
While not an entirely unique concept, a new decentralized service is beta testing a peer-to-peer network that would “rent” unused space from your PC’s hard drive as part of a cloud service to store files from other users.
The service, called Storj, uses the network and end-to-end encryption to allow the transfer of data to and from your computer’s drive. Your hard drive is literally used to store other people’s data.
During a crowdsourcing campaign last year, Storj garnered 910 Bitcoins valued at $461,802, according tothe CoinDesk Bitcoin Price Index.
Users who rent out space on their hard drives earn “Storjcoin X” (SJCX), a form of currency that can be used to purchase capacity on Storj’s “Driveshare” service.
Users earn the SJCXs by selling excess hard drive space with DriveShare, or use it to purchase space on the Storj Metadisk network using the company’s file sharing app.
Users who want to store files on the peer-to-peer network simply drag and drop them into the Metadisk app, where they’re then listed for viewing or retrieval. If a user wants to share a file with someone else, they simply click on a “copy URL” icon and send along the resulting URL.
The peer-to-peer cloud storage network allows users to transfer and share data without relying on a third-party data provider. Storj claims that by removing any form of central controls, it eliminates most traditional data failures and outages, “as well as significantly increasing security, privacy and data control.”
The service works by first uploading a file-sharing application onto a user’s computer then breaking file data into small 8MB or 32MB blocks, or “shards,” as Storj calls them. Each block of data is encrypted with a unique hash, and then the pieces are distributed throughout the cloud network, according to a white paper the company published on its peer-to-peer storage technology.
The file blocks get distributed throughout the network on nodes called “DriveShares” located all over the world.
Storj uses hash chains or Merkle Trees, as they are sometimes called, to verify the contents of a file after it has been broken up into blocks or “leaves” off of a master or root hash.
Storj periodically cryptographically checks the integrity and availability of a file, and offers direct rewards to those maintaining the file.
AMD announced that it would demonstrate the first implementation of Apache Hadoop on an ARM Cortex-A57 part at the JavaOne conference.
The chip in question is of course an A-series Opteron. AMD recently announced the Opteron A1100 and it is the company’s first ARM-based server part.
The presentation was delivered by AMD corporate fellow Leendert van Doorn and Henrik Stahl, VP of Java product management and IoT at Oracle.
“This demonstration showcases AMD’s leadership in the development of a robust, standards-based ecosystem for ARM servers,” said van Doorn. “Servers powered by AMD Opteron A-Series processors are well-suited for Hadoop, offering an efficient scale-out compute platform that can also double as an economical persistent storage platform.”
The demo showed an A1100 dev platform running Apache Hadoop on the Oracle JDK. AMD said it would continue its collaboration with ARM, Oracle, Red Hat, Linaro and SUSE in order to boost ARM development in the server space.
Apache Spark, a high-speed analytics engine for the Hadoop distributed processing framework, is now available to plug into the YARN resource management tool.
This development means that it can now be easily deployed along with other workloads on a Hadoop cluster, according to Hadoop specialist Hortonworks.
Released as version 1.0.0 at the end of May, Apache Spark is a high-speed engine for large-scale data processing, created with the aim of being much faster than Hadoop’s better-known MapReduce function, but for more specialised applications.
Hortonworks vice president of Corporate Strategy Shaun Connolly told The INQUIRER, “Spark is a memory-oriented system for doing machine learning and iterative analytics. It’s mostly used by data scientists and high-end analysts and statisticians, making it a sub-segment of Hadoop workloads but a very interesting one, nevertheless.”
As a relatively new addition to the Hadoop suite of tools, Spark is getting a lot of interest from developers using the Scala language to perform analysis on data in Hadoop for customer segmentation or other advanced analytics techniques such as clustering and classification of datasets, according to Connolly.
With Spark certified as YARN-ready, enterprise customers will be able to run memory and CPU-intensive Spark applications alongside other workloads on a Hadoop cluster, rather than having to deploy them in separate a cluster.
“Since Spark has requirements that are much heavier on memory and CPU, YARN-enabling it will ensure that the resources of a Spark user don’t dominate the cluster when SQL or MapReduce users are running their application,” Connolly explained.
Meanwhile, Hortonworks is also collaborating with Databricks, a firm founded by the creators of Apache Spark, in order to ensure that new tools and applications built on Spark are compatible with all implementations of it.
“We’re working to ensure that Apache Spark and its APIs and applications maintain a level of compatibility, so as we deliver Spark in our Hortonworks Data Platform, any applications will be able to run on ours as well as any other platform that includes the technology,” Connolly said.
Apache Software Foundation released an advisory warning that a patch issued in March for a zero-day vulnerability in Apache Struts did not fully patch the bug. Apparently, the patch for the patch is in development and will be released likely within the next 72 hours.
Rene Gielen of the Apache Struts team said that once the release is available, all Struts 2 users are strongly recommended to update their installations. ASF provided a temporary mitigation that users are urged to apply. On March 2, a patch was made available for a ClassLoader vulnerability in Struts up to version 126.96.36.199. All it took was an attacker to manipulate the ClassLoader via request parameters. However Apache admitted that its fix was insufficient to repair the vulnerability. An attacker exploiting the vulnerability could also cause a denial-of-service condition on a server running Struts 2.
“The default upload mechanism in Apache Struts 2 is based on Commons FileUpload version 1.3 which is vulnerable and allows DoS attacks. Additional ParametersInterceptor allows access to ‘class’ parameter which is directly mapped to getClass() method and allows ClassLoader manipulation.”
It will be the third time that Struts has been updated this year. In February, the Apache Struts team urged developers to upgrade Struts 2-based projects to use a patched version of the Commons FileUpload library to prevent denial-of-service attacks.
Database company SkySQL has announced a $20m round of funding to develop the open source software MariaDB database fork of MySQL.
The Series B funding round was led by Intel Capital to support developing the MariaDB database into “a world-class database to challenge American rivals such as IBM and Oracle”.
With the merger of SkySQL, founded by ex-members of the MySQL team, and MariaDB architect Monty Programme back in April, the new company has been seeking ways to develop the open source project, including increased back end support and scalability of the MariaDB server software.
Other investors in the consortium are California Technology Ventures, Finnish Industry Investment, Open Ocean Capital and Spintop Private Partners alongside the lead California investors.
“Adoption of the MariaDB database server has grown explosively in the last year,” said SkySQL CEO Patrik Sallner. “With the help of our loyal user base, we have built up significant market share when compared to other open source database technologies.”
Sallner added, “For large-scale internet players like Google and Wikipedia, MariaDB database server delivers clear benefits over existing relational databases.
“With this funding we plan to deliver commercial solutions that make it even easier for enterprises of any size to run MariaDB databases at scale.”
Since its formation in 2010, SkySQL has attracted some blue chip clients including Craigslist, EA, HP and Disney.
Intel has released its Apache Hadoop distribution, claiming significant performance benefits through its hardware and software optimisation.
Intel’s push into the datacentre has largely been visible with its Xeon chips but the firm works pretty hard on software as well, including contributing to open source projects such as the Linux kernel and Apache’s Hadoop to ensure that its chips win benchmark tests.
Now Intel has released its Apache Hadoop distribution, the third major revision of its work on Hadoop, citing significant performance benefits and claiming it will open source much of its work and push it back upstream into the Hadoop project.
According to Intel, most of the work it has done in its Hadoop distribution is open source, however the firm said it will retain the source code for the Intel Manager for Apache Hadoop, the cluster management part of the distribution. Intel said it will use this to offer support services to datacentres that deploy large Hadoop clusters.
Boyd Davis, VP and GM of Intel’s Datacentre Software Division said, “People and machines are producing valuable information that could enrich our lives in so many ways, from pinpoint accuracy in predicting severe weather to developing customised treatments for terminal diseases. Intel is committed to contributing its enhancements made to use all of the computing horsepower available to the open source community to provide the industry with a better foundation from which it can push the limits of innovation and realise the transformational opportunity of big data.”
Intel trotted out some impressive industry partners that it has been working with on the Hadoop distribution and while the firm’s direct income from the Hadoop distribution will come from support services, the indirect income from Xeon chip sales is likely what Intel is most looking towards as Hadoop adoption grows to manage the extremely large data sets that the industry calls “big data”.
Big Blue wants to take on competitors such as Oracle and Hewlett Packard by offering a cheap and cheerful Power Systems server and storage product range.
Rod Adkins, a Senior Vice President in IBM’s Systems & Technology Group said the company was was rolling out new servers based on its Power architecture with the Power Express 710 starting at $5,947. He said that the 710 is competitively priced to commodity hardware from Oracle and HP.
Adkins added that IBM is expanding its Power and Storage Systems business into SMB and growth markets. The product launches on Tuesday. IBM said it will start delivering by February 20.
Adding to an already considerable set of cloud IT offerings, Amazon has unveiled a hosted data warehouse service called Redshift, pitching it as a lower-cost alternative to on-premise data warehouse deployments.
“Anyone who has used a traditional old-guard data warehouse solution knows that it is really expensive and complicated to manage,” said Andy Jassy, senior vice president of Amazon Web services, who announced the new offering at the company’s AWS re: Invent conference being held this week in Las Vegas. In contrast, Redshift “is about the tenth of a cost of [a] traditional data warehouse,” Jassy said. “It automates the deployment and administration and works with popular business intelligence tools.”
A limited preview version of the service is now available. Amazon said it will launch the service commercially in early 2013.
Redshift works with a number of business intelligence (BI) applications, including software packages from Microstrategy, SAP, IBM and Jaspersoft. Users would use one of these BI packages to parse data in the Amazon cloud, using PostgreSQL drivers along with ODBC (Open Database Connectivity) and JDBC (Java Database Connectivity) APIs.
Users can store up to 1.6 petabytes, in either 2 terabyte or 16 terabyte nodes, up to 100 nodes. The data will be stored in a columnar format so the “queries will be much faster,” Jassy said.
Amazon will offer the service on a pay-as-you-go billing basis, or for slightly less expensive rates by reserving the service ahead of time. Prices start at US$0.85 per hour for ad-hoc querying and decline from there for greater usage. On the whole, the service could cost as little as $1,000 per year per terabyte of data, compared to an average cost of $19,000 to $25,000 per terabyte per year to maintain data warehouse operations in-house, Jassy noted.
Dell is offering access to its Zinc ARM based server to the Apache Software Foundation for development and testing purposes.
Dell had already shown off its Copper ARM based server earlier this year and said it intends to bring ARM servers to market “at the appropriate time”. Now the firm has allowed the Apache Software Foundation access to another Calxeda ARM based server codenamed Zinc.
Dell’s decision to give the Apache Software Foundation access to the hardware is not surprising as it is the organisation that oversees development of the popular Apache HTTPD, Hadoop and Cassandra software products, all applications that are widely regarded as perfect for ARM based servers. The firm said its Zinc server is accessible to all Apache projects for the development and porting of applications.
Forrest Norrod, VP and GM of Server Solutions at Dell said, “With this donation, Dell is further working hand-in-hand with the community to enable development and testing of workloads for leading-edge hyperscale environments. We recognize the market potential for ARM servers, and with our experience and understanding of the market, are enabling developers with systems and access as the ARM server market matures.”
Dell didn’t give any technical details on its Zinc server and said it won’t be generally available. However the firm reiterated its goal of bringing ARM based servers to the market, though given that it is trying to help the Apache Foundation, a good indicator of ARM server viability will be when the Apache web server project has been ported to the ARM architecture and has matured to production status.
Java Developers looking for a mobile-friendly platform could be happy with the next release of IBM’s Websphere Application Server, which is aimed at offering a lighter, more dynamic version of the app middleware.
Shown off at the IBM Impact show in Las Vegas on Tuesday, Websphere Application Server 8.5, codenamed Liberty, has a footprint of just 50MB. This makes it small enough to run on machines such as the Raspberry Pi, according to Marie Wieck, GM for IBM Application and Infrastructure Middleware.
Updates and bug fixes can also be done on the fly with no need to take down the server, she added.
The Liberty release will be launched this quarter, and already has 6,000 beta users, according to Wieck.
John Rymer of Forrester said that the compact and dynamic nature of the new version of Websphere Application Server could make it a tempting proposition for Java developers.
“If you want to install version seven or eight, it’s a big piece of software requiring a lot of space and memory. The installation and configuration is also tricky,” he explained.
“Java developers working in the cloud and on mobile were moving towards something like Apache Tomcat. It’s very light, starts up quickly and you can add applications without having to take the system down. IBM didn’t have anything to respond to that, and that’s what Liberty is.”
For firms needing to update applications three times a year, for example, the dynamic capability of Liberty will make it a much easier process.
“If developers want to run Java on a mobile device, this is good,” Rymer added.
The new features are also backwards compatible, meaning current Websphere users will be able to take advantage of the improvements.
However, IBM could still have difficulty competing in the app server space on a standalone basis, according to Rymer.
“Red Hat JBoss costs considerably less, and there’s been an erosion for IBM as it’s lost customers to Red Hat and Apache. Liberty might have an effect here,” he said.
“But IBM wins where the customer isn’t just focused on one product. It will never compete on price, but emphasises the broader values of a platform or environment.”
IBM will be demoing Websphere running on Raspberry Pi at Impact today.
The open source software project has reached the milestone of its first full release after six years of development. Hadoop is a software framework for reliable, scalable and distributed computing under a free licence. Apache describes it as “a foundation of cloud computing”.
“This release is the culmination of a lot of hard work and cooperation from a vibrant Apache community group of dedicated software developers and committers that has brought new levels of stability and production expertise to the Hadoop project,” said Arun Murthy, VP of Apache Hadoop.
“Hadoop is becoming the de facto data platform that enables organizations to store, process and query vast torrents of data, and the new release represents an important step forward in performance, stability and security,” he added.
Apache Hadoop allows for the distributed processing of large data sets, often Petabytes, across clusters of computers using a simple programming model.
The Hadoop framework is used by some big name organisations including Amazon, Ebay, IBM, Apple, Facebook and Yahoo.
Yahoo has significantly contributed to the project and hosts the largest Hadoop production environment with more than 42,000 nodes.
Jay Rossiter, SVP of the cloud platform group at Yahoo said, “Apache Hadoop will continue to be an important area of investment for Yahoo. Today Hadoop powers every click at Yahoo, helping to deliver personalized content and experiences to more than 700 million consumers worldwide.”