• Using NiFi to write to HDFS on the Hortonworks Sandbox

    The (Hortonworks Sandbox)[http://hortonworks.com/sandbox/] is designed to be a self contained learning platform for HDP, and runs in a VM, with all the components on HDP installed in a very small environment. Obviously there are some compromises inherent in pushing a powerful clustered computing system into a single environment on a VM on your laptop, but it remains a powerful tool for trying out the functionality that HDP has to offer.

    Hortonworks Data Flow is a new tool which provides a simple means of ingesting data to the HDP platform and others. In this tutorial I’m going to show you how to hook up an instance of HDF running locally, or in some VM, to a remote instance of HDF running within the sandbox. This instance will then have easy access to HDFS, HBase, Solr and Kafka for example within the sandbox.

    Why not just go direct to these services? In VM environments, there are often a range of port forwarding and routing issues that you need to fix to get access to these services. You would also have to download client configurations to link your machine up to the sandbox machine, which can over-complicate the learning process.

    So lets start by putting HDF onto the Sandbox. (Ali Bajwa)[https://github.com/abajwa-hw] provides an excellent Nifi Service for Ambari, along with (instructions to install this on the sandbox)[https://github.com/abajwa-hw/ambari-nifi-service]. There are also some great first step tutorials on building Flows with the newly installed NiFi on your sandbox.

    Enabling remote connections (site-2-site)

    Now let’s looks at how you would connect up a NiFi running outside the sandbox. First we need to enable the remote port for our sandbox instance. To do this, use Ambari, find the nifi service on the left:

    Then find the Advanced nifi-properties-env section:

    In this section, enable remote site-2-site by specifying a port for the nifi.remote.input.socket.port property. For now we’re also going to turn off the encryption of the site-2-site, since we’re just using a sandbox, by setting nifi.remote.input.secure=false

    Don’t forget to restart your NiFi service in Ambari to apply these changes.

    What this does is to establish the port for the data channel used by the NiFi site-2-site protocol, this is separate from the API control channel which we use to run the NiFi GUI on (9090 by default).

    We will now need to forward the remote data port on our sandbox VM if you’re using NAT networking (for example with the VirtualBox sandbox). To do this, go to the network settings on your virtual machine:

    And add two port forwarding rules for port 9090 (the default NiFi GUI and API port) and 9091 (the data channel for the NiFi site-2-site protocol).

    Once we’ve got the configuration in place, we can create a flow on the sandbox with and input port for the remote connection, and a PutHDFS processor to write out the data. Of course if we were doing this properly, we would include MergeContent before the PutHDFS to ensure we’re not writing too many small files to HDFS, but for the purposes of this demonstration, this will do fine.

    Setting up the remote collector

    Now let’s got to another NiFi instance running directly on our host machine (you could of course use another VM, as long as you have VM to VM networking enabled). For this one we’re just going to put together a simple flow to demonstrate the remote link.

    Drag a remote process group onto the canvas, and give it the URL of the sandbox nifi interface (note this is exactly as you would type it into a web browser, NOT the port you set in the remote settings).

    Right click on this to Enable Transmission, and let the processor grab a list of the input ports on the other side.

    We can now connect data into one of those ports:

    Before data will flow, we also need to turn on the remote port. Right-click on the remote process group, and bring up the remote ports dialog. Here you should have a switch to be able to turn out the remote port that will be receiving data.

    Note you can also tune the number of connections between the two. Between one and four usually makes sense.

    You should now be able to see data flowing from your local NiFi on the laptop, into the NiFi instance running in the Sandbox. The Sandbox instance will also be accepting that data an writing it into HDFS.

    This provides a simple example of how the remote site-2-site protocol is setup. The pattern proves extremely powerful when collecting data from remote sites, or servers, and as a means of communicating between different HDF instances. Remember, HDF is a two-way data flow engine afterall. The protocol also provides the means to configure secure two-way authenticated SSL, supports compression on the wire, and transfers both the data payload and the flow file attributes between NiFis.

    more...
  • How to listen to https with ListenHTTP in NiFi

    NiFi provides a way of listening for HTTP requests on a port on a NiFi node. The ListenHTTP processor feeds the content of the request as a FlowFile into the rest of the flow. By default it provides a plain text HTTP service. However, you can also configure the processor to provide an SSL endpoint.

    more...
  • Ambari only serves gzip

    By default Ambari only serves up gzip encoded resources. This is of course the right thing to do. However, sometimes the realities of a corportate network mean old proxies that do odd things. This will stop Ambari from working, and frankly makes it look really weird (just like any SPA without its styles and scripts).

    To enable Ambari to host out plain encoded versions as well, just run:

    for a in /usr/lib/ambari-server/web/{javascripts,stylesheets}/*.gz; do gzip -dc $a > ${a%.gz}; done
    

    on your ambari-server. This will uncompress the gzip files to generate plain text encoded versions of them which Ambari’s spring content negotiation will then serve out in the absence of gzip in the Accept-Encoding header. Problem solved.

    See my stack overflow post on this

    more...
  • Machine Learning without the PhD in the Cloud with Azure ML

    This talk was give at CloudBurst 2014 in Stockholm, Sweden.

    I promise to put some more info up here later!

    Slides

    more...
  • NoSQL Matters Dublin 2014

    Many thanks to the organisers, speakers and all the sponsors of NoSQL Matters Dublin.

    more...
  • Getting your Big Data on with HDInsight

    This webinar was delivered on 26 Aug 2014, to the Florida Azure User Group.

    more...
  • Know your data lineage

    An academic paper without the footnotes isn’t an academic paper. Journalists wouldn’t base a news article on facts that they can’t verify. So why would anyone publish reports without being able to say where the data has come from and be confident of its quality, in other words, without knowing its lineage. (sometimes referred to as ‘provenance’ or ‘pedigree’)

    more...
  • Riding the Elephant - Hadoop 2.0

    Hadoop is about so much more than MapReduce these days. This talk is a discussion of some of the new frameworks around hadoop, and how you can make best of use of Hadoop 2.0 components like YARN, as well as new complementary tools like Spark and Mesos.

    more...
  • Finding (and using!) Big Data in your business

    This is the backup and resources page for my talk on the cultural, business, and technology aspects of making the most of Big Data, paritcularly Hive and Hadoop.

    more...
  • When to NoSQL and when to know SQL

    With NoSQL, NewSQL and plain old SQL, there are so many tools around it’s not always clear which is the right one for the job.

    more...
  • Quick Starting your HDInsight Cluster with Node CLI

    HDInsight is Microsoft’s Hadoop PaaS offering on Azure. Microsoft have partnered with HortonWorks to bring a hosted version of the HortonWorks Data Platform to the Azure service, which is great, but there’s more. The really clever thing they’ve done is to essentially replace (or at least sideline) the HDFS part of Hadoop and insert Azure Storage blocks in its place. This means you don’t need your cluster to be persistent, all your data is off cluster, but you still get the benefits of distributed compute, and pretty close to the benefits of data localisation that you get with regular Hadoop. This lets you use the nice cheap redundant storage all the time, and only pay for huge compute nodes when you need them, but how do you manage turning clusters on and off all the time?

    more...
  • MongoDB-Hadoop connector on Azure HDInsight

    Recently, MongoDB bought out a connector that lets you use Pig, Hive and core Map Reduce in hadoop to operate on Mongo sourced data.

    Someone was asking about connecting mongo data to HDInsight on Stack Overflow, so I thought I’d make it work.

    more...
  • NDC London 2013 - MongoDB for C# developers

    I’ve recently been speaking at a number of conferences about MongoDB and how C# developers can make the best of it.

    more...
  • HDFS Explorer

    Apologies, I've learnt that since I left Red Gate, they have stopped supporting this product line, so this page really only serves as an archive. Please let me know if you are still looking for this functionality in the comments below. If enough people are interested, I may put together an alternative solution.

    We’ve been working on some new tools for querying data in Hadoop, and specifically Hive for a few months at Red Gate. One of the biggest problems we’ve had with our testing is finding a decent way to manage files in HDFS. So we’ve made a free utility to help manage files in HDFS clusters.

    more...
  • Changing video playback speed in Vimeo

    I watch a lot of conference session videos, lectures, and screencasts. I don’t have a lot of time. What I tend to do as a result is listen to them at 1.5, or 2 times the speed. That way I can get the content without the pauses, the ums, and the waiting around for everybody to catchup.

    more...
  • Personal Tech Radar

    Many people have come across the ThoughtWorks Tech Radar concept. It’s a great collection of the technologies, practices and tools that ThoughtWorks consultants recommend, and the stage they feel those technologies are at. We recently started putting one of these together at [Red Gate](http://dev.red-gate.com], as a way to guide learning and development, and a point of reference for technology choices when we spin up new projects. Even in its early days, it has already helped my team make some quick choices, and saved time on architecture validation, which has gone straight back into product value.

    more...
  • DDD North 2013 - MongoDB for C# developers

    On Saturday Oct 12, I delivered a presentation at DDD North (which was an excellent community run conference in Sunderland) on MongoDB. Here are a few of the things I mentioned in my talk. Thanks to everyone who came along, for all the great questions. Hope you enjoy getting started with MongoDB, and let me know how you get on with it.

    more...
  • Personal Tech Radar

    Many people have come across the ThoughtWorks Tech Radar concept. It’s a great collection of the technologies, practices and tools that ThoughtWorks consultants recommend, and the stage they feel those technologies are at. We recently started putting one of these together at [Red Gate](http://dev.red-gate.com], as a way to guide learning and development, and a point of reference for technology choices when we spin up new projects. Even in its early days, it has already helped my team make some quick choices, and saved time on architecture validation, which has gone straight back into product value.

    more...
  • Looking at the outliers in the crowdsourced postcodes

    A few weeks ago, I published some data on the accuracy of crowdsourced postcode data.

    This weekend had a brief quiet moment, so I thought I'd try out a few visualisations to see if there was any particular grouping around the inaccuracies. Doesn't seem to be anything systematic.

    more...
  • Hive InputFormat to connect Hadoop to Azure Tables

    I recently came across a stack overflow post asking whether it was possible to run Hadoop jobs using Azure Table Storage as a data source. The user had been going through an ETL process to get data out of the Azure Tables store into Hadoop for further processing. This seems unnecessary. I also needed a way to run SQL like queries on top of very large Azure Diagnostics logs, and fancied writing a custom InputFormat anyway, so put together a quick Hive Storage Handler which allows you to directly query Azure Tables in Hive.

    more...
  • Static search for Jekyll with lunr

    I have recently converted by blog to run on Jekyll, partly because I was fed up of the weight of wordpress just to run a simple text delivery system, and because I didn’t need the security hastle of a painted target for script kiddies on my domain.

    more...
  • Embedding Rmarkdown in a Jekyll site

    I'm a big fan of Rmarkdown, which I came across through the very handy R IDE, RStudio. Since I've recently converted this blog to use Jekyll, which is also markdown based, I thought I'd have a go at combining the two. There are various things around to do this, but the below, a plugin creating a new Liquid tag worked best for me.

    more...
  • How accurate are crowdsourced postcodes?

    A little while ago at the Cambridge Enterprise Search meetup, Nick Burch of Quanticate shared with us the fantastic story of http://www.npemap.org.uk/. This was setup in the days when the CodePoint dataset, containing the location of all postcodes in the UK, was tightly guarded and ludicrously expensive. It was also I remember a pain to deal with, and still remember the days when the pack of CDs would turn up, and the test server got to spend it's next few days chugging away to rebuild all the indexing.

    more...
  • Glimpse Plugins

    I recently spoke at the Norwich Developers meetup about Glimpse and how easy it is to write plugins for it.

    more...
  • Bath Node Copter 2013

    All the code on GitHub

    more...
  • Injecting dependencies into a JPA EntityListener

    It is not possible to inject spring managed beans into a JPA EntityListener class. This is because the JPA listener mechanism should be based on a stateless class, so the methods are effectively static, and non-context aware.

    In Hibernate at least this is not strictly true, since if you were (and you shouldn't) to record an entity in a field of this in a method, it's sat there next time the method comes along.

    more...
  • OS X for the Coding. Linux for deploying.

    I just came across an excellent article on setting up Xcode, and a generally mac environment to point at an SSHFS disk within a VMWare Fusion box. All this was to the end of using a nice interface to code against a nice clean unix platform, laudable goals indeed.

    more...
  • Cloud security and management portals

    There’s a lot of talk these days about cloud security. The usual, sensible, and frankly right answer is, don’t be silly. With the right combination of VLANs firewalls, proper machine isolation in the hypervisor, and all the normal sort of things you should be doing with a server anyway (hardening, patching, and those other things sysadmins like spending their evenings and weekends on)… yeah, all that stuff. We’re covered (well, close enough)

    more...
  • Physical security matters!

    Here is an account of some interesting experiments in using very sensitive microphones, or voltage meters to crack RSA.

    more...
  • Advice for AJAX programmers, the partners they love, and the browser they hate

    I just came across this fantastic article, which has a number of things to contribute. Firstly, how to get that crock-of-browser IE to work.

    Secondly, it has some lovely advice for partners on how to deal with us web developers when we’re not going to bed until this works.

    more...
  • How I learned to let go, and got back on the rails

    Many many years ago I produce a website in Rails. It was OK. I certainly didn’t totally hate it. That was back in the very early days of Rails 1. The website did ticketing for a one off event and it ran quite nicely once I’d fought a bit with apache. Once the event was over, and the hangover cleared (I got paid in tickets) I put Rails back in a box marked ‘for hipsters and web-designers’ and went quietly on my way back to PHP, Java, and all that lot.

    more...
  • Ruby on Rails and ExtJS 4 Data Model

    I’ve been playing around a lot with rails 3 recently, and since I’ve spent many years working with the ExtJS platform from Sencha, and spend most of my working day with it, I thought it was about time I looked at combining the two. With the release of ExtJS 4, there are some extremely cool data and client side model tools.

    more...
  • Rails conf 2011

    Streaming it live is great. Quite impressed at how well I can watch a conference in Baltimore from the corner of a desk in London and actually feel like its working.

    more...
  • Quick PhoneGap build gotcha

    I will be blogging a lot more about mobile development it seems, having now ended up becoming both an overnight iOS and Android developer. Loving both platforms, but also for the simple stuff, loving PhoneGap, the sort of html and javascript with a library framework for both.

    more...
  • Cambridge Startup Weekend Winners

    So we’ve finally made it to the end of an intense weekend. There were some absolutely fantastic ideas, and some excellent presentations at the end of it all. With many very deserving winners, in all sorts of categories, I’m very pleased to announce that my team managed to scoop Best Healthcare Application.

    Now to sleep.

    more...
  • Cambridge Startup Weekend Pitching Begins

    I’ve blogged before about the Cambridge Startup Weekend, and am pleased to see that the organisers have done a fantastic PR job and sold out completely. One of the most interested pieces of social PR work I say in the run up was a series of twitter competitions to pitch, or anti-pitch ideas and win a ticket. This produced some fun, some sarcasm, and some great ideas in assorted measure.

    more...
  • Cambridge Startup Weekend

    So, Startup Weekend is coming to Cambridge and I have to say that I am really looking forward to it. There is something about the idea of spending 54 hours making a new business happen which takes me back to my early days, my younger days, and even my student Varsity days. I have to say I’m curious to see if the intensity still works as well as it always did when press deadline, or just a project timescale kept me up all night.

    more...
  • ISIN Validation with Javascript

    As part of a recent integration project, I found that I needed to use ISIN codes as a common key. ISINs are an internationally recognised way of labelling exchange listed securities. As with many things my first call for an overview was wikipedia, although investopedia and the motley fool are fantastic sites for finding out more about finance industry concepts.

    more...

StackOverflow Flair

profile for Simon Elliston Ball at Stack Overflow, Q&A for professional and enthusiast programmers