Tag Archives: PubSub

Working with Multiple Cloud Providers – Part 2 – Getting Data Into BigQuery

In this post, I described how we might attempt to help Santa and his delivery drivers to deliver presents to every child in the world, using the combined power of Google and Microsoft.

In this, the second part of the series (there will be one more), I’m going to describe how we might set-up a GCP pipeline that feeds that data into BigQuery (Google’s BigData NoSQL warehouse offering). We’ll first set up BigQuery, then the PubSub topic, and finally, we’ll set-up the dataflow, ready for Part 3, which will be joining the two systems together.

BigQuery

Once you navigate to the BigQuery section of the GCP console, you’ll be able to create a Dataset:

You can now set-up a new table. As this is an illustration, we’ll keep it as simple as possible, but you can see that this might be much more complex:

One thing to bear in mind about BigQuery, and cloud data storage in general is that, often, it makes sense to de-normalise your data – storage is often much cheaper than CPU time.

PubSub

Now we have somewhere to put the data; we could simply have the Azure function write the data into BigQuery. However, we might then run into problems if the data flow suddenly spiked. For this reason, Google recommends the use of PubSub as a shock absorber.

Let’s create a PubSub topic. I’ve written in more detail on this here:

DataFlow

The last piece of the jigsaw is Dataflow. Dataflow can be used for much more complex tasks than to simply take data from one place and put it in another, but in this case, that’s all we need. Before we can set-up a new dataflow job, we’ll need to create a storage bucket:

We’ll create the bucket as Regional for now:

Remember that the bucket name must be unique (so no-one can ever pick pcm-data-flow-bucket again!)

Now, we’ll move onto the DataFlow itself. We get a number of dataflow templates out of the box; and we’ll use one of those. Let’s launch dataflow from the console:

Here we create a new Dataflow job:

We’ll pick “PubSub to BigQuery”:

You’ll then get asked for the name of the topic (which was created earlier) and the storage bucket (again, created earlier); you’re form should look broadly like this when you’re done:

I strongly recommend specifying a maximum number of workers, at least while you’re testing.

Testing

Finally, we’ll test it. PubSub allows you to publish a message:

Next, visit the Dataflow to see what’s happening:

Looks interesting! Finally, in BigQuery, we can see the data:

Summary

We now have the two separate cloud systems functioning independently. Step three will be to join them together.

Working with Multiple Cloud Providers – Part 1 – Azure Function

Regular readers (if there are such things to this blog) may have noticed that I’ve recently been writing a lot about two main cloud providers. I won’t link to all the articles, but if you’re interested, a quick search for either Azure or Google Cloud Platform will yield several results.

Since it’s Christmas, I thought I’d do something a bit different and try to combine them. This isn’t completely frivolous; both have advantages and disadvantages: GCP is very geared towards big data, whereas the Azure Service Fabric provides a lot of functionality that might fit well with a much smaller LOB app.

So, what if we had the following scenario:

Santa has to deliver presents to every child in the world in one night. Santa is only one man* and Google tells me there are 1.9B children in the world, so he contracts out a series of delivery drivers. There needs to be around 79M deliveries every hour, let’s assume that each delivery driver can work 24 hours**. Each driver can deliver, say 100 deliveries per hour, that means we need around 790,000 drivers. Every delivery driver has an app that links to their depot; recording deliveries, schedules, etc.

That would be a good app to write in, say, Xamarin, and maybe have an Azure service running it; here’s the obligatory box diagram:

The service might talk to the service bus, might control stock, send e-mails, all kinds of LOB jobs. Now, I’m not saying for a second that Azure can’t cope with this, but what if we suddenly want all of these instances to feed metrics into a single data store. There’s 190*** countries in the world; if each has a depot, then there’s ~416K messages / hour going into each Azure service. But there’s 79M / hour going into a single DB. Because it’s Christmas, let assume that Azure can’t cope with this, or let’s say that GCP is a little cheaper at this scale; or that we have some Hadoop jobs that we’d like to use on the data. In theory, we can link these systems; which might look something like this:

So, we have multiple instances of the Azure architecture, and they all feed into a single GCP service.

Disclaimer

At no point during this post will I attempt to publish 79M records / hour to GCP BigQuery. Neither will any Xamarin code be written or demonstrated – you have to use your imagination for that bit.

Proof of Concept

Given the disclaimer I’ve just made, calling this a proof of concept seems a little disingenuous; but let’s imagine that we know that the volumes aren’t a problem and concentrate on how to link these together.

Azure Service

Let’s start with the Azure Service. We’ll create an Azure function that accepts a HTTP message, updates a DB and then posts a message to Google PubSub.

Storage

For the purpose of this post, let’s store our individual instance data in Azure Table Storage. I might come back at a later date and work out how and whether it would make sense to use CosmosDB instead.

We’ll set-up a new table called Delivery:

Azure Function

Now we have somewhere to store the data, let’s create an Azure Function App that updates it. In this example, we’ll create a new Function App from VS:

In order to test this locally, change local.settings.json to point to your storage location described above.

And here’s the code to update the table:

    public static class DeliveryComplete
    {
        [FunctionName("DeliveryComplete")]
        public static HttpResponseMessage Run(
            [HttpTrigger(AuthorizationLevel.Function, "post", Route = null)]HttpRequestMessage req, 
            TraceWriter log,            
            [Table("Delivery", Connection = "santa_azure_table_storage")] ICollector<TableItem> outputTable)
        {
            log.Info("C# HTTP trigger function processed a request.");
 
            // parse query parameter
            string childName = req.GetQueryNameValuePairs()
                .FirstOrDefault(q => string.Compare(q.Key, "childName", true) == 0)
                .Value;
 
            string present = req.GetQueryNameValuePairs()
                .FirstOrDefault(q => string.Compare(q.Key, "present", true) == 0)
                .Value;            
 
            var item = new TableItem()
            {
                childName = childName,
                present = present,                
                RowKey = childName,
                PartitionKey = childName.First().ToString()                
            };
 
            outputTable.Add(item);            
 
            return req.CreateResponse(HttpStatusCode.OK);
        }
 
        public class TableItem : TableEntity
        {
            public string childName { get; set; }
            public string present { get; set; }
        }
    }

Testing

There are two ways to test this; the first is to just press F5; that will launch the function as a local service, and you can use PostMan or similar to test it; the alternative is to deploy to the cloud. If you choose the latter, then your local.settings.json will not come with you, so you’ll need to add an app setting:

Remember to save this setting, otherwise, you’ll get an error saying that it can’t find your setting, and you won’t be able to work out why – ask me how I know!

Now, if you run a test …

You should be able to see your table updated (shown here using Storage Explorer):

Summary

We now have a working Azure function that updates a storage table with some basic information. In the next post, we’ll create a GCP service that pipes all this information into BigTable and then link the two systems.

Footnotes

* Remember, all the guys in Santa suits are just helpers.
** That brandy you leave out really hits the spot!
*** I just Googled this – it seems a bit low to me, too.

References

https://docs.microsoft.com/en-us/azure/azure-functions/functions-how-to-use-azure-function-app-settings#manage-app-service-settings

https://anthonychu.ca/post/azure-functions-update-delete-table-storage/

https://stackoverflow.com/questions/44961482/how-to-specify-output-bindings-of-azure-function-from-visual-studio-2017-preview

A C# Programmer’s Guide to Google Cloud Pub Sub Messaging

The Google Cloud Platform provides a Publish / Subscriber system called ‘PubSub’. In this post I wrote a basic guide on setting up RabbitMQ, and here I wrote about ActiveMQ. In this post I wrote about using the Azure messaging system. Here, I’m going to give an introduction to using the GCP PubSub system.

Introduction

The above systems that I’ve written about in the past are fully featured (yes, including Azure) message bus systems. While the GCP offering is a Message Bus system of sorts, it is definitely lacking some of the features of the other platforms. I suppose this stems from the fact that, in the GCP case, it serves a specific purpose, and is heavily geared toward that purpose.

Other messaging systems do offer the Pub / Sub model. The idea being that you create a topic, and anyone that’s interested can subscribe to the topic. Once you’ve subscribed, you’re guaranteed* to get at least one delivery of the published message. You can also, kind of, simulate a message queue, because more than one subscriber can take message from a single subscription.

Pre-requisites

If you want to follow along with the post, you’ll need to have a GCP subscription, and a GCP project configured.

Topics

In order to set-up a new topic, we’re going to navigate to the PubSub menu in the console (you may be prompted to Enable PubSub when you arrive).

As you can see, you’re inundated with choice here. Let’s go for “Create a topic”:

Cloud Shell

You’ve now created a topic; however, that isn’t the only way that you can do this. Google are big on using the Cloud Shell; and so you can create a topic using that; in order to do so, you select the cloud shell icon:

Once you get the cloud shell, you can use the following command**:

gcloud beta pubsub topics create "test"

Subscriptions and Publishing

You can publish a message now if you like; either from the console:

Or from the Cloud Shell:

gcloud beta pubsub topics publish "test" "message"

Both will successfully publish a message that will get delivered to all subscribers. The problem is that you haven’t created any subscribers yet, so it just dissipates into the ether***.

You can see there are no subscriptions, because the console tells you****:

Let’s create one:

Again, you can create a subscription from the cloud shell:

gcloud beta pubsub topics subscriptions create --topic "test" "mysubscription"

So, we now have a subscription, and a message.

Consuming messages

In order to consume messages in this instance, let’s create a little cloud function. I’ve previously written about creating these here. Instead of creating a HTTP trigger, this time, we’re going to create a function that reacts to something on a cloud Pub/Sub topic:

Select the relevant topic; the default code just writes the test out to the console; so that’ll do:

/**
 * Triggered from a message on a Cloud Pub/Sub topic.
 *
 * @param {!Object} event The Cloud Functions event.
 * @param {!Function} The callback function.
 */
exports.subscribe = function subscribe(event, callback) {
  // The Cloud Pub/Sub Message object.
  const pubsubMessage = event.data;

  // We're just going to log the message to prove that
  // it worked.
  console.log(Buffer.from(pubsubMessage.data, 'base64').toString());

  // Don't forget to call the callback.
  callback();
};

So, now we have a subscription:

Let’s see what happens when we artificially push a message to it.

If we now have a look at the Cloud Function, we can see that something has happened:

And if we select “View Logs”, we can see what:

It worked! Next…

Create Console App

Now we have something that will react to a message, let’s try and generate one programmatically, in C# from a console app. Obviously the first thing to do is to install a NuGet package that isn’t past the beta stage yet:

Install-Package Google.Cloud.PubSub.V1 -Pre

Credentials

In this post I described how you might create a credentials file. You’ll need to do that again here (and, I think anywhere that you want to access GCP from outside of the cloud).

In APIs & Services, select “Create credentials”:

Again, select a JSON file:

The following code publishes a message to the topic:

static async Task Main(string[] args)
{
    Environment.SetEnvironmentVariable(
        "GOOGLE_APPLICATION_CREDENTIALS",
        Path.Combine(AppDomain.CurrentDomain.BaseDirectory, "my-credentials-file.json"));
 
    GrpcEnvironment.SetLogger(new ConsoleLogger());
 
    // Instantiates a client
    PublisherClient publisher = PublisherClient.Create();
 
    string projectId = "test-project-123456";
    var topicName = new TopicName(projectId, "test");
 
    SimplePublisher simplePublisher = await SimplePublisher.CreateAsync(topicName);
    string messageId = await simplePublisher.PublishAsync("test message");
    await simplePublisher.ShutdownAsync(TimeSpan.FromSeconds(15));
}

And we can see that message in the logs of the cloud function:

Permissions

Unless you choose otherwise, the service account will look something like this:

The Editor permission that it gets by default is a sort of God permission. This can be fine-grained by removing that, and selecting specific permissions; in this case, Pub/Sub -> Publisher. It’s worth bearing in mind that as soon as you remove all permissions, the account is removed, so try to maintain a single permission (project browser seems to be suitably innocuous).

Footnotes

* Google keeps messages for up to 7 days, so the guarantee has a time limit.

** gcloud may need to be initialised. If it does then:

gcloud init
gcloud components install beta

*** This is a big limitation. Whilst all topic subscriptions in other systems do work like this, in those systems, you have the option of a queue – i.e. a place for messages to live that no-one is listening for.

**** If you create a subscription in the Cloud Shell, it will not show in the console until you F5 (there may be a timeout, but I didn’t wait that long). The problem here is that F5 messes up the shell window.

References

https://cloud.google.com/pubsub/docs/reference/libraries

https://cloud.google.com/iam/docs/understanding-roles