AWS Re:Invent 2018 – Recap

On November 2018 I was on my first AWS Re:Invent convention in Vegas. This was an amazing experience, which I highly recommend to anyone working with AWS (don’t we all?).

The sheer size of the convention (50K people!), the volume of sessions and products and above all, the amazing diversity of occupations and fields people came from was mind blowing.

Following is a short recap of the lessons I learned my first time in AWS Re:Invent – from registration to what not to miss (and what you can feel free to miss) in the event, and how to survive it.

Question 1: How much does it costs?

The registration fee is $1,800, and staying in a nice hotel for 6 nights was ~$800. In addition, you have 6 days of not getting any work done, plus flights .

Question 2: Is it worth it?

For me, as a DevOps/Developer with our entire fleet hosted on AWS – Totally. But I must say that the most valuable things I learned at Re:Invent was not so much the AWS services tutorials/announcements, but the sessions where professionals from around the world shared their experiences with moving/building/expanding to AWS infrastructure. If you want to convince your supervisor why it’s good, AWS even have a ready justification letter 🙂

Question 3: OK, I’m in. Should I register in advance to sessions? How? Where?

So, registration to sessions is really the weak side of the Re:Invent convention. The website looks like a relic from the 1990’s, searching is very hard and unintuitive, the calendar option only appeared a week after the registration was opened – seriously, terrible.

After you got over you shock that this is the entry point to what is the largest developers convention I know, few tips:

  • Most big sessions are held more than once, so can find them on other days/venues
  • A lot of sessions are broadcasted live in different venues (and even in the same venue) – So if you couldn’t find a seat, you still have a chance to see it.
  • The system won’t let you schedule 2 sessions less the 30 minutes apart if they are in different venues – take that into consideration.
  • Registration to sessions ends fast. I had all my desired sessions opened in different tabs, and the moment the registration opened I clicked “Register” on each one of them – and still didn’t get a seat in some.
  • New sessions and additional screenings are added all the time during the convention, and people replace and free their seats. Keep your “favourite” lists and check daily if something interesting has opened up.
  • Sessions level – anything lower than 300 is very basic. Only go if it’s something totally new for you / you’re new to AWS
  • Session types:
    • Workshops – vary significantly in their value: Some of them a really good, but in most of them you’re just following a github-hosted tutorial and have 2 AWS personal going around and assisting you with technical issues. I must say most of the workshops weren’t very valuable to me
    • Chalk-talk – Most chalk talks I’ve been at had 2 very experienced engineers, sharing their experiences on various topics. These were some of my best sessions.

Question 4: What to see?

Re:Invent really has a lot of extracurricular activities (Bar crawl, races, 4k runs, the expo, and so forth). I admit I haven’t attended to most of them – I arrived late Sunday night and had a 10 hours time diff to get over, so most nights I was in a zombie state, and I’m not very good networker. If you are (and you’re not jet-legged to death) – go!

The Expo: I guess you’ve heard all the urban legends of the wonderful land of expo, where swag is abundant and freely given. Well, it’s true, a lot of things are freely given, but you will have to stand in line for hours for some tech-labeled-socks. For the really valuable things, you’d have to compete with people, register to listen to some sales pitch you’re not interested at, and generally waste your time. My recommendation – skip it. If you have a free hour at the Venetian, go have a look – but trust me, no need to plan your schedule around it.

The Quad, however, is waaaay more interesting. You get a chance to play and build robotic legos and other things!

re:Play Party: I’ve only been to one, but I must say it wasn’t that impressive. I mean, go – it’s already paid for in your ticket, but let’s say I didn’t have any remorse for leaving early….

Question 5: What to wear? Where to eat? How to get around?

Unless you’re presenting something – snickers, jeans, and a t-shirt. Get a light jacket for the over-conditioned lecture halls and the rides between places, but most of the time the temperature is really office-like. (You’ll spend most of your time indoors anyway)

The food halls are enormous, but the food is really good – they always have gluten free choices, btw! – and food and drinks are abundant , to the point where you get snack when getting off the shuttles. I haven’t been to any of the breakfasts, only lunches, but I guess the standard is the same. Basically, you’ll only have to eat dinner on your own expense.

Getting around the venues is extremely easy with the shuttles. Before I arrived I heard from a lot of people that in previous years the shuttles were really bad, and that I should base my mobility on Uber – but at least this year I can attest that the shuttles were rapid, fast and convenient.

General Tips

  • DO NOT buy coffee at Starbucks. They have (good) coffee/tea/soda stands everywhere around the lecture halls. Save your money and time (the queues are infinite)
  • Constantly fill you water bottle (They have refill stands everywhere)
  • Carry a lip balm on your person. Vegas is dry as hell.
  • There are electricity outlets literally everywhere, and the wifi was surpassingly good.

How to handle runtime exceptions in Kafka Streams

We’ve been using Kafka Streams (1.1.0, Java) as the backbone of our μ-services architecture. We’ve switched to stream mainly because we wanted the exactly-once processing guarantee.

Lately we’ve been having several runtime exceptions that killed the entire stream library and our μ-service.

So the main question was – is this the way to go? After a few back and forth, we realized that the best way to test this is by checking:

  1. What would Kafka do?
  2. Do we still keep our exactly-once promise?

What does Kafka do?

This document is the Kafka Stream Architecture design. After the description of the StreamThread , StreamTask and StandbyTask, there’s a discussion about Exceptions handling, the gist of which is as follows:

First, we can distinguish between recoverable and fatal exceptions. Recoverable exception should be handled internally and never bubble out to the user. For fatal exceptions, Kafka Streams is doomed to fail and cannot start/continue to process data. […] We should never try to handle any fatal exceptions but clean up and shutdown

So, if Kafka threw a Throwable at us, it basically means that the library is doomed to fail, and won’t be able to process data. In our case, since the entire app is built around Kafka, this means killing the entire μ-service, and letting the deployment mechanism just re-deploy another one automatically.

Do we still keep our exactly-once promise?

Now we’re faced with the question whether or not this behaviour might harm our hard-earned exactly-once guarantee.
To answer that question, we first need to understand when the exactly-once is applicable.

exactly-once is applicable from the moment we’re inside the stream – meaning, our message arrived at the first topic, T1. So everything that happens before that is irrelevant: the producer who pushed the message to T1 in the first time could have failed just before sending it, and the message will never arrive (so not even at-least-once is valid) – so this is something we probably need to handle, but that doesn’t have anything to do with streams.

Now, let’s say our message, M, is already inside topic T1. Hooray!

Now we can either fail before reading it, while processing it, and after pushing it.

  • If we failed before reading it, we’re fine. The μ-service will go up again, will use the same appId, and we’ll read the message.
  • If we read it and failed before we even started processing it, we’ll never send the offset commit, so again, we’re fine.
  • If we failed during processing it, again, we’ll never reach the point of updating the offsets (because we commit the processed message together with the consumer offset – so if one didn’t happen, neither did the other)
  • If we failed after sending it – again, we’re fine: even if we didn’t get the ack, both the consumer offset and the new transformed/processed message are out.

Uncaught Exception Handlers

Kafka has 2 (non-overlapping) ways to handle uncaught exceptions:

  • KafkaStreams::setUncaughtExceptionHandler – this function allows you to register an uncaught exception handler, but it will not prevent the stream from dying: it’s only there to allow you to add behaviour to your app in case such an exception indeed happens. This is a good way to inform the rest of your app that it needs to shut itself down / send message somewhere.

  • ProductionExceptionHandler – You can implement this class and add it via the properties: StreamsConfig.DEFAULT_PRODUCTION_EXCEPTION_HANDLER_CLASS_CONFIG – but in this case, you will need to decide if the stream can keep going or not, and I feel this requires very deep understanding of the internals of the streams, and I’m still not sure when exactly you would want that.

Conclusion

For us, using k8s deployments, with n number of pods of each service being automatically scaled all the time, the best way to handle runtime/unchecked exceptions is to make sure our app goes down with the Kafka Stream library (using the KafkaStreams::setUncaughtExceptionHandler), and letting the deployment service take care of starting the app again.

Node JS for backend programmers

I started working in a new and exciting start up, and the thought of using Java for development crushed my soul, so I began shopping around for a fast-development, fast-deployment, easy ramp up and heavily supported language. I wanted something that will be easy for new developers to learn, and most importantly, something that I’ll have fun writing in.

NodeJS was definitely not my first (or my second, for that matter) option: I’m a backend kind of guy. I’ve done some web, but I would hardly claim expertise in web and UI. When I’ve heard of NodeJS, which was described to me as “Javascript for backend”, I was hardly excited about it. I’ve had some experience with JS, and the experience has hardly made me want to keep using it, let alone write my backend with it — it had none of the order or the elegance of what I used to think of as “Real languages”.

However, since the name NodeJS kept coming up from people and colleagues whose opinions I really appreciate, I’ve decided I can’t keep ignoring it and I have to look into it before I dismiss it.

To make a long story short, I now use NodeJS as part of the production environment, and I find it flexible, easy to deploy and very well supported by the community. However, it doesn’t fit everything and everyone — specially if you have a cpu intensive app.

Bottom line — It’s a very efficient functional language with non-blocking I/O that runs on a single thread, on a single cpu (but you can start several apps and communicate between them) and with a large and thriving community.

So, what is NodeJS, anyway?

Basically, Node JS is frameworks designed around an event-driven, non-blocking I/O model, based on Javascript with I/O libraries written in C++. There are 2 main reasons why JavaScript was chosen as the programming language of NodeJS:

  • Google’s V8 engine (the javascript interpreter in chrome)
  • The fact that it had no I/O modules, so the I/O could be implemented a new to support the goal of non-blocking I/O.

The fact that it’s running on Google’s V8 also ensures that as long as google keeps improving Chrome’s performances, NodeJS platform will keep advancing as well.

What does it do, and how?

Well, from my POV, the main deal with node is that everything is non-blocking, except your code. What does it mean? It means that every call an I/O resource — disk, db, web resource or what have you is non-blocking. How can it be non blocking, you ask? everything is based on callbacks.

So for example, if you’re working with mongodb, this is how a query looks like:

var onDBQueryReturn = function(err, results) { 
   console.log(“Found “ + JSON.stringify(results) + “ users”);
   console.log(“Calling query”);
   db.usersCollection.find({‘_id’: “1234”}, onDBQueryReturn);
   console.log(“Called query”);
}

The output will be:

Calling queryCalled queryFound {“_id”: “1234”, “name”: “User1”}

Now, this might make perfect sense for people who are experienced with functional programming, but it’s kind of a strange behaviour for people accustomed to Java, for example. In node (much like in other functional programming languages), a function is a first level object, just as a string — so it can be easily passed as an argument. What happens here is that when the call to the DB is completed, the callback function is called and executed.

The main thing to remember is that node js runs on a single cpu*. Because all it’s I/O intensive calls are non-blocking, it can gain high efficiency managing everything else on a single cpu, because it never hangs waiting on I/O.

How is it like to write backend app with Nodejs?

In one word — easy. In several words — easy until you get to parallelism .

Once you get the hang of the weird scoping issues of Javascript, Node can be treated like any other language, with a non-customary-yet-powerful I/O mechanism. Multi threading, however, is not a simple as in other languages — but keep in mind that the concept of non-blocking I/O has made the need for multithreading much less necessary than you usually think.

In addition, Node has a very thriving community — which means that everything you could possible want has already been developed, tried and honed: You can use the npm — Node Package Manager (which is Node’s equivalent of the maven repo, I would say) for easy access to everything: One of the most interesting pages is the Most depended upon modules.

Multithreading in Node

There is nothing parallel in node. It might look like a major quirk, but if you think about it, given that there’s no I/O blocking, the decision to make it single threaded actually makes everything a lot easier — You don’t really need multithreading for the kind of apps Node was designed for if you don’t hang on I/O, and it relieves you from the need to think about parallel access to variables and other object.

Inspite all the above, I feel strange when I can’t run things in parallel, and sometimes it easy kind of limiting, which is why Node comes with cluster capabilities — i.e, running several node processes and communicating between them on sockets — but it’s still experimental . Other options are using the fork(), exec() and spawn(), which are all different implementations of ChildProcess.

Should I use it or not?

The short answer is, as always, it depends.

If you feel at home with functional programming and you’re running an I/O intensive app — which is the classic use case for a website — then by all means, do. The community is very vibrant and up to date, deployment is a spree (specially with hosting services like Nodejitsu and codeship.io)

If you’re running a cpu-intensive application, if you don’t trust dynamic typing, if you don’t like functional programming — don’t use it. Nice as it is, I don’t think it offers anything that couldn’t be achieved using Ruby or Python (or even scala, for that matter).

Few last tips

(in no special order)

IDE

IDEs are a kind of religion, but I’ve been using WebStorm and I like it a lot. It’s 30-days trial and a $50 for a personal licence (or $99 for a commercial one), and I think they even provide them for free for open-source projects.

Testing

It’s very easy to make mistakes, especially in dynamic language. Familiarize yourself with Mocha unit testing, and integrate it into you project from day 1.

Coordinating tasks

Sometimes you want to run several processes one after another, or you might want to run a function after everything is done, and so forth. There are several packages for that exact purpose, but the most widely used and my personal favourite is Async

More resources

Understanding the node.js event loop
Multiple processes in nodejs and sharing ports
Multithreaded Node.JS

AWS Redundant Docker logs

Our AWS, ElasticBeanstalk deployed services are running with Docker.

Our dockerrun.aws.json defines a both the logpath:

"LogPath": "/var/lib/docker/containers/<containerId>/<containerId>-json.log",
"LogConfig": {
        "Type": "json-file",
        "Config": {
            "max-file": "4",
            "max-size": "250m"
        }
    },
}

And indeed, the log files are there, and they rotate according to size (250m) and max of 4 files.

However, we have more logs files, which is not totally clear who is writing, which are found at /var/log/containers/ – and they are casuing us troubles, because they can grow to huge sizes, and chocke up the host.

These are not the files which are read when running docker logs <containerId>, because I’ve deleted them and still the docker logs works. When I delete the files at /var/lib/docker/containers/, on the other end, the docker logs returns empty.

Apple Provisioning Hell

We’re developing an ionic app for iOS and android, and handling the Apple provisioning/certificates/apn profile is hell. Just pure hell.

One of them most annoying messages is “Your account already has a valid iOS distribution certificate”. If my account already has a valid iOS dist. cert, why won’t you just get it and build the freaking thing?

After wasting craploads of time on the nonsese, I finally found Sigh. They had me at the first sentence:

Because you would rather spend your time building stuff than fighting provisioning

Just install it: sudo gem install sigh

And then run it: sigh

And that’s it. It will get, fix, download and install all the provisioning profiles, and you’ll save yourself hours of tweaking around with apple’s weird licensing logic.

Sigh is also a part of an open-source toolbox called fastlane, to easily facilitate iOS development, testing, installation and deployment.

Monitoring CloudWatch statistics using Grafana, InfluxDB and Telegraf

We’ve started checking out monitoring solutions for our AWS-based infrastructure, and we want it to be not-that-expensive, monitor infrastructure (cpu, I/O, network…) and Application statistics

We’ve looked into several options, and we’re currently narrowing it down to Grafana-InfluxDB-Telegraph.

The idea is as following: Use Telegraf to pull CloudWatch statistics from amazon, save them into the InfluxDB, and use Grafana to present these statistics and manage alerts and notifications.

(Why not just use the Grafana CloudWatch plugin? Because it doesn’t supoort notification, sadly )

Set up the environment

To test everything, we’ve set up a docker env:

Create a network

docker network create monitoring

The Grafana docker

docker run -d -p 3000:3000 --name grafana --net=monitoring -v $PWD:/var/lib/grafana -e "GF_SECURITY_ADMIN_PASSWORD=secret" grafana/grafana

The Influx docker

docker run -p 8086:8086 -d --name influxdb --net=monitoring -v $PWD:/var/lib/influxdb influxdb

Important! add 127.0.0.1 influxdb to your hosts file (see Sanities for durther explanation)

The Kapacitor docker

We’re running it with — net=container:influxdb docker run -p 9092:9092 -d --name=kapacitor -h kapacitor --net=monitoring -e KAPACITOR_INFLUXDB_0_URLS_0=http://influxdb:8086 $PWD/kapacitor.conf:/etc/kapacitor/kapacitor.conf:ro kapacitor

The telegraph docker

First we need to generate a config file for our needs, so: docker run --rm telegraf --input-filter cloudwatch --output-filter influxdb config > telegraf.conf

And then we need to fix the region, credentials, and so on (not a lot) Then run the docker:

docker run -d --name=telegraf --net=monitoring -v $PWD/telegraf-aws-influx.conf:/etc/telegraf/telegraf.conf:ro telegraf

and

docker logs -f telegraf

Let’s monitor!

So we have all the services up — grafana, influxDB and telegraf. By now, telegraf should be pulling data from aws cloudwatch, and storing them inside influxDB. So now we need to hook up grapfana into to that data stream!

Create a new DataSource from your InfluxDb, with db = telegraf (you’ll have to input it in the DataSource page) and call influx_cloudwatch

Create a new Dashboard with your influx_cloudwatch data source, and create a new Graph.

Entities probelm

As you might have notice, we now have all these matrics, but we have a problem: We want to monitor our performance by application, and most of the data is available to us with only instanceIds (and these are not fixed, because we use ElasticBeanstalk).

Some of the data measurements, like AWS/ECS is available with clusterName tag which is a bit similar to our application name (“awseb-${appName}-rAnd0mNum”), but the AWS/EC2 instances only come with AutoscalingGroup tag, which is by not very indicative to our application names(awseb-r-x5r778aw-stack-AWSEBAutoScalingGroup-123Z3RRAFF86). So, we need to find a normal way to add the application name to both the EC2 data and the ECS data, so we could build something that makes sense.

So we’re using Kapacitor:

To add script: kapacitor define ${scriptName} -type stream -tick ${scriptFileName.tick} -dbrp kapacitor_example.autogen

To write to the db: https://docs.influxdata.com/kapacitor/v1.3//nodes/influx_d_b_out_node/

Let’s Alert!

In order to add alerts to our graphs, we first need to add Alert channels. In Grafana, go to Alerting, and add an alerting channel. The easiest one imho is the Telegram alert channel.

Just install telegram (on your local machine, and on your mobile phone), and then go to the BotFather. Create a bot according to the instructions, and after you create it, run /token. It will give you the Bot Api Token. The next thing you need is you chatId. To get that, just go to get_id_bot which will give you your Chat Id.

That’s all you need. Now you can go to one of the graphs, hit ‘Alerts’, and there ‘Notifications’

Sanities

If things don’t seem to work:

First, log in to your Grafana docker and run curl -G 'http://influxdb:8086/query?pretty=true' --data-urlencode "q=SHOW DATABASES" If the query passes, you can access influx db from the grafana docker Now run the same thing from the telegraf docke.

Also, the reason you should add “influxdb” to your hosts file is this: When we use all the dockers in the same network it means they can access each other seamlessly. However, when we open Grafana using our local browser, and try to add influxdb as a Data Source, it is all done on the client — which is the host(!) of the dockers. So, it’s doesn’t know what is “influxdb”. That’s why we add it to the hosts file.