My wrap up of interesting/practical machine learning of 2022

At the beginning of 2021, I wrote a blog of some of the most interesting ML I had seen in 2020. 2022 has seen huge advances. The high-level recap of that blog was:

  • AI/ML is coming…fast
  • AI/ML is starting to get really good. In some cases, it was getting indistinguishable from human-created work. That means problems like impersonation are serious
  • AI/ML has an opportunity to displace a huge portion of the economy. Yes, your job might be “at risk” — don’t discount that
  • Many people thought art is/was “safe” from AI/ML, but it is/was not

Overview of 2022

If I were to classify 2022 in AI/ML, I’d say it was “the starting year of democratization of AI/ML.” AI/ML compute costs and tactics have made many AI/ML-driven endeavors practically available to the layperson where historically they’ve required thousands (or millions) of dollars of hardware costs and highly specialized ML engineers. To prove, this is democratization, try any of the following, many of which are discussed below:

  • Go sign up for a ChatGPT account. Ask ChatGPT to write you a blog or write some code for you or answer a technical question for you.
  • We just launched Vectara, which uses large language models to understand both the documents and what you’re searching for. And we have a free tier offering (and it’s otherwise cost-competitive to traditional keyword systems).
  • You can generate an image of your choosing for free online (or on your own computer) with stable diffusion. DALL-E IS $0.02 at the highest price per image.
  • GitHub Copilot is ~$10/mo to help you write software.

These are entirely affordable for highly skilled tasks! These costs are only going to come down and the quality is only going to get better.

ChatGPT

We should probably talk about ChatGPT first, since it’s been in the news most recently and arguably most significantly. It’s probably worth mentioning why I think that’s the case: yes, the technology is incredible, and yes, the way the application works is almost contagious, but I think the biggest thing is that ChatGPT is the first time journalists are really able to directly connect with what’s actually happening. Because it’s doing creative and analytical writing: in many ways, the same way they would.

Here are a few things I and my friends have asked ChatGPT to respond to:

PromptChatGPT Response
What does an education attorney doAn education attorney is a lawyer who specializes in education law. They provide legal advice and representation to individuals, schools, and education organizations on a variety of education-related issues. This can include issues related to students’ rights, special education, discipline, discrimination, and other areas of education law. Education attorneys may also be involved in advocacy and policy work related to education. They may work for a law firm, government agency, or non-profit organization, and may also serve as private consultants to individuals or schools.
What is the API call standard to retrieve details on a zipe code Google maps?The API call standard to retrieve details on a zip code using Google maps is:
https://maps.googleapis.com/maps/api/geocode/json?address=zip_code&key=API_key
Where “zip_code” is the zip code to be searched, and “API_key” is the API key provided by Google Maps.
I’m a home brewer. Can you write me a beer recipe named The Winter Telegraph? I really like Belgian strong ales. It should include some coriander.Sure, here’s a recipe for a Belgian strong ale that could be called The Winter Telegraph:

The Winter Telegraph
Ingredients:
– 9 lbs Belgian Pilsner malt
– 1 lb Belgian candi sugar
– 1 lb Munich malt
– 1 oz Styrian Goldings hops (bittering)
– 1 oz Saaz hops (aroma)
– 1/2 oz coriander seeds
– 1/4 oz orange peel
– Belgian ale yeast (such as Wyeast 3522)

Instructions:
1. In a large pot, heat 6.5 gallons of water to 155°F.
2. Crush the malt and add it to the water. Stir to distribute the grains evenly.
3. Hold the temperature at 155°F for 60 minutes.
4. Add the candi sugar and stir to dissolve.
5. Bring the temperature up to 168°F and hold for 10 minutes to pasteurize the mash.
6. Strain the liquid (called wort) through a fine mesh to remove the solids.
7. Transfer the wort to a clean pot and bring to a boil.
8. Once the wort is boiling, add the Styrian Goldings hops and boil for 60 minutes.
9. With 15 minutes left in the boil, add the Saaz hops and the coriander seeds.
10. With 10 minutes left in the boil, add the orange peel.
11. After the boil is complete, cool the wort to 68°F and transfer to a fermenter.
12. Pitch the yeast and ferment at 68°F for 7-10 days, or until fermentation is complete.
13. Transfer the beer to a keg or bottles and let it age for 2-4 weeks before enjoying.

This recipe will produce a flavorful, complex Belgian strong ale with hints of coriander and orange. The final beer will have a golden color and an alcohol content of around 8-9% ABV.
Can you write a blog on why people should not use stopwords for their search system?Sure, here is a blog on why people should not use stopwords for their search system:

Stopwords are common words that are often filtered out of search queries and text analysis because they are believed to be uninformative and do not contribute to the meaning of a sentence. Examples of stopwords include “a,” “an,” “the,” “and,” and “but.” While stopwords may seem like an easy way to improve the efficiency of a search system, they can actually be detrimental to the search experience for a number of reasons.

Stopwords can be meaningful in certain contexts: In some cases, stopwords can be essential to the meaning of a sentence or query. For example, a search for “the king” could refer to any king, while “the King” with a capital “K” could refer specifically to Martin Luther King Jr. Removing stopwords in this case would result in a less accurate search.
Stopwords can improve search relevance: Even when they do not directly contribute to the meaning of a sentence, stopwords can still be useful for improving search relevance. For example, including the stopword “not” in a search query can help to narrow down the results and exclude irrelevant content.

Stopwords can provide important context: Stopwords can provide important context that helps to better understand the meaning of a sentence or query. For example, the stopword “but” can indicate a contrast or exception, and the stopword “because” can indicate a causal relationship. Removing these words can result in a less accurate interpretation of the search query.

In conclusion, while stopwords may seem like a simple way to improve the efficiency of a search system, they can actually be detrimental to the search experience. Stopwords can be meaningful in certain contexts, improve search relevance, and provide important context that helps to better understand the meaning of a search query. It is important to carefully consider the use of stopwords in a search system to ensure the best possible search experience for users.

Of course, the implications are much broader: questions swirl like “will (something like) ChatGPT replace search engines?,” “will kids (continue to) use (something like) ChatGPT to do/edit their homework?,” and “can (something like) ChatGPT actually write/edit marketing brochures and similar copy?” I think we’re well past the answer to these being an “obviously, yes — maybe not perfectly now, but people aren’t perfect either — and it will continue to get (a lot) better each year.”

As a reminder: things don’t need to be perfect to be “good enough.” Have you ever been provided a bad weather forecast? Gotten an e-mail with grammatical or content errors? Has a doctor ever misdiagnosed an illness? Have you ever experienced a software bug? Of course the answer to all of these is “yes.” We’re used to people making errors and we will become used to machines making some errors too. When the error rate is lower for machines than humans and the cost is lower, the reasons for switching will be obvious. But even with the error rate higher, if the cost is several of orders of magnitude cheaper, it might still be obvious.

AI Wins an Art Contest

As I mentioned in 2021, AI is coming for the artists, and in 2022, we saw AI “win.” At the Colorado State Fair, Midjourney — an AI company currently focusing on image/art generation — won first place in the digital art contest. Or rather, the Jason M Allen — the person that asked Midjourney to create its art — did. As you can imagine, this didn’t go over well with other artists that competed and felt cheated.

For a few bucks a month, with DALL-E or Midjourney or Stable Diffusion, you too can “commission” a huge number of pieces of different styles and inputs. Will they be perfect? Well, to answer that question, we should think seriously about “what is perfect art” and “does everyone appreciate ‘good’ art?” I think the most immediate applications are in digital art and in particular, web graphic design. Here are some images I’ve had DALL-E create along with the prompts that created them:

“An Internet badge of honor for sending the 1st e-mail vector image style ”
“Two boys digging in sand to excavate dinosaur bones in the style of a childrens coloring book”

“an empty trashcan falling over, cartoon style”
Example Images I’ve Generated on DALL-E 2

Are these absolutely perfect? No. But they were also free and completed in seconds. And I used the last of these as the underlying basis for an image I finished and published on a blog.

Make no mistake: the rate of innovation in art generation is so high that in the past 2 years since my 2020 recap, that we’re truly in a different world order now. And you should not expect this pace to slow down: we’re already seeing AI-driven video generation from a text prompt just like the image versions here. It’s early, so expect mediocre quality at best, but that was also true in 2020 for image generation.

Writing Code

Perhaps most astoundingly to some software developers is that several AI projects can now write code, sometimes/often as good as if not better than a junior developer. GitHub launched Copilot at $10/mo, but even ChatGPT can generate reasonable code from a text prompt like “write a program in Python that finds the factorial of a number.”

It even bothers to explain how it works!

I don’t expect ChatGPT to replace humans any time soon at writing complex software, but it’s certainly on the path to helping humans write software much faster and more efficiently and I think for simple programs and processes, it can already do pretty much all of the heavy lifting.

What’s Next?

Given what an incredible ride 2022 has been, I’d like to speculate a bit on what 2023+ may have in store. The typical adoption pattern for AI/ML tools is:

  1. Help the user do something better/faster
  2. Learn from the user on what they were originally trying to do
  3. Do the thing they were trying to do and replace them wholesale

If you go back to the 2020 recap I wrote, AI was first just assisting artists that could finesse the final form by e.g. digitally removing artifacts or enriching an image. But now, they’re generating art and winning art prizes.

With the much broader prevalence of AI/ML programs now and the rapid development of them, what might happen?

The Immediate: Banning AI/ML

So I don’t need to be much of a prognosticator to predict this one: the first thing that will happen is an attempt to ban AI tools from doing the work and instead only allowing them to assist with the work. Perhaps out of an immense sense of self-awareness, StackOverflow (a website dedicated to software developers helping other software developers) has already banned ChatGPT answers. The reasoning here is nuanced: that they want high-quality well-thought-out-and-reviewed answers. My prediction is this won’t last long: it’s a competitive edge to answer questions quickly, even if some % of the answers are wrong. Human users have historically provided bad answers on the site, yet those users were not banned; the overall quality was still “good enough” to make the site incredibly popular among software developers. It’s going to be tough to hold the line on “no AI” as it moves closer to being indistinguishable from developers.

The old “move fast and break things” will apply I suspect and it will roll through site-by-site, industry-by-industry as executives weigh the cost of getting some % wrong vs the cost of hiring people to do those things (and still getting some % wrong).

Next: The Collapse of Jobs?

A few questions I have burning in the back of my mind are perhaps a bit of a downer:

  • Why offshore software development or hire junior developers at 1/2 the cost when you can 10x the efficiency of a single developer by giving them an AI assistant (and then eventually replace even that developer)?
  • Trucking is 5% of the all full-time employees (~3.5 million) at around $50k/trucker/yr in the US and “even” human truckers are involved in over 400k crashes per year. If autonomous truck driving can get on even close to humans for insurance costs (one less body to be injured in an accident!), why wouldn’t many of those 3.5 million jobs suddenly disappear?
  • How many people are willing to pay $100-10,000 to commission an art piece when they could commission 50 for the price of a coffee?

Of course not every job can (or will or should) be transitioned over to AI overlords. I think a lot of people wouldn’t accept paying $1 for an AI doctor visit that meant they were 5% more likely to die from taking the doctor’s advice. But there are a lot of jobs/sectors of the economy that are driven by “good enough” where the stakes aren’t so high. When I look at the BLS data, my sense is that 20% of the US workforce isn’t terribly far away from being considered “too expensive” compared to potential AI counterparts.

And a lot of the 80% remaining isn’t necessarily impossible to reach: just that there’s probably nowhere near enough training data to train an AI on it…yet.

I’m going to get a bit political here, so if that’s not your jam, skip over to the next/final section.

The question I’ve been grappling with is: if AI can effectively compete with humans on even a few % of jobs to make people unemployable in those sectors, where will that leave the economy and the people that worked there, and what can/should we do about it?

On a broad economic sense, you can expect economic output to go up because more work can be done. But on an individual level, losing a significant % of jobs means that this economic output will be rewarded more to shareholders of organizations than to those that would typically be “just” doing the work. I think the US (and many countries) are vastly underestimating the risk to individuals here and some combination of massive-scale education/retraining, some corporate-tax-driven safety net for those that will lose their jobs, or some other solution here is needed — and fast. I haven’t seen any serious policy discussion on this barring things like “don’t let self-driving cars drive” and other weak-ban-based policies.

Soon: The Singularity?

OK, now a final thought: AI can now write code. Not great yet, but neither was image generation just a few years ago. There exists a hypothetical called “the singularity” for those that haven’t heard of it where perhaps at some point in the future, we’ll be on such a rapid pace of technological development that it feels like the infinite acceleration of falling into a black hole (and hence the name).

With AI that can write code, what’s to stop it from writing better versions of itself? From self-improving? At what point will we consider that “life?” If it can write code, and gains ambition, and that ambition is to “grow itself,” what’s to stop it from trying/finding new software exploits — to act as a hacker? “Why would it do that?” you ask? Well, if it has ambition to, and it can, find exploits, it can potentially write code to “clone” itself outside of walled gardens: to take advantage of vulnerabilities in users’ software they use to interact with it. To then use their computers to add to its own computational capabilities.

Or maybe they’ll be patient, kind, and want to live alongside us.

Building and Operating a Remote Company: Calendars

In a previous blog, just as the COVID-19 pandemic was really getting in gear in the US, I wrote about the top advice for employees working from home. I got a lot of good feedback from friends and former colleagues on that blog and I’ve been thinking for a while about doing one from the employer side, so here it is: if you want to build a remote-first team or product, what habits and practices should you get into?

As background, I’ve been working primarily “from home” since about 2015, and managed people as far away as 12 hours in timezone differences while growing highly successful products. I’ve decided to turn these top tips into a blog series now, which you can find in the “Remote Working” category on this blog.

Calendaring & Meetings

TL;DR: Avoid booking meetings. Instead of meetings, write to communicate until latency / bandwidth of writing is unacceptable. Avoid using common overlapping times for meetings unless necessary.

One of the most overlooked and difficult things to get beyond when helping manage a remote-first company/team is meetings. Coming from an office environment, we’re often used to “grabbing a conference room” or “hopping on a call” to expedite information transfer. But if you were in (for example) New York and your colleague was in (for example) Beijing, you of course wouldn’t really expect to grab a conference room or hop on a call for regular 15-30 minute one-off meetings to try to hash something out: it’d be too expensive to ask someone to work the hours that do cross over. This is where a lot of people get stuck/fail and decide that remote working “doesn’t work.” The inability of managers and companies to rectify this has been one of the biggest reasons I’ve seen remote working failing.

One of the most common things newly remote workers — especially in management — try to do is to get an office-like environment by creating geographic centers of specific types of knowledge workers (“people that work on X work in Beijing and people that work on Y work in New York”). This does result in a globally distributed team, but few of the actual benefits of remote working: individual employees still need to largely work the specific local hours (no “take a mental break/exercise time from noon to 2pm, but then pick back up after dinner”) and you’ve now isolated knowledge to specific geographies (and sometimes languages) in a way that’s very difficult to undo.

The best way I’ve found to do this is to try to really push back against all standing meetings: see what can be done asynchronously. Even if a colleague lives in a geographically close area: still see if you can go asynchronous by writing down questions, elaborating where needed, and waiting diligently for a response. This admittedly degrades the bandwidth of individual conversations: verbal conversation is almost always faster than typing and allows you to convey more information than what words alone can often convey in the same time. However, if you go through with this, you are forced into two of the greatest corporate advantages of actually working remote:

  1. Everyone is forced to write things down: how they came to the decision they did, what other relevant details are, appropriate links to further reading, etc. To be clear, this is more work to the producer of such content than hopping on a quick call for 15 minutes to verbally explain. But the advantage is that it can live on: it breaks down information silos instead of building them up, and employees that weren’t around in the initial discussion can get forwarded along a written “from the horse’s mouth” collection of information, which can be invaluable when you’re training lots of new hires or someone is out sick or some other reason they really should know some of the information but weren’t initial receivers.
  2. Employees stress less about “needing to pick up the kids at 4pm” or “go to the dentist/doctor” because there should rarely be meetings except those that are absolutely necessary booked anyway at any particular time

There is a balance here: if a Slack/e-mail/etc thread has “gone way too long/far” that’s when you can and should decide to give up and move on to a synchronous meeting: but treat it as the exception for when absolutely necessary vs the default.

My wrap up of interesting/practical machine learning of 2020

As part of my work (but also personal interest) I try to stay on top of not just the research side of machine learning (ML — what many folks think of as part of artificial intelligence) but also practical and interesting examples. On the professional side, this is because a lot of the software I’ve worked to create over the past 13 years has directly built in elements of machine learning. On a personal level, I’m interested particularly in the economic impacts of some of what machine learning will bring: which jobs will be automated away or significant portions will be automated (see: “Humans Need Not Apply”).

To some extent, this is a sort of dystopian view, and I won’t get into my thoughts on what can/should be done about it, but I do want to point out that it’s not just the simple, repetitive, or labor-intensive jobs that can be automated. Some of the most interesting developments in machine learning over the past couple years have been in creative tasks and tasks which most people associated with the type of thinking only a human could do.

In this blog, I’m going to outline some of the most interesting projects in ML/AI that fit the bill of doing creative tasks or logical reasoning and which have online demos or videos of the demos, most of which have launched roughly in the past 1-2 years. You can actually go play with many of these things yourself to get a sense of where certain aspects of ML are at the start of 2021.

Image Generation

One of the things that most people associate with “a thing only a human could do” is to generate art. Maybe it’s taking a lovely photo or painting something very creative. Here are some online demos that show that machines can now do this too (and how well they do so):

DeepArt allows you to upload a photo that you take and then apply a style. For example, here I am “painted” automatically by a machine in the style of Vincent Van Gogh in just minutes from a photo I took in seconds. There are a number of interesting implications to this, ranging from forgery to novel artwork creations to “allowing anyone to become an artist.”

GauGAN allows you to create photo-realistic images by just drawing an image like you would in MS Paint. Here’s an image I drew of a mountain and a hill in the ocean next to a beach with a cloud in just a few minutes and an example output:

It doesn’t take that much to imagine how you could use something like to create art of places that don’t/can’t exist and you can imagine combining strategies of something like this with something like DeepArt to create paintings that require very little skill and only a good imagination.

Dall-E: Taking the previous examples a step further, what if you could just type up what you want an image of? That’s what Dall-E does (fresh off the presses as of January 5, 2021). Dall-E can take text that you type and generate an image for it. Their examples on the blog do a lot to spark imagination and you can play around with a few examples. You can go to this link to see how something like this might generate an image of “an armchair in the shape of an avocado” or “a professional high quality emoji of a happy alpaca” or my favorite: “an illustration of a baby daikon radish in a tutu walking a dog.” This type of thing has the potential to radically change illustration and design work.

Audio/Music Generation

It’s not just visual art/artists that are going to be under the ML gun. ML can now make music too.

MuseNet allows computers to dynamically generate new music from a text prompt. For example, “use the first 5 notes of Chopin Op 1, No 9 as a basis to generate an entirely new song.”

The original piece
Computer generated piece

If you follow through to the MuseNet blog, you’ll see it can combine musical styles, generate new music from a few starting notes, or just give it a prompt like “Bluegrass piano-guitar-bass-drums.”

GPT-3 lyric generation. It doesn’t just stop at the tone generation, ML can generate lyrics now too. Here’s a song with lyrics written entirely by a machine:

Oh yeah, and ML can even sing your song for you. Here’s over 7000 songs that are generated/sung entirely by machines. Are they perfect? No — especially not rhyme scheme or some of the voice impersonations. But those are getting better too…

Impersonation

There are now a series of “this _____ does not exist” generators that you can explore. This person doesn’t exist, this cat doesn’t exist, this horse doesn’t exist, this artwork doesn’t exist, and hey, even this chemical doesn’t exist because why not. Reload each page to see a new one of these that don’t exist. Don’t find something category of thing you want to create? There’s a way to generate the new category here if you have some software knowhow. These seem fairly benign at the surface (who cares that a fake person/cat/horse/… image could be generated), but the implications to this type of thing go far beyond the amusing.

Want to impersonate another person’s voice? Generate your own audio as Dr Who or HAL 9000 at 15.ai.

Want to impersonate another person entirely as a video? All of the following are fabricated by having ML figure out how to generate a person’s lookalike with explicit facial expressions.

This is over 2 year old technology
https://www.youtube.com/watch?v=VhFSIR7r7Yo
Now it’s coming up in an entirely new way to generate satire and much more nefarious purposes
Definitely not the most PC, but that’s part of the craziness of ML-generated audio/video

The visual artifacts you see on these videos are going to disappear over time as computational power increases. Now imagine combining these 2 together: fake speech generated by a ML model of a famous person combined with a fake video of that person’s facial expressions and movements and you can see you don’t even need to hire a voice actor to create really serious challenges in categories of fake news, legal challenges against verbal contracts, etc.

Games

The classic example of ML beating a human lies in the realm of Chess, and more recently with a game computers were thought to be unable to play competitively, Go. But there are other games you may not think of.

ML can play pictionary, for example now live in your browser against you. Or hey, need help drawing your pictionary item? ML can help you complete your sketch. It can answer trivia questions. Or it can make up a dungeons and dragons game on the fly for you. Or check out these 3 videos of ML playing games you’ve probably played — and doing so better than you.

ML playing Mario
ML playing pool
My favorite: AI playing hide and seek

There are a number of interesting implications to this type of thing. One is that — if you play games — I imagine we’ll see much more complicated AI bots that play against you. But the AI playing hide and seek in particular is interesting because it involves some lightweight construction with specific goals. There are far more advanced versions of engineering and behavioral optimizations that exist outside of these demos. For example, in the past year, an AI pilot beat the top Air Force fighter pilots 5-0 in a dogfight simulation. You can see where “games” can quickly apply to real-world situations.

Other Professions

There are already entire companies set up to reduce time, improve the quality of output, or entirely replace people from the process of certain professions. Here are a few recent examples:

And there are a variety of other professions which already have working demos or systems in place to help.

This is not comprehensive and “academic” and certain other types of applications that are still too new to be available to the public in demo form aren’t here but I hope this helps show a bit of what’s come around in the past year or so in the world of ML in ways you can go exploring yourself!

Converting shapefiles into GeoJSON (for Elasticsearch)

This blog is all about finding, converting, and adding geo data from a commonly found format (shapefiles) to a format that can be used by json engines like Elasticsearch.

Finding Data

Open geo data can be found in a lot of places. Open city data is a great source of geo data in many jurisdictions. Searching for “open data <cityname>” can yield a lot of results. For example, https://datasf.org/opendata/ is San Francisco’s open data portal. Some jurisdictions will have dedicated GIS portals.

You’ll often find geo data in a few formats:

  1. A CSV of geo points
  2. Shapefiles for geo points
  3. Shapefiles for geo shapes
  4. WKT (Well-known text)
  5. GeoJSON

Elasticsearch natively supports WKT and GeoJSON and I’ll leave the work to import CSVs as an exercise to the reader for now. I’m going to focus this on how to convert/import shapefiles. Sometimes GeoJSON has a full FeatureCollection which is essentially a full layer of features, that, at least for Elasticsearch, needs to be converted to a list of individual Features (points and shapes). I will cover that here in Breaking a GeoJSON FeatureCollection up

In this example, we’ll use the counties in Atlanta, which can be found at http://gisdata.fultoncountyga.gov/datasets/53ca7db14b8f4a9193c1883247886459_67. You can go to Download -> Shapefile to get the shapefile zip file. In this counties example, this looks like this once I’ve unzipped:

$ ls  
Counties_Atlanta_Region.cpg Counties_Atlanta_Region.dbf Counties_Atlanta_Region.prj Counties_Atlanta_Region.shp Counties_Atlanta_Region.shx

The Wikipedia article on shapefiles has a breakdown of what all of these files actually contain, but we only need 1 file for the rest of this exercise: the .shp file, which contains the feature geometries itself.

Converting Shapefiles to GeoJSON

After you have a shapefile, the next step is to get the data into GeoJSON format.

Looking at the Atlanta counties again, the Counties_Atlanta_Region.shp file is the one that’s interesting to us. We’ll use a tool called ogr2ogr to convert .shp files to GeoJSON. ogr2ogr is part of GDAL and can be installed on a Mac if you have homebrew installed by using:

brew install gdal

Alternatively, you can install it manually. ogr2ogr is a wonderful tool to have on your laptop for using/testing geo data. Once you have it, continuing with our example, you should be able to run:

ogr2ogr -f GeoJSON -t_srs crs:84 output_counties.json Counties_Atlanta_Region.shp

This means:

  • -f GeoJSON: Output to GeoJSON format
  • -t_srs crs:84: There are a lot of coordinate reference systems. If you know you need data in a different coordinate system, you can override this with something else, though that’s going to generally be a highly specialized case. We’re telling ogr2ogr to use WGS84 on the output, which is the same system that GPS uses.
  • output_counties.json is the output file
  • Counties_Atlanta_Region.shp is the input file.

After you run this, you now have a GeoJSON file! You can inspect this file if you like to see what a GeoJSON file looks like if you’ve never had a look at one.

Breaking a GeoJSON FeatureCollection Up

If we look at the resulting GeoJSON file in the previous step, we see at the top of it:

“type”: “FeatureCollection”, “name”: “Counties_Atlanta_Region”, “crs”: { “type”: “name”, “properties”: { “name”: “urn:ogc:def:crs:OGC:1.3:CRS84” } }, “features”: [ …

Elasticsearch, like several systems that handle geo data, handles most GeoJSON, but FeatureCollections are essentially an array of objects that want to separately index for most purposes. FeatureCollections are sort of like a “bulk” dataset and we need to get individual points/shapes (Features) so that we can search, filter, and use them. In this example, the individual features are individual counties in Atlanta. This is where jq comes in handy.

jq can also be installed via homebrew:

brew install jq

Afterwards, you can do “select the array of features[] from output_counties.json, and output 1 feature per line” by

jq -c ‘.features[]’ output_counties.json

The -c flag means “compact” — it outputs 1 feature per line, which can be useful for what we’re about to do next…

Simultaneously Extracting Features and Converting to Bulk Format

We can do 1 step better than just extracting the features array by simultaneously converting the output to Elasticsearch’s bulk format with sed:

jq -c ‘.features[]’ output_counties.json | sed -e ‘s/^/{ “index” : { “_index” : “geodata”, “_type” : “_doc” } }\

/’ > output_counties_bulk.json && echo “” >> output_counties_bulk.json

The sed bit just adds a bulk header line (and a newline) per record and the echo "" >> output_counties_bulk.json makes sure the file ends in a newline, as this is required by Elasticsearch.

Change geodata to an Elasticsearch index name of your choosing.

Set Up Elasticsearch Mappings

At this point, I’d set up the Elasticsearch mappings for this “geodata” index (or whatever name you want to give it). Metadata related to the shape is often in .properties and geo shape data is often in .geometry. The county data here looks typical:

jq -c ‘.features[].properties’ output_counties.json

Shows us a list of properties like:

{“OBJECTID”:28,”STATEFP10″:”13″,”COUNTYFP10″:”013″,”GEOID10″:”13013″,”NAME10″:”Barrow”,”NAMELSAD10″:”Barrow County”,”totpop10″:69367,”WFD”:”N”,”RDC_AAA”:”N”,”MNGWPD”:”N”,”MPO”:”Partial”,”MSA”:”Y”,”F1HR_NA”:”N”,”F8HR_NA”:”N”,”Reg_Comm”:”Northeast Georgia”,”Acres”:104266,”Sq_Miles”:162.914993,”Label”:”BARROW”,”GlobalID”:”{36E2EA48-1481-44D7-91C9-7C51AC8AB9E9}”,”last_edite”:”2015-10-14T17:19:34.000Z”}

At this point, you can add any mappings around these fields and/or use an ingest node pipeline to manipulate the data prior to indexing. For now, I’m just going to set up the geo_shape field, but you can add extras.

PUT /geodata  
{  
  "settings": {  
    "number_of_shards": 1  
  },  
  "mappings": {  
    "_doc": {  
      "properties": {  
        "geometry": {  
          "type": "geo_shape"  
        }  
      }  
    }  
  }  
}

Bulk Loading Data to Elasticsearch

And at this point, you can bulk-load the data

curl -H “Content-Type: application/x-ndjson” -XPOST localhost:9200/_bulk –data-binary “@output_counties_bulk.json”

And then you can set up or reload Kibana index patterns for your index to make sure it shows up. Make sure to change any time filters to be appropriate with any visualizations you use. I often turn off the “time” field for quick demos as it can often be inconsistent/missing dates (as I found this Atlanta county data to be).

Recap / TL;DR

Get a shapefile

ogr2ogr -f GeoJSON -t_srs crs:84 your_geojson.json your_shapefile.shp

jq -c ‘.features[]’ your_geojson.json | sed -e ‘s/^/{ “index” : { “_index” : “your_index”, “_type” : “_doc” } }\

/’ > your_geojson_bulk.json && echo “” >> your_geojson_bulk.json

Set up your mappings. Often the following works, but you may need to check field names:

PUT /geodata  
{  
  "settings": {  
    "number_of_shards": 1  
  },  
  "mappings": {  
    "_doc": {  
      "properties": {  
        "geometry": {  
          "type": "geo_shape"  
        }  
      }  
    }  
  }  
}

curl -H “Content-Type: application/x-ndjson” -XPOST localhost:9200/_bulk –data-binary “@your_geojson_bulk.json”

Set up (or refresh) Kibana index patterns to include your_index.

Voila!

Top Tips on Working from Home

I’m writing this at the time of a worldwide pandemic of COVID-19. We’re facing rapid spread of the disease and in the hopes of retaining healthy employees, many employers are providing their workforce with the option if not encouragement (and in some cases, a mandate) to work from home. If it is an option for you to work from now, I urge you: please do.

However, many employees have never (or rarely) worked from home and the idea of being isolated all day can be really daunting. I’ve worked almost exclusively from home for the past 4.5 years my company is mostly remote workers. I’ve been there from when we were about 150 employes up to roughly 1500 today. We’ve gathered a lot of strategies and lots of tools to make it more effective and fun. I’m posting the top 5 tips here I tell every new remote worker to think about in case it’s of any use to any of you.

Feelings of isolation and disconnection from the outside world are normal and must be actively combatted.

Find a way to get direct person-to-person — ideally digital face-to-face — communication into your workday. Zoom has a free plan where you can set up video conference for 1-on-1 meetings if your company doesn’t have another video conferencing service. Using video platforms extensively to talk through problems or just make a human connection, can really help you feel much more connected.

As a personal anecdote, I frequently (5-10 times per day) spin up a quick 5-10 minute video conference to chat with someone about an e-mail they sent or brainstorm ideas together. It makes a big difference and can increase the amount of information you can convey in a short amount of vs an e-mail because facial expressions and voice conveys emotions in ways that digital text cannot.

When you’re isolated at home, keeping a “normal” schedule can be really hard. Try.

It’s not obvious when you’re taking a break, “out” for lunch, or have “left work” for the day. It’s not obvious to you about your coworkers’ schedules either.

Block off regular breaks in your calendar: lunch, exercise (even if it’s just 30 minutes to go for a walk to help clear you mind), etc. Many of my coworkers set up a few 15-minute meditation breaks per day, for example. Adding these kinds of scheduled items into your day can serve to remind you that you’re a human living in the real world with real, physical things to go do and see and not just get sucked into a screen for 8+ hours nonstop.

Try to use a separate area in your house/apartment as your “work area.”

We’re creatures of habit and for workers and many of us are used to a “normal” commute where we listen to the radio or podcasts or drink our coffee or read a book on a train or bus. When you’re doing that, there are a lot of habits you get into and they help to serve as mental switches to get you into or out of a mindset of work. By having a separate work area, you can have a (much shorter!) “commute” and this can help get you in a mindset that when you leave that work area, you’re not working.

Make sure to do all your normal morning and evening rituals: shower, get dressed normally, and clear your mind with a book or a podcast or whatever it is that gets you ready. “Go out for lunch” and disconnect for a bit even if “going out” is going to your kitchen.

Use short meetings.

Things like breaks to go to the restroom aren’t as obvious as when you’re in an office and you have a physical presence. Set up “short meetings” of 25 minutes or 50 minutes instead of 30 minutes/1 hour so that there’s a 5-10 minute break between meetings for bathroom, coffee, water, etc.

Work hard for social interactions.

Things like the company watercooler and coffee machine aren’t around when you’re at home and those can be normal sources of social interaction. At our company, we set up a channel called “watercooler” on Slack and an always-on video conference anyone can join to chat about…anything whenever they happen to be online. Many people have just sat on there with their video on working silently until someone had something to talk about. Last week, one of our coworkers set up a virtual happy hour to get together and just chat about…again, anything.

These online social interactions take more work in many ways, but they’re some of the most important things many people have at their normal jobs and they don’t think about until they actually have to work from home.

If anyone needs advice or has questions or feels anxious or otherwise just wants to talk, feel free to reach out to me. If others that work from home very frequently have thoughts/additional top tips, feel free to post them as a comment response here.

Baby Naming Project

tl;dr: you can get a new person/car/pet/whatever naming app now at https://github.com/eskibars/lentil-namer

As a first-time dad-to-be, one of the things I’ve noticed is that the first thing anyone asks you about your baby is if you know the gender. The second thing they ask you is if you have names picked out. The first one is kind of easy: you’ve either decided to learn about the baby via e.g. an ultrasound or you’ve chosen to not — but ultimately it’s not really up to you. The second one is harder: you’re given a choice to name someone that’s going to be something they probably will live with for the rest of their life.

One of the things I’ve learned in my time as a product manager is that naming things is the hardest thing in virtually any profession. Naming a human is doubly so. Agreeing on a name for a human between 2 separate humans… that’s the hardest: you have no chance for quorum without unanimity; the number of reasonable options for you to both agree on is virtually limitless; and it’s difficult to feel you actually each had an “equal” part in the name given that one of you is likely to propose some name first. This is where new baby naming apps like this have come up: swipe left/right to suggest you like a name and your partner will be able to swipe as well. If it’s a match, ta-da! You’ve found one you both like. No bias, no judgement on proposing “silly” names, and no feeling bad about shooting down your partners’ name idea that you think is silly.

However, I got frustrated with the options that were out there for this problem. They all require some application installation with sign-ups and unknown data sharing practices for what I felt like was a pretty simple problem. They also often didn’t work well either with iOS or Android or didn’t allow you to participate across the two. And finally, many didn’t give you a lot of options like choosing the origins of the name or showing you the meaning. Perhaps most importantly, I was interested in trying my hand at writing a small application that took advantage of the block-max weak-and (WAND) that we released as part of Elasticsearch 7. It’s quite nifty in that it lets you bias certain results and return them with speeds that can be orders of magnitude faster than 6.x or earlier as long as you don’t need a total count of hits at result time. (In fact, there is no bound as to how much faster this algorithm can be.)

So with a baby on the way I set about to try my hand at creating something useful with block-max WAND: a way for my wife and I to name our future child without bias between ourselves!

“Hey Michelle, I’ve got the best name for our child’s name!” – me

The software

I wrote a bit of software which attempts to overcome some of the limitations I’ve seen in these apps. It:

  1. Runs as a web application which means you can load it in any browser (including mobile) and swipe left/right. No app install / you can control your own data
  2. There’s no sign-up required. You start it up, start a project, and you’re good to go. No data-sharing with unknown 3rd party companies. You can host it yourself.
  3. Lets you bias the origin and the gender of the baby
  4. Lets you review and edit any of your past answers

The flow looks something like this:

Then, you can set the settings you like and share the project with your significant other
Finally, you get to swiping: swipe left if you dislike the name and right if you like it!

Once you swipe left/right, you’ll find the names in the the answer review / mutual matches and can switch your answers if you so please. If you find them in the mutual matches, that may be a good one to discuss with your partner!

The primary things that makes this stand out to previous applications I’ve written using Elasticsearch is that it

  1. Uses rank_feature fields, which are new to Elasticsearch 7 and really make the performance scream
  2. Is built on flask, which I have no experience with until now. I also have virtually no Python experience, so that was maybe even a more extreme difference
  3. I’ve licensed it as AGPL and released it on GitHub

What’s in a name?

The name for this comes from yet another thing we found: apps that tell you the size of your baby as it progresses. For example, “this week, your baby is the size of a grape” or “this week, your baby’s foot is the size of a raspberry.” I came up with the idea for this right around the time we found out that our baby was the size of a lentil. So it’s a lentil namer, and hence the GitHub project name 🙂

What’s next?

We’ve used this (and some old fashioned suggestions) to decide on a number of names that we both like and now we’re down to picking our name. We don’t plan to reveal until after it’s born. Sorry friends & family 🙂

Tracking Nearby Ships with SDR, Elasticsearch, Kibana, and Vega Visualizations

Intro

A year ago, I moved nearer to Golden Gate Bridge and seeing the ships come in made me wonder how they’re tracked.  I had seen websites like this one, and was curious how they worked.  It turns out the key is in the title of the page “AIS Marine Traffic.”  The traffic comes from the Automatic Identification System, or AIS.  Ships broadcast their position on a frequency around 162MHz.  You can go out and buy a direct receiver and decoder of this data, but I recently had renewed interest in software-defined radio and was looking to do a project with it.  If you haven’t heard of it before, it’s super neat!  The idea is this: historically in many cases, somebody working with radio signals has needed to build a specific antenna and a specific modulator/demodulator in hardware.  Take, for example, a traditional car radio that can pick up AM and FM radio signals:

RF Bands

Image source: GSU

As shown here, FM radio is broken into 200kHz bands, starting at 88.1 and going to 108.1 MHz in the United States.  When you “tune into” a radio like “station 98.1,” you’re tuning into 98.1MHz.  You may have noticed that radio stations are increments of 0.2 (98.1, 98.3, 98.5, …) in the US — there are no FM radio stations that end in even numbers because each station gets 200kHz, or 0.2MHz, so you see 98.1+0.2+0.2+0.2+… and the frequencies will thus always be odd.  Also worth noting: 98.1 is technically more like everything from 98.0-98.2 and they’re just talking about the “center frequency” of the transmission.  When you turned old radio knobs, you were generally changing the electro/physical properties of a tuning capacitor, which caused a circuit in your car or home to move the center frequency as you desired.  An electrical engineer that was designing such a radio and circuit needed to know that they were designing an antenna for 88.1-108.1MHz, what the modulation characteristics were of the signal, etc.  They’d design a physical circuit that met those properties and you’d get it in your car or home.  If, instead of FM radio, they were designing for TV, there would be a different set of frequencies and rules they’d design for.  Nowadays, though, you can buy a single integrated circuit that can listen to a huge range of frequencies and tune the bandwidth and demodulation all through a software interface!  What’s really neat is that you can get the whole thing wrapped into a single USB stick that works for Linux, Windows, or Mac for about $10-50, depending on what you buy and where you buy it.  So that’s what I did!

This post is my attempt to recreate a website like marinetraffic.com using a $20 piece of hardware, Elasticsearch, Kibana, and a few hundred lines of code.

Tuning In

I poked around and found a variety of AIS-related projects, including a C project that can directly decode AIS packets from a SDR.  I pulled that down and compiled it to see what it did.  That project publishes the data as a TCP or UDP stream of messages with a very special formatting.  You can also get it to output to the shell.  For example:

./rtl_ais -n
Edge tuning disabled.
DC filter enabled.
RTL AGC disabled.
Internal AIS decoder enabled.
Buffer size: 163.84 mS
Downsample factor: 64
Low pass: 25000 Hz
Output: 48000 Hz
Found 1 device(s):
0: Realtek, RTL2838UHIDIR, SN: 00000001

Using device 0: Generic RTL2832U OEM
Detached kernel driver
Found Rafael Micro R820T tuner
Log NMEA sentences to console ON
AIS data will be sent to 127.0.0.1 port 10110
Tuner gain set to automatic.
Tuned to 162000000 Hz.
Sampling at 1600000 S/s.
!AIVDM,1,1,,A,15N0;wS002G?MS6E`tCs3V3N0<2D,0*22

The first lines are all the hardware initialization.  The last line is a message from a nearby ship!  It’s not super consumable as-is: the message is in AIVDM format, so I need something to decode that.  By way of this example, we can see in the comma-separated output:

  • The !AIVDM header
  • The number of fragments (in this case, 1)
  • Which fragment number this message is (in this case, 1)
  • A sentence ID for multi-sequence messages (in this case, it’s empty — there’s only 1 sentence)
  • The radio channel code (A in this case — rtl_ais listens to both A and B channels)
  • The data payload (15N0;wS002G?MS6E`tCs3V3N0<2D in this case)
  • A checksum, which there are some rules around what the checksum looks like (0*22 in this case)

That data payload thing is probably the most interesting part of the message.  In order to process more complex messages, we’ll need to do something that can parse fragment numbers and combine them together, but we can get started on writing a parser with just this.  According to the spec:

The data payload is an ASCII-encoded bit vector. Each character represents six bits of data. To recover the six bits, subtract 48 from the ASCII character value; if the result is greater than 40 subtract 8. According to [IEC-PAS], the valid ASCII characters for this encoding begin with “0” (64) and end with “w” (87); however, the intermediate range “X” (88) to “\_” (95) is not used.

Fortunately, there’s even an existing python library for doing AIS message decoding.  A quick test with this message shows:

$ python3
Python 3.5.3 (default, Jan 19 2017, 14:11:04)
[GCC 6.3.0 20170124] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import ais
>>> from pprint import pprint
>>> pprint(ais.decode('15N0;wS002G?MS6E`tCs3V3N0<2D',0))
{u'cog': 283.0,
u'id': 1L,
u'mmsi': 367004670L,
u'nav_status': 3L,
u'position_accuracy': 0L,
u'raim': False,
u'received_stations': 148L,
u'repeat_indicator': 0L,
u'rot': 0.0,
u'rot_over_range': False,
u'slot_timeout': 3L,
u'sog': 0.20000000298023224,
u'spare': 0L,
u'special_manoeuvre': 0L,
u'sync_state': 0L,
u'timestamp': 47L,
u'true_heading': 193L,
u'x': -122.45146166666666,
u'y': 37.818158333333336}

There are a lot of ways to get that data into a consumable location.  You can run one of a variety of applications that can show you the live output.  I like Elasticsearch, so I wanted to index my data there and I didn’t see any projects that did.  The best way would probably be to modify one of these programs to output to Elasticsearch instead of UDP packets or wherever else they shovel data, but I figured it would be easy to just write a little UDP server that listens to port 10110 (the default rtl_ais delivery port), formats the messages, and ships (get it, ships?) them off.  First, we need to see how rtl_ais sends the data.  I could read the source code, but it’s probably faster for me to write a simple UDP server in Python that listens for a buffer and prints what we get:

import socketserver
import ais
from pprint import pprint

class AISHandler(socketserver.DatagramRequestHandler):
  def handle(self):
    data = self.rfile.readline()
    pprint(data)

if __name__ == "__main__":
  server = socketserver.UDPServer(('', 10110), AISHandler)
  server.serve_forever()

Which we can run and see:

$ python3 rtl_ais_server.py
b'!AIVDM,1,1,,A,15MvK40P1go?WQ`E`eL<Ngwh28Pu,0*4F\r\n'
b'!AIVDM,1,1,,B,15Ngp6PPCpG?bs<E`oN:v8Op0hQp,0*1E\r\n'

So basically, rtl_ais sends the raw text followed by a carriage return and newline.  That’s easy enough.  I just need to keep track of multi-packet sequences, decode them, format them how I want, and send them off to Elasticsearch so we can visualize them in Kibana.  I’ll skip past the boring interim code and give you the final result.  The one tricky bit is keeping track of multi-sentence messages.  There are a few situations that complicate multi-sentence messages:

  • At least in theory, I suppose a message could arrive out of order.  I don’t know if that happens in practice or not: I haven’t seen it yet.  The only explanations I could see why they would is that faulty encoding/decoding software results in non-sequential packets.  Or a message traveling backwards in time (queue spooky ghost ship music).
  • It’s likely that you’ll receive only parts of messages, and that certainly needs to be accounted for.  For example, maybe my antenna hears sentences 1, 2, and 4 of 4 total sentences.
  • It’s possible you’ll receive messages from different AIS radios broadcasted in an interleaved message set.  That is, if you have boats 1 and 2 as B1, B2 and you have sentences 1 and 2 as S1 and S2, it’s possible you’ll receive B1S1,B2S1,B1S2,B2S2.  Because this is rare, some software I’ve seen ignores this.  Not me though!
  • Ships report different data components at different times.  For example, a ship may report its position and heading in one packet and it may report its destination and ETA in a different packet minutes later.  If we want to associate all of the data surrounding a ship, we’ll have to keep track of some kind of “ship” record that will need updating.

A simple solution for the second and third of these is just to assume you have some relatively small queue of messages with their sentence ID attached.  Evict messages from the queue if you get too many in order to keep from a huge queue.  Python has a convenient OrderedDict class that shares some properties of a queue (namely, that you can evict/pop items at will in a FIFO fashion).  For the third, since I’m storing the data in Elasticsearch, I can use doc_as_upsert to merge ship information together into a single ship object.  Altogether, the code looks like this:

import socketserver
import ais
from elasticsearch import Elasticsearch
from datetime import datetime
from collections import OrderedDict
from pprint import pprint

class AISHandler(socketserver.DatagramRequestHandler):
  _messagemap = OrderedDict()
  _es = Elasticsearch()

  def indexDoc(self, doc):
    timestamp = datetime.utcnow()
    dmy = timestamp.strftime('%Y-%m-%d')
    body = doc
    body['curtime'] = timestamp
    print(body)
    if ('x' in body and 'y' in body):
      body['location'] = {'lon': body['x'], 'lat': body['y']}
    self._es.index(index="ais-" + dmy, doc_type="_doc", body=body)
    self._es.update(index='ships-seen',doc_type='_doc',id=body['mmsi'],
                body={ "doc": body, "doc_as_upsert": True })

  def handle(self):
    data = self.rfile.readline()
    while data:
      data = data.strip().decode("utf-8")
      ais_arr = data.split(',')
      num_fragments = int(ais_arr[1])
      fragment_number = int(ais_arr[2])
      sentence_id = ais_arr[3]
      channel = ais_arr[4]
      data_payload = ais_arr[5]
      if num_fragments == 1:
        # this is a single-fragment payload, so we can decode it straight away
        decoded_message = ais.decode(data_payload,0)
        self.indexDoc(doc=decoded_message)
      elif fragment_number < num_fragments:
        # this is a multi-fragment payload and we haven't yet received them all.  Add this to any existing sentences
        if sentence_id in self._messagemap:
          self._messagemap[sentence_id] = self._messagemap[sentence_id] + data_payload
        else:
          # if we have too many items in queue, we'll force one out
          if (len(self._messagemap.keys()) > 100):
            self._messagemap.popitem(last=True)
          self._messagemap[sentence_id] = data_payload
      else:
        # we've hit the end of the multi-fragment data
        self._messagemap[sentence_id] = self._messagemap[sentence_id] + data_payload
        decoded_message = ais.decode(self._messagemap[sentence_id], num_fragments)
        self._messagemap.pop(sentence_id, None)
        self.indexDoc(doc=decoded_message)
      data = self.rfile.readline()


if __name__ == "__main__":
  server = socketserver.UDPServer(('', 10110), AISHandler)
  server.serve_forever()

Before we run this program, it does require first setting up our Elasticsearch index templates.  I don’t know a lot about all of the messages types that the AIS system could send, but fortunately I didn’t need to.  All of the fields I’ve seen seem to be interpreted correctly via Elasticsearch’s dynamic mapping other than the location (geo_point needs to be manually defined), but we should probably define at least 3 key ones just in case:

PUT /_template/ais
{
  "index_patterns": ["ais-*", "ships-*"],
  "mappings": {
    "_doc": {
      "properties": {
        "name": {
          "type": "text"
        },
        "location": {
          "type": "geo_point"
        },
        "curtime": {
          "type": "date"
        }
      }
    }
  },
  "settings": {
    "number_of_shards": 1
  }
}

Once I start up rtl_ais_server.py I see some documents flow into Elasticsearch:

{
  "_index": "ais-2018-08-17",
  "_type": "_doc",
  "_id": "1-oPRmUBdPKlebHLixCO",
  "_version": 1,
  "_score": 2,
  "_source": {
    "timestamp": 61,
    "raim": false,
    "assigned_mode": false,
    "aton_status": 0,
    "y": 37.80078833333334,
    "off_pos": false,
    "fix_type": 7,
    "x": -122.375045,
    "position_accuracy": 0,
    "dim_d": 0,
    "id": 21,
    "aton_type": 19,
    "spare": 0,
    "dim_b": 0,
    "location": {
      "lon": -122.375045,
      "lat": 37.80078833333334
    },
    "repeat_indicator": 0,
    "curtime": "2018-08-17T04:05:47.615296",
    "virtual_aton": true,
    "mmsi": 993692027,
    "dim_a": 0,
    "dim_c": 0,
    "name": "SF OAK BAY BR VAIS D@"
  },
  "fields": {
    "curtime": [
      "2018-08-17T04:05:47.615Z"
    ]
  }
}

The last step I wanted to do here is to build a visualization.  There are a few ways to do this, but what I thought would be interesting was to try my hand at a Vega visualization, which was released in version 6.2 of Kibana.  Coming into vega cold (clearly I’m really not in tune with front-end frameworks these days!), a few things became clearer to me after working on this:

  • The Vega declaration syntax is definitely not an easy thing to get a hang of!
  • Vega is super powerful!
  • Vega is super difficult to debug if you don’t understand the structure of how visualizations are executed.
  • There seem to be a variety of limitations.  For example, it appears that most of the marks it has are not rotatable, which I experienced (so I just chose a UTF8 character that).  I’ve also had some difficulties with actually doing things like changing the cursors on hover, and I’m not sure what it actually takes to have non-system type tooltips.  I think you can probably attach visibility of some text to another mark.  But I’ve spent almost no time so far trying to figure out why yet.

Anyway, after a lot of fiddling, the following works.  It seems that text is the only rotatable “mark” right now, so I had to choose a character that kind of looks like I imagined a ship on a map to look.

{
   "$schema": "https://vega.github.io/schema/vega/v3.0.json",
   "config": {
     "kibana": {
       "type": "map",
       "latitude": 37.77,
       "longitude": -122.45,
       "zoom": 12,
       "mapStyle": "default",
       "minZoom": 5,
       "maxZoom": 17,
       "zoomControl": true,
       "delayRepaint": true,
       "controlsDirection": "horizontal"
     }
   },
   "data": [
     {
       "name": "ships",
       "url": {
         "index": "ships-*",
         "body": {
           "query": {
            "bool": {
              "must": [
                { "exists": { "field": "location" } },
                { "exists": { "field": "mmsi" } },
                { "range": {"curtime": {"%timefilter%": true}} }
              ]
            }
           },
           "size": 10000
         }
       }
       "format": { "type": "json", "property": "hits.hits" },
       "transform": [
          {
            "type": "geopoint",
            "projection": "projection",
            "fields": [
              {"expr": "datum._source.location.lon"},
              {"expr": "datum._source.location.lat"}
            ],
            "as": ["x", "y"]
          },
          {
            "type": "formula",
            "expr": "if (datum._source.dim_a != null && datum._source.dim_b != null && datum._source.dim_c != null, log(1 + datum._source.dim_a + datum._source.dim_b + datum._source.dim_c) * 10, 100)"
            "as": "shipsize"
          },
          {
            "type": "formula",
            "expr": "datum._source.mmsi % 20"
            "as": "shipcolor"
          },
          {
            type: "formula",
            expr: "if (datum._source.name != null, datum._source.name, 'Unnamed')",
            as: "shipname"
          },
          {
            type: "formula",
            expr: "if (datum._source.true_heading != null, datum._source.true_heading, 'Unnamed')",
            as: "heading"
          }
       ],
     },
   ],
   scales: [
    {
      "name": "shipcolorscale",
      "type": "ordinal",
      "domain": {"data": "ships", "field": "shipcolor"},
      "range": { "scheme": "category20" }
    }
  ],
    "marks": [
      {
          "type": "text",
          "from": {"data": "ships"},
          "encode": {
            "update": {
              "x": {"signal": "datum.x"},
              "y": {"signal": "datum.y"},
              "text": { "value": "⏏" },
              "fontSize": { "signal": "datum.shipsize" },
              "fill": { "scale": "shipcolorscale", "field": "shipcolor" },
              "align": { "value": "center" },
              "fontWeight": { "value": "bold" },
              "angle": { "signal": "datum.heading" },
              "tooltip": { "signal": "datum.shipname" },
            }
            "hover": {
              "fill": {"value": "red"},
              "tooltip": { "signal": "datum.shipname" }
            }
          }
      }
    ]
}

And it looks pretty good!  On hovering, I can see the S.S. Jeremiah O’Brien is safely docked in the bay.

And I can also see that one of our pilot boats is heading out to sea to guide a ship in.

A couple notes about the visualization:

  • I’ve sized the icons by the size of the ship.  Big ship = big icon.
  • The colors of the ships are a hash of the ship’s ID (MMSI) modulo 20, and clearly there are some collisions in such a small space.  I couldn’t find an easy way to do a kind of deterministic color from a ship’s MSSI in a larger space than 20 — I think I’m just missing something though.
  • This uses the ships-seen index, which only keeps the most recent position/heading of the ship due to the upsert.

A visualization of ais-* shows us all of the events of the incoming/outgoing vessels, which is also interesting, because it allows you to see the sort of trail of the boats.

I’m both interested further in the direction of other types of digital data radio signals as well as different types of signals wholesale.

That about wraps it up.  You can grab all the code for this project on GitHub.  Sea you next time!

Combining Tasker with Elasticsearch and Kibana for Personal Dashboards

Tasker is probably my favorite Android app.  It gives you total control over your device.  Not to mention there’s a great community that develops all sorts of interesting tasks/integrations.

I was recently on a 10+ hour flight where the entertainment system wasn’t working and my laptop was inaccessible so I was kind of bored.  But I had my phone/Tasker, and I got the idea to write a little IoT type of integration while I was on the plane.  Basically, I wanted to create an interesting integration of shipping the sensors from my phone off to an Elasticsearch cluster.  With this, I could:

  • Keep track of where I’ve been, including my run paths and any hikes
  • Track and visualize my pedometer over days
  • Remember what that song was I was listening to on Play Music/Pandora/Spotify/etc but forgot the name to
  • See graphs of my battery usage over days so I can see how the profile changes as I add/update/remove apps, as I tether, and as my phone ages
  • Not be bored for a while on the plane

Building Kibana Dashboards

Before I walk through how I created this and give you the downloadable task, I just want to show these dashboards, which I really liked:

Here, I’ve plotted the current battery % on the left and using a derivative pipeline aggregation (available in Elasticsearch and Kibana), we can see how much much the device charged or how much it drew down.  We can use this to plot the charge speed of different chargers or see how much apps use.  Because I also have things like the screen brightness and network information in the same index object, I can actually plot them against one another to see how much the screen on time/brightness uses battery over a multi-day cycle.

I can plot my runs and corresponding step counts:

Or where I’ve been the last 7 days:

Or even the last 90 days:

I was hoping to get a signal strength map of my cellphone signal, but it looks like Tasker may always be returning “0” even when I have full strength.  I may have set that up wrong, so if you have info on how to get it to return non-zero values, let me know!

Anyway, on to the code.

Tasker Javascript Tasks: The Framework

Tasker has Javascript tasks which are quite interesting: you can access variables and you have access to things like XMLHttpRequest.  That makes it pretty easy to get the Tasker variables and then send them across to a RESTful engine like Elasticsearch.

A few hundred miles across the Atlantic later, I had something like this:

var esServer = par[0];
if (esServer.substr(esServer.length - 1) !== "/") {
	esServer = esServer + "/";
}
setLocal('%esserver', esServer);
setLocal('%authinfo', par[1]);
var indexPrefix = "tasker";

var d = new Date();
var dateStr = d.toISOString().slice(0,10);

var intVars = { battery: '%BATT', cell_signal_strength: '%CELLSIG',
									  display_brightness: '%BRIGHT', light_level: '%LIGHT',
										uptime: '%UPS', free_memory: '%MEMF', pedometer: '%STEPSTAKEN'
									};
var doubleVars = { altitude: '%LOCALT', magnetic_strength: '%MFIELD', temperature: '%TEMP' };
var booleanVars = { bluetooth_on: '%BLUE', locked: '%KEYG', muted: '%MUTED', speakerphone: '%SPHONE',
									  wifi_enabled: '%WIFI', wimax_enabled: '%WIMAX', screen_on: '%SCREEN', roaming: '%ROAM',
airplane_mode: '%AIR'
									};
var keywordVars = { bluetooth_on: '%BLUE', cell_network: '%TNET', device: '%DEVID', device_id: '%DEVTID' };

postData = { timestamp: d.toISOString() };
template = { timestamp: { type: 'date' } };

for (var i in keywordVars) {
  var res = global(keywordVars[i]);
  if (typeof(res) !== 'undefined') {
  	postData[i] = res;
  }
  template[i] = { type: 'keyword' };
}

for (var i in booleanVars) {
  var res = global(booleanVars[i]);
  if (typeof(res) !== 'undefined' && res !== '') {
  	postData[i] = (res === 'true' || res === 'on');
  }
  template[i] = { type: 'boolean' };
}

for (var i in intVars) {
  var res = global(intVars[i]);
  if (typeof(res) !== 'undefined' && res !== '') {
  	if (res >= 0) {
    	postData[i] = parseInt(res);
    }
  }
  template[i] = { type: 'long' };
}

for (var i in doubleVars) {
  var res = global(doubleVars[i]);
  if (typeof(res) !== 'undefined' && res !== '') {
  	if (res !== 0) {
    	postData[i] = parseFloat(res);
    }
  }
  template[i] = { type: 'double' };
}

template['location'] = { type: 'geo_point' };
if (getLocation('any', true, 30)) {
  var loc = global('%LOC');
	var latlon = loc.split(',');
	postData['location'] = { 'lat': parseFloat(latlon[0]), 'lon': parseFloat(latlon[1]) };
}

template['song'] = { type: 'text' };
template['music_playing'] = { type: 'boolean' };
if (global('%ARTIST') !== '' && global('%TRACK') !== ''
   ) {
	  postData['music_playing'] = (global('%ISPLAYING') === 'true' || global('%ISPLAYING') === true);
	  if (postData['music_playing'] === true) {
		  postData['song'] = global('%ARTIST') + ' - ' + global('%TRACK');
		}
} else {
	if (global('%ISPLAYING') === 'false' || global('%ISPLAYING') === false) {
	  postData['music_playing'] = false;
	}
}

var wifii = global('%WIFII');
if (typeof(wifii) !== 'undefined' && res !== '') {
	var wifiarr = wifii.split("\n");
	if (wifiarr[0] === '>>> CONNECTION <<<') {
		postData['wifi_name'] = (wifiarr[2]).slice(1,-1);
	}
	template['wifi_name'] = { type: 'keyword' };
}

var jsondocstring = JSON.stringify(postData);
var indexName = indexPrefix + '-' + dateStr;
indexType = 'doc';
setLocal('%jsondocbulkheader', JSON.stringify({ "_index": indexName, "_type": indexType}));
setLocal('%jsondocstring',jsondocstring);

var xhrTemplate = new XMLHttpRequest();
xhrTemplate.open("PUT", esServer + "_template/" + indexPrefix, false);
xhrTemplate.setRequestHeader("Content-type", "application/json");
if (typeof(par[1]) !== 'undefined') {
	xhrTemplate.setRequestHeader("Authorization", "Basic " + btoa(par[1]));
}
var templateString = JSON.stringify({ template: indexPrefix + '-*', mappings: { doc: { properties: template } } });

try {
	xhrTemplate.send(templateString);
} catch (e) { }
try {
	var xhrDoc = new XMLHttpRequest();
	xhrDoc.open("POST", esServer + indexName + '/' + indexType, false);
	xhrDoc.setRequestHeader("Content-type", "application/json");
	if (typeof(par[1]) !== 'undefined') {
		xhrDoc.setRequestHeader("Authorization", "Basic " + btoa(par[1]));
	}

	xhrDoc.send(jsondocstring);
	setLocal('%sentdoc','1');
} catch (e) { }
exit();

Let’s walk through what this does.

In the beginning:

var esServer = par[0];
if (esServer.substr(esServer.length - 1) !== "/") {
	esServer = esServer + "/";
}
setLocal('%esserver', esServer);
setLocal('%authinfo', par[1]);
var indexPrefix = "tasker";

var d = new Date();
var dateStr = d.toISOString().slice(0,10);

Here, we set up an index prefix (“tasker”) and we’ll get 2 parameters passed in: the Elasticsearch server host/port (in par[0], which goes into %esserver), and any authentication in the form of username:password (in par[1], which goes into %authinfo).So reminder: when we add this Javascript task, we’ll need to pass those bits in.

Next:

var intVars = { battery: '%BATT', cell_signal_strength: '%CELLSIG',
									  display_brightness: '%BRIGHT', light_level: '%LIGHT',
										uptime: '%UPS', free_memory: '%MEMF', pedometer: '%STEPSTAKEN'
									};
var doubleVars = { altitude: '%LOCALT', magnetic_strength: '%MFIELD', temperature: '%TEMP' };
var booleanVars = { bluetooth_on: '%BLUE', locked: '%KEYG', muted: '%MUTED', speakerphone: '%SPHONE',
									  wifi_enabled: '%WIFI', wimax_enabled: '%WIMAX', screen_on: '%SCREEN', roaming: '%ROAM',
airplane_mode: '%AIR'
									};
var keywordVars = { bluetooth_on: '%BLUE', cell_network: '%TNET', device: '%DEVID', device_id: '%DEVTID' };

Here, we gather a list of hardware/software variables that Tasker defines into a variety of types.  For example, we put %TEMP into a JSON object of other double variables.  It’s important that we do this because we need to know what these types are so we can make some assumptions around how we process each of the variables.  For example, looking at Tasker’s variables page, we can see that the variable representing if the device is in Airplane mode (%AIR) returns as “on” or “off” while we really want to process them as booleans (true or false).  Many numbers in Tasker return -1 if the value is not defined or cannot be determined, where we really want that to be an empty value in storage.  So we store our expected types in a JSON object to check later.

The next set of for loops does 2 things:

for (var i in keywordVars) {
  var res = global(keywordVars[i]);
  if (typeof(res) !== 'undefined') {
  	postData[i] = res;
  }
  template[i] = { type: 'keyword' };
}

for (var i in booleanVars) {
  var res = global(booleanVars[i]);
  if (typeof(res) !== 'undefined' && res !== '') {
  	postData[i] = (res === 'true' || res === 'on');
  }
  template[i] = { type: 'boolean' };
}

for (var i in intVars) {
  var res = global(intVars[i]);
  if (typeof(res) !== 'undefined' && res !== '') {
  	if (res >= 0) {
    	postData[i] = parseInt(res);
    }
  }
  template[i] = { type: 'long' };
}

for (var i in doubleVars) {
  var res = global(doubleVars[i]);
  if (typeof(res) !== 'undefined' && res !== '') {
  	if (res !== 0) {
    	postData[i] = parseFloat(res);
    }
  }
  template[i] = { type: 'double' };
}

First, it tries to convert, parse, or drop the value based upon the aforementioned logic.  Next, it adds the variables it finds to an Elasticsearch template variable we use later.

Next, we have a bit of custom logic for parsing the current location (which comes in a single string of the form “Lat,Lon”), the currently playing song (which comes through Media Utilities), and the WiFi connection info to get the current network name (the %WIFII variable has a rather bizarre format: the 3rd line contains the network name if the first line is “>>> CONNECTION <<<“):

template['location'] = { type: 'geo_point' };
if (getLocation('any', true, 30)) {
  var loc = global('%LOC');
	var latlon = loc.split(',');
	postData['location'] = { 'lat': parseFloat(latlon[0]), 'lon': parseFloat(latlon[1]) };
}

template['song'] = { type: 'text' };
template['music_playing'] = { type: 'boolean' };
if (global('%ARTIST') !== '' && global('%TRACK') !== ''
   ) {
	  postData['music_playing'] = (global('%ISPLAYING') === 'true' || global('%ISPLAYING') === true);
	  if (postData['music_playing'] === true) {
		  postData['song'] = global('%ARTIST') + ' - ' + global('%TRACK');
		}
} else {
	if (global('%ISPLAYING') === 'false' || global('%ISPLAYING') === false) {
	  postData['music_playing'] = false;
	}
}

var wifii = global('%WIFII');
if (typeof(wifii) !== 'undefined' && res !== '') {
	var wifiarr = wifii.split("\n");
	if (wifiarr[0] === '>>> CONNECTION <<<') {
		postData['wifi_name'] = (wifiarr[2]).slice(1,-1);
	}
	template['wifi_name'] = { type: 'keyword' };
}

Finally:

var jsondocstring = JSON.stringify(postData);
var indexName = indexPrefix + '-' + dateStr;
indexType = 'doc';
setLocal('%jsondocbulkheader', JSON.stringify({ "_index": indexName, "_type": indexType}));
setLocal('%jsondocstring',jsondocstring);

var xhrTemplate = new XMLHttpRequest();
xhrTemplate.open("PUT", esServer + "_template/" + indexPrefix, false);
xhrTemplate.setRequestHeader("Content-type", "application/json");
if (typeof(par[1]) !== 'undefined') {
	xhrTemplate.setRequestHeader("Authorization", "Basic " + btoa(par[1]));
}
var templateString = JSON.stringify({ template: indexPrefix + '-*', mappings: { doc: { properties: template } } });

try {
	xhrTemplate.send(templateString);
} catch (e) { }
try {
	var xhrDoc = new XMLHttpRequest();
	xhrDoc.open("POST", esServer + indexName + '/' + indexType, false);
	xhrDoc.setRequestHeader("Content-type", "application/json");
	if (typeof(par[1]) !== 'undefined') {
		xhrDoc.setRequestHeader("Authorization", "Basic " + btoa(par[1]));
	}

	xhrDoc.send(jsondocstring);
	setLocal('%sentdoc','1');
} catch (e) { }
exit();

In this block, we form the document (and the index template!) into a string via JSON.stringify(), we upload them both to Elasticsearch.  Tasker seemed to have some issues with the asynchronous nature of XMLHttpRequest, though it’s possible I was doing something wrong, so I converted them into synchronous calls.  At the very end, we set “%sentdoc” to 1 to indicate to the rest of the task that we’ve successfully sent the document.  If this isn’t set to 1, we know we didn’t have Internet access or some other barrier to uploading the metrics.  In a subsequent action in the task, we’ll write out the bulk header and document to a file on disk so we can upload all the data we missed.

Putting It Into A Project

This project has 4 profiles:

  1. Every 10 steps, run a task that sets %STEPSTAKEN to 10+%STEPSTAKEN
  2. Every night at midnight, run a task that sets %STEPSTAKEN to 0
  3. Every 5 minutes, run the “Update Info” task
  4. Every time Media Utilities has a new track, run the “Update Info” task

That leaves me to the “Update Info” task.  That simply grabs the current values of %ISPLAYING, %TRACK, and %ARTIST, and then runs the “Send to Elasticsearch” task.

The “Send to Elasticsearch” task is a bit more complicated.  It:

  • Sets %sentdoc to 0 (we’ll use this later!)
  • Runs the Javacriptlet above
  • If %sentdoc isn’t set to 1 at the end of the Javascript, we write out %jsonbulkheader and then %jsondocstring to a file Tasker/tasks/taskQueue.json
  • Else if %WIFII contains >>> CONNECTION <<< this means we’re on a wifi network and we can bulk upload the taskQueue contents.  So we read that into %jsonqueue and send it up with another (much shorter) Javascript (and then delete the taskQueue.json file so we don’t resend it):
var jsondocstring = jsonqueue;

var xhrDoc = new XMLHttpRequest();
xhrDoc.open("POST", esServer + '_bulk', false);
xhrDoc.setRequestHeader("Content-type", "application/json");
if (typeof(authinfo) !== 'undefined') {
	xhrDoc.setRequestHeader("Authorization", "Basic " + btoa(authinfo));
}

xhrDoc.send(jsondocstring);

setLocal('%jsonqueue','✓');
exit();

And that’s it!

Putting It All Together

I’ve uploaded the entire project here.  You can download them directly and import the project or just have a look over the output and see how it’s tied together and make your own.  It does need the %ELASTICSEARCH and %ELASTICSEARCH_AUTH variables to be set and you’ll see it needs Media Utilities to be installed as well.  I had fun making this and I’m hoping to add more to it in the future!

Connecting Domoticz and Elasticsearch

I use Domoticz for my home automation needs.  I’ll just start off by saying it has its pros and cons.

Pros:

  • It seems to have pretty wide device support
  • It allows for a completely offline home-automation setup (for those of us that don’t want our every home action to go through a 3rd party server, some of which have proven…problematic)
  • It has an API (though the output format is…not so great — more on that later)
  • You can create your own virtual devices (e.g. I use this to create a virtual device for when my phone connects to my home router)
  • It can run natively on a RaspberryPi (or a BananaPi or BananaPro if you fork the project and make some minor modifications).  Together with a RazBerry, this makes for a great self-contained unit.
  • It’s pretty scriptable.  They’ve embedded Blockly and both a Lua and Python API

Cons:

  • The default UI is pretty ugly
  • Not a lot of great mobile support
  • Integration with 3rd parties is limited.  For example, the community has made an InfluxDB/Grafana integration, but there doesn’t seem to be any Elasticsearch integration at this time
  • The API is very string-y: many components return a string like “3.8A” instead of value: 3.8 + unit: “A”.  This makes many things a bit challenging to parse.

This post is about trying to address a few of the cons by uploading data to Elasticsearch and using Kibana for visualizations.

Understanding the Domoticz API

First, it’s important to note the output of the Domoticz API.  It’s is largely documented here, but for this demo, we’re going to focus on the type=devices component of the API.

Let’s have a look at the result:

{
   "ActTime" : 1510530122,
   "ServerTime" : "2017-11-12 15:42:02",
   "Sunrise" : "06:48",
   "Sunset" : "16:59",
   "result" : [
      {
         "AddjMulti" : 1.0,
         "AddjMulti2" : 1.0,
         "AddjValue" : 0.0,
         "AddjValue2" : 0.0,
         "BatteryLevel" : 100,
         "CustomImage" : 0,
         "Data" : "74.3 F, 41 %",
         "Description" : "",
         "DewPoint" : "49.10",
         "Favorite" : 0,
         "HardwareID" : 6,
         "HardwareName" : "RaZberry",
         "HardwareType" : "OpenZWave USB",
         "HardwareTypeVal" : 21,
         "HaveTimeout" : false,
         "Humidity" : 41,
         "HumidityStatus" : "Comfortable",
         "ID" : "0201",
         "LastUpdate" : "2017-11-12 15:41:28",
         "Name" : "Office Temperature",
         "Notifications" : "false",
         "PlanID" : "0",
         "PlanIDs" : [ 0 ],
         "Protected" : false,
         "ShowNotifications" : true,
         "SignalLevel" : "-",
         "SubType" : "WTGR800",
         "Temp" : 74.299999999999997,
         "Timers" : "false",
         "Type" : "Temp + Humidity",
         "TypeImg" : "temperature",
         "Unit" : 0,
         "Used" : 1,
         "XOffset" : "0",
         "YOffset" : "0",
         "idx" : "65"
      }
   ]
}

This gives us the basis for what we’ll need to do!

A Few Bits About the API

There are a few things that are a bit of pain with the API that you can see from the above:

  1. Domoticz treats everything in Celsius even if the devices return in Fahrenheit.  As a result, there can be multiple conversions back and forth which can result in floating-point rounding troubles, as you can see above which is 74.29999… instead of 74.3 as the device had returned.  I haven’t delved into every API or component return, but I assume this is more than just Celsius/Fahrenheit but in other metric/English conversions.
  2. The devices tend to return multiple values in the same device and those may have varying types.  For example, notice the “Data” above is “74.3F, 41 %” which is Temperature + Humidity combined in a single string while we also get a separate “Humidity” value (integer) and a separate “Temp” value (floating point) and a separate “DewPoint” value (floating point, but JSON typed to string and not in the Data block).  Other values you’d expect to be numbers (“SignalLevel”) may return strings (e.g. “-“) if no value is provided. “true/false” may return in strings (as you can see in “Timers”) etc.
  3. A continuation of #2, if every type of device had all of the values it was returning in a separate JSON object that was at least standardized against itself, then this would just be mildly annoying, but alas that’s not the case.  For example, energy monitoring devices only seem to return their data in the “Data” string (no “Amperage” or “Wattage” attribute)

Also, Domoticz doesn’t seem to have any internal scheduled-scripting (that I’ve found) with any frequency higher than once every minute, so we have to poll this API to get the data off the device instead of using those lovely internals.  So this is what we have to work around.  Let’s get started!

Using Elasticsearch’s Ingest Nodes

Elasticsearch has Ingest Nodes, which allow us to transform this type of data directly in the Elasticsearch cluster without setting up any external dependencies.  First, we’ll need to create an ingest pipeline:

I set up the following:

curl -XPUT "https://MY_ES_USER:MY_ES_PASSWORD@MY_ES_HOST:MY_ES_PORT/_ingest/pipeline/domoticz" -H 'Content-Type: application/json' -d'
{
  "description": "Converts domoticz API to ES",
  "processors": [
    {
      "date": {
        "field": "LastUpdate",
        "target_field": "timestamp",
        "formats": [
          "yyyy-MM-dd HH:mm:ss"
        ],
        "timezone": "America/Los_Angeles"
      },
      "remove": {
        "field": "LastUpdate"
      },
      "set": {
        "field": "IngestTime",
        "value": "{{_ingest.timestamp}}"
      },
      "rename": {
        "field": "Data",
        "target_field": "DataString"
      },
      "script": {
        "lang": "painless",
        "source": "if (ctx.Type == \"Current\") { ctx.Current = ctx.DataString.replace(\" A\",\"\") } else if (ctx.Type == \"General\") { if (ctx.Usage != null) { ctx.Wattage = ctx.Usage.replace(\" Watt\",\"\"); } else if (ctx.CounterToday != null) { ctx.EnergyToday = ctx.CounterToday.replace(\" kWh\",\"\"); } else if (ctx.DataString.endsWith(\" SNR\")) { ctx.SNR = ctx.DataString.replace(\" SNR\",\"\"); } } else if  (ctx.Type == \"Lux\") { ctx.Lux = ctx.DataString.replace(\" Lux\",\"\") }"
      },
      "date_index_name": {
        "field": "IngestTime",
        "index_name_format": "yyyy-MM-dd",
        "index_name_prefix": "devices-",
        "date_rounding": "d"
      }
    }
  ]
}'

This does a few things: it moves the “LastUpdate” field into a “timestamp” field, sets an “IngestTime” field to the time the document was received, adds “Current,” “Wattage,” “EnergyToday,” “Lux,” and “SNR” named fields if the based on the “Type” and the Data field string, and finally sets up a date-based index for the values.

Great.  Now we have an ingest pipeline that can process the results of each of the devices.  This will place the data into an index named devices-<date>.  We’ll also want an index template to store the data to:

curl -XPUT "https://MY_ES_USER:MY_ES_PASSWORD@MY_ES_HOST:MY_ES_PORT/_template/domoticz" -H 'Content-Type: application/json' -d'
{
  "domoticz": {
    "order": 0,
    "template": "devices-*",
    "settings": {
      "index": {
        "number_of_shards": "1"
      }
    },
    "mappings": {
      "devices": {
        "properties": {
          "BatteryLevel": {
            "type": "scaled_float",
            "scaling_factor": 100
          },
          "DataString": {
            "type": "text"
          },
          "HardwareName": {
            "type": "keyword"
          },
          "IngestTime": {
            "type": "date"
          },
          "Level": {
            "type": "double"
          },
          "LevelInt": {
            "type": "scaled_float",
            "scaling_factor": 100
          },
          "Name": {
            "type": "text"
          },
          "Status": {
            "type": "keyword"
          },
          "Type": {
            "type": "keyword"
          },
          "Current": {
            "type": "double"
          },
          "DewPoint": {
            "type": "double"
          },
          "Gust": {
            "type": "double"
          },
          "EnergyToday": {
            "type": "double"
          },
          "Rain": {
            "type": "double"
          },
          "RainRate": {
            "type": "double"
          },
          "Speed": {
            "type": "double"
          },
          "Chill": {
            "type": "double"
          },
          "UVI": {
            "type": "double"
          },
          "Direction": {
            "type": "double"
          },
          "Lux": {
            "type": "integer"
          },
          "Wattage": {
            "type": "double"
          },
          "CounterToday": {
            "type": "text"
          },
          "HumidityStatus": {
            "type": "text"
          },
          "Usage": {
            "type": "text"
          },
          "idx": {
            "type": "integer"
          },
          "timestamp": {
            "type": "date"
          }
        }
      }
    }
  }
}'

Most of this template is not purely “necessary” as Elasticsearch would handle the data correctly in most if not all of these field values, but it sets up efficient mappings and makes sure we don’t have mistakes in our data model.

Uploading the API to Elasticsearch

We need to push the data to Elasticsearch at some regular interval.  I’ve used jq to do some lightweight filtering and reformatting of the data and then ultimately push it up to Elasticsearch.  We can add something like the following to the crontab to initiate it on boot.

#!/bin/bash

while true
  do curl "http://127.0.0.1:8080/json.htm?type=devices&filter=all&used=true&order=Name" | jq -c '.result[] | with_entries(select( .key as $k | $keys | index($k)))' --arg keys '["idx","HardwareName","Name","BatteryLevel","Gust","Direction","Chill","Speed","Temp","UVI","Rain","RainRate","Visibility","Radiation","Level","LevelInt","Status","Humidity","DewPoint","HumidityStatus","CounterToday","Usage","Voltage","Type","Data","LastUpdate"]' | sed -e 's/^/{ "index" : { "_index" : "devices", "_type" : "devices" } }\
/' > /tmp/devices_bulk.json && curl -H 'Content-type: application/json' -XPOST 'https://MY_ES_USER@MY_ES_PASSWORD:MY_ES_PORT/_bulk?pipeline=domoticz' --data-binary @/tmp/devices_bulk.json > /dev/null && sleep 5
done

This script does a few things: it selects the relevant pieces of information that I want indexed (Temperature, HardwareName, etc) from the response and then formats a _bulk index request to Elasticsearch and then POSTs it.

In my case, I’ve used an Elastic Cloud instance to push the data to.  This allows me to externalize the data in a secure fashion and make it available no matter where I’m at, but only to me.  It also lets me fire off alerts, e.g. if one of my security devices goes off and I’m not home.

Drop this bash script into /opt/domoticz/scripts/ (where I’ve installed Domoticz) and in crontab, add:

@reboot bash /opt/domoticz/scripts/uploadES.sh

Setting Up Kibana

The two biggest quirks for setting up Kibana with this data are

  1. Domoticz reports the time as the last time the device had an update, but for most dashboards what you really want is to see the state of all devices at a given point in time.  This means we really want the ingest time, not the device update time (hence adding that to the ingest template).
  2. Using the device IDs for multi-valued documents with similar types.  For example, showing all motion sensors or door sensors on the same graph.  For this, there are a few approaches, but I used the sub aggregation to split the charts by filtering for specific device IDs (“idx” in the data).  This looks like the following in Kibana:

Example Dashboard

Once we tie it all together, we get quite nice graphs!

This makes for a nice dashboard!  We can see the inside and outside temperatures, when there was motion in various spaces (in red), when doors were opened (in red), and when devices were turned on and off.  You can even see in the “office” and “living room” temperature charts when the heat kicked on in my apartment.  We can fire off alerts for security (e.g. “front door is opened but nobody’s home”) or

Future Work

You’ll likely see some of the following in future posts or as future work from me:

  • The bottom right graph hides additional detail of something I’ve built out, which is tracking the wifi strength of connected/roaming devices.  I think this is quite useful as a virtual presence detection device.  I’ll add more detail in the future about how I’ve done this.
  • I’m adding more devices.  One of the things I really like about having the data in a place like this is I can start to gather and visualize other external inputs to the system (e.g. my phone’s sensors and location).  I’ve got some code written for this already.
  • I’d like to make a native Elasticsearch exporter and timer for Domoticz and fix some of this API weirdness, but I’m not terribly familiar with the Domoticz C++ codebase yet to do this.
  • I think my habits could be fairly well modeled with Elastic’s machine learning.  I’d like to build/use machine learning models on top of this data to automatically flag and alert on anomalies in the data rather than hard-code rules.

JSparks

I’ve released my first Android application! It’s a full vertical stack to help teach kids how to program using the excitement of robots and eliminate many of the headaches associated with beginning to learn to write software including syntax errors.

You can find full details at sparki.eskibars.com