Over the coming weeks I'll be releasing the code that runs this blog, using it as an excuse to tidy it up a bit. All code will be available at github.com/wbhb/blog
This week I'm looking at the upload tool, a node.js script that runs on my local development environment.
This is the part of the code I work with most often, as I use it every week to take the post I've written and get it onto the server for consumption. It performs several tasks:
- Finds all images in the post.
- Scales the images to several sizes.
- Uploads each image to Amazon's Simple Storage Service (S3).
- Uploads image metadata to Amazon's DynamoDB.
- Spellchecks the post.
- Uploads the post and its metadata to DynamoDB.
The script tries to do as much as possible asynchronously so that a slow upload doesn't slow down everything else. However, there are some times where a specific result is needed before anything else can continue, such as the base64 encoded image thumbnail.
The posts themselves are stored locally in two files - an HTML file with the actual content and a JSON file with the metadata. These are stored together on the server, but I find it easier to write them separately, as it's hard to write HTML in the middle of a JSON file - it's far to easy to forget to escape something, and the syntax highlighting doesn't work.
To save me writing a lot of HTML manually, and to allow rendering of the document in a variety of formats such as AMP
, certain elements, like images, are inserted using a basic templating system. This will be discussed more in a future post.
The script looks for all images in the above format and then uses imagemagick (actually GraphicsMagick) via the gm
library to open each image and resize it. Originally this was a mess, as the script just started all images resizing at the same time, which led to severe memory issues on my small AWS C9 development environment, and also used too many connections to AWS, which would lead to errors.
The image processing is now done one at a time, with a queue created using an async generator. Within the generator each image is processed into each of the required sizes, with the appropriate metadata yielded, signifying to the main function that it's ready to be uploaded. The metadata itself gets put onto DynamoDB for the templating engine to use at runtime.
The uploads themself now use a worker pool arrangement, where a limited number of resizing and uploading operations are allowed to run concurrently, and any more have to wait for the previous ones to finish. I put this code in a separate file as it seemed reusable.
The spellcheck is accomplished using a node binding for hunspell. It's pretty rudimentary at the moment, but is enough for me to catch some of the large number of spelling mistakes I make - unfortunately C9 doesn't have spellcheck. I would like to employ my favourite editor
, but as this blog makes me no money it wouldn't be cost effective, so hunspell it is.
A set of dodgy regular expressions get rid of all the HTML tags, templates, and punctuation and then tokenise the text into words, each of which gets checked. If they're not in the dictionary it gives me a warning. The upload still goes ahead, as the spellcheck finds a lot of false positives, and uploading only the text again is very quick.
The post and its metadata then get assembled into a JSON-like format that DynamoDB uses. I plan to change this to use JSON and then be programmatically converted into the DynamoDB format, but I haven't gotten around to this yet, so it's a bit hillbilly. This gets uploaded, via the same pool the images used, and the script then spits out a URL to test the post at, to check nothing's gone wrong - I've mistyped template tags before, for instance, which means the runtime component doesn't swap them out.
Thanks to some time waiting at an airport the help/usage text is now displayed from a config file, using a small library I wrote. This uses minimist for the argument parsing, but adds automatic help generation and some error handling.
The final change I've been making this weekend is moving all the configuration constants into a separate config file. This way the script can be used by others, eventually, without needing to change all the addresses and folders that were originally hardcoded. It also stops me publishing these values, in case that is a security concern. It's likely the image sizes will also be moved into this config file at some point.
As the time of writing, there's still some tidying to do, but it does the job it needs to do for now (as evidenced by this post you've just read). There's at least one bug in it I've noticed today involving the base64 encoded images - kudos to anyone who finds it and puts a bug report on github, and a discretionary (and non-legally binding) beer to the first PR that fixes it.