2021 Wordcloud

Inspired by 2021 in titles by Jason Werner (who I found clicking the "next" link on the Indieweb ring as I enjoy doing), I decided to play at my own Wordcloud for 2021. Rather than just grabbing all the titles for the year and throwing it at a web site, I had to spice it up a bit.

Step 1: Dockerised https://github.com/amueller/word_cloud

The wordcloud tool Jason used is amueller/word_cloud - a pretty great wordclouder with templates and all sorts of power. I've been running almost every new tool in their own docker container these days to try and not pollute my own environment, so I threw together a quick container (again, inspired by the work on Dockerised Puppeteering.

FROM python:3
RUN mkdir /wordcloud
WORKDIR /wordcloud
RUN pip install wordcloud

Simple and effective - grab the latest Python 3 container, install wordcloud into the new /wordcloud directory. Job done.

Step 2: Harvest the content from all my posts for 2021

This took me a while. I thought sed could pattern match my way through a find from the subdirectory... but sed is line per line so that didn't work. I looked at other tools like csplit before googling my way to remembering "oh wait, this is what Grep does." Reading through the grep --help revealed a lot, and this post on multiline grep and the -P flag gave me the last little bit I needed.

> grep -hPor --include=*.md  "(?s)===(.*)" DIRECTORY/posts/*/2021 | pandoc -t plain > output/input.txt
  1. grep
  2. -h Hide the file name from the output
  3. -P use PECL regexp, so I can add the (?s) flag to do multiline matching
  4. -o Only print the matching lines
  5. -r to recurs through the directories.
  6. --include=*.md only look at Markdown files in the directory
  7. "(?s)===(.*)" My files use === as the Markdown preamble separator, and I'm sure not to use them anywhere else in the file so my parser doesn't break. So I can tell it to only bring back anything after a ===.
  8. DIRECTORY/posts/*/2021 start at the posts/*/2021 directory and bring back all the matching files
  9. | Pass the output of that command, to the next command
  10. pandoc This is the next command, my markdown converter
  11. -t plain output plain text, to remove all the formatting and be as close to just my words as possible.
  12. > output/input.txt send the resulting text to the input.txt file

Count it!

> docker run -i --rm -v THISDIR\output:/wordcloud/output -t wordcloud wordcloud_cli --text output/input.txt --imagefile output/mep.png
  1. docker run Let's run a prebuilt image
  2. i interactive shell so we can stop when we want, if we want
  3. --rm delete the running image once it's finished running, so we don't leave stuff laying about
  4. -v THISDIR\output:/wordcloud/output Mount the local output directory into the image so the running image can access the input file
  5. -t wordcloud this is the image we want to run, the one we built earlier
  6. wordcloud_cli --text output/input.txt --imagefile output/mep.png run the wordcloud_cli command over the text we made earlier, giving us the image we want!

And it works:

And I can do it just for my titles, like Jason did

> grep -ihPor --include=*.md  "Title:.*" DIRECTORY/posts/*/2021 | pandoc -t plain > output/title.txt
> docker run -i --rm -v THISDIR\output:/wordcloud/output -t wordcloud wordcloud_cli --text output/title.txt --imagefile output/title.png



I like coding, comedy, games (computer and tabletop RP), and steampunk. From the clouds, it's very obvious I've been focusing on the coding. Not sure if that's because it's so much easier during the pandemic to fire up the code windows rather than trying to steampunk things. I think I need to make an effort this year to broaden my hobby time.


Yes I now recognise the irony, thank you.