2021 Wordcloud
Inspired by 2021 in titles by Jason Werner (who I found clicking the "next" link on the Indieweb ring as I enjoy doing), I decided to play at my own Wordcloud for 2021. Rather than just grabbing all the titles for the year and throwing it at a web site, I had to spice it up a bit.
Step 1: Dockerised https://github.com/amueller/word_cloud
The wordcloud tool Jason used is amueller/word_cloud - a pretty great wordclouder with templates and all sorts of power. I've been running almost every new tool in their own docker container these days to try and not pollute my own environment, so I threw together a quick container (again, inspired by the work on Dockerised Puppeteering.
1FROM python:3
2RUN mkdir /wordcloud
3WORKDIR /wordcloud
4RUN pip install wordcloud
Simple and effective - grab the latest Python 3 container, install wordcloud into the new /wordcloud directory. Job done.
Step 2: Harvest the content from all my posts for 2021
This took me a while. I thought sed
could pattern match my way through a find
from the subdirectory... but sed is line per line so that didn't work. I looked at other tools like csplit before googling my way to remembering "oh wait, this is what Grep does." Reading through the grep --help
revealed a lot, and this post on multiline grep and the -P flag gave me the last little bit I needed.
1> grep -hPor --include=*.md "(?s)===(.*)" DIRECTORY/posts/*/2021 | pandoc -t plain > output/input.txt
grep
-h
Hide the file name from the output-P
use PECL regexp, so I can add the(?s)
flag to do multiline matching-o
Only print the matching lines-r
to recurs through the directories.--include=*.md
only look at Markdown files in the directory"(?s)===(.*)"
My files use === as the Markdown preamble separator, and I'm sure not to use them anywhere else in the file so my parser doesn't break. So I can tell it to only bring back anything after a ===.DIRECTORY/posts/*/2021
start at the posts/*/2021 directory and bring back all the matching files|
Pass the output of that command, to the next commandpandoc
This is the next command, my markdown converter-t plain
output plain text, to remove all the formatting and be as close to just my words as possible.> output/input.txt
send the resulting text to the input.txt file
Count it!
1> docker run -i --rm -v THISDIR\output:/wordcloud/output -t wordcloud wordcloud_cli --text output/input.txt --imagefile output/mep.png
docker run
Let's run a prebuilt imagei
interactive shell so we can stop when we want, if we want--rm
delete the running image once it's finished running, so we don't leave stuff laying about-v THISDIR\output:/wordcloud/output
Mount the local output directory into the image so the running image can access the input file-t wordcloud
this is the image we want to run, the one we built earlierwordcloud_cli --text output/input.txt --imagefile output/mep.png
run the wordcloud_cli command over the text we made earlier, giving us the image we want!
And it works:

And I can do it just for my titles, like Jason did
1> grep -ihPor --include=*.md "Title:.*" DIRECTORY/posts/*/2021 | pandoc -t plain > output/title.txt
2> docker run -i --rm -v THISDIR\output:/wordcloud/output -t wordcloud wordcloud_cli --text output/title.txt --imagefile output/title.png

FUN!
Result
I like coding, comedy, games (computer and tabletop RP), and steampunk. From the clouds, it's very obvious I've been focusing on the coding. Not sure if that's because it's so much easier during the pandemic to fire up the code windows rather than trying to steampunk things. I think I need to make an effort this year to broaden my hobby time.
Edit
Yes I now recognise the irony, thank you.