Async is not new in Python but I was not familiar with the concept. I have used it but without fully grasping the idea, and that smells disaster. This article and a whole journey I went through has been sparked by one question. Why you shouldn’t run blocking code on the event loop? The answer is simple as it will block the whole thing. I kinda knew that as everyone says it, so you can’t miss it really. But why is it? How does it work? If you would like to know read on.
Bridge to kubernetes
When you are using kubernetes you have a cluster, and clusters are rarely small. Your gripes may come in different flavours though. The cluster won’t fit on your local box any more. You most probably have people in your team that need the cluster as well. You need the feedback faster than deploying your changes to the cluster. What to do with all those problems?
Run a bridge that will route traffic to your local version of one of the microservices. Easy. You will run your service locally that will interact with a kubernetes cluster without interrupting anyone. Sounds amazing, doesn’t it.
To illustrate this let’s run a sample app in a cluster and then do some “development” and “debugging”. This will not affect the “development” environment for other people that could use it. The code will be isolated and your team will be happy.
Cloud agnostic apps with DAPR
DAPR is cool, as stated on the website “APIs for building portable and reliable microservices”, it works with many clouds, and external services. As a result you only need to configure the services and then use DAPR APIs. It is true and I’ll show you. You will find the code for this article here. A must have tool when working with micro services.
Applications can use DAPR as a sidecar container or as a separate process. I’ll show you a local version of the app, where DAPR is configured to use Redis running in a container. The repo will have Azure, AWS, and GCP configuration as well.
Before we can start the adventure you have to install DAPR.
Running app locally
Configuration
You have to start off on the right foot. Because of that we have to configure secrets store where you can keep passwords and such. I could skip this step but then there is a risk someone will never find out how to do it properly and will ship passwords in plain text. Here we go.
# save under ./components/secrets.yaml
apiVersion: dapr.io/v1alpha1
kind: Component
metadata:
name: local-secret-store
namespace: default
spec:
type: secretstores.local.file
version: v1
metadata:
- name: secretsFile
value: ../secrets.json
The file secrets.json
should have all your secrets, like connection strings, user and pass pairs, etc. Don’t commit this file.
{
"redisPass": "just a password"
}
Next file is publish subscribe configuration. Dead simple but I’d recommend going through the docs as there is much more to pub/sub. Here you can reference your secrets as shown.
# save under ./components/pubsub.yaml
apiVersion: dapr.io/v1alpha1
kind: Component
metadata:
name: order_pub_sub
spec:
type: pubsub.redis
version: v1
metadata:
- name: redisHost
value: localhost:6379
- name: redisPass
secretKeyRef:
name: redisPass
key: redisPass
auth:
secretStore: local-secret-store
Publisher and subscriber
With config out of the way only thing left are publisher part and subscriber part. As mentioned before the app you write talks to DAPR API. This means you may use http calls or Dapr client. Best part is that no matter what is on the other end, be it Redis or PostgreSQL, your code will not change even when you change your external services.
Here goes publisher that will send events to a topic. Topic can be hosted anywhere, here is a list of supported brokers. The list is long however only 3 are stable. I really like how DAPR is approaching components certification though. There are well defined requirements to pass to advance from Alpha, to Beta, and finally to Stable.
# save under ./publisher/app.py
import logging
from dapr.clients import DaprClient
from fastapi import FastAPI
from pydantic import BaseModel
logging.basicConfig(level=logging.INFO)
app = FastAPI()
class Order(BaseModel):
product: str
@app.post("/orders")
def orders(order: Order):
logging.info("Received order")
with DaprClient() as dapr_client:
dapr_client.publish_event(
pubsub_name="order_pub_sub",
topic_name="orders",
data=order.json(),
data_content_type="application/json",
)
return order
Here is a consumer.
# save under ./consumer/app.py
from dapr.ext.fastapi import DaprApp
from fastapi import FastAPI
from pydantic import BaseModel
app = FastAPI()
dapr_app = DaprApp(app)
class CloudEvent(BaseModel):
datacontenttype: str
source: str
topic: str
pubsubname: str
data: dict
id: str
specversion: str
tracestate: str
type: str
traceid: str
@dapr_app.subscribe(pubsub="order_pub_sub", topic="orders")
def orders_subscriber(event: CloudEvent):
print("Subscriber received : %s" % event.data["product"], flush=True)
return {"success": True}
Running the apps
Now you can run both apps together in separate terminal windows and see how they talk to each other using configured broker. For this example we are using Redis as a broker. You will see how easy is to run them on different platforms.
In the first terminal run the consumer.
$ dapr run --app-id order-processor --components-path ../components/ --app-port 8000 -- uvicorn app:app
In the other terminal run the producer.
$ dapr run --app-id order-processor --components-path ../components/ --app-port 8001 -- uvicorn app:app --port 8001
After you make a HTTP call to a producer you should see both of them producing log messages as follows.
$ http :8001/orders product=falafel
# producer
== APP == INFO:root:Received order
== APP == INFO: 127.0.0.1:49698 - "POST /orders HTTP/1.1" 200 OK
# subscriber
== APP == Subscriber received : falafel
== APP == INFO: 127.0.0.1:49701 - "POST /events/order_pub_sub/orders HTTP/1.1" 200 OK
Running app in the cloud
It took us a bit to reach the clu of this post. We had to build something, and run it so then we can run it in the cloud. Above example will run on cloud with a simple change of configuration.
Simplest configuration is for Azure. Change your pubsu
b.yaml
so it looks as follows, and update your secrets.json
as well.
apiVersion: dapr.io/v1alpha1
kind: Component
metadata:
name: order_pub_sub
spec:
type: pubsub.azure.servicebus
version: v1
metadata:
- name: connectionString
secretKeyRef:
name: connectionStrings:azure
key: connectionStrings:azure
Your secrets.json
should look like this now
{
"connectionStrings": {
"azure": "YOUR CONNECTION STRING"
}
}
Rerun both commands in terminal and the output will look the same as with local env but the app will run on Azure Service Bus.
Bloody magic if you’d ask me. You can mix and match your dependencies without changing your application. In some cases you may even use features not available to a particular cloud, like message routing based on body in Azure Service Bus. This will be another post though.
Here is the repo for this post, it includes all the providers listed below:
- Azure
- Google Cloud
- AWS
Please remember to update your secrets.json
.
Have fun 🙂
Debugging is difficult, what’s even more difficult, debugging production apps. Live production apps.
There are tools designed for this purpose. Azure has Application Insights, product that makes retracing history of events easier. When setup correctly you may go from a http request down to a Db call with all the query arguments. Pretty useful and definitely more convenient than sifting through log messages.

Here you can see the exact query that had been executed on the database.

You may also see every log related to a particular request in Log Analytics.

Improving your work life like this is pretty simple. Everything here is done using opencensus
and it’s extensions. Opencensus
integrates with Azure pretty nicely. First thing to do is to install required dependencies.
# pip
pip install opencensus-ext-azure opencensus-ext-logging opencensus-ext-sqlalchemy opencensus-ext-requests
# pipenv
pipenv install opencensus-ext-azure opencensus-ext-logging opencensus-ext-sqlalchemy opencensus-ext-requests
# poetry
poetry add opencensus-ext-azure opencensus-ext-logging opencensus-ext-sqlalchemy opencensus-ext-requests
Next step is to activate them by including a couple of lines in your code. Here I activate 3 extensions, logging, requests, and sqlalchemy. Here is a list of other official extensions.
import logging
from opencensus.trace import config_integration
from opencensus.ext.azure.log_exporter import AzureLogHandler
logger = logging.getLogger(__name__)
config_integration.trace_integrations(["logging", "requests", "sqlalchemy"])
handler = AzureLogHandler(
connection_string="InstrumentationKey=YOUR_KEY"
)
handler.setFormatter(logging.Formatter("%(traceId)s %(spanId)s %(message)s"))
logger.addHandler(handler)
One last thing is a middleware that will instrument every request. This code is taken from Microsoft’s documentation.
@app.middleware("http")
async def middlewareOpencensus(request: Request, call_next):
tracer = Tracer(
exporter=AzureExporter(
connection_string="InstrumentationKey=YOUR_KEY"
),
sampler=ProbabilitySampler(1.0),
)
with tracer.span("main") as span:
span.span_kind = SpanKind.SERVER
response = await call_next(request)
tracer.add_attribute_to_current_span(
attribute_key=HTTP_STATUS_CODE, attribute_value=response.status_code
)
tracer.add_attribute_to_current_span(
attribute_key=HTTP_URL, attribute_value=str(request.url)
)
return response
You are done 🙂 you will not loose information on what is going on in the app. You will be quicker in finding problems and resolving them. Life’s good now.
I was looking for a way to deploy a custom model to Sagemaker. Unfortunately, my online searches failed to find anything that was not using Jupiter notebooks. I like them but this way of deploying models is not a reproducible way nor it is scalable.
After a couple of hours of looking, I decided to do it myself. Here comes a recipe for deploying a custom model to Sagemaker using AWS CDK.
The following steps assume you have knowledge of CDK and Sagemaker. I’ll try to explain as much as I can but if anything is unclear please refer to the docs.
Steps
- Prepare containerised application serving your model.
- Create Sagemaker model.
- Create Sagemaker Endpoint configuration.
- Deploy Sagemaker Endpoint.
Unfortunately, AWS CDK does not support higher-level constructs for Sagemaker. You have to use CloudFormation constructs which start with the prefix Cfn
. Higher-level constructs for Sagemaker are not on the roadmap as of March 2021.
Dockerfile to serve model
First thing is to have your app in a container form, so it can be deployed in a predictable way. It’s difficult to help with this step as each model may require different dependencies or actions. What I can recommend is to go over https://docs.aws.amazon.com/sagemaker/latest/dg/build-multi-model-build-container.html. This page explains the steps required to prepare a container that can serve a model on Sagemaker. It may also be helpful to read this part https://docs.aws.amazon.com/sagemaker/latest/dg/your-algorithms-inference-code.html on how your docker image will be used.
Define Sagemaker model
Once you have your model in a container form it is time to create a Sagemaker model. There are 3 elements to a Sagemaker model:
- Container definition
- VPC configuration for a model
- Model definition
Adding container definition to your app is simple (the hard part of creating a docker image is already done). The container definition will be used by the Sagemaker model.
asset = DockerImageAsset(
self,
"MLInferenceImage",
directory="../image")
primary_container_definition = sagemaker.CfnModel.ContainerDefinitionProperty(
image=asset.image_uri,
)
Code language: PHP (php)
Creating Vpc is pretty straightforward, you have to remember about creating public and private subnets.
vpc = ec2.Vpc(
self,
"VPC",
subnet_configuration=[
ec2.SubnetConfiguration(
name="public-model-subnet", subnet_type=ec2.SubnetType.PUBLIC
),
ec2.SubnetConfiguration(
name="private-model-subnet", subnet_type=ec2.SubnetType.PRIVATE
),
],
)
model_vpc_config =
sagemaker.CfnModel.VpcConfigProperty(
security_group_ids=[vpc.vpc_default_security_group],
subnets=[s.subnet_id for s in vpc.private_subnets],
)
Code language: PHP (php)
Creating a model is putting all created things together.
model = sagemaker.CfnModel(
self,
"MLInference",
execution_role_arn=role.role_arn,
model_name="my-model",
primary_container=primary_container_definition,
vpc_config=model_vpc_config,
)
Code language: PHP (php)
At this point, cdk deploy
would create Sagemaker model with an ML model of your choice.
Define endpoint configuration
We are not done yet as the model has to be exposed. Sagemaker Endpoint is perfect for this and in the next step we create endpoint configuration.
Endpoint configuration describes resources that will serve your model.
model_endpoint_config = sagemaker.CfnEndpointConfig(
self,
"model-endpoint-config",
production_variants=[
sagemaker.CfnEndpointConfig.ProductionVariantProperty(
initial_instance_count=1,
initial_variant_weight=1.0,
instance_type="ml.t2.medium",
model_name=model.model_name,
variant_name="production-medium",
),
],
)
Code language: PHP (php)
Create Sagemaker Endpoint
Last step is extremely simple. We take the configuration created earlier and create an endpoint.
model_endpoint = sagemaker.CfnEndpoint(
self,
"model-endpoint", endpoint_config_name=model_endpoint_config.attr_endpoint_config_name,
)
Code language: PHP (php)
Congrats
Now you may call cdk deploy
and the model is up and running on AWS Sagemaker 🙂
GitOps workflow
I have been using GitOps in my last project and I like the way it changed my workflow. It had to change as in the world of microservices old ways have to go. It is not a post if this is good or bad, I may write one someday. This post is about GitOps, if you do not know what GitOps is read here. TLDR version: in practice, each commit deploys a new version of a service to one’s cluster.
Going back to the subject of the workflow. I’ll focus on the microservices workflow as here in my opinion GitOps is extremely useful. One of the main pain points of microservices architecture is deployment. When deployments are mentioned you instinctively think about deploying the application. It may be difficult but it is not as difficult as creating developer environments and QA environments.
Here comes GitOps. When applied to your project you immediately get new service per each commit. This applies to each service you have. Having this at your disposal you can set up your application in numerous combinations of versions. You can also easily replace one of the services in your stack. Sounds good but nothing beats a demo, so here we go.
Demo
Let’s say I have a task of verifying if our changes to one of the microservices are correct. I’m using Rio to manage my Kubernetes cluster as it makes things smoother. Change in one service affects another service, and I have to verify it using UI. This adds up to 3 services deployed in one namespace and configured so they talk to each other. After I add commit in the service repository there is a namespace already created on a cluster. Each commit creates a new version of the service.
% rio -n bugfix ps -q
bugfix:app
Code language: CSS (css)
Now I need to add missing services, and I can do it by branching off from master. The name of the branch must match in all services involved.
% cd other_services && git checkout -b bugfix
% git push
After pushing the changes Rio adds them to the same namespace.
% rio -n bugfix ps -q
bugfix:app
bugifx:web
bugfix:other_app
Code language: CSS (css)
One thing left is to wire them up so services talk to each other. As I’m using recommendations from https://12factor.net/config so it is dead easy, and I can use Rio to do it. Edit command allows me to modify the environment variables of each service.
% rio -n bugfix edit web
This opens up your favourite text editor where you can edit the variables and setup where the web app can find other services. You can do the same changes in other services if necessary.
I have wired up services, they are talking to each other and I can proceed with my work. This my workflow using GitOps, Rio, and microservices.
When integrating with third party API’s you need to make sure that your requests reach the third party. In case of issues on their end you want to retry and best not to interrupt the flow of your application or even worse pass the information about such issues to the end user (like leaking 503 errors).
Most popular solution is to use a background task and there
are tools for helping with that: celery
, python-rq
,
or dramatiq
.
They do the job of executing the code in the background but they
require some extra infrastructure to make it work, plus all the
dependencies they are bringing in. I have used them all in the past with great
success but most recently decided to write a basic background task myself.
Why? As I mentioned earlier all of them require extra infrastructure in a form of a broker
that most of the time is redis
, this implies changes to deployment, requires additional resources,
makes the stack more complex.
The scope of what I had to do just did not justify bringing in this whole baggage.
I needed to retry calls to AWS Glue service in case we maxed out capacity. Since the Glue job we
are executing can take a couple minutes our calls to AWS Glue had to be pushed into the background.
I’ll give you the code and summarize what it does. By no means this code is perfect but it works 🙂
# background.py
import threading
from queue import Queue
task_queue = Queue()
worker_thread = None
def enqueue_task(task):
task_queue.put_nowait(task)
global worker_thread
if not worker_thread:
worker_thread = _run_worker_thread()
def _process_tasks(task_queue):
while task_queue.qsize():
task = task_queue.get()
try:
print(f"Do stuff with task: {task}")
except Exception as e:
task_queue.put(task)
global worker_thread
worker_thread = None
def _run_worker_thread():
t = threading.Thread(target=_process_tasks, args=(task_queue,))
t.start()
return t
Public interface of this small background
module is one function enqueue_task
.
When called task is put on the queue and thread is started. Each subsequent call
will enqueue task and thread will be closed after it processed all of them.
I find this simple and flexible enough to handle communication with flaky services or services with usage caps. Since this can not be scaled it has limited usage, but HTTP calls are just fine. This code had been inspired by one of the talks of Raymond Hettinger regarding concurrency and queue module.
I like types and I like TypeScript, mostly because of the types. Even though it does not go as far as I’d like. After going to one interesting talk regarding safe domain in the code base I have an idea to give it a go in TypeScript. This is going to be a short journey in possible options I had to use types to make my code a bit safer.
I have a piece of code that integrates with two APIs, Bitbucket and Jira, and as usual it uses tokens to do it. The idea is to define a type describing token that would not be mixed up. The compiler would tell me if I made a mistake and passed Jira token into function that expects one from Bitbcuket. Tokens are just strings so thefirst option is type alias.
Type alias
So I had defined two type aliases, one for each API, and then a function that would only accept one of them. If you read TypeScipt documentation on types you know that this would not work.
Aliasing doesn’t actually create a new type – it creates a new name to refer to that type. Aliasing a primitive is not terribly useful, though it can be used as a form of documentation.
The below code will compile and according to tsc
there is nothing wrong here.
Here is a link to code in TypeScript playground.
function runAlias(a: BitbucketToken) {
return a;
}
type BitbucketToken = string;
type JiraToken = string;
runAlias("a" as JiraToken);
runAlias("a" as BitbucketToken);
Interface
My second thought was to try and use interface
but it was dissapointing as well.
TypeScript uses what is called "structural subtyping” and since token types have
similar sctructure they were identified as compatible but that was not my goal.
Here is a link to code in TypeScript playground
interface BitbucketToken {
value: string;
}
interface JiraToken {
value: string;
}
function runInterface(a: BitbucketToken) {
return a.value;
}
runInterface({ value: "a" } as BitbucketToken)
runInterface({ value: "a" } as JiraToken)
Class
Next in line is class
and as you can see boiler plate ramps up. Result is unfortunately same as with inteface version. It should not be a sruprise to me as
documentation clearly says what is going on.
TypeScript is a structural type system. When we compare two different types, regardless of where they came from, if the types of all members are compatible, then we say the types themselves are compatible.
Here is a link to code in TypeScript playground
// class version
class BitbucketToken {
value: string;
constructor(value: string) {
this.value = value;
}
}
class JiraToken {
value: string;
constructor(value: string) {
this.value = value;
}
}
function runClass(a: BitbucketToken) { }
runClass(new BitbucketToken("a"))
runClass(new JiraToken("a"))
Class with private or protected
Last and final option, as it did the job, was class
but with private
or protected
property. Again documentation helps with understanding why it works.
TypeScript is a structural type system. When we compare two different types, regardless of where they came from, if the types of all members are compatible, then we say the types themselves are compatible. However, when comparing types that have private and protected members, we treat these types differently. For two types to be considered compatible, if one of them has a private member, then the other must have a private member that originated in the same declaration. The same applies to protected members.
This version finally worked and tsc
complained when tokens where mixed up so I went with it in my personal project. Both options work private
or protected
.
Here is a link to code in TypeScript playground.
// class version
class BitbucketToken {
private value: string;
constructor(value: string) {
this.value = value;
}
}
class JiraToken {
private value: string;
constructor(value: string) {
this.value = value;
}
}
function runClass(a: BitbucketToken) { }
runClass(new BitbucketToken("a"))
runClass(new JiraToken("a"))
There is one thing that has bothered me for a couple of months. It felt wrong when I saw it in the codebase but I could not tell why it is wrong. It was just a hunch that something is not right, but not enough to make me look for a reason.
For last couple of days I have been struggling to sort out my bot configuration on Azure and decided I need a break from that. Python being something I know best is a good candidate to feel comfortable and in control again.
I have decided to finally answer the question that was buggin me. Why using f-strings in logger calls makes me uneasy? Why this feels wrong?
hero = "Mumen Rider"
logger.error(f"Class C, rank 1: {hero}")
f-strings
Most of the pythonistas would know by now what f-strings are. They are convenient way of constructing strings. Values can be included directly in the string what makes the string much more readable. Here is an example from Python 3’s f-Strings: An Improved String Formatting Syntax (Guide), which is worth at least skimming through if you know f-strings.
>>> name = "Eric"
>>> age = 74
>>> f"Hello, {name}. You are {age}."
'Hello, Eric. You are 74'
They have benefits and my team have been using them since. It’s fine as they are awesome however I feel that they should not be used when we talk about logging
.
logging
I’m not talking about poor man’s logging which is print
. This is an example of logging in Python
logger.info("This is an example of a log message, and a value of %s", 42)
When the code includes such line and when it is executed it outputs string according to log configuration. Of course your log level needs to match but I’m skipping this as it is not relevant here, I’ll get back to this later.
The %s
identifier in log messages means that anything passed into logger.info
will replace the identifier. So the message will look like this.
INFO:MyLogger:This is an example of a log message, and a value of 42
logging + f-strings
Since logging accept strings and f-strings are so nice they could be used together. Yes, it is possible of course but I’d not use f-strings for such purpose. Best to illustrate why is an example followed with explanation.
import logging
logging.basicConfig(level=logging.ERROR)
logger = logging.getLogger('klich.dev')
class MyClass:
def __str__(self):
print('Calling __str__')
return "Hiya"
c = MyClass()
print("F style")
logger.debug(f'{c}')
logger.info(f'{c}')
logger.warning(f'{c}')
logger.error(f'{c}')
print()
print("Regular style")
logger.debug('%s', c)
logger.info('%s', c)
logger.warning('%s', c)
logger.error('%s', c)
This short example creates logger and sets logging level to ERROR. This means that only calls of logger.error
will produce output. __str__
method of object used
in log messages prints information when it is called. So each level matching logger
call will print Calling __str__
message and Hiya
. Since there are two logger.error
calls we should get four lines total. This is what actually is printed out.
% python3 logg.py
F style
Calling __str__
Calling __str__
Calling __str__
Calling __str__
ERROR:klich.dev:Hiya
Regular style
Calling __str__
ERROR:klich.dev:Hiya
We can see that logger lines using f-strings are calling __str__
even if the log message is not printed out. This is not a big penalty but it may compound to something significant if you have many log calls with f-strings.
what is going on
According to documentation on logging
Formatting of message arguments is deferred until it cannot be avoided.
Logger is smart enough to actually not format messages if it is not needed.
It will refrain from calling __str__
until it is required, when it is passed to std out or to a file, or other with options supported by logger.
To dig a little bit more we can use dis module from python standard library. After feeding our code to dis.dis
method we will get a list of operations that happend under the hood. For detailed explanation of what exact operations do have a look at ceval.c
from Python’s sources.
>>> import logging
>>> logger = logging.getLogger()
>>> def f1():
logger.info("This is an example of a log message, and a value of %s", 42)
>>> def f2():
logger.info(f"This is an example of a log message, and a value of {42}")
>>> import dis
>>> dis.dis(f1)
0 LOAD_GLOBAL 0 (logger)
2 LOAD_METHOD 1 (info)
4 LOAD_CONST 1 ('This is an example of a log message, and a value of %s')
6 LOAD_CONST 2 (42)
8 CALL_METHOD 2
10 POP_TOP
12 LOAD_CONST 0 (None)
14 RETURN_VALUE
>>> dis.dis(f2)
0 LOAD_GLOBAL 0 (logger)
2 LOAD_METHOD 1 (info)
4 LOAD_CONST 1 ('This is an example of a log message, and a value of ')
6 LOAD_CONST 2 (42)
8 FORMAT_VALUE 0
10 BUILD_STRING 2
12 CALL_METHOD 1
14 POP_TOP
16 LOAD_CONST 0 (None)
18 RETURN_VALUE
In this case we won’t get into much details, it is enough to see that f-strings add two additional operations of FORMAT_VALUE
(Handles f-string value formatting.) and BUILD_STRING
.
After this small research I can explain why we should not be using f-strings in this specific place which is logging. I also can put my uneasiness to rest.
This may be a secret as I didn’t mention it here that one of my hobbies is economy. The other one is programming, which is kinda obvious. I enjoy teaching as well, and run Coding Dojo in my town, but this is a story for some other time. These three things, programming, economy, and teaching mixed together with an idea might create something interesting.
Few months ago I have decided to start calculating one economic indicator called [Misery Index](https://en.wikipedia.org/wiki/Misery_index_(economics) of Poland. It is simple, easy to understand, and data required to calculate it is available from Polish Bureau of Statistcs(GUS).
After having this idea I created https://jakjestw.pl. For first few months I have updated data by hand. New data is not released often and I could simply check and then apply changes myself. That was fine at first but started being cumbersome. I happen to fancy Elixir and to test it out on something real I decided to create poor man’s static site generator and data parser.
Requirements were ridiculously simple. The app had to fetch data from GUS website, which was in JSON format. Then transform it into something that can be injected into HTML file. Since I was done with manual labour it had to run on Gitlab pipelines.
Easier said than done. My major language is Python what influenced how I chose to model my data. This had been my demise. I picked really poorly, by following gut instinct of a Python programmer. Maps, tuples, and lists were my choice. In Python it does make sense, for such case of data transformation might not be the best but still Dict is a goto structure. My Elixir data looked like this, what a lovely year for Polish wallets by the way.
%{
2015 => [
{12, %{cpi: -0.5}},
{11, %{cpi: -0.6}},
{10, %{cpi: -0.7}},
{9, %{cpi: -0.8}},
{8, %{cpi: -0.6}},
{7, %{cpi: -0.7}},
{6, %{cpi: -0.8}},
{5, %{cpi: -0.9}},
{4, %{cpi: -1.1}},
{3, %{cpi: -1.5}},
{2, %{cpi: -1.6}},
{1, %{cpi: -1.4}}
]
}
My website displays latest number that is calculated each day. It also provides information how the number is calculated by providing both components of the indicator, CPI and unemployment. One last thing is a comparison of last four years by giving data from last month of each year. Not a perfect situation but will do for comparison.
Extracting such information from data structure presented above requires a lot of effort. Lot more than I expected and I have told myself that this is because I’m not fluent in Elixir. After I have finished I realised that it’s not me it is my data structure. Which is my fault, but it’s not me.
That sparked an idea to change my data structure to something that map/reduce
can handle easier. This time with some experience in processing data in pipelines I decided to skip the nested structures and have flat data like list and use proper date object.
[
[~D[2016-12-01], {:unemployment, 8.2}],
[~D[2016-11-01], {:unemployment, 8.2}],
[~D[2016-10-01], {:unemployment, 8.2}],
[~D[2016-09-01], {:unemployment, 8.3}],
[~D[2016-08-01], {:unemployment, 8.4}],
[~D[2016-07-01], {:unemployment, 8.5}],
[~D[2016-06-01], {:unemployment, 8.7}],
[~D[2016-05-01], {:unemployment, 9.1}],
[~D[2016-04-01], {:unemployment, 9.4}],
[~D[2016-03-01], {:unemployment, 9.9}],
[~D[2016-02-01], {:unemployment, 10.2}],
[~D[2016-01-01], {:unemployment, 10.2}]
]
This is perfect for map/reduce/filter
operations. Saying that code is simpler from my point of view does not makes sense as I spent a lot of time with it. The metric that can be helpful here is
number of added and removed lines. In total I have removed 409 lines while adding 244, that is 165 lines less then before. After removing lines that changed in test we get 82 removed and 67 added, which is around 25% less code doing the same thing. Which is a good news but giving only LOCs could be misleading as lines are not equal. So now code before
def second_page(all_stats) do
Enum.to_list(all_stats)
|> Enum.map(fn {x, data} -> for d <- data, do: Tuple.insert_at(d, 0, x) end)
|> List.flatten()
|> Enum.sort(fn x, y -> elem(x, 0) >= elem(y, 0) && elem(x, 1) >= elem(y, 1) end)
|> Enum.find(fn x -> map_size(elem(x, 2)) == 2 end)
|> elem(2)
|> Map.to_list()
end
And after.
def second_page(all_stats) when is_list(all_stats) do
Enum.drop_while(all_stats, fn e -> length(e) < 3 end)
|> hd
|> tl
end
This is the most striking example from the codebase that illustrates what changes this can involve.
TIL:
My main take from this experience is that mistakes at the start of a project may lead to disastrous consequences later on. The time spent on designing, that includes writing throw away code when doing spikes, is best investment you can make. Think about it before you start.
P.S.
Code is up on Gitlab, feel free to look and comment.