What I learnt from giving 120+ Data Science presentations

towards-data-science

This post was originally published by Das Wijesundera at Towards Data Science

In 1869, French civil engineer Charles Minard visualised the movement of Napoleon’s Russian campaign in 1812. Whilst it is difficult to construct such charts even now, it is a stunningly simple chart that articulates six variables in one 2D plot! (temperature, troop count, travelled distance, direction of travel, lat/long and location relative to specific dates). It literally tells the sad story of conquest through data.

Organise your insights to not just show data, but also to read easily. Two basic yet reoccurring themes I’ve spotted in presentations are: a lack of comparative measures and the right use of colour.

Rulers and scales

I feel that some background is necessary to explain why our visual prowess often fails.

Human brains are tuned to spot differences in similar things, but only if they are noticeably different from another. We easily misinterpret or underestimate certain data points in large charts (bar or line) if there are no common ways to spot differences across multiple items.

This is probably a trivial explanation, but the relationship between actual difference vs perceived or subjective difference is stated in Webber’s Law.

In addition, Steven’s Power Law worsens this variance to our perceptive bias as each person has a unique scaling system, one that is non-linear across length, area and volume.

2016 US Dem presidential candidate, NBC News — Clinton lines can be compared, but how about Sanders?

For example, when any two people estimate the difference between two bars in a bar graph (i.e. the ratio), they would come up with different estimates. Whilst this might be incredibly small in most circumstances, comparing multiple bars in unstructured graphs results in a compounded effect. This holds true for length, area, volume, brightness, sound intensity (loudness) and weight.

To fix this, we can use periodic rulers (or a common edge as rulers), as they provide the best comparative tool in charts. This is invaluable in establishing a baseline to compare two or more items.

Whilst this fix seems very trivial, it is something we often forget to consider. These errors are most prominent in stacked and area charts as it is much harder to use rulers. Therefore, I avoid using these and stick to simpler charts.

I spent 15 mins recreating the right side of the above chart (in Excel). I can now discern differences between candidates & race. It is not as clean as above, but now you can study it!

Contrast

Use of colour, specifically the difference in contrast, is critical in drawing attention. We usually include colour to differentiate between multiple variables, using a variation of sequential, diverging and qualitative colour palettes that are used in charts produced by Python and R.

It is important to question what is relevant to show, one variable vs all or multiple variables. The former is easier to interpret, requires little variation of colour therefore can make impactful distinctions with just two contrasts. The latter is ideal in articulating variation and complexity in variables, describing faceted problems and insights. The less variables the better. If you really want to go the extra mile, group them by importance and relevancy to the audience.

There’s a detailed post on the science of how we recognise patterns and interpret them — I highly recommend it!

The majority of our projects could eventually drain the enthusiasm out of us. No matter how much time we spend data cleansing and passionately optimising, ultimately all stakeholders care about the purpose and outcome of our code. It is not interesting in the grand scheme of things, at least for the non-technical audience who can fund further work.

Giphy — When someone asks me ”So what?”, I hear Chandler’s voice.

I had to remind myself to focus on what is relevant, what made the frustrating challenges interesting to solve and what interesting things I had learnt along the way.

I incorporated different graphs to increase engagement during presentations, included long term goals of production models, strategised ways to gain additional benefits with other projects and even spent time finding well organised templates with animations.

At worst, your enthusiasm and genuine interest will get your stakeholders and audience excited for you, you learn something interesting from the experience and move on. At best, they will support you, spread your word and rally the right people to take the right decisions with you.

Interaction at a start of a presentation can ground your audience and focus them on what you want to draw their attention to. This hold particularly true if they have had to sit through multiple presentations before you are up! An interesting fact about the project, an intriguing statistic, a “show of hands…” question are great ways of interacting with your audience.

In general, you can also use this opportunity to understand the level of prior knowledge they have and the level of complexity they are comfortable with.

Of course, they could lie or be disengaged, but by interacting with them, the audience has a chance to build trust and break any awkward barriers when it comes to asking the presenter questions.

If someone is new to our world, then measuring progress with terms like accuracy, precision/recall and AUC might make things more confusing.

Be very specific about how you are evaluating the performance of a model. Make time to recap or provide the audience cliff notes on how to interpret the results. If it is AUC, explain why you would use this method, the results and what it means for the project. Always link back to the bigger picture and the value it offers.

If the results aren’t promising, or you believe that more effort is necessary to improve, be transparent and justify a potential approach. Same applies if it is a dead-end and you’ve exhausted all paths.

Spread the word

This post was originally published by Das Wijesundera at Towards Data Science

Related posts