Why Scrum is awful for data science

Agile Scrum


This post was originally published by Isaac Godfried at Towards Data Science

More and more data science teams seem to be hopping on the scrum bandwagon, but is it a good idea?

Scrum is not suitable for most data science teams. Image source

File:Scrum Framework.png

Scrum is a popular methodology for PM in software engineering and recently the trend has carried over to data science. While the utility of Scrum in standard software engineering may remain up for debate, here I will detail why it has unquestionably no place in data science (and data engineering as well). This is not to say that “Agile” as a whole is bad for data science, but rather that the specific principles of Scrum: sprints, single product owner, scrum master, daily stand-ups (and the litany of other meetings) fit poorly for data science teams and ultimately result in poorer products.

The Sprint/Estimation

Scrum prioritizes creating “deliverables” often in two-week sprints. While this might arguably work well for certain areas of software engineering, it fails spectacularly in the data science world. Data Science by its very nature is a scientific process and involves, research, experimentation, and analysis. Data Science projects are very difficult to estimate because many times they are asking the team to do something that hasn’t been done before. While it is true that data scientists may have designed similar models before they likely haven’t leveraged the dataset or utilized the specific technique required. This means that there is a lot of uncertainty in the process. Things like poorer than expected data quality, problems with hyper-parameter tuning, and/or a technique just not working can cause a failure to “deliver” by the end of the sprint. This means that point estimates are often less than worthless as they are based on prior projects that often don’t bear resemblance to the current project in progress.

Moreover, even if data scientists “deliver” the required items for a sprint in many cases they have likely sacrificed code quality, model robustness, or documentation in order to meet the arbitrary end of the sprint. I have often heard management describe positively the benefits of “more urgency” in a two-week sprint. But remember this urgency also has drawbacks chiefly that data scientists are more likely to make mistakes and overlook things.

On the opposite end of the spectrum, I’ve seen data scientists who finished their work early, hesitant to pull in new stories out of fear of not being able to complete them by the end of sprint. Therefore, they just sit idly for several days before the next sprint.

But couldn’t we break up these big tasks? Proponents of Scrum will argue that the issue here is not Scrum, but just the need to better break up big tasks (likely with additional time consuming grooming sessions). However, even breaking up big tasks does not remove the uncertainty with data science. For instance, a task like train an XGBoost model and report results, might take much longer than a single sprint due to missing values in the code that need to be encoded or needed data not being present at all. Yes, this could be addressed by having prior story “explore the dataset and fill missing values” but as I will describe in a second most PO’s lack the expertise to prioritize these types of stories as it doesn’t fulfill an immediate deliverable.

Constant Pivoting

Related to the above point, Scrum often results in constant pivoting from one project to another. True, many consider pivoting a “desirable” trait, however this constant change in direction often results in nothing getting done and promising projects being shelved simply because they aren’t producing immediate deliverables. This is particularly true in data science where many projects require a long-term investment in both employee time and resources. I have seen many times where promising projects were discontinued because they didn’t deliver performance improvements fast enough or the product owner just saw something flashier, they wanted to focus on.

Lack of cross team pollination

Scrum often creates a horribly narrow focus on one’s own team’s sprint tickets to the exclusion of everything else. It discourages data scientists (or really anyone) from contributing to other initiatives around their company; other initiatives where their skills could potentially be of help. It also has the tendency to push off important issues that could affect other teams unless that team is in direct contact with the product owner.

The role of the product owner (PO)

Another key problem of Scrum is that it places too much power in the hands of the PO. The PO is generally in charge of the backlog and determines which issues need to be prioritized. However, product owners generally have a poor understanding of the technical nuances of data science projects. Therefore, needed work such as refactoring of code or further analysis of model performance often gets pushed to the back. Additionally, lack of immediate “progress” might result in the product owner moving away from a project entirely. This isn’t to say that data scientists shouldn’t regularly communicate with stakeholders to determine the priority of tickets, but rather than having a dedicated product owner at all the meetings and deciding the priority of tasks is counterproductive both to the team and long term to the product itself.

Daily Standups, grooming and other wastes of time

I’ve seen very few if any teams that need to meet on a daily basis. Communication between teammates is important, however, usually twice per week or three times will more than suffice. Likewise, teammates should be encouraged to reach out if they get blocked or need help. However, a daily standup often does nothing but micro-manage employees.

Grooming (or refinement) is a meeting of the Scrum team in which the product backlog items are discussed, and the next sprint planning is prepared.

Grooming is another session that needlessly wastes time. As I mentioned above, technical complexity in data science often means that sprint goals will often not be met or met with subpar results. This in turn often results in the justification for even more grooming meetings (or pre-grooming as we used to call them) in order to “break down those big issues.” In a never-ending cycle these meetings continue to eat up more and more data scientist time.

The retro is one of the few scrum meetings that I like, however, suggestions at these meetings are often not taken seriously. For instance, at several sprint retros I’ve attended during my career, the majority of teammates recommended not having a daily stand-up, but the scrum master and management discounted these suggestions because “that would not be scrum.” However, in contrast, suggestions for adding more grooming sessions are almost always enacted without question.

The Scrum Master

Another role that is essentially useless is the scrum master. the definition of a scrum master formally is:

The scrum master is the team role responsible for ensuring the team lives agile values and principles and follows the processes and practices that the team agreed they would use.”- Agile Alliance

What…? In practice, the scrum master acts as a non-technical busy body who coerces team members into attending the aforementioned pointless meetings and drags around JIRA cards, while preaching the canon of how Scrum will lead your team to salvation (e.g. more points delivered per sprint).

False Dichotomy and the no true Scrum argument

Finally, proponents of Scrum often create a straw man in comparing Scrum to waterfall and other older project management methods. Moreover, in many companies, management takes an all or nothing approach. It is possible to take aspects of Scrum, Agile or other forms of project management without adhering to them completely. For instance, you could utilize ideas like stories, epics, etc, without a product owner, sprints, or a scrum master.

Another common trend I see frequently is for people to say “what you experienced was not true Scrum, blah blah is actually waterfall. If only you had a better product owner…” What this fails to realize is that Scrum as a system breeds these types of problems. Delegating a singular role as the product owner is bound to cause problems. Sure you could have an exceptionally good PO that has years of DS experience or understands the team, but that likely won’t be the case. Moreover, the sprint at its core encourages frantic rushing at the end of whatever arbitrarily decided duration in order to meet the “commitment.” It also fundamentally assumes that all work can be concretely estimated. Scrum could possibly work in areas of software engineering where there is very well defined problems that are only slight variations (though even then you have the issue with POs and the Scrum masters). However, when there is any uncertainty (like there is in data science, data engineering and Devops) Scrum breakdowns and results in both wasted time and resources.

What should you use instead?

This leads to a central question of how you should manage a data science team. There is no single answer. What I’ve found to work well is a Kanban based approach without a product owner but regular discussions with stakeholders (weekly or every other week). Additionally, work in progress limits seem to help streamline the process. Like I mentioned above, meetings twice per week (Tuesday/Friday) or some alternative often work well.

However, this approach may not work well for all teams. That is why, particularly for data science, I’d recommend trying out many different approaches to determine what works well for your team. The key is to find the system that works well for your team and stakeholders rather than one that just placates upper management’s ideas of how a data scientist team should operate.

Spread the word

This post was originally published by Isaac Godfried at Towards Data Science

Related posts