A Woman Dies in Childbirth Every Fifteen Minutes

That statistic is from India alone. A woman dies in childbirth every fifteen minutes, and the WHO estimates India accounts for roughly 12% of all maternal deaths worldwide. The majority are preventable. Not because the medicine doesn't exist, but because the information doesn't reach the people who need it.

This is the problem ARMMAN has been chipping away at since 2008. Their program mMitra sends free, timed voice calls to enrolled women throughout pregnancy and up to a year after childbirth — 141 messages in total, in the woman's chosen language, at a time she picks. It's reached over 2 million women. It's been shown to actually move health parameters in the right direction.

The problem: a lot of women stop picking up.

The Real Problem Is Attrition

When a woman stops engaging with mMitra, she doesn't send a cancellation email. She just gradually stops answering. Maybe life gets in the way. Maybe she moved. Maybe the calls started arriving at the wrong time. The program keeps calling, but nobody picks up, and the health information that could've mattered goes unheard.

ARMMAN has health workers who can intervene — make a personal call, send a reminder, connect a woman with a counselor. But there are over 300,000 women in the program. You can't intervene with everyone. You need to know who to call.

That's the problem we were asked to solve. Working with ARMMAN and Google Research India, our job was to build models that could flag women at risk of dropping out before they actually did — early enough for an intervention to still matter.

The Data

We had access to anonymized call records for 329,489 women registered in 2018, amounting to over 70 million individual call log entries. Each entry told us whether a call was attempted, whether it connected, and how long it lasted. On top of that, we had demographic information collected at registration: age, education level, income group, language, gestation age, preferred call time.

A quick note on the data that took some working through: not every call failure means the woman ignored it. Network failures look identical to a missed call in the logs. And for every failed call, the program retries up to twice. We collapsed all of these into the best outcome per message — longest call duration — which reduced the dataset from 70 million rows to a more manageable 27 million. We also drew a line at 30 seconds: anything shorter than that we didn't count as a real engagement, just a connected-but-not-really.

So: attempt means a call was made, connection means someone picked up, engagement means they stayed on for at least 30 seconds. Three distinct things that look deceptively similar in the raw logs.

Two Different Problems

We framed this as two separate prediction tasks, because the interventions are different.

The first is short-term dropout prediction: given the last four weeks of a woman's call history, will she engage with any calls in the next two weeks? This is the "call her now" signal — the kind of flag that tells a health worker to make a personal check-in this week.

The second is long-term dropout prediction: given the first two months after registration, is this woman likely to disengage over the following six-plus months? This is the "she needs a counselor" signal — the kind of early warning that lets you plan a more sustained intervention.

For the long-term task, we also split it further: are we predicting low engagement-to-connection ratio (she connects but doesn't actually listen) or low connection-to-attempt ratio (she's barely picking up at all)? These are different problems with different likely causes.

The Models: CoNDiP and ReNDiP

Yes, we named them. It's a prerogative of grad school.

The core challenge was combining two kinds of information: static demographics (collected once at registration) and sequential call history (a time series of variable length). A random forest over demographics alone was our baseline — it got 70% accuracy and an AUC of 0.83. Decent, but it was ignoring most of the signal.

CoNDiP (Convolutional Neural Disengagement Predictor) encodes the call history using 1D convolutions — think of it as learning local temporal patterns in the sequence, like "three misses in a row" or "engagement dropping off over two weeks." The demographic features go through a separate feedforward network, and then both encodings get concatenated and fed into a final classification head.

ReNDiP (Recurrent Neural Disengagement Predictor) swaps the convolutional layers for a bidirectional LSTM, which reads the call sequence in both directions and captures longer-range dependencies. The rest of the architecture is the same.

For short-term prediction, CoNDiP won: 83% accuracy, AUC of 0.908, versus 0.831 for the random forest. The convolutional approach was better at picking up on the recent local pattern — "she stopped answering three weeks ago" — which is exactly the kind of signal that matters for short-term risk.

For long-term prediction, ReNDiP pulled ahead slightly. Makes sense: predicting behavior over six months requires understanding the arc of a woman's engagement over the first two months, not just the last week. The LSTM was better at capturing that trajectory.

// the headline numbers

Short-term: CoNDiP achieves 83% accuracy, 85% recall on high-risk women — a 13% improvement over the random forest baseline.

Long-term: ReNDiP achieves 76% accuracy — 7% better than baseline, and in the pilot deployment, rising to 84% as the model's predictions were validated against real behaviour.

Putting It in the Field

The part of this project I'm proudest of isn't the model numbers — it's that we actually deployed it.

We built a dashboard for ARMMAN's health workers: a simple interface showing each beneficiary's predicted dropout risk, so workers could prioritize who to call. The pilot ran on 18,766 women who registered in November 2019. We had real call logs to validate against, and the results held up — better than held up, actually. In the pilot, the model hit 84% accuracy at a "minimum 15 connections" threshold, compared to 76% during training.

Some of that improvement was a distribution shift: the November 2019 cohort happened to have a higher proportion of low-risk women, and the model was better at identifying those. But even controlling for that, the core finding stood — the model was genuinely useful in the field.

We also ran a pre-pandemic analysis separately (cutting off at March 10, 2020), because lockdowns obviously changed call behavior in ways that had nothing to do with maternal health engagement. The numbers held there too.

What I'd Say Honestly

This paper isn't the most technically ambitious thing I've worked on. The models are solid applications of existing deep learning methods — 1D CNNs and LSTMs weren't new ideas in 2020. The novelty is in the framing, the data pipeline, and the deployment. Taking a well-understood problem (churn prediction) and making it work in a domain with noisy call logs, no clear outcome labels, and real-world stakes is its own kind of hard.

What made this project stick with me is simpler than that. The gap between "a model that works on a held-out test set" and "a model that a health worker actually uses to make decisions" is enormous, and most ML research never crosses it. We crossed it, at least at pilot scale. Every high-risk prediction that led to a health worker making a call that kept a woman in the program is the whole point.

The bigger problem — figuring out not just who to intervene with, but how — is still open. The paper gestures at reinforcement learning as a next step, using intervention data to learn optimal sequences of outreach.

// paper

Missed calls, Automated Calls and Health Support: Using AI to improve maternal health outcomes by increasing program engagement
Siddharth Nishtala, Harshavardhan Kamarthi, Divy Thakkar, et al.
ARMMAN & Google Research India · IIT Madras · arXiv:2006.07590

arXiv all publications