Friday, November 15, 2024
Google search engine
HomeData Modelling & AIHow to be data-driven without data…

How to be data-driven without data…

…and then make better use of the data you get.

The usefulness of data science begins long before you collect the first data point. It can be used to describe very clearly your questions and your assumptions, and to analyze in a consistent manner what they imply. This is neither a simple exercise nor an academic one: informal approaches are notoriously bad at handling the interplay of complex probabilities, yet even the a priori knowledge embedded in personal experience and publicly available research, when properly organized and queried, can answer many questions that mass quantities of data, processed carelessly, wouldn’t be able to, as well as suggest what measurements should be attempted first, and what for.

The larger the gap between the complexity of a system and the existing data capture and analysis infrastructure, the more important it is to set up initial data-free (which doesn’t mean knowledge-free) formal models as a temporary bridge between both. Toy models are a good way to begin this approach; as the British statistician George E.P. Box wrote, all models are wrong, but some are useful (at least for a while, we might add, but that’s as much as we can ask of any tool).

Let’s say you’re evaluating an idea for a new network-like service for specialized peer-to-peer consulting that will have the possibility of monetizing a certain percentage of the interactions between users. You will, of course, capture all of the relevant information once the network is running — and there’s no substitute for real data — but that doesn’t mean you have to wait until then to start thinking about it as a data scientist, which in this context means probabilistically.

Note that the following numbers are wrong: it takes research, experience, and time to figure out useful guesses. What matters for the purposes of this post is describing the process, oversimplified as it will be.

You don’t know a priori how large the network will be after, say, one year, but you can look at other competitors, the size of the relevant market, and so on, and guess, not a number (“our network in one year will have a hundred thousand users”), but the relative likelihood of different values.

 /></p>
<p>The graph above shows one possible set of guesses. Instead of giving a single number, it “says” that there’s a 50% chance that the network will have <i>at least</i> a hundred thousand users, and a 5.4% chance that it’ll have at least half a million (although note that decimals points in this context are rather pointless; a guess based on experience and research can be extremely useful, but will rarely be this precise). On the other hand, there’s almost a 25% chance that the network will have less than fifty thousand users, and a 10% chance that it’ll have less than twenty-eight thousand.</p>
<p>How do you build such a graph, or rather, how do you assemble the information represented on it? The answer will probably look surprisingly old-fashioned: by learning as much as you can about the topic, talking with people who know about it, exercising your judgment, and then using formal mathematics to force yourself to write your best guess in a way that’s explicitly clear about what it says and what it doesn’t. The first steps are things you were already doing to help you with your problem, but the last one is what will allow you to coordinate knowledge and experience from different sources to give you the best possible answer to your question, given whatever you know at that moment.</p>
<p>You can use the same process to codify your educated guesses about other key aspects of the application, like the rate at which members of the network will interact, and the average revenue you’ll be able to get from each interaction. As always, neither these numbers nor the specific shape of the curves matter for this toy example, but note how different degrees and forms of uncertainty are represented through different types of probability distributions:</p>
<p><img decoding=blog.rinesi.com/

RELATED ARTICLES

Most Popular

Recent Comments