Hands-on Tutorials

Stratification, CUPED, Variance-Weighted Estimators, and ML-based methods CUPAC and MLRATE

Why do we need variance reduction?

When we do online experiments or A/B testing, we need to ensure our test has high statistical power so that we have a high probability to find the experimental effect if it does exist. What are the factors that might affect power? Sample sizes, sampling variance of the experiment metric, significance level alpha, and effect size.

The canonical way to improve power is to increase the sample size. However, the dynamic range is limited since the minimum detectable effect MDE is proportional to 1/sqrt(sample_size). …

My favorite Python Viz tools — HoloViz

It is surprising to me that many data scientists do not know HoloViz. HoloViz is my favorite Python viz ecosystem, which comprises seven Python libraries — Panel, hvPlot, HoloViews, GeoViews, Datashader, Param, and Colorcet.

Why do I love Holoviz?

HoloViz allows users to build Python visualization and interactive dashboard with super easy and flexible Python code. It provides the flexibility to choose among several API backends, including bokeh, matplotlib, and plotly, so you can choose different backends based on your preferences. Plus, it’s 100% open source!

Unlike the other python viz and dashboarding options, HoloViz is very serious about supporting every reasonable context in which…

Hands-on Tutorials

Streaming and Refreshing

Data scientists use data visualization to communicate data and generate insights. It’s essential for data scientists to know how to create meaningful visualization dashboards, especially real-time dashboards. This article talks about two ways to get your real-time dashboard in Python:

  • First, we use streaming data and create an auto-updated streaming dashboard.
  • Second, we use a “Refresh” button to refresh the dashboard whenever we need the dashboard to be refreshed.

For demonstration purposes, the plots and dashboards are very basic, but you will get the idea of how we do a real-time dashboard.

The code for this article can be found…

Minimal effort to make slides and host an html file on Github

Check out the slideshow of this article here: https://sophiamyang.github.io/slides_github_pages/.

There are two parts to this article:

  1. How to turn your Jupyter Notebooks into a slideshow and output to an html file.
  2. How to host an html file on Github.

Jupyter Notebook slides

First, let’s create a new environment slideshow, install a Jupyter notebook extension RISE, and launch Jupyter Notebook:

conda create -n slideshow -c conda-forge python=3.9 rise
conda activate slideshow
jupyter notebook

Then create a Jupyter Notebook file as usual:

  • Click View→Toolbar→Slideshow to define the slide type for each cell.
  • RISE creates a button “Enter/Exit Live Reveal Slideshow” in the top right of…

Intake driver for salesforce

A Salesforce database can be a hot mess. The figure below illustrates the relationship among some of the data tables in Salesforce. As you can see, the relationship among data tables (i.e., objects) can be complicated and hard to work with. I wrote a blog post previously on how to understand and query Salesforce data using the Salesforce Object Query Language (SOQL) through a Python API simple-salesforce. Salesforce Object Query Language (SOQL) is a SQL-like language that is designed specifically for the relational data in Salesforce and it is not the easiest to understand and write for people who are…

SQLAlchemy, Python Client for Google BigQuery, and bq command-line tool

How do you query BigQuery data? This article talks about 3 ways to query BigQuery data in Python. Hope you find them useful.


conda install notebook google-cloud-bigquery sqlalchemy pybigquery


To authenticate Google Cloud locally, you will need to install Google Cloud SDK and log in/authenticate through the following command line. More information can be found in the official documentation.

gcloud auth login

To authenticate through a credential file, you can create a service account and get the credential from the service account: Go to the google cloud service account page, click on a project, click “+CREATE SERVICE ACCOUNT” and then…

Math and gradient descent implementation in Python

Multiclass logistic regression is also called multinomial logistic regression and softmax regression. It is used when we want to predict more than 2 classes. A lot of people use multiclass logistic regression all the time, but don’t really know how it works. So, I am going to walk you through how the math works and implement it using gradient descent from scratch in Python.

Disclaimer: there are various notations on this topic. I am using the notation that I think is easy to understand and visualize. …

Using pytest and hypothesis for unit testing

Software testing is essential for software development. It is recommended for software engineers to use test-driven development (TDD), which is a software development process that develops test cases first and then develops the software. For data scientists, it is not always easy and plausible to write tests first. Nevertheless, software testing is so important. Every data scientist should know how to do unit testing and use unit testing in their data science workflow. A lot of data scientists already use assertions, which is a very important first step to test-driven development. This article will step up from assertions and focus…

setup, debug, version control, and deployment

Many data scientists like to use Jupyter Notebook or JupyterLab to do their data explorations, visualizations, and model building. I know some data scientists refuse to use Jupyter Notebook. But, I love to use Jupyter Notebook/Lab to do my experiments and explorations. Here is a Jupyter Notebook workflow that might be helpful.


Setup environment

Whenever you are working on a new project, you should always have a new fresh environment to start with. If you are lazy and don’t want to create a new environment for every project, you should at least create one new environment that’s separate from your base environment…

cProfile and line_profiler

Profiling helps us identify bottlenecks and optimize performance in our code. If our code runs slowly, we can identify where our code is slow and then make corresponding improvements.

Here are the explanations of Python profilers from the Python documentation:

cProfile and profile provide deterministic profiling of Python programs. A profile is a set of statistics that describes how often and for how long various parts of the program executed…cProfile is recommended for most users; it’s a C extension with reasonable overhead that makes it suitable for profiling long-running programs.

cProfile is a Python built-in profiler, which is great for…

Sophia Yang

Ph.D. | Senior Data Scientist @ Anaconda | Twitter @ sophiamyang | All views are my own

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store