When we do online experiments or A/B testing, we need to ensure our test has high statistical power so that we have a high probability to find the experimental effect if it does exist. What are the factors that might affect power? Sample sizes, sampling variance of the experiment metric, significance level alpha, and effect size.
The canonical way to improve power is to increase the sample size. However, the dynamic range is limited since the minimum detectable effect MDE is proportional to 1/sqrt(sample_size). …
It is surprising to me that many data scientists do not know HoloViz. HoloViz is my favorite Python viz ecosystem, which comprises seven Python libraries — Panel, hvPlot, HoloViews, GeoViews, Datashader, Param, and Colorcet.
HoloViz allows users to build Python visualization and interactive dashboard with super easy and flexible Python code. It provides the flexibility to choose among several API backends, including bokeh, matplotlib, and plotly, so you can choose different backends based on your preferences. Plus, it’s 100% open source!
Unlike the other python viz and dashboarding options, HoloViz is very serious about supporting every reasonable context in which…
Data scientists use data visualization to communicate data and generate insights. It’s essential for data scientists to know how to create meaningful visualization dashboards, especially real-time dashboards. This article talks about two ways to get your real-time dashboard in Python:
For demonstration purposes, the plots and dashboards are very basic, but you will get the idea of how we do a real-time dashboard.
Check out the slideshow of this article here: https://sophiamyang.github.io/slides_github_pages/.
There are two parts to this article:
First, let’s create a new environment
slideshow, install a Jupyter notebook extension RISE, and launch Jupyter Notebook:
conda create -n slideshow -c conda-forge python=3.9 rise
conda activate slideshow
Then create a Jupyter Notebook file as usual:
A Salesforce database can be a hot mess. The figure below illustrates the relationship among some of the data tables in Salesforce. As you can see, the relationship among data tables (i.e., objects) can be complicated and hard to work with. I wrote a blog post previously on how to understand and query Salesforce data using the Salesforce Object Query Language (SOQL) through a Python API simple-salesforce. Salesforce Object Query Language (SOQL) is a SQL-like language that is designed specifically for the relational data in Salesforce and it is not the easiest to understand and write for people who are…
How do you query BigQuery data? This article talks about 3 ways to query BigQuery data in Python. Hope you find them useful.
conda install notebook google-cloud-bigquery sqlalchemy pybigquery
To authenticate Google Cloud locally, you will need to install Google Cloud SDK and log in/authenticate through the following command line. More information can be found in the official documentation.
gcloud auth login
To authenticate through a credential file, you can create a service account and get the credential from the service account: Go to the google cloud service account page, click on a project, click “+CREATE SERVICE ACCOUNT” and then…
Multiclass logistic regression is also called multinomial logistic regression and softmax regression. It is used when we want to predict more than 2 classes. A lot of people use multiclass logistic regression all the time, but don’t really know how it works. So, I am going to walk you through how the math works and implement it using gradient descent from scratch in Python.
Disclaimer: there are various notations on this topic. I am using the notation that I think is easy to understand and visualize. …
Software testing is essential for software development. It is recommended for software engineers to use test-driven development (TDD), which is a software development process that develops test cases first and then develops the software. For data scientists, it is not always easy and plausible to write tests first. Nevertheless, software testing is so important. Every data scientist should know how to do unit testing and use unit testing in their data science workflow. A lot of data scientists already use assertions, which is a very important first step to test-driven development. This article will step up from assertions and focus…
Many data scientists like to use Jupyter Notebook or JupyterLab to do their data explorations, visualizations, and model building. I know some data scientists refuse to use Jupyter Notebook. But, I love to use Jupyter Notebook/Lab to do my experiments and explorations. Here is a Jupyter Notebook workflow that might be helpful.
Whenever you are working on a new project, you should always have a new fresh environment to start with. If you are lazy and don’t want to create a new environment for every project, you should at least create one new environment that’s separate from your base environment…
Profiling helps us identify bottlenecks and optimize performance in our code. If our code runs slowly, we can identify where our code is slow and then make corresponding improvements.
Here are the explanations of Python profilers from the Python documentation:
profileprovide deterministic profiling of Python programs. A profile is a set of statistics that describes how often and for how long various parts of the program executed…
cProfileis recommended for most users; it’s a C extension with reasonable overhead that makes it suitable for profiling long-running programs.
cProfile is a Python built-in profiler, which is great for…