What I learned from a data science conference?
I attended the data science/data engineering conference Data Council 2022 this week. This is my first time attending this conference. One thing special about this conference is that there are a lot of startups and investors. There are several interesting themes we noticed:
- There are many interesting problems and issues with the current way we do data engineering and data science: it can be a frustration with specific tooling, the complexity of data infra and data clouds, the issues with data, the issues with models, and others. The issues and problems bring opportunities. Many startups are building new tools to solve various data problems.
- Enterprises have adopted seemingly overlapping or redundant tools which serve different parts of the business.
- Many companies are looking to start an OSS offering and commercialize from there, but this is by no means straightforward and requires intentional community development.
- Self-serve analytics has been a goal for years, but we still haven’t effectively achieved it.
I’d like to share with you what I have learned and highlight some of my favorite talks. Note that some of the words below directly come from speakers’ slides and talks. Consider all the content quoted and referenced to each talk and presenter.
Opening keynote by Peter Wang, CEO at Anaconda
I love Peter’s keynote on “Enterprise Data Science Comes of Age”. He talked about Python adoption and growth in the past 10 years. Python was not an analytics tool 10 years ago. But because of the unique advantages of Python (easy to learn, rich library ecosystem, etc.), Python took off in the data community. Enterprise IT orgs are at the center of adopting new tech. Do you want to be settlers or pioneers? I especially like when Peter talks about “against efficiency … All creative growth requires “slack” to explore the adjacent possible”. He is probably one of the few CEOs seeing the value of slacking. We love Peter!
AB Experimentation by Chad Sanderson, Head of Product at Convoy & Chetan Sharma, CEO at Eppo
I think this is the only experimentation talk at this conference. I always like a good talk on experiments. Chad talked about the importance of measuring important things. He shared three case studies with Convoy, Microsoft, and Subway. Each company has different sets of challenges and focuses on different sets of metrics. The ideal stack includes randomization assignment, metrics, and analysis. However, 3rd party tools often are limited in each of these in terms of what entities we can randomize, what metrics we can create and monitor, and what kind of analysis we can do. The takeaway is to structure the experimentation system around use cases and business needs.
Chetan described and provided tips for each building block of the experimentation architecture:
- Randomization: use md5() hashes for assignments
- Metrics: use the business metrics that matter. It’s common to focus on one part of the funnel. However, the experiment might boost one part of the funnel and damage another.
- Sufficient stats
- Stat tests: many companies only use t-tests without knowing the restrictions of t-tests (e.g., not use ratios, not have outliers/power laws, not look at the results until it’s done). It’s better to use sequential testing, so that people can take a look at the results any time. Use CUPED to save money and speed up experiments (I actually wrote a blog post summarizing various variance reduction methods including CUPED. Go check it out!)
- Diagnostics: use sample ratio mismatch test (SRM) to make sure you have balanced groups. Use a non-parametric test for outliers. Alternatively, use Winsorization and CUPED to solve the outlier problems.
- Investigations: slice-dice your experiments to investigate different groups. Check your funnels to investigate each step of the funnel.
- Reporting: provide good reporting without requirements of understanding statistics and infrastructure.
Declarative machine learning by Tristan Zajonc, CEO at continual.ai
Tristan is one of my favorite presenters at this conference. He is very engaging and very great at telling technical stories. He described the complexity in MLOps, that people trying to stitch together many different ML tools and frameworks. This kind of technical complexity problem has been fixed before through higher-level declarative abstractions. For example, Terraform solves the complexities of BASH commands. People move from Hadoop Map Reduce to SQL, and move jQuery to React.
Tristan proposed a similar high-level abstraction for operational AI. We don’t need pipelines. Instead, all we really need to know about ML models is task and policy.
- Task defines what are the inputs and outputs. It’s quite interesting to me that when we define the input and output, we will know what kind of models we need. For example, if the input is text and image, the output is text, then the model is probably image question and answering. If the input is a sequence and the output is a sequence, then the output is probably reinforcement learning.
- Policy defines the logistics of the model, for example, training schedules, promotion policies, and prediction schedules.
We can simply define the feature set and the model declaratively through tasks and policies. Then the system can run all relevant models for users to choose the best one.
Deepnote Notebooks by Elizabeth Dlha, Head of Community & Partnerships at Deepnote
I love how Elizabeth uses colors to describe different users and use cases! Elizabeth started her talk by describing tech people/data producers as “blue” people and business people/data consumers as “red” people. The technical “blue” side cares about language interoperability, integrations, and code intelligence. The business “red” side cares about reactivity, seamless editing (like google docs), and data apps (not notebooks). Deepnote fulfills all these needs, allows real-time collaboration/comments and knowledge organization through folders and templates, and provides a general-purpose platform for the “purple” people who are involved in both business and tech. I think I am interested in testing it out.
Hex Notebooks by Caitlin Colgrove, CTO at Hex
I feel like we all have love and hate relationships with Jupyter Notebooks. Many startups try to make the notebook experience better. Caitlin focused on solving the interpretability, reproducibility (out-of-order cells are hard to reproduce), and performance (reruns are wasteful) issues with the notebooks and proposed a reactive programming approach for notebooks. I like her example of using excel. Excel is reactive. Whenever we change a cell value, another cell that is calculated based on this cell will get changed. Hex also uses DAGs to show the dependency graphs of the notebook cells, which can be SQL codes, charts, Python codes, and widgets. The UI is actually pretty nice.
Type-safe ML pipeline with Flyte and Pandera by Niels Bantilan, ML engineer at Union
I came to this talk because I’ve been wanting to use Pandera for a while now. Pandera is a statistical typing and data testing library for dataframes. Niels is the creator of Pandera. I have tried out a little bit of both Pandera and Great Expectations. As a data scientist who mainly works with Pandas dataframe, I find Great Expectations kind of heavyweight. It provides a UI, data profiling, and many other features I find I might not need. Niels mentioned that Great Expectations is for data infra and data engineers and Pandera is for data scientists and data analysts. I tend to agree. Although I don’t have extensive knowledge of either of these tools, so don’t take my words on it. Fun fact: Niels chose the name Pandera because he likes Pandaria from World of Warcraft, and also Pandera somewhat is close to Pandas in the spelling.
Okay, back to his talk. He cares about type-safety a lot. Type defines the set of values and the domain of operations that data allows. Flyte is a data- and ML orchestration tool with strong type-safety to ensure the reliability, efficiency, and auditability of an ML workflow. Flyte uses task decorators to define the containerized units of tasks as building blocks in its workflow. The workflows are dynamic dags with built-in parallelism to execute tasks. Flyte works greatly with Pandera, which expresses statistical types in the codebase and ensures the quality of data in your ML pipelines.
Privacy Protection methods by Will Thompson, Principal Software Engineer at Privacy Dynamics
Will pointed out is that synthetic data is not automatically protected. This is a common misconception, and many people are sharing synthetic data not understanding the risk. He went through various privacy protection methods I thought were very interesting.
- Suppression/Generalization: This approach basically puts data into buckets. For example, age 25 can be generalized into a 25–35 bucket. However, when it’s applied ad-hoc, it doesn’t guarantee privacy.
- Global differential privacy: This is an interactive model, which adds noise to a single statistic and uses epsilon values to measure the worst-case scenario.
- K-anonymity: It targets k individuals to generalize and ensure that an individual’s identifier is the same as at least k-1 other records. It is complex and computationally expensive.
- Local differential privacy: it protects the whole dataset and provides strong privacy guarantees.
- Synthetic data: use a model to produce a new dataset that looks like the original data.
- Microaggression: k-anonymity with less distortion than classical methods.
I think all these methods are super interesting. I’m not sure if I understand all of them yet. I will probably read more on them later.
Data checking through models by Peter Gao, CEO at Aquarium
Peter emphasized the data problems in the ML workflow. There are different types of data problems: invalid data, labeling errors, difficult edge cases, and out-of-sample data. Identifying failure cases in the data is difficult. Peter proposed to improve the datasets efficiently through model feedback. Let your model check your data problems. High loss disagreements with labels indicate labeling errors, error patterns vs labels indicate errors in the metadata or raw data, and checking distributional shifts between training and production environments is always a good idea. In his computer vision task examples, there are many cases where the model outputs correctly, but the images were not labeled correctly. I thought those examples were pretty cool.
Model decay/data drift by Bastien Boutonnet, Lead Data Scientist at Soda
Data drift is “when the distribution of one or more of your input features has between example training time and deployment”. It’s important to have a data quality monitoring framework to check data drift. Data quality monitoring is hard and data quality monitoring should not add time to release. Soda provides a solution to statistically check the difference between two distributions (e.g., KS test) easily.
Overall, I think Data Council is a very interesting and unique conference. I met a lot of startup CEOs, cofounders, Python package creators, and investors. The startup's creative energy is inspiring.
Acknowledgment: Thank you so much Christian Capdeville for the feedback and suggestions. Christian suggested three of the four themes mentioned in the first paragraph.