Here are my notes on synthetic data. Take it with a grain of salt, as I am new in this area.
Let’s say you have a generative AI (generative machine learning) problem, so you have some input data and some corresponding output data. But you have only a tiny quantity of this training data.
Synthetic data is data generated by a machine learning model. Here we will discuss primarily large language models (LLMs). Synthetic data helps where you need to add missing hard-to-collect data for your predictions. This data could be something people never write down or say. For example, human problem-solving thoughts usually don’t get written down. Let’s Verify Step by Step uses synthetic data to create PRM800K dataset for math step-wise reasoning dataset not available elsewhere:
For this kind of PRM800K dataset is rare and with additional each step ranking it is even rarer:
Let's call the numerator x.
So the denominator is 3x-7.
We know that x/(3x-7) = 2/5.
So 5x = 2(3x-7).
5x = 6x-14.
So x = 7.
The rarer and more critical the data, the more impactful the synthetic data can be.
If nothing like the required data was present during the model pretraining, you won’t be able to prompt-instruct the model to perform the required action. Few shot examples can help, but the more complex the problem, the more likely you will need more examples, which are costly to write by hand.
Another advantage of synthetic data is that they can be cleaner than data collected from random places on the internet. The disadvantage is that they can be less diverse, less representative than the real data, and contain more hallucinations.
Real data costs human time, it can contain copyrighted data or personally identifiable information (PII privacy), it can be noisy, incomplete, or irrelevant. Synthetic data can represent a way around these problem. But you cannot create the required data synthetically out of thin air.
You may need:
Another way of looking at 2 is to use other data, which was trained into LLM, to shift the output distribution the way you want. For example, polishing the LLM behavior by making it consistent with selected good patterns in some parts of the training data, which are retrieved and applied with prompting instructions, leads to generating good synthetic data.
Terms of service must be checked to comply with the provider’s conditions. Mistral.ai allows use of their GPT-4-level Large model for training LLMs on their output synthetic data, which is in contrast to OpenAI’s policy which does not allow that.
Another alternative are open-source models, e.g., Llama-70b which has some use-restrictions, or Mixtral which is much more open and requires only 12B active parameters. You can buy Mixtral inference from many providers for example OpenRouter or Fireworks.ai. As of 2024-03-18, the prices for Mixtral are similar to GPT-3.5.
Here is a spectrum of increasingly less human involvement in the process or human leverage in the process or model development.
Garbage-in implies garbage-out. The more complex the data to generate or the more distant to the real training data, the harder it is to synthesize the correct data to train on.
In some problems, verification is easier than generation, so you can remove the invalid data from the generated data. For example, the goal may be to generate a program function that passes an executable set of tests. In this case, verifying that the generated sum is correct is very easy. Another example could be playing chess.
Constitutional AI uses an LLM to label synthetically generated responses to follow specific rules (constitution about harm, bias, and more).
Below is a prompt from Self-Alignment with Instruction Backtranslation for self-curation of synthetic
Below is an instruction from an user and a candidate answer. Evaluate whether or
not the answer is a good example of how AI Assistant should respond to the user’s
instruction. Please assign a score using the following 5-point scale:
1: It means the answer is incomplete, vague, off-topic, ...
2: It means the answer addresses most of the asks from the user. ...
3: It means the answer is helpful but not written by an AI Assistant. ...
4: It means the answer is written from an AI assistant’s perspective with a clear focus of addressing the instruction. ...
5: It means it is a perfect answer from an AI Assistant. ...
Please first provide a brief reasoning you used to derive the rating score, and then write "Score: <rating>" in the last line.
<generated instruction>
<output>
WizardLM Evol Instruct uses human-written coding examples to generate more difficult (complex) examples with GPT-3.5. Then, they use the synthetic dataset to fine-tune and improve performance on coding problems in general.
In some cases, generating the questions (inputs) is easy given the answers (outputs). This inverted generation also allows you to control the distribution of the labeled samples. In the case of text classification, it is easier to generate a text that matches the given category label. Another example of this method is Self-Alignment with Instruction Backtranslation, which also involves self-curation.
The model cannot learn out of thin air, but if you can reformulate or abstract the given problems into other problems, which the model trained more on, you can reapply the lessons and improve the model by polishing it this way.
For example, verifying and rating the response quality may be easier than writing the reply so that the model can self-improve on the given task this way.
What synthetic data tools can you use today?
Mistral 7B fine-tunes on open Hermes synthetic datasets from OpenAI are one of the best OSS models in this weight count category.
Nous-Hermes-2-Mistral-7B-DPO was trained on:
DSPy can take just tens of labeled examples and high-level LLM chains and generate prompts to use, generate synthetic data, and fine-tune a smaller model.
This library generally helps you build prompt chains or pipelines where the LLMs have well-defined inputs and outputs and various tools like RAG.
The library abstracts away manual prompt engineering and instead optimizes the prompts for you, such that you only focus on structured and documented inputs and outputs. DSPy seems much more practical than LangChain.
The library uses a selected larger model (GPT-3.5 or Llama2 13b) to generate prompts and few-shot examples for your smaller LLM like T5. Not only that, you can compose an entire pipeline or prompt chain. Generating and optimizing the prompts within the prompt chain is called compiling. For example, DSPy can generate reasoning examples and optionally fine-tune a smaller LLM on them. The question is, how good are the examples? Un-compiled chains use zero-shot prompting.
Here is an example DSPy project in a video.
Below is a partial list, and I may expand on this in the future.
A math reasoning dataset. Contributions of this paper are:
A LLM model is trained to generate reasoning steps (chains) that use general tools like calculator or search. The tools are then executed in order of given in the reasoning chain, where output of one tool call be an input of another.
The abstraction word here is to express that the reasoning chains reduce amount of specifics effectively then using re-usable general problem-solving tools.
Llama-70b was used to generate the synthetic data to train on.
The smaller model teacher model generates inputs and labels. The bigger model can learn to outperform a weaker teacher if allowed to be “over-confident.” However, this approach is not generally proven for all situations and still shows an upper-performance limit.
]]>Here are my notes on Q-learning and Q-transformer. Take it with grain of salt, as I am new in this area.
The Q-transformer is important paper, because it describes successful application of suboptimal synthetic (autonomously collected) data and transformer architecture in a robotic reinforcement learning problem.
Before Q-transformer let’s first talk about a bigger topic: Bellman Update in Reinforcement Learning.
Let’s suppose we have a game with game-states and rewarded actions we can take at each game-state. For example, in chess this is a state of the chessboard and actions are allowed moves we can make. Or for example, a Mario and Luigi computer game.
In chess, the reward is served only at the end, and the opponent may behave randomly. In Mario and Luigi instead, we collect coin rewards cumulatively throughout the game play, and the world is mostly rule-based.
In cases where the game world deterministic and the decision maker is in full control, we call the games Deterministic Sequential Decision Problems or Optimal Control Problems. In cases, where randomness impacts outcome of the decisions, and decision maker is not in a full control this problem called Markov Decision Process.
Let’s focus on Deterministic Sequential Decision Problems.
Or even simpler example is below, where we have just a single state, single possible action, and single reward for that action.
In this diagram if we keep looping, we will keep stacking rewards. If we discount future rewards with 0.5 discount factor the total reward will be 2, so value of the state is 2.
More interesting examples is where we have 2 possible actions:
In this case, we if we’re making the right decision, we still get reward 2, so the value of the state is still 2.
But the above example demonstrates the discount factor importance, but is still a bit confusing because of the infinite possible paths. Let’s look at this 3 state example:
Can you see what is the best path in above?
Again, the best path is always choosing the first action.
With discount_factor=0.5
the value of the state 1 is:
value[state1] = 1 + 0.5 * 1 = 1.5
How do I know that? Well, working backwards from the last state. From State2 the best reward is through action1, and then again through action1. This solution approach is called Backward induction. Notice that Backward Induction has some similarities to Dijkstra’s shortest path algorithm in that we memorize the best paths to certain sub-set of states.
Optimal decision maker is always able to get the best in each situation in the total. Because the rewards are added to the total value, we can decompose the value of the state into the best action reward and the value of the next state.
The Principle of Optimality simply says that for the best decision maker (policy), no matter where you start or what your first step is, the next steps should always form the best plan for the situation after that first step. Where the best plan is the highest total reward. This principle is captured by the Bellman Equation, which is a necessary condition for optimality.
# Bellman Equation
value(current_state) == (
reward(current_state, the_best_action)
+ discount * value(the_best_next_state)
)
We can see this decomposition in the Example 3. We can also see, how Backward Induction solves the equation.
Notice the best next state, which is determined by maximizing the total value. We use maximum function here, which makes Bellman Equation non-linear.
We can explore paths through states and actions and estimate minimal value as the total path reward starting from that state. Every time we find a better path, we can use the Bellman equation above to update the state value. This we iterate until we learn the best decision for every starting state.
From above, we can see that we can apply the principle of optimality above as an update rule to refine our decision-making, based on the trajectories we explored. We do this with Bellman Update.
We explore a different path or different action state and update corresponding value function with the action that leads along the path that leads to the highest reward. In many scenarios we will over time get to (converge) to the accurate value function.
Since we are storing values for each state, we represent the value function as an array or python dictionary:
# Bellman Update
value[current_state] = (
reward[current_state, the_best_action]
+ discount * value[the_best_next_state]
)
Value-iteration method can be high-level described as:
There is also, Policy-iteration method, which focuses on finding policy rather than value function.
I read that the Value-interation method is a Fixed-Point Method and is likely to converge under reasonable conditions.
Instead of a value function, it is easier to work with a Q-function:
# Bellman Equation with q_function
q_function[current_state, the_best_action] = (
reward[current_state, the_best_action]
+ discount * max(
q_function[the_best_next_state, the_next_action]
for the_next_action actions[the_best_next_state]
)
)
With that we can describe the Bellman update rule in more detail:
states = [...] # list of all states
actions = [...] # list of all actions
gamma = 0.9 # discount factor
# The Q-table is initially filled with zeros
q_function = { (state, action): 0 for state in actions for action in actions }
def bellman_optimal_operator_update(q_function, state, action):
# this defined by the environment
next_state = get_next_state(state, action)
# the next action of in the next step as defined by the optiomal policy maximizing the q_function
# we directly update the q_function
q_function[state, action] = (
reward(state, action)
+ gamma * max(q_function[next_state, next_action] for next_action in actions)
)
return q_function
def optimal_policy(q_function, state):
# optimal policy is defined by action maximizing the q_function
return argmax(lambda a: q_function[a, state], all_actions(state))
Instead of model-free tabulation of the q-function, which is very memory-intensive, we can model the Q-function to interpolate the table using less than full data.
Temporal difference learning (TD-learning) is related to value-iteration or Q-learning, but it makes fewer assumptions about the environment. The method is called temporal difference because of the difference between current estimate, and one-lookahead estimate based on future state Q-function values.
States consists of textual instruction, 320 × 256 camera image, robot arm position, robot arm orientation, robot arm gripper.
Actions consists of 8 dimensions: 3D position, 3D orientation, gripper closure command, and episode termination command.
Reward is received only at the end and the termination command must be triggered for policy to receive a positive reward upon task completion.
For example, in the Q-transformer a multi-modal neural network with transformer architecture is used for modeling the Q-function and TD-learning is used for offline training.
More specifically the input camera image goes to instruction-conditioned convolutional network for images. The text instruction conditions FiLM-conditioned visual-modality EfficientNet convolutional network. The conditioned network outputs a combined output information into a transformer, which then outputs Q-function value predictions.
At initialization, the neural model has a cold-start problem and is very bad at estimating the state values. But if we tabulate (memoize) rewards for successful trajectories, we can immediately provide a minimal reward for any point on the successful pathway. This speeds up learning of the Q-function neural network. This tabulation method is called Monte Carlo return. In a way, we are combing brute-force with neural network interpolation.
This is one of the tricks used in Q-Transformer.
The most foundational ideas applied in Q-transformer paper were described above. Here is a summary of other contributions in this paper:
Autoregressive Discretization of Actions: To accommodate the high-capacity Transformer architecture, the Q-Transformer discretizes each dimension of the continuous action space separately and treats each dimension as a different timestep in the learning process. This allows the model to learn Q-values for each action dimension separately, enabling efficient scaling of the method to high-dimensional action spaces without encountering the curse of dimensionality.
Conservative Q-Function Regularization: The Q-Transformer uses a modified version of Conservative Q-learning (CQL) that introduces a regularizer to minimize Q-values for actions not present in the dataset explicitly. This conservative approach biases the model towards in-distribution actions, i.e., those seen during training, and serves to mitigate overestimation of Q-values for unseen or suboptimal actions. This approach ensures that during training, the estimated Q-values are kept closer to the minimal possible cumulative reward, which is consistent with the non-negative nature of the rewards in the tasks targeted by the Q-Transformer. This method differs from softmax method of pushing Q-values down for unobserved actions and up for the observed actions, which may prevent keeping Q-values low for suboptimal in-distribution actions that fail to achieve high reward.
Loss Function: The loss function for the Q-Transformer combines both the temporal difference error (between the current and target Q-values) and the conservative regularization term. The action space dimensionality is expanded to include the discrete bins of action values, and the update rule is applied separately for each action dimension.
Q-Transformer outperforms QT-OPT and Decision Transformer in a reinforcement learning task, where suboptimal synthetic data is available for offline training. QT-OPT, also performs TD-learning in contrast to Decision Transformer, which seems to be the biggest factor here for good performance with suboptimal data.
]]>You want to install a specific Python version, e.g., Python 3.9, on Ubuntu 23.10; neither Ubuntu PPA nor Deadsnakes PPA distributes it. Fortune has it; you don’t have to compile CPython from the source manually. And you don’t have to change your Ubuntu to an LTS version, e.g., 23.04 or 24.04, where Deadsnakes PPA supports your version.
You have these options:
pyenv lets you easily switch between multiple versions of Python. Each version is compiled on your machine for you and for most popular versions the compilation should work reliably.
pyenv install 3.9
and it should compile it from source.Conda instead provides precompiled version of Python for different platforms. Although the support may not be guaranteed and a compilation fallback is an option. In contrast to pyenv, conda is an environment manager and a package manager, so you can create multiple separate environments and install different precompiled packages into them.
conda create -n my-python-3.9-environment python=3.9
conda activate my-python-3.9-environment
Appears to be popular, but I have no experience with it. Here is an example.
asdf install python 3.7.4
This is the least practical but the most flexible and reliable approach. Dokcer is usually only used to run single version applications. Docker is not intended as a development environment. But there is a support for IDE’s like Pycharm to run a Python interpreter in Docker containers.
This is a webpage I wanted to find, when a college was having this problem. I hope it helps you now.
]]>There may be a scary secret problem in your use of lambda in Python, when used with AsyncIO or Multi-threading. It is called Late Binding.
Can you see what is unexpected about below results of the DocTest?
By the way, you are familiar with add_done_callback
method, right?
async def process(i):
await asyncio.sleep(2)
print(i, end=', ')
async def process_all():
"""
>>> asyncio.run(process_all())
0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19,
"""
tasks = []
for i in range(10):
task = asyncio.create_task(process(i))
# will this lambda change value?
task.add_done_callback(lambda _task: asyncio.create_task(process(i + 10)))
tasks.append(task)
await asyncio.sleep(0.1)
await asyncio.gather(*tasks)
await asyncio.sleep(2)
The output values are all stuck on final value of 19
!
Always assign all variables that you are using the lambda with AsyncIO or threading.
Or create an object that will carry the specific values intended to be used during later execution of the function preventing them to change.
The repeated 19
in the results is due to the late binding behavior of closures in Python. The lambda function captures the variable i
by reference, not by value. By the time the lambda is executed, the for loop has completed and i
has its final value of 9
. When i + 10
is evaluated inside the lambda, it always equals 19
.
async def process(i):
await asyncio.sleep(2)
print(i, end=', ')
async def process_all():
"""
>>> asyncio.run(process_all())
0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19,
"""
tasks = []
for i in range(10):
task = asyncio.create_task(process(i))
# will this lambda change value?
# task.add_done_callback(lambda task: asyncio.create_task(process(i + 10)))
# Instead of above, capture the current value of i by passing it to the lambda.
task.add_done_callback(lambda _task, _i = i: asyncio.create_task(process(_i + 10)))
tasks.append(task)
await asyncio.sleep(0.1)
await asyncio.gather(*tasks)
await asyncio.sleep(2)
Try for yourself, switch the lines above and see the difference in results. Can you run a DocTest? It is also a useful tip for You.
]]>We witness the hints of reasoning capability in the large language models today. If Descartes is not right and thinking does not imply that I am, then what does? What is then left for a human to be? If there is only matter, is the human mind also matter, and is the mind replicable in a machine?
Are there any alternatives to materialism? The belief that life is a primary building block? Religion seems to put the idea of life, the idea of a god, front and center of the worldview instead of a dead matter of materialism. Does this partially explain higher birth rates in religious populations?
Because life is the self-directing, self-replicating, and self-improving so stationarity or uniformity would be against that, and so from this a concept of a central one god would be expected. People find it easier to relate to and follow other people, so it is simpler to understand a god as a person. Notice David Deutsch’s conjecture and criticism (Popper, Xenophanes) and points on disobedience necessity for creation of an AGI.
Was René Girard right that the Bible’s exposure of the scapegoat mechanism (e.g. Leviticus 16:21-22) shortcuts mimetic conflicts, and so is necessary? I wonder if individualism also provides similar mechanism.
Had Soviet Russia collapsed, partially because ruling communism attempted to remove religion with their materialist view, but itself had lower birthrates, and so the demographics eventually shifted towards people who grew up in non-communist communities, which eventually rejected communism causing the union to collapse? Look at the chart of the Russian birthrates plummeting after 1910, which coincides with the Russian Revolution. Note that Lenin had no children, Stalin had 3, Khrushchev (de-Stalinization) had 5, Brezhnev (neo-Stalinist) had 2.
Here are 4 great quotes from Sabine Hossenfelder’s (contemporary, theoretical physicist, science communicator) book Existential Physics:
Science progresses through conjecture and criticism if we follow Popper’s ideas. David Deutsch explained these ideas and popularized them in The Beginning of Infinity. Bible proverbs often praise criticism: “A wise man listens to advice” and “Better is open rebuke”. Regarding the conjectures, the bible verses promote understanding and wisdom. We have a biblical proverb: “The heart of the discerning acquires knowledge, for the ears of the wise seek it out.”
Prayer may also be a problem-solving method by working backward from the end goal (e.g., Amazon) or inversion of Charlie Munger. Sometimes, it is a method of practicing gratitude, which has positive psychological effects. From another point of view, prayer can be akin to meditation, which also seems to have positive psychological effects.
A very unusual book The Pragmatist’s Guide to Crafting Religion is a compilation information on cultures, religions, and traditions. It identifies elements of soft or pop culture or soft culture beliefs: wishing thinking, crystals, wow-effect large things like universe, lacking specificity or containing unknowable. While these seem to be similar to intuitive early religions, in wealthy societies these are associated with very low birth rates.
How does your philosophy shape your life? I hope you became more by brainstorming today with me.
]]>I read a book about 38 Letters Rockefeller’s to his son. I really enjoyed it at first, but then it started to sound suspicious to me. The language was somehow too modern, like from a self-help book from 2010. I read online reviews, I found similar complaints that referred to the archives of the letters, about how the book was first published in China, and that the content is completely different in the original sources.
I studied the online archive, bought another book of Rockefeller quotes, and ordered physically book called “Dear Father, Dear Son” and skimmed through it in one evening. And I have a completely different feeling from it. Those 38 letters are fictional partially based on various quotes and historical events, but mostly fictional and sometimes completely made up (e.g. the car chase).
The real letters from the archive collected by J.W. Ernst in “Dear Father, Dear Son” are completely different. Less instructive in the self-help direction, but beautiful. They are full of familial love and gratitude, which I think is also a productivity enhancing mental technology based on my other recent readings. The archived letters reveal deep family love, a father’s guidance to his son, the son’s acceptance of that wisdom, the weight of a near-billion-dollar legacy, and an unshakable belief in a benevolent God.
“The 38 Letters from J.D. Rockefeller to his son” are a fictional letters: The signs are:
This is a reminder why you have to be skeptical. Read more on how to spot fakes here: Validate Reliability of a Research Paper.
Regardless of the above, any quotes or books are at risk of not being useful to You, the reader. You are unique or can become unique, living in a unique time and place. Books are mere words that gain meaning when acted on. You may find it more beneficial to describe the world in your own words from the experience of interacting with it.
These are extractions from letters in book “Dear Father, Dear Son”, that captured my attention.
I verified sources for a couple of these, but not all yet. You can consider reading books Titan: The Life of John D. Rockefeller, Sr. and John D. Rockefeller on Making Money: Advice and Words of Wisdom on Building and Sharing Wealth.
People keep asking me about, what is the difference between encoder, decoder, and normal transformer (with self-attention). It is a simple thing, you can master quickly.
BERT has Encoder-only architecture. Input is text and output is sequence of embeddings. Use cases are sequence classification (class token), token classification. It uses bidirectional attention, so the model can see forwards and backwards.
Another encoder-only model example is ViT (Vision Transformer) for image classification.
GPT-2 has Decoder-only architecture. Input is text and output is the next word (token), which is then appended to the input. Use cases are mostly text generation (autoregressive), but with prompting we can do many things including sequence classification. The attention is almost always causal (unidirectional), so the model can see only previous tokens (prefix).
T5 has Encoder-Decoder or Full-Transformer. Input is text and output is the next word (token), which is then appended to the decoder-input. Encoder decoder uses cross-attention to introduce information from the encoder into the decoder.
The intuition is that, the decoder model just appends text, so if we have significant distribution difference between the input and the output, for example completely different set of tokens, we can expect that encoder-decoder would work better. And the decoder (prefix model) and sees only the past, and so any task that involves seeing entire text context and addressing specific tokens is a bit more complex for it. However, decoder-only is simpler architecture than Encoder-decoder, and it is already Turing-complete and size of the model and training is likely the biggest factor in most cases (The Bitter Lesson).
To make relevant apples to apples comparison, we can compare these in latency or compute-matched or parameter-match way, but it is hard to get rid of major differences in training objectives, which likely play the decisive role.
In the Flan-UL2 paper, authors attempted to reduce training differences by reformulating fill-in-the-blank task (denoising) into generative (autoregressive or prefix-language modelling setting) - this is called Mixture of Denoisers. Furthermore, they seem to use the same encoder-decoder model in both decoder-only way and encoder-decoder way. Also in Flan-UL2 paper, their best model was 20b parameter encoder-decoder.
Furthermore, Compute matched encoder-decoder models in UL2 paper have approximately twice the number of parameters as the decoder models but similar speeds and accuracy. This indicates that encoder-decoder may have more sparsity that may be taken out with some pruning or distillation techniques to eventually outperform.
In this older pre-RLHF paper, Encoder-decoder models trained with masked language modeling achieve the best zero-shot performance after multitask finetuning.
For details, there is a difference between decoder-only causal and prefix LM. Prefix-LM has a section that has non-causal (bidirectional attention) token dependencies like BERT:
Personally, I will choose based on what pretrained model is available and how easy is it to adopt it for the task at hand. It is unclear what architecture may be the best from the start. Perhaps minor consideration could be following:
Create and share a Google Calendar event link in no time, irrespective of whether you have a Google Calendar account or not. With this tool, you can seamlessly generate a URL for your event and share it with participants. All they need to do after that is to click on the link to add the event to their respective Google Calendars.
No Google Calendar Account Required: With our tool, there’s no need for you to have a Google Calendar account yourself. Simply fill up the form and generate a URL for your event in seconds.
Streamlined Calendar Addition for Participants: Once you share the link with your participants, they can add the event to their calendars with just a click.
Better Event Recall for Attendees: With the event securely added to their calendars, attendees can keep track of it easily, reducing the risk of forgetting about it.
All you need to do is fill out the given form, which comprises fields for the event name, start and end times, and time zone.
Then copy the event URL or go directly to the link by clicking on the respective buttons.
Double-check your timezone settings and test the URL with someone.
Message me to request more features or report bugs.
Organizing Webinars or Online Training: Creators or educators can utilize this tool to share schedules of their upcoming webinars or online courses, boosting enrolment rates.
Scheduling Business Meetings: The tool can be used to generate Google Calendar Event Links for business meetings quickly. Participants can add the event to their calendars with a simple click, improving meeting attendance.
Planning Community Events: Organizations can utilize this tool to create links for community events such as fundraisers, charity runs, or social gatherings, ensuring maximum participation.
Get started now, and save your time, as well as that of your participants, with our Google Calendar Event Link Creator. Share events and schedules quickly and efficiently, and make sure that all your attendees mark their calendars without fail. Happy event planning!
Click the "Go to link" after you have filled out the event details to navigate to the google calendar page, or click the copy button to copy the URL!
Deep neural network consumes input numbers, passes them through multi-layer neural network calculation, and produces a prediction. The loss function provides error how each prediction differs from the desired prediction target. Gradient descent calculates corrections to the network backwards through the layers. The neuron activation values in between the layers before the output, which form arrays of numbers (vectors), are called embeddings (representations).
Gradient descent calculates weight corrections (gradients) with backpropagation algorithm. Backpropagation takes the distance from the correct results, and calculates gradients (derivatives) starting from the output results and iterating through neural network layers back to the input. Because deep neural networks have layered structure, backpropagation uses chain-rule and analytical derivatives for known functions. Backpropagation changes the neural weights in the opposite direction of the gradient with a small learning step. In this way, backpropagation increases or decreases reliance on neuron outputs in proportion to their influence on pointing towards the false label.
Overfitting refers to when model has low training set loss, but high testing-set loss. For example, if a model has sufficient capacity and “insufficient regularization”, it may memorize training data. Read more about overfitting and double descent here.
A decision tree is an if-else look up table and with sufficient size without pruning regularization can memorize training set. That is because the tree can create an individual bin for each dataset input, and then recover desired training set label.
If ReLU neuron activates, we can say that the neuron memorized to respond. Each neuron represents a dot-product of input vector with weight vector, and the dot-product is positively valued, the neuron outputs non-zero. Because we can have a bias values, this is not only direction but a hyperplane. In this way, we can see that a neural network of sufficient size can also learn to split hyperspace into planes, such that for each dataset input there is a bin into which a hidden representation will fall and which will activate a neuron corresponding to a label, so it can also overfit.
There are various regularization methods for neural networks to prevent overfitting and increase generalization. For example, see Dropout below.
Let’s consider a fully connected neural network with a lower internal dimension than input and output. In a way, this is auto-encoder configuration.
This set of ReLU neurons can memorize more vectors than their count, which is called superposition (Anthropic). In other words, ReLU network can embed and recover more vectors than its dimension (neurons), thanks to superposition. Or, ReLU network memory is greater than a sum of its neurons or hidden dimension because of the non-linearity.
This means that internal embeddings of features are not fully orthogonal and have a small non-zero dot-product During reconstruction ReLU will only activate for the original feature to be reconstructed, thanks to bias weights preventing activation.
A similar effect was observed in Transformers (Hopfield Networks is All You Need). During stored vector reconstruction, hidden activations form vectors with maximally different directions (polytopes) from when it reconstructs other stored vectors.
Instead of embeddings, we can look at weight vectors. In Superposition, Memorization, and Double Descent generalization was observed when weight vectors instead formed polytopes, while embeddings did not.
Note that, we can see language modeling as a compression problem, so memorization is legitimate solution for the model. On the other hand, it is just generalization may be want we want. Thus, we push the model training to instead find the generalization, that will compress even more.
In the toy model, they observed that those often repeated patterns where memorized, instead of generalized.
When memorization is no longer possible, generalization will happen. The testing set error may get worse for some time (double descent).
From paper: Towards Understanding Ensemble, Knowledge Distillation and Self-Distillation in Deep Learning, which compares average ensembling, ensemble knowledge distillation, training averaged models, and individual models on classification problems.
The Theory of random feature mapping (RFM) cannot explain deep learning neural network ensembling behavior as these are too different models. They also behave different.
Random feature maps in machine learning are techniques used for dimensionality reduction and feature extraction. RFM is based on sampling random matrices, which have been found to often preserve dot product well. For example, Gaussian Random Projections a simple element wise matrix sampling. After these random features are sampled, then we can employ, for example, gradient boosting on them. Similar methods are used by Performer Transformer and word mover’s embedding.
Ensemble is a combination of several models to make a prediction. Ensemble distillation works much better in deep learning and performs similar to ensemble, contrary to random feature mapping. Training a model that is average of output of 10 models does not improve results in case of deep learning, because once a simple solution is found in one of the models, the gradients will prevent further exploration in the other models. On the other hand, in RFM this does not seem to be a problem, because gradient descent is not used?
Note that, input statistical distribution matters for the results of every machine learning algorithm. Fortunately, most real world problems deal with a similar class of distributions. In the case of Gaussian mixture, deep learning ensembling does not help because test variance tends to go down, despite not test accuracy.
They define a multi-view assumption as compositional of the samples with smaller features. If these features appear together, they trigger classification class. Authors indicate this as possible because of explainable visualizations.
Then authors show that each model learns these local features differently and at different speeds. And because they get the simplest features first, it becomes difficult to find the other features. Then the model overfits and is not able to learn the other feature and rather learns a noise in the small number of samples.
Perhaps the layered nature of deep neural networks explains why some features are forgotten across the layers, if not reinforced enough. This is what multiple separate training helps to prevent. Random feature mapping and boosting as it is more shallow and has access to all the features, but then fails to optimize very well.
Distillation works for DL because the network has a signal that there must be feature that it has to find it. The model has the capacity, and with the additional signal, it can learn to detect all the features.
Dropout regularization randomly prevents usage of neurons or entire input features from the previous layer. Dropout is turned off during inference (prediction) time. Dropout helps to reduce overfitting during training, probably because it is prevening the network to rely too much on small set of features. You can see Dropout also as a random pruning.
Deduplication and better sampling of the training data help by preventing overfitting, because deduplication reduces repetition, which reduces memorization.
Active learning is one of the methods to create more training samples minimizing the labeling cost. For example, in confidence-based active learning (Pool-Based Sampling), we select samples where the network is the least confident.
More data increase diversity and thus again reduce repetition and encourage generalization, e.g., Chinchilla: Training Compute-Optimal Large Language Models
Other topics in training neural networks:
Need to measure an average time of an event, that may be different every time, but you need to get approximate mean period? Use this tool.
This tool has a current measured millisecond time, one toggle button for start and stop, one clear button. It has a table for the time intervals in milliseconds that were measured, where each row can be edited and deleted, and average result output that is updated when the values in the table change by addition, edit, or deletion.
Interval [ms] | Action |
---|