### Bitonic Sequences * Time limit: 200...

Question

Bitonic Sequences

Time limit: 2000 ms
Memory limit: 128 MB

Description

We define a positive bitonic sequence of sum N as a sequence of positive integers S₁, S₂, ..., Sₖ such that the sum of all elements equals N, and there exists an index j satisfying:

Sᵢ ≤ Sᵢ₊₁ for all 1 ≤ i < j, and
Sᵢ ≥ Sᵢ₊₁ for all j ≤ i < k.

Find the number of distinct positive bitonic sequences with a total sum of N, for all 1 ≤ N ≤ M, modulo 10⁹ + 7.

Two bitonic sequences S₁ and S₂ are considered different if they have different lengths, or if there exists an index j' such that S₁,j' ≠ S₂,j'.

Standard input

The first line of input contains a single integer M.

Standard output

Print the number of distinct positive bitonic sequences for each N (1 ≤ N ≤ M), modulo 10⁹ + 7.

Constraints and notes

1 ≤ M ≤ 3 × 10⁵.

Example

Input:

Output:

1 2 4 8 15

#include <bits/stdc++.h>
using namespace std;

const int MOD = 1000000007;

int addmod(int a, int b) { a += b; if (a >= MOD) a -= MOD; return a; }
int mulmod(long long a, long long b) { return int((a * b) % MOD); }

int main() {
ios::sync_with_stdio(false);
cin.tie(nullptr);

text
int M;
if (!(cin >> M)) return 0;
vector<int> f(M + 1, 0);
// Q_p coefficients up to degree T_p where needed. We maintain current Q array up to M.
vector<int> Q(M + 1, 0), Qnext(M + 1, 0);
Q[0] = 1; // Q_1 = 1, since product over v=1..p-1 empty

// For p = 1, B_1(x) = x / (1 - x) = sum_{m >= 1} x^m
// Contribution to f[N] from p=1 is A_1(N - m) with A_1(t) = coeff of Q_1(t) = [t == 0]
// So f[N] += 1 for all N >= 1.
for (int N = 1; N <= M; ++N) f[N] = addmod(f[N], 1);

// Now iterate p from 2..M, build Q_p from Q_{p-1} by multiplying (1 - x^{p-1})^{-2}
for (int p = 2; p <= M; ++p) {
    int k = p - 1;
    // Compute Qnext = Q * (1 - x^k)^{-2} truncated to degree M
    // For each residue r in [0..k-1], process indices t = r + i*k
    for (int r = 0; r < k; ++r) {
        // collect sequence a[i] = Q[r + i*k] for i s.t. idx <= M
        vector<int> a;
        for (int idx = r; idx <= M; idx += k) a.push_back(Q[idx]);
        int L = (int)a.size();
        if (L == 0) continue;
        vector<long long> pref(L, 0), prefi(L, 0);
        pref[0] = a[0];
        prefi[0] = 0LL; // 0 * a[0]
        for (int i = 1; i < L; ++i) {
            pref[i] = pref[i-1] + a[i];
            if (pref[i] >= MOD) pref[i] -= MOD;
            prefi[i] = prefi[i-1] + 1LL * i * a[i];
            prefi[i] %= MOD;
        }
        // b[i] = sum_{c=0..i} (c+1) * a[i-c] = (i+1) * sum_{j=0..i} a[j] - sum_{j=0..i} j * a[j]
        for (int i = 0; i < L; ++i) {
            long long sumA = pref[i];
            long long sumJA = prefi[i];
            long long val = ((long long)(i + 1) * sumA - sumJA) % MOD;
            if (val < 0) val += MOD;
            int idx = r + i * k;
            Qnext[idx] = (int)val;
        }
    }
    Q.swap(Qnext);
    // clear Qnext for next iteration
    fill(Qnext.begin(), Qnext.end(), 0);
  
    // Add contributions for this p: B_p(x) = sum_{c>=1} x^{pc}
    // For each N, f[N] += sum over m = p, 2p, ... ≤ N of A_p(N - m), where A_p = Q
    for (int m = p; m <= M; m += p) {
        int maxN = M;
        // for N from m..M: f[N] += Q[N - m]
        // We can do this as a loop; still O(M/p) per m, totals O(M log M) per p? Actually sum over m is M/p terms, each loop M - m ~ M, gives O(M^2/p)
        for (int N = m; N <= maxN; ++N) {
            f[N] = addmod(f[N], Q[N - m]);
        }
    }
}

// Output results for N=1..M
for (int N = 1; N <= M; ++N) {
    if (N > 1) cout << ' ';
    cout << f[N];
}
cout << '\n';
return 0;

}

I only passed a few use cases, and all the others timed out. How can I optimize this?

Answer 1

You’re running into the right idea (use the GF)

F(x)=\sum_{p\ge1}\Bigg(\prod_{v=1}^{p-1}\frac1{(1-x^v)^2}\Bigg)\cdot\frac{x^p}{1-x^p},

but the way you add each $p$ -block is $O(M^2)$ overall (and your update of $Q_p$ is also $O(M)$ per $p$ , so $O(M^2)$ total). With $M\le 3\cdot10^5$ that will time out.

Below are two optimizations, in order of impact.

1) Do it in $\tilde O(M)$ multiplications via a divide-&-conquer (segment-tree) of polynomials

Key observation: define the (truncated) polynomials

$H_k(x)= (1-x^k)^{-2} = \sum_{c\ge 0}(c+1)\,x^{ck}$ (cut off at degree $M$ ),
$G_p(x)=\dfrac{x^p}{1-x^p} = \sum_{c\ge 1}x^{cp}$ (cut off at degree $M$ ).

Then

F(x)=\sum_{p=1}^{M}\Big(\prod_{v=1}^{p-1}H_v(x)\Big)\cdot G_p(x)\quad\text{(truncated to degree }M\text{)}.

Instead of forming each prefix product and convolving it with $G_p$ one-by-one, build a segment tree over $p\in[1,M]$ and compute everything with $O(M)$ polynomial multiplications (each is NTT/CRT-based, degrees trimmed to $\le M$ ):

Build a product tree $\mathcal P$ : each node stores $P_{[L,R]}(x)=\prod_{v=L}^R H_v(x)$ .
Build a “weighted sum” tree $\mathcal D$ bottom-up with the identity $D_{[L,R]}(x) \;=\; D_{[L,mid]}(x)\;+\;P_{[L,mid]}(x)\cdot D_{[mid+1,R]}(x),$ with leaf $D_{[p,p]}(x)=G_p(x)$ .
Then $D_{[1,M]}(x)=\sum_{p=1}^{M}\Big(\prod_{v=1}^{p-1}H_v(x)\Big)\,G_p(x)=F(x)$ .

Every internal node performs one polynomial multiplication for the product tree and one for the D-tree, both truncated to degree $M$ . With an NTT this is $O(M\log^2 M)$ total and is easily fast enough.

Implementation tips:

Leaves:
- $H_k$ is very sparse: set poly[0]+=1, poly[k]+=2, poly[2k]+=3, … until degree $>M$ .
- $G_p$ is a simple 0/1 periodic vector: set poly[cp]=1 for cp in {p,2p,… ≤ M}.
Always truncate to degree $M$ after each multiply.
Use NTT(s). Since the modulus is $10^9+7$ $1 0^{9} + 7$ (not NTT-friendly), do MTT/CRT:
- run NTT under two or three friendly primes (e.g., 998244353, 1004535809, 469762049),
- Garner / CRT the result back mod $10^9+7$ .
For very small degrees, switch to a naive $O(nm)$ multiply to reduce constants.
Pre-allocate buffers; reuse vectors to avoid allocations.

This completely eliminates your $O(M^2)$ loops; you never form $Q_p$ explicitly and you never add “multiples of $p$ ” with nested loops—the tree does the algebra for all $p$ at once.

Sketch (only the structure; drop-in NTT is standard and omitted for brevity):

cpp
// Build leaves H_k and G_p up to degree M.
// Product tree: P[node] = P[left]*P[right]  (truncate to deg M)
// D-tree: D[node] = D[left] + P[left] * D[right]  (truncate to deg M)
// Answer poly = D[root]; print coeffs 1..M.

struct Poly { vector<int> a; }; // coeffs mod 1e9+7 via CRT in mul()

Poly mul(const Poly& A, const Poly& B, int M); // NTT+CRT, truncate to M
void add_inplace(Poly& A, const Poly& B);      // (mod 1e9+7), resize as needed
void trim(Poly& A, int M);

vector<Poly> Hleaf(M+1), Gleaf(M+1);

void build_H_G(int M) {
    for (int k = 1; k <= M; ++k) {
        auto& H = Hleaf[k].a; H.assign(M+1, 0);
        for (int t = 0, deg = 0; deg <= M; ++t, deg = t*k) {
            if (deg > M) break;
            H[deg] = (H[deg] + (t+1)) % MOD;  // (1 - x^k)^{-2}
        }
        auto& G = Gleaf[k].a; G.assign(M+1, 0);
        for (int deg = k; deg <= M; deg += k) G[deg] = 1; // x^k/(1-x^k)
    }
}

struct Node { Poly P, D; };
vector<Node> seg(4*(M+5));

void build(int idx, int L, int R, int M) {
    if (L == R) {
        seg[idx].P.a = Hleaf[L].a;     trim(seg[idx].P, M);
        seg[idx].D.a = Gleaf[L].a;     trim(seg[idx].D, M);
        return;
    }
    int mid = (L+R)>>1;
    build(idx<<1, L, mid, M);
    build(idx<<1|1, mid+1, R, M);
    seg[idx].P = mul(seg[idx<<1].P, seg[idx<<1|1].P, M);
    Poly rightWeighted = mul(seg[idx<<1].P, seg[idx<<1|1].D, M);
    seg[idx].D = seg[idx<<1].D;
    add_inplace(seg[idx].D, rightWeighted);
    trim(seg[idx].D, M);
}

After build(1,1,M,M), seg[1].D.a[n] (for 1 ≤ n ≤ M) are your answers.

2) If you must keep your current shape, at least drop a log-factor (still too slow at 3e5 though)

Right now you do:

cpp
for (int m = p; m <= M; m += p)
  for (int N = m; N <= M; ++N)
    f[N] += Q[N - m];

Rewrite that using residue-class prefix sums (for each r in [0..p-1]):

let a[i] = Q[r + i*p] for i ≥ 0,
prefix pref[i] = pref[i-1] + a[i],
then for j ≥ 1, add pref[j-1] to f[r + j*p].

This makes the B_p update $O(M)$ (down from $O(M\log M)$ per $p$ ), but the whole algorithm remains $O(M^2)$ and will still TLE at the given limit. Use it only as an incremental improvement if you can’t implement #1 yet.

Complexity & practicality

Segment-tree D&C + NTT/CRT: ~ $O(M \log^2 M)$ polynomial time with small constants; passes comfortably for $M=3\cdot10^5$ .
Memory: store a few hundred MB at most if you aggressively trim to degree $M$ and reuse buffers.

If you want, I can share a full drop-in NTT+CRT template wired to the tree above; it’s a bit long for a single message, but that’s the exact code pattern to make this pass.

### Bitonic Sequences * Time limit: 200...

سؤال